EVE allows you to discover, explore and dominate an amazing science fiction universe while you fight, trade, form corporations and alliances with other players.
EVE has a deadly cobra strike force team alpha of extremely dedicated and proficient developers fighting the Lagmonster tooth and nail as their only mission. Through their work and the work of others, there's now, at this moment, a perfect storm within the company and we have a great number of fixes in the pipes that will knock your socks off.
Our hope is that very soon our beloved Tranquility will be able to support fleet fights of a scale that far exceeds anything you've seen before, hopefully going beyond the roof of roughly one thousand on a dedicated node.
That's tough talk, and we mean it. We will continue to go into detail in our ongoing series of dev blogs, some of which have previously been outlined in a dev blog by CCP Zulu. As soon as the optimizations are ready they will be pushed out to Tranquility individually and you will be able to gauge the difference yourself.
Character Nodes
In this blog I am going to talk about one of the optimizations we've been working on over the past few months: Character Nodes which we have already started deploying to Tranquility with phenomenal success.
The EVE Server is architected such a way that functionality is split into logical load balancing units and these units are statically assigned to a node which is a server process running on a single CPU core (since the server process is single-threaded). As long as any individual load balancing unit does not exceed the capacity of that CPU core we're fine since we can just add more nodes if the load becomes too high. However, if a single load balancing unit needs to do more work than a single CPU core can handle then we're in trouble.
A market region (Forge, Lonetrek, etc) is an example of a load balancing unit. We have multiple market regions living on a single node and currently four nodes servicing all the market regions. If the load on the market increases we can just increase the number of nodes dedicated to that task and decrease the number of markets on a given node. What this means is that when you're in Jita and browsing the market, you're not talking to the Jita node at all and don't feel any Jita lag effects (up until the point where you buy or sell something in which case the Jita inventory system gets involved).
Another example of a load balancing unit is a solar system. Typically we will have multiple, even hundreds of solar systems living on a single node. We call these types of nodes Location Nodes. Typically these solar systems have such low load that they can be mapped onto a node with a lot of other systems in the same way as the market is. However, now comes the gotcha: If a single solar system exceeds the capacity of the CPU core we have lost the ability to further balance it.
This is the problem in solar systems like Jita and in systems where fleet fights are occurring. We cannot split the solar system up into more units and spread them out and we cannot spread the work out onto multiple cores. We are effectively stuck between a rock and a hard place (stupid GIL!).
Because of these absolute constraints it's important that any work that doesn't need to be done by the location node is taken elsewhere. Therefore things like planets (in Planetary Interaction), markets, corporations and alliances are load balanced separate of the solar system you're in and if your EVE Client calls these services (such as when you are viewing your corp bulletins) then it's talking to different node than your location node. Your client is at any one time talking to half a dozen different nodes, depending on the call context.
Because of the very strict request-response model that we employ (you click a button (make a request), the server does work, you get a response back) in our business logic a lot of the game systems lend themselves very well to distributed load balancing (e.g. away from your location). However, historically the location node has been viewed as your 'primary' node for a lot of the auxiliary logic that runs on the cluster. If a particular call doesn't have a specific place to go to it will get routed to your location node. Now, that's just lazy.
Today, after years of optimizations the only true remaining bottleneck on the tranquility cluster is the solar system location and we need to shave off every single cpu cycle that we can. With this in mind we introduced the concept of the Character Node in Tyrannis and have been moving services over to this paradigm since.
Figure 1: Node configuration
Figure 1 shows what this look like. Calls that would otherwise have gone to 'Location' are now routed to 'Character', which is a set of nodes that is very easily load balanced according to number of logged-in characters. This fits nicely with the schema that is already in place. We need roughly 8 character nodes (out of the total of 204 sol nodes in the cluster) to handle the load and this is very predictable.
Now that most of these changes have been deployed, when your client makes a call for things that aren't directly related to your location, that call will typically be routed to your dedicated character node where your character happily ‘lives' along with tens of thousands of other characters.
Since the work coming from an individual client has a long way to go before it exceeds the capacity of a single CPU core and is easily predictable, we're in a good place. The client still makes the same number of calls to the cluster as before and the amount of work done on the cluster is unchanged. All that is different is that other nodes perform the work.
Changing this isn't very glorious or filled with a lot of eureka! moments. It's mostly just eating your vegetables and rearranging logic. The benefits, however, are substantial since, as it turned out, the majority of the calls that a client makes in a given period can be serviced by any node and therefore should not go to the location node.
The results
Okay, all the wall-of-text above was just to explain how the system works. Now that you've eaten your vegetables it's time for the sugar-laden dessert. Let's see the effect this had on Jita last week, comparing the performance before and after the phase #2 of these changes were deployed to Tranquility on 12 August.
Figure 2: Network traffic on the Jita node
In Figure 2 you can see the number of calls made onto the Jita node in four consecutive runs. After moving some services over to the character nodes, up to 80% of the calls were routed elsewhere freeing Jita up for important things like inventory operations and scams in local.
Figure 3: CPU utilization on the Jita node
As expected, moving services away from the Jita node had a good effect on CPU and has allowed us to scale Jita beyond the 1400 pilot limit that has capped its population for a while. Hopefully you should not be towed to another star system when you log in on Sunday evenings for a while.
Even though the metrics that we have gathered are for the Jita node, this change will have a positive effect on all loaded nodes in the cluster--with Jita and Fleet Fight nodes benefiting the most. The reason I use Jita here is that it has a very predictable load pattern whereas fleet fights are anything but. However, the same principles apply. Before this change you would be making something like 5-10 server calls to your location node to finish jumping, each one of these calls could take a long time to complete. Now you'll be making something like 4, with the rest returning very quickly.
We're hoping you'll be able to tell the difference the next time you decide to invade your nearest friendly neighbor. :-)
Also keep in mind that the other benefit of this type of offloading is that you don't "feel" the lag as much. Take Jita for example. If you're just browsing the market you don't notice that it's laggy since you're not talking to that node, you're talking to the market node. With the character node changes that we're doing much of the user experience will be improved greatly since the buttons that you're clicking end up on a lightly loaded node, even if clicking around in space is laggy. This can make a big difference in the overall playing experience.
We have been deploying the character node changes piecemeal to Tranquility throughout August and have a few more going out over the next few weeks. These should give you better performance in fleet fights as well as allow us to push Jita's concurrent player boundaries further.
This is all well and good but keep in mind that ‘lag' hasn't been ‘fixed' once and for all and it might not be in the foreseeable future, but we are pushing the boundaries of the cluster more aggressively these days than we have been for a while. Like I mentioned before, this is just one of the many things that we are working on right now. More services are being moved to character nodes this autumn and we will have plenty of more awesome fixes aimed at fleet fights coming in over the next weeks and months.
Fleet fight tests on Singularity
As stated before, it is very important for us to get as many people as possible involved when we are testing on the Singularity test server (SiSi). Please join in when the tests are advertised and help us test the performance improvements so that they can be deployed onto Tranquility as quickly as possible.
A word about fleet fight notifications
A fleet fight which happens on a node with a hundred other systems is not a good experience for anyone involved since often times the node only has 30% left of its CPU capacity when the fight starts. You can help make sure that the solar system you will be fighting in is on a dedicated node.
Use the Fleet Fight Notification form to let us know about a pending fleet fight. Sending in a petition is not the right way to report this, you must use the form. If you report the fight well in advance we can make sure that the fight is as smooth as we can make it.
Jon Bjarnason Technical Director EVE Online, CCP Games
From the lookout post of the impregnable CCP stronghold high in the Equus Mortuus mountains, we can see the world turning beneath us. As the universe slowly cools and we turn into ever older geezers, so also do the tools of our trade evolve.
As you may know, EVE has at its core the programming language known as Stackless Python. When development of EVE was started, all we had to go by was a pocket watch, a piece of twisted wire, Stackless Python 1.5 and an LP with a band called Randver. Stackless Python, back then, was centered around the concept of "Continuations" and it was rather tricky to use.
Later we moved to Stackless Python 2.0. Along with the changes to Python itself, Stackless had grown the "Tasklets" and "Channels" that we have come to know and love.
We have since tried to keep abreast of developments in Python. At various points in time we have upgraded our codebase to use Python 2.1, 2.3 and 2.5 successively.
Due to the amount of work that it entails, both the act of integration and the impact on developers, we have successfully dodged every other version. But with each new version come benefits:
New language features make development simpler and more effective
The Python library is improved in terms of features and performance
The language itself gains performance improvements.
We stay current with a language in development.
Now, after a few years of sticking with Stackless Python 2.5, we have finally taken the leap and upgraded to 2.7. Python 2.7 was released a few weeks ago, and subsequently, Stackless Python 2.7.
When we do an upgrade like this, it is not a simple task. There is a lot of private modifications to Python that need to be migrated. There is a lot of C code that needs recompiling. And there is a lot of python code that may need some minor modifications.
For this reason, we are very careful to make sure that no unintended problems show up. We run an extensive set of regression tests to ensure that everything works as before, and a full performance benchmark suite to verify that performance does not degrade.
As an example, we found during testing that some client entry fields started working differently. A user would enter a percentage, say 5.55%, and have it rounded to one fractional digit as 5.5% rather than the previous value of 5.6%. It turned out that the rounding of floating point numbers had been subtly corrected in version 2.7 and that 5.5% is indeed the correct value (because 5.55 is actually stored as 5.54999). To make such UI sensitive rounding cases work as expected switched to using the decimal module, a toolkit for doing arithmetic in binary coded decimal.
Python 2.7 is the last of a line. There are no plans to continue with a python 2.8 version. All Python development is now focused on the 3.2 version of Python. But we won't be going there any time soon. Moving from version 2 to 3 is a much bigger leap: There are widespread incompatibilities on the API level and there are substantial language changes too. And at this time there appears to be no immediate benefit to us in switching.
What effect will the upgrade have on the player? In the short term probably none. But in the longer term, it will make our job of providing quality services to you simpler, more enjoyable, and easier on our arthritic fingers and receding hairlines.
I've also got us a new stylus for our gramophone. Stackless Python 2.7 and Randver will keep things moving for quite a while yet.
As we're all too painfully aware, when a few hundred pod pilots get together and exchange ammunition, the dreaded "lag monster" often comes around. There are a number of distinct phenomena that are referred to as lag. It is important to distinguish between them, as they typically have different causes/solutions (even if the symptoms seem the same, like "my guns didn't work"). This blog is about the type called "module lag," in which modules start to respond very poorly - sometimes for minutes late. Specifically, this blog is about the bug that causes this for repeats/deactivations, and why it's not fixed on Tranquility (TQ) yet.
At the Council of Stellar Management (CSM) summit in June, the CSM made an excellent presentation of the issues around fleet-fight lag, specifically about the issues of modules becoming ‘stuck' and the workarounds used by players. This provided useful insight into a problem which, until recently, we'd never been able to reproduce in a development environment. This made the problem very tangible, and gave us a great symptom to begin digging into.
Like all the best bugs, this one has layers. We'll start at the first thing we noticed when digging at it: the system responsible for telling the server when modules should be turned off or repeated would get minutes behind in processing when fleet fights happen while other systems remain reasonably responsive. This system, named Dogma, handles module activation/repeat/deactivation, as well as the actual effects of those modules. It also handles the various skill/ship modifiers you see in the game.
This sort of lopsided performance should be easily tweaked. Tasks on the EVE servers use a time-sharing technique called cooperative multitasking which, in short, means that a task has to willingly yield execution to other tasks, otherwise it will run forever. In this case, it would seem the part of Dogma handling module repeat and deactivation was being too nice - yielding execution too much.
Looking at the code some, a theory emerged as to why. There was an error case that stuck out as odd - if an effect was supposed to be stopped or repeated, but the effect system itself didn't agree that it was time yet, the code would throw up its hands and give up. If that error case gets hit, the processing loop would yield to other systems early. A code comment was very reassuring though - this error was supposedly "rare."
The "rare" error happened 1.5 million times in the month of June, 2010 on TQ.
Above: A re-enactment of my expression after learning this fact
So, there's the first layer. This error happens, Dogma doesn't finish what it wanted to do (process every module event that should happen), the work load backs up and your modules start taking minutes to respond. The typical thing to do here would be to run a test to verify this. Unfortunately at that point the only environment we had to test on was the Singularity server mass tests. So, off we go, putting in a setting to test to see if simply continuing past this error and finishing processing would be enough to alleviate the module repeat/deactivation problem. There was a solid chance that it would.
Participants of the July 15th mass test: this is what I tested that made the node go horribly, horribly wrong. Instead of clearing through the errors and charging on to victory, the effects processing loop instead choked on them, constantly hitting them and not yielding execution to anyone else until minutes later. Interesting result. The next mass test was three weeks away, so back to the lab I went to cook up something to test then.
And then, our savior arrived. The Thin Client, as outlined in CCP Atropos' blog. These lovely little beasties, once properly tamed, allowed us to get a few hundred clients doing simple behaviors in the lab. First, there were lazors:
Then there were drones:
Then there was orbiting:
And then there was lag! The server I was running against was positively unhappy. But still, modules were reasonably fine - the error was not to be seen. (Remember what I said above about there being many manifestations of lag? Well this was a different sort, but not what we needed to reproduce the ‘stuck module' symptom) Some fantastic sleuthing by my teammate CCP Masterplan lead us to some simple steps that could be done to induce this error. Module delays quickly followed.
Once we have a reproduction in the lab that we can poke at, test new code on, and generally just play with, only the craziest bugs stand a chance. This one took only a couple hours to pin down once this setup was in place.
So layer two! The error, as you may recall from a few paragraphs up, was that Dogma was saying that the effect wasn't ready to be stopped or repeated yet. Well, why? The calculation is as easy as you'd think it would be - start time plus duration. It would take a lot to miscalculate that, so what the hell?
Turns out, the errors were coming about because the same effect was in the stop/repeat manager queue twice (or more). One of the copies would process fine, and then all of the duplicates would generate the error because the effect was already repeated by the first one that processed fine. So, duplicates in the manager caused the error, the error caused the process loop to yield early, the process loop yielding early caused module problems.
Next up, we put together a test to see if enforcing that there be no duplicates in the manager queue fixes the error, and therefore the module problem. Cue the thin clients, they happily pew pew themselves for a few hours while we get the code right, and the result is a positive - fixing this layer does fix the symptom.
But good engineers don't just fix a problem once, you fix it at as many layers as you can. At this point, we've prevented duplicate effects from entering, but they shouldn't be trying to get in there in the first place. A quick trace dropped in where the duplicate is requested pointed us right at the top layer of this bug. The code that's responsible for making an entity (a drone, an NPC, things like that) go docile was calling the very same function that the stop/repeat manager was calling to stop its effects! So specific timing cases, which are rare under normal circumstances but extremely common in high-load situations, were causing the effect to repeat instead of stop. One quick tweak there and the top layer is fixed as well.
Thin client tests confirmed that fixing either layer fixes the symptom. Cue the mass test!
The August 5th mass test had three rounds of combat in order to help us identify the impact of these bug fixes. The first round had the fixes turned off, and the second with them turned on, no difference in style of fighting. The fixes did exactly what they intended to do - repeating/disabling effects went from being 155 seconds behind at the height of round one to never getting more than seven seconds behind. Success!
Well, almost. As anyone who was at that test would tell you, the game was in bad, bad shape. If you could get a lock and get a module to start repeating, it would repeat with gusto, but it rarely actually happened. Locking took minutes, same with turning on or off a module. Even warping away took a very long time.
There's two problems evident from that test. Firstly it's clear that while relying on a bug to yield execution from this loop is a bad idea, not yielding at all isn't a good plan either. Secondly, and more importantly, reasonable numbers of pilots can generate enough load for the effects system that it can't keep up. That's a dangerous game, taking more than one second to process one second worth of effects - it quickly leads to doing nothing but process effects. That's what we started to see happen in the second phase of combat, which is scary business.
In the third phase we wanted to see where the tipping point was, so we started off with a decreased load by requesting no drones. Everything was peachy, the system could keep up just fine. Adding drones in doesn't add very much load, but it was enough to tip the effects system over the edge of being able to keep up. We were able to find where that tipping point was, and that will help guide what we do for TQ in the short term.
Essentially what we have to do is re-introduce yielding in the effect processing queue, only under our control instead of at the whim of some defect. It will be after a configurable amount of time, and I'll be present at some big fights to try to tune in on where that value should be to balance the needs of processing effects versus every other system. Longer term, we're working on some designs that could allow the effects system to keep up under very high load, which should finally put the nail in the coffin on module response issues.
In the meantime, we've been pointing the thin clients at various hotspots of performance and tackling them. More on that as soon as the results are out for you to enjoy!
~CCP Veritas
PS - Big shout out to the CSM for their help to date pin pointing specific issues like this one and notifying us of fights as they happen so we can be there monitoring. Also, thanks to all the players turning out for the Singularity mass-tests. Much love~
I like planning. I really do. I think calendars are pretty sexy and my OCD acts up when things aren’t color coded, have point values or do barrel rolls. This summer, I planned to sneak a few smaller items onto my teams backlog and positively surprise the Council of Stellar Management (CSM) when they got here. I ended up going to all these CSM summits, talking about features new and old, but never actually got to tell the CSM about some of the small things we had lined up. My plan didn’t work out, but the job got done, which is what matters.
Usually, smaller changes get bunched up and released with the coming expansion. This particular change is a little different though, as part of the code is needed for a bugfix, meaning we will be deploying it in the near future instead of this winter. The change we are introducing is going to end ghost datacore accumulation.
Currently, any character collecting research points from an agent will continue to do so, even after your account lapses due to inactivity. This is a pretty massive loophole to making substantial amounts of money and it is now being closed. When this change is deployed, characters will stop collecting research points when your account lapses into inactivity. You won’t lose the points you’ve accumulated up to that point, but simply won’t gain anymore. When you activate your account, your character will automatically begin earning RPs again. This change is long overdue and will hopefully benefit our active players.
On a sidenote, this was an issue our team picked off the CSMs list of priorities. It’s a great list that outlines both big and small concerns in the community and hopefully we can continue to address it. Summers are pretty quiet here, due to holidays and it’s a great time to go over the list and pick a few items for your team. Hopefully, we can bring you more items off the CSMs list, but till then, here’s one of them.
Over the last few days you will have seen the blogs coming out from the boffins in our engineering teams about Mass Testing, Long Lag and Thin Clients. In some of these blogs you will have seen reference to ‘Core' teams and so I wanted to take this chance to introduce the Core Technology Group (CTG), let you know what it is we do and where we sit within the CCP development structure.
At the end of June, our CEO Hilmar Veigar Petursson referred to CCP's Core Technology Platform at his opening keynote of China GDC. For some time now, we have been internally branding the framework we use to build all of our games as "Carbon."
Giving the framework a name helps us a lot with communication internally and of course to you, the wonderful players of EVE Online. There is also the old Icelandic saying of "if you know its name, you can kill it", which basically talks about the power of knowing the names of things and how that gives you the ability to control said phenomena. As complex and multifaceted as our Carbon framework is then we certainly benefit from all the help we can get to wield it. And the custodians of Carbon are the Core Technology Group.
This video shows some of the Carbon technology in action as was presented by two CTG members at GDC this year.
So, why have Carbon and what is it really?
As CCP and EVE continue to grow, it makes sense for us to consider our existing technologies and re-use these if appropriate across our projects as part of the Carbon framework. This becomes more appealing if those technologies are proven (or, as we call it, ‘battletested') in the crucible that is a game being used by a passionate, resourceful and sometimes devious player base. Furthermore, as our new projects continue to mature, we can take the technologies developed by them and apply them to EVE. This means that the EVE development teams get some great new technologies which it hasn't had to spend any developer time creating. Having the CTG take the strain making this happen ensures the EVE developers can concentrate on developing EVE and make best use of this common technology.
It also makes sense for a central group (the CTG) to build brand new Carbon technologies which we know can be shared. This group can then deploy and support the technology across CCP to whoever needs or wants to use it.
So what is the Carbon technology framework?
Well, it is mostly things which operate at a low level in our games. The idea is that the framework will consist of all of the key technology pieces for an MMO that our game teams can pick up and use. The Trinity2 graphics engine was the first piece of Carbon technology (even if we didn't know it as Carbon back then) introduced back in late 2007 by the newly formed Core group. Since then, various parts of our technology have been ‘Carbonised' such as the graphics refresh that happened as part of Apocrypha. You can refer back to these Dev Blogs for details of things which have been made part of Carbon in the past.
Of course, just creating and deploying new or existing technology, even if it is ‘battletested', is only part of the story. The CTG also spends a significant portion of its time re-working, re-writing and re-engineering parts of our codebase in a continuous effort to keep it up to date as technology advances. This also allows us to prepare the codebase to accept new technologies into the Carbon framework, allowing us to keep pushing the boundaries of what EVE Online can be.
Who is in the Core Technology Group?
Well, the CTG started, as mentioned earlier, with a few graphics programmers working under the direction of our Chief Technology Officer to produce the next generation of the EVE graphics engine. From that small beginning, we have been hiring new people into CCP in order to fill a number of teams. The CTG is a separate group within CCP, it is not part of EVE. We do not count as part of the EVE headcount and we have a separate budget and hiring plan. However, in reality we work very closely with EVE although Core does not work on game features. We provide the reprocessed minerals that the game teams then use to, in the case of EVE, build your serious internet spaceships.
We currently have 5 teams within the CTG, although we do use carefully chosen 3rd parties when needed. Futuremark is a good example where we are working with them on a new part of the graphics rendering engine. It must be noted that whilst the CTG teams all work on Carbon, some parts of Carbon are not maintained by the CTG. These are usually specialist areas which can be managed by the team which has that expertise. Animation is a prime example. The animation team is based in Atlanta and the work they are doing will be part of Carbon.
The 5 Core Technology Group teams consist of the following...
2 x Core Graphics Teams (12 people):
These teams are those working in the bowels of Trinity2, making it perform better, provide sexier visuals and use more up-to-date technology. The Carbon Character Technology video from earlier in this blog demonstrates a small portion of their work-in-progress. As you can see, this team has been working on the graphics technology for avatars and the environments they will inhabit. For EVE that's Incarna. These guys also develop and maintain our in-house tools system, ‘Jessica', which is used by pretty much every developer in CCP and contains the Trinity2 rendering engine. As requests for improvements, new functionality or bug fixes come in from the EVE developers, the Core Graphics guys get on the case and deliver. Jessica is also used by the EVE video team in making all of our trailers, allowing them serious time-saving shortcuts in staging assets for dramatic narrative effect.
Core EVE Graphics Team (3 people):
This team consists of three Core Graphics developers who are 100% assigned to EVE, working as part of a nine person EVE development scrum. Due to the demands of working in a cross-project, multinational company, parts of the CTG must occasionally shift its focus onto a particular project on a temporary basis. When this concentration is not on EVE (currently it is), the Core EVE Graphics team provides the link and continuity of Core involvement in the graphics side of EVE. By staying 100% laser focused, they make sure EVE continues to get the Core graphics programmer support it needs. You may have seen the recent Dev Blog by CCP Blaze about Tyrannis Performance Improvements who is on this team. These guys are instrumental in working closely with the EVE Art team to bring you new ships, planets and other graphical wonders.
Core Infrastructure Team (4 people):
This team is really at the heart of everything we are able to put out to our customers at CCP. These guys have created a common set of Productivity software, tools and technology which allows us to build more efficiently EVE and its patches from the source code, significantly reducing the time it takes to make EVE. This team has also built the repair tool which CCP Mandrake blogged about recently. In addition, this team has been working for a long time on delivering a way for our developers to be able to stress test EVE when they are developing new features or investigating and fixing bugs. We are now rolling this out and you can read more in the Dev Blog from CCP Atropos about the Thin Clients. I believe this is a significant step forward in how we are able to build and maintain EVE, allowing us to deliver a virtual world that we are much more confident is able to operate as we intend when we have record numbers of players hammering our servers and software.
Core Cluster Team (7 people)
What is one of the cornerstones of EVE? Lots of people interacting in a single universe without sharding. Obviously, EVE has scaled to dizzying heights over the last 7+ years and we now have more players than ever immersed in New Eden. However, we know that there are problems and we know we have to be constantly fighting to help the EVE cluster scale well beyond where it is now. This important task rests on many people all across CCP, from operations and virtual world staff to the EVE software developers to the EVE game designers and beyond. A key element in the fight to improve how we scale and reduce ‘Lag' in the game is the work of the Core Cluster team. This team has the high level goal of developing our cluster technology to allow EVE to scale well into the future, even as we put more demands on it with more pilots inhabiting a more dynamic and developed EVE universe. This is no overnight fix and there have been some great discussions (as well as some painful ones) on EVE-O and other forums. Many of these threads have a good grasp, if not of the specifics, then of the general idea that we are trying to solve some of the hardest problems in a number of very complex disciplines. CCP Warlock works as the Distributed Systems Architect for CCP and the Core Cluster team and recently released an excellent blog on some of the challenges we are facing.
As recent blogs have started to describe, we are doing things right now to attack lag. We have people looking at specific bugs and issues which have been plaguing fleet fights. We are well aware that there are lag issues around large fleet fights (and some not so large ones) which are reducing your enjoyment of this part of EVE. We know because you have been telling us about your experiences regarding things like module lag, jump in lag and blackscreens to name just a few. We also know because we, as players ourselves (long time lurker, first time poster checking in after 5 years playing) experience these issues. I will leave the specifics of our progress against these problems to the actual developers working on "lag" who can get low down and dirty with some real techy blogs.
Genuine progress on these issues is being made thanks to the tools being produced by Core Infrastructure that are being used by the Core Cluster and EVE development teams. The nature of the lag problem means that it takes time to diagnose and address the problems, but we have invested significantly in getting the right tools and people to help us identify and improve the situation as fast as possible. Once fixes have been deployed to TQ and we have real evidence that they are making a positive difference to your playing experience, you will see more Dev Blogs detailing the processes and fixes.
So in terms of teams and what we work on, that is the Core Technology Group as it stands right now. You can find some more information here. We will continue to grow proportionately to CCP's development teams and the growing scope of the Carbon framework. We are actively recruiting into the CTG and if you are interested, you can apply at the usual place, the CCP jobs page.
Now that you hopefully have a bit more clarity on how CCPs Core teams are structured, it is important that you, the EVE players, know that the focus of the Core Technology Group is on supporting EVE and the experience of playing it. Through our Carbon strategy we make sure that all future product development at CCP feeds back into Eve, as they all integrate back to Carbon. We also make sure that CCP has a robust battletested core framework to win our war against the impossible.
Over the last few months I've been working on something very cool that some of you may have heard about; it's a project to rework the guts of the EVE game client to remove the audio and visual aspects of the game. In other words, to slim the client down as much as possible so that it can be considered 'lite' or 'thin'.
And thus the Thin Client(TM) was born!
Click to see the thin client in all its glory...
What can this thin client do?
The basis for the thin client is the very EVE client you use yourselves; it takes that core and extends and overrides parts of it, so that you no longer need to have a sound card (insert generic EVE has sound meme) or a graphics card to run it.
Why should I care?
The thin client requires less system resources than a traditional 'full fat' client. As a result we can run more of them in parallel on one computer. Whereas a (normal?) EVE player might run 2 or maybe 3 accounts simultaneously with a traditional client, it's possible to run many times this number with the thin clients.
The obvious benefit of a client like this is one of scale; we can start up many hundreds of these clients and have them do something, anything.
It now becomes possible to set them up so that we can undertake a controlled, large scale test; you can submit a new change to the code base and retest with the same setup to examine the effects of the code change. The level of control and precision these tests now give us is unprecedented.
Such practices have been used to load test websites for a long time, by repeatedly making requests to websites in an effort to discover the bottlenecks of the system. However, for EVE the closest we have gotten is the mass tests on Singularity that CCP Tanis runs.
The mass tests provide us with valuable data, but they can be very hard to exercise control over, since you are dealing with anywhere from 200 to 500 living breathing EVE players. The thin clients on the other hand are mindless automatons; if we say jump off a cliff, they will, metaphorically, go straight for the edge.
Ok, but what does this actually mean for me?
The thin clients themselves aren't any smarter than a normal client. If you start up a normal EVE client it doesn't suddenly start trying to take over the world ala SkyNet (hopefully), and the same is true of the thin client. To bridge this gap we've created a variety of methods through which we can tell the clients what to do: the two methods are internal projects called Orchestrator and the Automaton Project, both of which I'll touch on later.
By being able to tell a client what to do we have created for ourselves a massive new tool box, which can be used to great effect. Allow me to elaborate:
It becomes possible to examine the behavior of massive amounts of mission running in the same system. To examine, as it were, the Rens Effect.
We can systematically examine the behavior of clients when they're fighting large scale fleet fights, allowing us to recreate and diagnose the unique problems that large scale fleet fights create, in the lab.
We can place Jita under the microscope so that the impact of many thousands of market transactions can be understood in detail.
We can examine just why when one fleet jumps into another, the black screen of impending death appears along with more intricate reproduction steps beyond "get a big fleet and jump into another one".
We can determine not only the theoretical threshold but also the actual performance threshold for fully loaded systems, whether it's pilots idling in space, hunting NPC's, shopping, afk-ing, anything.
And finally, we can evaluate the impact, at large scales, of new gameplay mechanics. Older players will recall many gameplay changes over the years, attributed to enhancing server performance such as the limiting of a ship to 5 drones from 10, changes to the rate of fire and damage modifiers on weapons to limit the impact high rate of fire weaponry would have on the server, for two obvious examples.
This new tool box allows us to load and stress test some of the oldest and most intricate components of the game.
Enough of the hurf blurf, give me the juicy stuff...
This is where I get technical, so if you fear techy, geeky nerd talk, skip this.
The obvious question is how did we achieve this? The core of the solution was through the application of two simple things:
Mocking and mock objects
Python class inheritance
For those of you with no programming knowledge I'll clarify: mocking is the practice of replacing one object with another that is almost identical but allows a lot more control. It's a process that is used in unit testing and allows the developer to test a piece of code in isolation. The use of mocking allows us to replace the pieces of the traditional codebase that rely upon a GUI with mock objects that do nothing. If you want to know more Wikipedia has a nice page on the subject.
The second step was the use of standard inheritance within Python to allow us to override particular pieces of the code; to explain I'll run through a simple example:
Consider the targeting system: when you initiate a target lock on a ship, asteroid, or whatever, you're telling the server to lock a target and to let you know when that is successful, or not, if they're out of your range.
This is represented on your client as a new target appearing at the top of your screen. With the thin clients, we don't have a GUI and so when the client gets the message from the server and attempts to load up the icon, it can go a bit haywire and raise errors about UI components missing and such.
By inheriting the class that handles the targeting, we can replace the single function causing the error with something that gracefully handles this new set of circumstances.
Of course, this can be very beneficial for us. It allows us to highlight areas of the codebase that are ripe for refactoring, where the game logic and the UI are too closely tied together.
In a lot of these cases, we're reviewing and touching on older code and so we are getting ancillary benefits from reviewing these files from a more up-to-date viewpoint.
So what's the performance like?
The average thin client has a memory footprint of between 150 to 200 MB. Now this may not be listed under the definition of 'thin' in everyone's dictionary, but it's a very good start. As we progress there will be more ways that we can reduce this footprint even more. As for CPU, the client requires very little; almost all the CPU required is in the first 30 seconds as all the Python libraries and code are loaded into memory. Once that's complete the clients become relatively quiescent. Unfortunately when you run a few hundred of these at the same time, even minor CPU fluctuations, occurring across every client at the same time, can cause problems, so it's something we're keen to keep to a minimum.
But what about control? How do you tell them what to do?
Orchestrator is the framework that we've developed for running our system tests. Its primary function is to setup a server and client and to run a particular test on the client with traditional pass/fail mechanisms.
The only problem with this scenario is that Orchestrator is a very possessive system; it wants to have full control of everything, proxies, server and connecting clients, and for what we're doing it proves a little too greedy. Because of the architecture of Orchestrator it's not the ideal candidate for large scale control of clients, but it does allow us to run targeted tests making use of fewer slaved clients.
As for the Automaton Project, well, I have to point out, no one but me calls it that, it's just my pet name for it. The project is a way of bootstrapping the client and having it execute arbitrary code locally, rather than having its movements dictated by a controller elsewhere on the network.
The difference between the two methodologies we have for controlling the clients is that one is a master/slave paradigm, whereas the other is group of fully autonomous actors; each has its pros and cons and we don't want to blindly follow one particular path only to find that it's actually the cause of our problems rather than the salvation.
So when can I get my hands on this?
Never, sorry :) The client is a developer tool only and whilst many people may want a less resource intensive client this isn't the one you're looking for </jedi>.
What now?
Now that we've got these tools, there's work to be done creating tests for them. CCP Veritas has been toying around with them recently and has uncovered some interesting pointers whilst hunting the infamous lag monster, but I'm sure he'll detail that in his own blog.
As for myself, there's lots more API's that need coding to allow the client to do more. Our primary goal is solving the lag issue, but beyond that I want to create a market interface that will allow us to setup mass trading so we can emulate Jita. There's also turning a herd (flock? what do you call a group of these things? an army?) of asynchronous automatons into an organized fleet, then there's work to be done on slimming them down further, etc., etc., ad infinitum.
There is one thing I want to stress though: we still need your help. Once we've used these clients and other tools to track down problems and submit fixes, we're going to need each and every one of you to lend us your time and effort to checking them on Singularity. EVE players have massive amounts of ingenuity, and we need to use that and resilience to help us stress test these fixes.
As the CSM has been kind enough to request, and with apologies for a holiday-related delay, here is the presentation I gave at the Para 2010 conference a couple of months ago.
To give you a little background, this presentation was based on my Ph.D. research, which examined the problem of group organization and co-operation between semi-autonomous robots in scenarios without centralized control. What I was trying to get the robots to do was visual analysis on a shared scene, which also turned out to be a harder problem than I initially expected. However, this was just the demonstration. The primary goal of the research was to explore the problems you have to solve in order to create groups of robots that can co-operate on shared tasks when centralized control isn't possible.
Every EVE player who has been in a large fleet fight has experienced this set of constraints, just in the work the individual fleet commanders and coordinators try to do in handing out targets and coordinating actions. Similar management is done by the CEO's and Directors of the EVE corporations on a day-to-day basis. The problems are not, in principle, any different in the game's software layers, except that software message processing speed is considerably faster and more reliable, whilst the human layers are arguably more fault tolerant. Arguably.
What both software and human groups are fighting at a very fundamental level is a nasty relationships between organizational structure and available real time communication capacity. These relationships do not scale at all well as groups get bigger, especially as the shared task gets more complicated and requires more communication. Get everybody to spam the chat channel at the same time - easy. Coordinating a complicated operation with different fleet components, loadouts and staggered waves of attack is much harder. These are problems that arise from simple mathematical limits on what can be done or communicated by each node in a distributed system in any singular instance of real time. A very simple example is a human meeting. Four people in a one-hour slot where each can speak for 15 minutes is usually plenty of time to provide the insights that that person can bring to the topic. Twenty people in the same time period can only talk for 3 minutes each. Meetings just don't scale well as a way of exchanging information.
Where I personally think this gets particularly interesting is that there turns out to be a topology (arrangement of links between nodes) constraint on the total amount of instantaneous information that any given group of nodes can process. At the extremes you have a strictly hierarchical topology (think dictator), which can communicate the same message (orders) to everybody in the network very quickly; and a distributed topology (think democracy), within which a much larger amount of different messages can be communicated, but actually getting everybody coordinated quickly becomes a distinct problem. For shared tasks whose requirements are at the extremes, it's not a problem, but what do you do if the task requires both quick coordinated action, and sharing a lot of information between nodes?
The presentation at Para 2010 was an attempt to put this into a very high level framework for designing large scale distributed systems. Personally, I think the scaling limitations themselves are fairly obvious especially once they've been pointed out. A lot of the design problem in practice comes from hierarchical designs being chosen due to ease of initial implementation. They won't scale very well - but that's not necessarily very obvious when you're working in the lab with too few nodes for the scaling constraints to be a problem. Something akin to a state change occurs in these systems as they grow beyond a certain size.
In practice, both hierarchical and distributed approaches end up getting used to solve the same problem, and there is a fairly complex set of tradeoffs that have to be evaluated when designing individual systems to determine which is actually the most appropriate. Games like EVE Online are not really a single distributed system - they are effectively a superposition of multiple distributed systems which simultaneously provide different aspects of game play, and have to be somehow designed to play well with each other.
Our goal is to not only give you the best possible performance across the cluster as whole, but also for specific activities like fleet fights, measured against the theoretical limits. We have a lot of work to do on that and plans to do it. From the cluster team's perspective, it's work that would have to be done regardless of whether Incarna or DUST 514 was being rolled out or not, as we build the cluster out to the next level of scaling for EVE itself.
From time to time we also discuss scaling issues with game design, since that is the only place where some of these distributed scaling problems can be solved. The longer term view on fleet fight performance lag is that whilst we can and will maximize performance within any single server's area of space, we are going to have to continue to work on game design to somehow limit the number of people that can be simultaneously in that space. Fully granted, given the vast physical immensity that is actual Space, it is a little hard to make a game case that there isn't enough room for a piddling few thousand spaceships.
For the specific issue of fleet fight lag where players are getting stuck jumping into a system, (aka long lag), we know this is an issue. To be technical, it's a non-reproducible, stochastic problem that predominantly manifests itself only under high load. The less formal name for this sort of thing is "every developer's worst nightmare."
Debugging distributed real time systems is a little different from dealing with sequential programs. For one thing, especially on a cluster the size of EVE, putting in extra logging can slow down the entire game if we're not careful. Data mining the resulting logs then becomes its own set of issues. You just know you're in trouble when the program you've written to do the analysis itself takes hours to run. There can also be Schrodinger effects, where examining the state of the system changes its behaviour enough that the issue doesn't manifest itself. We also have to be generally careful with what we do, since we can easily take down the entire cluster if we're not. It slows us down in the short term, but it hopefully means we don't introduce any more problems than we have to.
What I told the CSM was that we are going to fix this, it is going to take time, and we apologize for how long it has already taken. All I can really say beyond that is that these really are hard problems to solve. I know that sounds pretty lame, but that is also the unfortunate reality. They pretty much have to be tackled scientifically, which is to say that you form an hypothesis, figure out how to test that theory, or to get more information that will help form better theories, swear a lot when your pet theory gets shot down in flames by the empirical evidence and come in the next day and get to do it all again. It takes time, no matter how smart we try to be about it.
A number of developers have been working on this problem. A number of possible causes have so far been identified, tested, and turned out not to be the direct cause of this particular issue. We thought for a few days that the "Heroic Measures" DB fix that CCP Atlas shepherded through would be it. It certainly fixed some significant problems, but it just didn't fix the problem. I had a pet theory that it was a TCP rate adaption issue, in conjunction with a system lock affecting multiple clients. No such luck. We narrowed it down a little after I accidentally left myself logged on while stuck overnight, and came in the next morning and found I was successfully jumped into the system. (Unfortunately due to the nature of EVE that really only works in an invulnerable ship.) But that still leaves us with a pretty wide target. We are just starting to reap the benefits of some of the longer term projects that were initiated back in the spring as backup plans: improved cluster monitoring and a much improved testing environment (major props to the Core Infrastructure team for that btw), being the major ones. That and the results from the mass testing efforts are helping to focus our suspicions.
In the meantime, I can guarantee that the team here feels just as frustrated as developers about this problem, as you do as players. Probably the most frustrating part is that based on past experience, when we do find this issue (or issues) it will be something that, in retrospect, appears incredibly obvious and silly to have caused so much pain.
So we will continue to beat our heads against this problem until we solve it, and then I suspect we will beat our heads against the nearest wall for a quite a while afterwards.
As CCP Zulu mentioned in his recent blog, we want to shed more light on the work that has been done to combat lag and generally improve the performance of EVE. I'll be focusing on the QA end of things, especially about our mass-testing program and the work we've been doing. All told, we've gotten quite a few improvements out to EVE through the pipeline, made several adjustments to the mass-testing program, gathered a ton of logs and data, made some good progress on the fight against "lag," and began working directly with the CSM to build an action plan for the future. All this is pretty good stuff, and we're very excited to finally be able to share it with you all.
What is mass-testing, anyway?
Mass-testing is a program we've been running for a while now that brings developers and players together, on the test server, to hammer away at various new changes and features while involving players directly in how we assess the quality of EVE. At its core, the Mass-Testing program does three things: allows us to gather performance trend-data for EVE; allows us to more rapidly get high-priority changes tested by a large number of EVE players; and allows us to get critical feedback about new features/changes to EVE, before we release them to Tranquility (or "TQ," EVE's production server). It is important to note that mass-testing is not just to investigate "lag," but is really a framework to get vital changes tested and, most importantly, quickly get feedback from EVE players about those changes. Additionally, it also allows us to get performance trend-data from each new major build.
This all comes together to give CCP's teams, producers and managers a much better picture of where we stand and where we need to focus our attention to make things better. This is especially true when it comes to player feedback. It can be said that "perception is reality" and if that's considered to be true, then what EVE players think of the quality of the game is a truly vital indicator about the overall quality of EVE that we cannot ignore. We've taken this idea to heart, in several ways, and will endeavor to use mass-testing to retain an open dialog between CCP's developers and the EVE playerbase, especially through the CSM.
What've we been up to with mass-testing, all this time?
That's a very valid question that we've been seeing often. We've been running test after test for many months without really talking much about what's going on behind the scenes, and we've realized that that isn't the best way to go about things. So, without further ado, I'll give you all a short summary of what we've done via the mass-testing program over the last several tests or so, and what that has provided for EVE.
In mid-February, I published a dev blog announcing the start of the 2nd iteration of the mass-testing program
o The first in this series of tests was on February 20th
o We try to run a mass-test event at least every two weeks
o A brief summary of the results of this test, and all subsequent ones, can be found here.
The primary issues we were testing during this time have been the "long-load" issue, or "jumping lag" issue, overview misbehavior, and module activation issues. All of these can be lumped into the general "lag" category.
In addition to "lag" testing, we also ran tests of the Planetary Interaction feature, and the new EVE Gate website.
As a result of this testing, we've been able to make several fixes and adjusted things about features, based on your feedback. Some, but not all of these changes are:
Server-side changes to how we handle session changes - making them behave better
Improvement to module cycling - changed how the server processes these types of calls
Fixes to the overview to make it update more appropriately
Improvements to the Planetary Interaction UI, making it easier to use
Changes to EVE Gate, making it a bit more user-friendly
Server-side fixes for module cycling, making it less likely to ‘bug out' during high load situations
Internal data-gathering improvements (better data = faster fixes)
There's more, a lot more, but I'll let the awesome programmers talk more about those, in later blogs
This may also beg the question: "why haven't we seen any blogs about this until now?!" That's a much trickier question to answer, but the long and the short of it is: we were, and still are, in the middle of our investigations. We realize that that answer isn't the greatest, but it's the truth. Quite simply, it's difficult to talk about things while you're in the middle of working on them. That being said, we're not happy with the lack of communication either, and we've been taking steps to improve things on our end and will continue to do so until we're happy with it. This blog is just part of that process. We will also be working with the CSM to ensure communication between the folks working on these issues and the players continues to flow, and that we're at least addressing the big questions on player's minds.
What does mass-testing really show us?
Those of you who have been to mass-test events in the past probably have heard, "we're getting lots of good data!" What exactly have we gathered? I'll give you a glimpse into the data analysis we do, with examples from the latest mass-test.
Let's start out by defining what we were looking at: Module activation andcycling during fleet fights. Specifically, we had a new server-side flag that, when toggled, changes how the server handles the calls made by the EVE client when you activate and cycle your modules. CCP Veritas will have another blog going into much greater detail on this particular issue and fix, so I won't go into it much here.
So, the test:
We had three fleets total, which first jumped en-masse between two pre-defined systems (MHC and F67), in order to generate a baseline for server performance. This was then repeated with one fleet camping the final system, and the other two fleets jumping into them and engaging (Poitot and X-BV98). This was done without the new server-side flag turned on. This gave us a good baseline of how things performed in a high-load situation which involved jumping, combat, and multiple fleets all at once. We then stopped combat, turned on the new server flag, and had everyone duke it out at a planet, with two fleets warping into the third, who was ‘defending.' This was then repeated one more time, initially without drones, to further isolate the changes being tested and determine how effective the changes were.
The results:
I'll start by showing you the server performance graphs from that test:
Note: Not all of the log-markers are placed exactly (ie. where it says "jump 1 start" etc). Unfortuantely, lag during the test caused the log-inserts to be delayed in some cases. Even we get pwned by the ebil lag sometimes :)
Looking at this first chart, we see how the server handled the first jump, and the exit portion of the second jump. You can see the node building up in resource usage as everyone enters the first test-system and then it spikes, very noticeably, when everyone jumps between the two systems. You can immediately see, from the timestamps alone, that the node was hurting and the jump took several minutes to complete for approximately 550 pilots involved. This is our first indicator for where to look for bottlenecks and where we could benefit from improvements. More on that later.
(MHC-R3, F67E-Q, and Poitot)
Now let's look at the really interesting graph, from the node where all the combat took place, X-BV98. Here we see the same "background noise" from everyone getting into position and whatnot, this is always to be expected and should effectively be ignored for the purposes of performance analysis. Anyway, we see again a long spike of maxed out resources on this node, though we must also account for the fact that this test added combat into the mix, not just jumping. The extra load is expected, but it also serves as a baseline for the latercombat tests where we will activate the new module-cycling flag on the server.
On the chart, the three big spikes in CPU usage coincide exactly with the three rounds of combat that took place (I said there'd be five, but we ran long and had to cut two rounds). After setting the baseline for "combat lag" for that test, we then proceeded to a planet to duke it out again, this time with the new server flag turned on. You'll also notice a spike in memory usage on the node, roughly at the same time. It was here that we still saw some fairly epic lag, but the important thing to note is that modules appeared to by cycling much more often than before. This doesn't mean there wasn't lag, and we must note that module activation/de-activation is a different type of call than modules cycling (ie. Repeating). This was made most evident by the fact that drones were still cycling in and doing damage and modules did far less "single shot" cycles, when on auto-repeat. We did find, however, that module activation and de-activation was unchanged from the previous test. The net result is a module that still takes a while to realize it's been turned on, but once it does, it should cycle more-or-less properly after that. This is further supported when you look at the tasklet data from that portion of the test, compared to previous runs. But I'll let CCP Veritas get into that in more detail in his upcoming blog.
(X-BV98, combat system)
We then repeated the combat at the planet one more time, this time without drones, at first. In this test we saw far less "lag" due to the lack of drones. This allowed us to confirm our results from the previous round of combat, i.e. that the module flag works, but does not affect module activation or de-activation, only cycling.
In the case of the last mass-test, it was the combination of both hard-data gathered from the server and also player feedback, reports, and descriptions that came together to give us a much better and more accurate picture of what the end-user experience was both before and after the new fix was activated. It is this combination of empirical data and player feedback that makes mass-testing so valuable and enables us to rout out the causes of hard-to-find issues, such as lag, much much easier than by doing either one alone. This is where everyone who participates in these tests helps out so much, and really does contribute towards improving the situation much faster than it would be otherwise.
There is a lot more that goes on behind the scenes, with many many hours being spent on analyzing logs and other data sources, plus time carefully reviewing all player feedback. I hope that this has given just a bit more insight into the process and helped answer people's questions about how player participation on the test server really does have a positive affect for everyone in EVE.
Blockers, grumpiness, and finding solutions
Of course, all of this doesn't go off without a hitch. When trying to setup and coordinate public testing with hundreds of players, we certainly run into our fair share of problems. That is exactly why we chose to handle mass-testing in an evolving manner. We must adapt in order to better achieve our desired results: improving EVE. The following is a rundown of the major problems we've faced, along with how we're planning on addressing those issues as we move forward.
Player Attendance
Mass-testing benefits most from having at least 300 players, preferably more. This is because the factors that contribute to "lag" as a whole are generally those of scalability. As a result of this, we simply wont trigger these conditions if we don't get enough people to show up. This means that even if we have a new batch of potential "lag fixes" we cannot really test them, without getting enough people to cause these scalability flaws to manifest. We've tried moving tests to weekends, and at various times off day, all with limited success. We then concluded that this is an issue of perception. Quite simply, if people don't feel that it's a good use of their time, they won't do it. This seems sensible enough, so we're now exploring various types of feedback that we can give to the participants of these tests to make it more readily apparent to everyone how what they're doing is helping. Really, this boils down to my second point, communication.
That said, communication isn't all of it. We're also looking at various ways we can provide better incentive to participate in testing, as well as to make it more fun. We've been considering things like "Singularity rewards," giving game-time for participating in "x" number of tests, turning it into a "Red vs Blue" style rivalry to make it more fun, and loads more. Our goal with this is to make testing more rewarding, in various ways, while still keeping it about the testing and not just about getting stuff, and certainly not unbalancing gameplay on TQ. This is all still being discussed and debated internally, but I wanted to be clear that we are working on it and give some idea of possible options we've been thinking about.
Communication is key
It doesn't take a rocket scientist to figure out that no one likes being kept in the dark. While our test results have been a bit opaque in the past, we're trying to keep that in the past, Since I'm not one who likes excuses I'll just skip to the solution here. We've started working directly with the CSM members to ensure we're keeping open lines of communication, feedback, and information between CCP and the players of EVE about mass-tests, their impact, and their findings. This will manifest in several ways, some sooner than others. Initially this will be mostly on the forums, where we will work with the CSM to refine the post-test reports that we publish to ensure they provide the right information. From there we are also building internal tools that will allow us to publish IGB-viewable pages with information about testing, detailed instructions, FAQ's, guides, etc. We're also looking at trying to put out a short blog after we test big or important changes, all in an effort to keep folks well informed and free from having to guess what's going on.
Staffing and resources
Along with the changes to communication above, we're also ramping up our allocation to mass-testing internally. We're allocating more QA resources into the test events, and building better test scenarios. We're getting more time from programmers to be able to speak directly with the CSM and key others about what exactly the issues are, what testing has found, and what our "next step" is likely to be. All told we're seeing a lot of firm commitment from both our QA and Software teams to not only keep working on the fixes, until they're done, but also to continuing to support these tests and the playerbase as we work through our endless effort to continually improve EVE's overall performance.
Hardware, hardware, hardware
Many of you have no doubt seen the recent blogs about our upgrades to TQ. While upgrades to TQ are great, it presents a different set of challenges to those of us in QA. This comes in the form of hardware, and how much it differes from Singularity (our primary test server, also called SiSi) and Tranquility (our live, production server). On the surface this may not seem like such a big deal, they both work, both let people login and fly spaceships, but that's about where the similarities stop. I could go into all the specifications about how different the two servers are, but that's a lot of techno-mumbo-jumbo that's best left for 2am discussions down at the Brickstore pub. The long and short of it is; right now, Singularity is far behind TQ in hardware. That causes test results to become more and more questionable, especially looking at cluster performance.
Have no fear: upgrades are (almost) here! Our awesome Operations team took up the task of upgrading SiSi to bring it more in-line with TQ, and thus make it a much better test cluster for TQ. These upgrades will happen in phases, but they're actually already started. Now, this will also mean SiSi wont always be available while we upgrade, but it does mean that the situation is already getting better. As of the next mass-test, on August 19th, we should have the test systems running on spiffy new TQ-style hardware. This means that the server should behave much more closely to how nodes on TQ does.
But let's not forget the big hurdle for many players: Patching the Singularity client
Believe me when I say, we're right there along with you in not liking how this currently works. Not only can it be a bit of a task, but it serves as a barrier to entry for participating in the mass-tests. With as often as we update Singularity at times, it can be very difficult to keep your test client up to date, much less find the right patches, get it all applied, and then figure out how to connect to the server. Well, we decided enough is enough and we're in the process of completely changing how this works, all for the better.
We're still in the initial phases of acceptance testing, but we've now got a new "Singularity updater" tool that we think will go a long way to making everyone's lives easier here. This is just a simple executable that you'd download, run, and point at your working TQ client. The tool then takes care of copying the client and updating it to the correct Singularity build, automatically. It will also give you a new link to the client, which connects it directly to Singularity. All of this will be with just three easy clicks for the end-user. We're still working out some kinks, and trying to make it run a bit faster, but we hope to have it out in public use very soon.
Eventually, we plan to expand upon the "Singularity updater" with additional functionality; like diagnostics, bug reporting helpers, logserver helpers, etc.. but that's all still very much in the realm of "Tanis's wet dreams" so we'll see how much of that we can feasibly implement over the next few months vs. what is just wishful thinking.
Plucking the harp one more time
I wanted to bring up the CSM again, before signing off on this blog. Mass-testing is not just about data, or testing fixes, it is about involving the EVE community in assessing the overall quality of EVE. We feel very strongly that EVE's players must be involved, at some level, in the discussion about the quality of the game. You folks are, after all, the ones who play it day in and day out. You spend your free hours in the universe which we've built, so you should always have a say in how good, or bad, you think that universe is working. Obviously, it isn't feasible for us to have one-on-one discussions with everyone, so we have to find some more workable middle ground. We believe we've found a very appropriate one in the CSM. These are the people whom you all elect to represent you to CCP. These are the people who will carry your issues, your gripes, and your kudos to us.
We have struck up a new commitment with the CSM to build a sustainable and open dialog between CCP and the players about the quality and performance of EVE. This means that the CSM will be able to bring concerns to us more readily and that we can, in return, work together to ensure that we effectively communicate about those issues with all of you. This isn't limited to just the current causes of lag, but any issue that may crop up later that makes EVE run poorly, or limits the ability for people to have amazing 1000+ fleet fight once again. We feel this is a very positive change and look forward to working more closely with the CSM towards more effective communication and better mass-tests.
We have a series of blogs coming out to give a better insight into the technical side of CCP and progress in our long battle on lag. We know this issue is important to you, and there's plenty of room for further explanation, so we hope you tune in here in the coming weeks and months to follow our efforts.
We call this the long battle on lag because there's not a single issue that creates lag or removes it. It's a constant, slow battle that has many possible warriors standing against us on the opposing side. You may also remember some of our more focused initatives, like Need For Speed which we started 2006. It has been a priority within CCP since then as we've taken a holistic approach to EVE's growing population and the emergent behavior of its pilots. In EVE's long history, we've made continual progress towards the promised land of minimal lag, sometimes in incremental steps and sometimes punctuated by large leaps such as StacklessIO and EVE64.
Shortly after Dominion, as we've all noticed, things began reverting. This has been a problem, as we know how powerful and unique an experience a 1000+ player fleet fight can truly be. While this is all relative, as we're dealing with an exceptional gaming situation where the universe is large and un-sharded and allows for freedom of movement in the gamespace that is rarely seen elsewhere, CCP's goal is to return you to those epic space battles and then well beyond.
In the past we've tried to give the right mix of information for the general EVE audience, which has led us a bit away from uber-technical blogs. However, the more technical blogs we have put out have been well-received and since we're talking lag, the blogs following this will follow suit.
In the coming series...
Topic: Singularity Mass Testing Report, CCP Tanis
This is a report from some of the findings of the lass mass testing on Singularity and includes some of the work and investigations being done as a result. These kinds of scenarios can seldom be created on Tranquility with the amount of logging, probing and debugging in place, so this should show why each test is of great importance to us and why the brave, patient players joining us on Sisi are integral to help us tackle lag.
Topic: The Long Lag and MMO Scaling, CCP Warlock
This blog is about distributed applications (of which EVE is one of the more complex) designing them and how CCP scales them to give you not only the best possible performance across the cluster as a whole, but also for specific activities like fleet fights. This comes from Warlock's presentation at Para 2010, which was presented to the CSM 5 during a break in deliberations and requested in this EVE-O post. This should give another angle of understanding of the behind-the-scenes efforts we've been taking.
Topic: "Thin Clients" and Automated Testing from CCP Atropos
This is an explanation of an effort to rework the guts of the EVE client to slim it down as much as possible so we can undertake controlled, large scale tests in a more automated way. This new testing tool will benefit both quality and scalability moving forwards and can help us simulate fleet-fights "on demand". We aren't to the point where we can imitate player behavior though, so this goes alongside mass-testing as one of our investigative tools.
Topic: Carbon, CCP's Core Technology , CCP Unifex
CCP has a Core Technology Group which creates and maintains all of the core functionality of what runs EVE and will run our future games. These are groups of superspecialists which helped to produce things like StacklessIO and the Trinity2 graphics engine. This blog will help clarify the overall approach we take to making games and more parts of CCP's organizational structure which contribute directly to the development of EVE Online than just the dedicated EVE Dev Team.
In conclusion, we've been a bit too heads-down on this anti-lag effort and realize the importance of our progress within the EVE player base. You can expect us to update you more frequently on our anti-lag efforts as well as the work of teams developing our core technology. We thank you for your patience, which has been quite long lasting, and hope you realize we will be throwing every last thing we've got at the terrible, multi-ship fleet known as lag.
Insurance changes, planetary interactions and an improved demographic section are now all included in the latest Quarterly Economic Newsletter for EVE Online. In this issue we also have the standard price level discussion, indices and the market snapshots. In Q2 of 2010 there were some very interesting fluctuations seen in the overall economy, the most significant one being a break in the standard economic cycle around expansions. This shift in the cycle is attributed to insurance and loot changes but what do you think? Are there some underlying factors in place that have changed the EVE economy? What do you think? Read the QEN and comment as if there is no tomorrow!
On behalf of the QEN team
Dr. EyjoG
]]>
Copyright Notice
Copyright Notice EVE Online and the EVE logo are the registered trademarks of CCP hf. All rights are reserved worldwide. All other trademarks are the property of their respective owners. EVE Online, the EVE logo, EVE and all associated logos and designs are the intellectual property of CCP hf. All artwork, screenshots, characters, vehicles, storylines, world facts or other recognizable features of the intellectual property relating to these trademarks are likewise the intellectual property of CCP hf. CCP hf. has granted permission to 'Static Corp / DaOpa's EVE-Online Fansite' to use EVE Online and all associated logos and designs for promotional and information purposes on its website but does not endorse, and is not in any way affiliated with, 'Static Corp / DaOpa's EVE-Online Fansite'. CCP is in no way responsible for the content on or functioning of this website, nor can it be liable for any damage arising from the use of this website.