There a few interesting stories about a recent 3000 player battle that took part in Eve Online. PC Gamer give a broad overview about what happened, which seems to stem from a mis-click by a player (UI lessons to be learned here I expect!). They have a very impressive video of part of the battle. Loses on both sides were significant and apparently ran to tens of thousands of dollars worth of in-game assets.
Penny Arcade then has an interesting article about how the server coped with the load. Two techniques are mentioned: moving solar systems off the server where the battle was taking place and time dilation. As per previous posts, EVE Online partitons its load by having each server deal with a certain number of solar systems. It appears that under heavy load, they can migrate solar systems, but the article notes that they can’t do this without forcing players to disconnect, which makes sense: it would be extremely hard to serialise the state and seamlessly migrate players.
Time dilation is an interesting technique, which deserves a brief analysis. It works by slowing down game time, so that the simulation progresses for all clients and the server as a given rate. There is an interesting video on this embedded in the Penny Arcade video, or found on YouTube here. Why does time dilation work? In a server-based network simulation the server is the bottle, especially when there are lots of player interactions. The bottleneck could simply be in-bound or out-bound bandwidth, or time required to compute the interactions between the players. One hypothesis is that time dilation would appear to be a technique where the server can instruct clients to slow simulation time: this means clients can send packets/instructions less frequently, meaning that instructions or packets are not being buffered at the network level or game logic level inside the server. The server can then process all the instructions in a timely manner without risk that some buffer will overflow, or an interaction will be missed. The alternative would appear to be just to let the server buffer and process events as quickly as possible, but congestion is going to get worse and worse as clients would probably need to resend events as they would assume time was progressing as fast as their local clock progresses.
Time dilation is thus quite a simple technique that is applicable in many situations. Of course there are probably many implementation gotchas (e.g. making sure all the clients slow down and speed up appropriately). It does leave me wondering how they deal with “calendar drift”, i.e. how they resynchronise the solar system’s local time to galactic time (assuming that there is such a notion). This would only appear to be applicable in situations where temporal continuity across regions is not important. For example, if you jumped into a region that was time-dilated from a region that was not, against a global clock you would have gone back in time. Perhaps once the battle has died gone, time accelerates again.