Optimizations roadmap

artch

Will the creep memory be passed along? I’d love to at least give it some purpose once it’s on the other side.

No it won’t, since the player may have custom Memory structure. But the creep’s name will be transferred, and you will be able to pass custom text messages to other shard as well (somehow).

You could probably make a memory segment public per shard, which you can load in other segments. It could be loosely coupled

This would require to connect runtime workers to all shards’ memory storages which creates to much coupling between all backend components. The idea is that inter-shard connectivity is being done only by one single isolated component running in background, passing creeps and text messages along between shards.

K-C

I think that the concerns here regarding the competitive advantage of young vs old shards are well expressed. I think that it should be possible to address these concerns by scaling the core progression statistics based on each shard's time dilation.

Base resources scaling could be something like y = 2 - 2/x so that at a 2s tick there is no benefit but the bonus is not linear (encouraging DoS).

For example, if with 3 shards exist: A (the origin, 5s tick), B (first new shard, 3s tick), and C (brand new shard, 2s tick) then the benefit to several resources should exist:

* GCL gained on an account might be scaled up per y = 2 - 2/x. Shard A would contribute 60% additional GC per RC, B would contribute 33% additional GC per RC.

* Power processed could have limited scaling perhaps on the function y = 1.5 - 1/x. Shard A would contribute 25% additional Power per Power processed. B would contribute a 16.7% bonus.

* The 5% market surcharge may be reduced to y = 0.05 * (0.5/x +0.75). Shard A would charge 4.38% on market orders. Shard B would charge 4.58%.

CPU expenditure would not be affected.

There's probably some edge cases I haven't considered, but scaling of this kind would slightly blunt the advantage players get from aggressively populating new shards, and provide some incentive to spanning new and old shards.

Atavus

Dear Screeps devs,

I'm very happy to see the team's dedication to handling the performance issues in the game. This is certainly a core issue for the game to continue thriving and should take precedence over the development of new features.

On the subject of performance, I believe the first and most important aspect is CPU stability. From the various individuals I've spoken with, tick times are usually a secondary concern compared to tick variability. The VM proposition for this topic sounds fantastic. Looking very much forward to that.

Regarding the proposal of world sharding. I am honestly quite worried.

1. It fundamentally changes the game without specifically improving gameplay. It adds another layer of complexity which players have to take into account.
3. It adds a significant layer of game engine complexity which will likely require major dev investment and maintenance.
4. Its effects on the community are hard to quantify. It's incredibly risky to play around with such a small, fragile community.
5. It's hard to accept that all technical options have been explored with regards to optimizing performance.

To me (not a specialist in the industry/field), it feels like it should be possible to parallelize and scale the processing of the game world. Is the DB the only bottleneck?

Afaik the concept of DB scalability is quite a studied field of computer science. What is fundamentally different about the Screeps world compared to other large-scale services/applications?

Afaik Screeps uses Mongodb. Why is sharding not a valid option?

Perhaps if additional information would be provided to the community about this problem, we might be able to assist. There's quite a number of individuals in the community with experience in the programming field :).

Kind regards,
Atavus

Timendainum

Aterm,

Have you guys considered talking to CCP? They have a large single sharded game, and while they are using a different database platform they may have some helpful advice on how to scale a large single instance database.

W4rl0ck

eve online is completely different. While everything is one cluster, every star system is a seperated server that runs for itself. It doesn’t matter if jita lags, your star system is fine. The same in wow. It doesn’t matter if ironforge lags, stormwind will be fine. In screeps the next tick can’t begin before the last one is finished with everyone and everything.

In screeps there is stuff that can’t run in parallel. The market system for example... if you sell something in eve online, it is directly taken out of your possession. So all transactions can run in a seperated database. In screeps all teurminal transfers and market transactions have to run after all rooms are processed... and they can’t really be run in parallel. It has to check if the resource is still available in the sellers terminal and if the destination terminal still has space free. That’s why they introduced the terminal cool down and why they don’t like using terminals for messaging.

tedivm

> 1. It fundamentally changes the game without specifically improving gameplay. It adds another layer of complexity which players have to take into account.

I actually think adding an extra dimension to the game could add some interesting gameplay mechanics. It would allow the devs to give out larger CPU rewards for higher GCL (with the limitation that each shard has a maximum of 300). The pathfinding challenge would be *amazing*, as would the ability to send surprise raids through other shards. I think it could be a cool way to expand the game- as long as each shard is roughly "equal" with the others in things like resources and tick rates.

Hell, this could bring my ultimate dream by creating a fully automated shard (no console or manual placement of flags or structures).

> 3. It adds a significant layer of game engine complexity which will likely require major dev investment and maintenance.

I don't see why this would be the case. Each shard is complete separate with the exception of a communication API for specific functions (transfer creeps, credits or "messages"). I don't see this as a huge project- the rewrite of the front end is probably a larger project.

> 4. Its effects on the community are hard to quantify. It's incredibly risky to play around with such a small, fragile community.

Agreed, although I think it can be mitigated by making sure the worlds interact as much as possible.

> 5. It's hard to accept that all technical options have been explored with regards to optimizing performance.

Also agreed.

tynstar

@Dissi/Artem: have you considered switching to Apache Cassandra? We use it at my company, and I can say this is a database that is built for horizontal scalability from the ground up. Originating at Facebook, it is now used at other very large companies. Citing their homepage:

Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million requests per day), and eBay (over 100 nodes, 250 TB).

I can't believe that Cassandra would really have any performance problems with handling all Screeps world data in a single database cluster.

Dravo

If a big difference remains between the tick times of the different shards, those who don't utilize the fast shard will be at a disadvantage on the slower shards.

The fast shard could be used to create an attack force to send back to the slow shard. When power creeps are implemented, the people that utilize the fast shard will be at a huge advantage if they decide to attack someone that has not used the fast shard. They will level their power creep on the fast shard.

artch

To me (not a specialist in the industry/field), it feels like it should be possible to parallelize and scale the processing of the game world. Is the DB the only bottleneck?

This is correct. Processing of the game world is well-parallelized, it is the database that is the bottleneck, not processing.

Afaik the concept of DB scalability is quite a studied field of computer science. What is fundamentally different about the Screeps world compared to other large-scale services/applications?

Because other services are quite different. W4rl0ck got it quite well in the post above. This proposed world sharding change is what can make Screeps closer to traditional use cases, and thus more applicable for traditional solutions.

Afaik Screeps uses Mongodb. Why is sharding not a valid option?

Because we tried it in all possible ways, and it made performance worse rather than better. Distributing every DB request (tens of thousands of them every second) among a cluster of shards incurs huge network and CPU overhead. This topic is not something that we are not competent in, we literally spent months learning this area and all possible options. Database sharding always comes at a cost. It is better to fix the flaw in our architecture than to keep trying looking for a solution for a man-made problem.

artch

If a big difference remains between the tick times of the different shards, those who don’t utilize the fast shard will be at a disadvantage on the slower shards.

On the other hand, people on the old shard can benefit from the well developed market and relations to established players. Either way has its pros and cons, and every player is free to choose on which shard he is willing to play.

artch

@Dissi/Artem: have you considered switching to Apache Cassandra?

Very interesting, have not considered it yet. We’ll make some benchmarks on our dataset and workload profile using it, thanks for the tip.

tynstar

> Very interesting, have not considered it yet. We’ll make some benchmarks on our dataset and workload profile using it, thanks for the tip.

You may have to model your data differently than what you're used to to really reap the benefits of Cassandra. Look closely at the way partition keys work. I'm willing to answer questions and give advice on data modelling, no strings attached, NDA is okay if needed.

artch

It looks like Cassandra is a better fit than MongoDB for big data set cases, not for higher read/write throughput. See this benchmark for example. In our case the data set is relatively small and completely fits into RAM of one single machine, but it is the requests per second rate that is crucial.

tynstar

> It looks like Cassandra is a better fit than MongoDB for big data set cases, not for higher read/write throughput.

Well, but read/write throughput in Cassandra scales linearly if you add more machines. So you don't have the problem of "overhead due to replication" killing the performance benefit of scaling horizontally.

Without going too much into details, the way Cassandra achieves this is because the partition key allows calculating which node(s) are responsible for the given query, and the driver will only ask these nodes. As a simple example (without duplicating data across nodes for fault tolerance), if I have 5 nodes, each of them will contain one fifth of the data, so only one fifth of queries will be handled by it. Thus, throughput load is spread evenly, and adding more nodes helps improving performance.

Atavus

Fair points.

If the primary goal of the world shard is to isolate the data structures/processing better, then I would be in favor of forcing the tick rate of all shards to be synchronized.

Effectively, you split the world into shards for the necessary performance benefit, but it's still a single synchronized game world.

W4rl0ck

Synchronized ticks would mean all shards would have the tick rate of the slowest shard. I don't think that would make moving attractive. Starting in a empty world with a 5 sec tick.

I don't think the current plan is to crush the world into pieces ... but to add alternatives worlds that are loosely connected. I don't know if / how shrinking the current world would work.

artch

Without going too much into details, the way Cassandra achieves this is because the partition key allows calculating which node(s) are responsible for the given query, and the driver will only ask these nodes. As a simple example (without duplicating data across nodes for fault tolerance), if I have 5 nodes, each of them will contain one fifth of the data, so only one fifth of queries will be handled by it. Thus, throughput load is spread evenly, and adding more nodes helps improving performance.

This looks pretty much close to how Redis Cluster works. And as far as I understand, Cassandra doesn't have secondary indexes support as well. Does it provide any benefits over Redis then?

Effectively, you split the world into shards for the necessary performance benefit, but it’s still a single synchronized game world.

The fact that it is not synchronized doesn’t make the world non-single, since they are connected through persistent portals.

Synchronized ticks would mean all shards would have the tick rate of the slowest shard. I don’t think that would make moving attractive. Starting in a empty world with a 5 sec tick.

Exactly, since the current world will become the first (and the slowest) shard. The idea is to provide better tick rate experience, and making new shards as slow as the current world doesn’t make any sense.

artch

That being said, if we manage to reduce the tick rate of the first shard to some acceptable value due to players moving to another shard, then synchronizing tick rates might become an option.

coteyr

I hate to say it but it sounds more like you don't know how to fix the problem so your goring to throw solutions at it till it sticks. In other words that the entire situation is fundamentally flawed and perhaps a rewrite is needed and not just a "muck with DB settings" approach. It seems that "sharding" is your rewrite.

However I worry that you have not solved the fundamental problem. You hinted at it in one of your last posts. X reads/writes = suck. Sharding may make that less common (cause your doing less read/writes per shard) but the limit still exists. Why not focus on fixing "X".

Actions don't need to be written to the database all that frequently. They can be stored in memory then flushed to the database. Maybe once every 1000 ticks you can flush to the database. Yes that means a crash is an auto roll back, but just don't crash (gotta love that one).

As to world interactions, what does it matter. Rarely in all those read writes are players interaction with more then 1-2 other players. Some maybe 10-20 players. But some kind of status propagation based on visibility could fix that. I mean what really needs to be passed, not much changes per tick "normally". When an attack happens (or when two players creeps are in the same room) more information needs to be passed around, but that's not "often" compared to a creep in it's own rooms.

In other words I see

State info stored in memory most ticks
That state shared to people that have visibility to that room and otherwise ignored
state is flushed to database infrequently.

Now this bounds your hardware and database and what not to number of rooms and not number of players, or what those players are doing. It also removes the "slow part" from the "fast part". If processing isn't a problem and just database-ing is a problem then just don't database.

ALL THAT SAID

I am very happy that you guys are doing something. At the current rate, you won't have a game in 6 months. These changes are the first glimpse of light at the end of that tunnel. Tick speeds per shard are going to be a huge issue. They need to be "equal" somehow. The screeps world needs to be "one world" somehow. But the fact that your making progress in some direction, is a great thing. It needs to happen. It may cause a "burp" while the player base adjusts, but don't do "something" and you won't have a player base. If your not growing your shrinking. Staying still isn't really an option.

artch

@cotyr We do know how to fix the problem - the world sharding change is the solution much better than the optimizations you propose. If we just optimize things here and there, we can grow the world 10%, 50%, 100% more, and eventually get back to the same issues. With world sharding, we can grow infinitely with the desired performance.

Regarding the visibility thing - don't forget about observers, globally synced operations like market and terminals, and external APIs fetching game state (we're going to introduce API keys in the future). Getting rid of the persistent database will complicate things to nearly unmanageable state. You basically propose to develop our own distributed database management system, I don't think such a task can be managed by a team of 2 developers.