Optimizations roadmap


  • Dev Team

    To me (not a specialist in the industry/field), it feels like it should be possible to parallelize and scale the processing of the game world. Is the DB the only bottleneck?

    This is correct. Processing of the game world is well-parallelized, it is the database that is the bottleneck, not processing.

    Afaik the concept of DB scalability is quite a studied field of computer science. What is fundamentally different about the Screeps world compared to other large-scale services/applications?

    Because other services are quite different. W4rl0ck got it quite well in the post above. This proposed world sharding change is what can make Screeps closer to traditional use cases, and thus more applicable for traditional solutions.

    Afaik Screeps uses Mongodb. Why is sharding not a valid option?

    Because we tried it in all possible ways, and it made performance worse rather than better. Distributing every DB request (tens of thousands of them every second) among a cluster of shards incurs huge network and CPU overhead. This topic is not something that we are not competent in, we literally spent months learning this area and all possible options. Database sharding always comes at a cost. It is better to fix the flaw in our architecture than to keep trying looking for a solution for a man-made problem.


  • Dev Team

    If a big difference remains between the tick times of the different shards, those who don’t utilize the fast shard will be at a disadvantage on the slower shards.

    On the other hand, people on the old shard can benefit from the well developed market and relations to established players. Either way has its pros and cons, and every player is free to choose on which shard he is willing to play.


  • Dev Team

    @Dissi/Artem: have you considered switching to Apache Cassandra?

    Very interesting, have not considered it yet. We’ll make some benchmarks on our dataset and workload profile using it, thanks for the tip.



  •  

    > Very interesting, have not considered it yet. We’ll make some benchmarks on our dataset and workload profile using it, thanks for the tip.

    You may have to model your data differently than what you're used to to really reap the benefits of Cassandra. Look closely at the way partition keys work. I'm willing to answer questions and give advice on data modelling, no strings attached, NDA is okay if needed.


  • Dev Team

    It looks like Cassandra is a better fit than MongoDB for big data set cases, not for higher read/write throughput. See this benchmark for example. In our case the data set is relatively small and completely fits into RAM of one single machine, but it is the requests per second rate that is crucial. 



  • > It looks like Cassandra is a better fit than MongoDB for big data set cases, not for higher read/write throughput.

    Well, but read/write throughput in Cassandra scales linearly if you add more machines. So you don't have the problem of "overhead due to replication" killing the performance benefit of scaling horizontally.

    Without going too much into details, the way Cassandra achieves this is because the partition key allows calculating which node(s) are responsible for the given query, and the driver will only ask these nodes. As a simple example (without duplicating data across nodes for fault tolerance), if I have 5 nodes, each of them will contain one fifth of the data, so only one fifth of queries will be handled by it. Thus, throughput load is spread evenly, and adding more nodes helps improving performance.



  • Fair points.

    If the primary goal of the world shard is to isolate the data structures/processing better, then I would be in favor of forcing the tick rate of all shards to be synchronized.

    Effectively, you split the world into shards for the necessary performance benefit, but it's still a single synchronized game world.


  • YP

    Synchronized ticks would mean all shards would have the tick rate of the slowest shard. I don't think that would make moving attractive. Starting in a empty world with a 5 sec tick.

    I don't think the current plan is to crush the world into pieces ... but to add alternatives worlds that are loosely connected. I don't know if / how shrinking the current world would work.


  • Dev Team

    Without going too much into details, the way Cassandra achieves this is because the partition key allows calculating which node(s) are responsible for the given query, and the driver will only ask these nodes. As a simple example (without duplicating data across nodes for fault tolerance), if I have 5 nodes, each of them will contain one fifth of the data, so only one fifth of queries will be handled by it. Thus, throughput load is spread evenly, and adding more nodes helps improving performance.

    This looks pretty much close to how Redis Cluster works. And as far as I understand, Cassandra doesn't have secondary indexes support as well. Does it provide any benefits over Redis then? 

    Effectively, you split the world into shards for the necessary performance benefit, but it’s still a single synchronized game world.

    The fact that it is not synchronized doesn’t make the world non-single, since they are connected through persistent portals.

    Synchronized ticks would mean all shards would have the tick rate of the slowest shard. I don’t think that would make moving attractive. Starting in a empty world with a 5 sec tick.

    Exactly, since the current world will become the first (and the slowest) shard. The idea is to provide better tick rate experience, and making new shards as slow as the current world doesn’t make any sense.


  • Dev Team

    That being said, if we manage to reduce the tick rate of the first shard to some acceptable value due to players moving to another shard, then synchronizing tick rates might become an option.



  • I hate to say it but it sounds more like you don't know how to fix the problem so your goring to throw solutions at it till it sticks. In other words that the entire situation is fundamentally flawed and perhaps a rewrite is needed and not just a "muck with DB settings" approach.  It seems that "sharding" is your rewrite. 

    However I worry that you have not solved the fundamental problem. You hinted at it in one of your last posts. X reads/writes = suck. Sharding may make that less common (cause your doing less read/writes per shard) but the limit still exists. Why not focus on fixing "X".

    Actions don't need to be written to the database all that frequently. They can be stored in memory then flushed to the database. Maybe once every 1000 ticks you can flush to the database. Yes that means a crash is an auto roll back, but just don't crash (gotta love that one). 

    As to world interactions, what does it matter. Rarely in all those read writes are players interaction with more then 1-2 other players. Some maybe 10-20 players. But some kind of status propagation based on visibility could fix that. I mean what really needs to be passed, not much changes per tick "normally". When an attack happens (or when two players creeps are in the same room) more information needs to be passed around, but that's not "often" compared to a creep in it's own rooms.

    In other words I see

    • State info stored in memory most ticks
    • That state shared to people that have visibility to that room and otherwise ignored
    • state is flushed to database  infrequently.

    Now this bounds your hardware and database and what not to number of rooms and not number of players, or what those players are doing. It also removes the "slow part" from the "fast part". If processing isn't a problem and just database-ing is a problem then just don't database.

     

    ALL THAT SAID

    I am very happy that you guys are doing something. At the current rate, you won't have a game in 6 months. These changes are the first glimpse of light at the end of that tunnel. Tick speeds per shard are going to be a huge issue. They need to be "equal" somehow. The screeps world needs to be "one world" somehow. But the fact that your making progress in some direction, is a great thing. It needs to happen. It may cause a "burp" while the player base adjusts, but don't do "something" and you won't have a player base. If your not growing your shrinking. Staying still isn't really an option.


  • Dev Team

    @cotyr We do know how to fix the problem - the world sharding change is the solution much better than the optimizations you propose. If we just optimize things here and there, we can grow the world 10%, 50%, 100% more, and eventually get back to the same issues. With world sharding, we can grow infinitely with the desired performance.

    Regarding the visibility thing - don't forget about observers, globally synced operations like market and terminals, and external APIs fetching game state (we're going to introduce API keys in the future). Getting rid of the persistent database will complicate things to nearly unmanageable state. You basically propose to develop our own distributed database management system, I don't think such a task can be managed by a team of 2 developers.



  • "we do know how to fix the problem - the world sharding change is the solution"

     

    That's kind of my point. Tweaking isn't going to do it. A big change is needed.

     

    As for:

     

    "Regarding the visibility thing - don't forget about observers,"

    I would think that it's rare compared to the number of "in room" read and writes

    "globally synced operations like market and terminals,"

    That can't be taking up that much DB time. Limit it to one market call per tick, Terminals are already limited in such a way with intents. (essentially)

    " external APIs fetching game state (we're going to introduce API keys in the future)."

    Turn it off for now. I know that makes our pretty graphs go away, but that's a fair trade for better tick times/stability. It can be brought back when you flesh out the API key stuff. Lots of games and other companies do that when a secondary part like the "Unofficial" API gets to be to burdensome. If API really is the problem then ditch it. We all know it is "Unofficial" anyway. 

    "Getting rid of the persistent database will complicate things to nearly unmanageable state."

    Maybe, but maybe not. I can't see your mongoDB database engine, but if it's like the opensource one, then you could come up with a way to just not write to the DB. Yes it would add some work, but I'm not sure that it would add that much work. A Pure in memory MongoDB that flushes to disk every 100 ticks or so could be an easy start. 

    "You basically propose to develop our own distributed database management system, I don't think such a task can be managed by a team of 2 developers." 

    IDK, two developers can do quite a bit 🙂  But yes, that is kind of my point. It seems like your going "what can we do in budget (meaning time and money not just money), but what I am saying is that you might not be able to solve the problem "in budget". It may be much bigger then that. To me (not knowing anything internal about the team) It's a big enough problem that all future dev stops on any thing that isn't this problem/solution. For example Stability, Tick rates, sharding is a "Must Have" while GUI, new clients, API keys, etc etc. all become "Nice to Haves".

     

    But as I said in my last post, it doesn't matter. The fact that you guys are going down a path, regardless of the path, is light at the end of a tunnel. It may be a long, twisty, narrow, tunnel, but look there's hope, and I think that's the important take away. "It's being worked on"



  • > This looks pretty much close to how Redis Cluster works. And as far as I understand, Cassandra doesn't have secondary indexes support as well. Does it provide any benefits over Redis then?

    Cassandra does have support for secondary indexes, but using them has a drawback: as secondary indexes are local, queries always have to involve all nodes (see https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes for background), whereas for regular tables (and even materialized views) queries are directed to only a part of your cluster nodes. To my developers and customers, as an alternative I usually recommend using materialized views and/or specialized "lookup tables" which redundantly store data with primary keys optimized for the respective queries. This approach yields best performance for load profiles where data is read more often than it is written (which I guess may be the case for you).

    I'm not too familiar with Redis Cluster. But from what I read (http://bigdataconsultants.blogspot.de/2013/12/difference-between-cassandra-and-redis.html and https://www.quora.com/Which-is-better-Redis-cluster-or-Cassandra) Redis uses a master slave architecture, whereas Cassandra nodes are all equal. I find the latter approach superior as it allows spreading not only read loads, but also writes.


  • YP

    >> " external APIs fetching game state (we're going to introduce API keys in the future)."

    > Turn it off for now. I know that makes our pretty graphs go away, but that's a fair trade for better tick times/stability. It can be brought back when you flesh out the API key stuff. Lots of games and other companies do that when a secondary part like the "Unofficial" API gets to be to burdensome. If API really is the problem then ditch it. We all know it is "Unofficial" anyway.

    @coteyr: yeah.. it's the "unofficial" api because it is the api the game client uses to communicate with the servers... so it's unofficial to use it for other purposes.

    do you really think it's a good idea to turn the api off that is used by the game client to communicate with the servers? how do you want to play the game?

    I think the game will get really boring if the code only runs inside the server and no one can see the gamestate anymore because they removed all external access.



  • @W4rl0ck: It's very easy to limit API access to clients only if you control the clients. Hell, it's a browser based game, with a great community. Set a cookie, or better yet a header, have the web server look for it and tell the player base to back off.  It doesn't need to be super secure, hell, if we were told to stop using it outside the clients most everyone would comply. 


  • YP

    @coteyr: that's not the point. the client is getting the gamestate out of the database. If you don't update the database there is nothing the client can show. you are suggesting to turn of the api that is needed by the client... you just suggest stuff without thinking or understanding how stuff works.

    It doesnt matter if it's the client that is acessing the data or a script that creates stats. web requests can get cached and is already rate limited to prevent problems ... web access is not the problem the server has.


  • Culture

    > @W4rl0ck: It's very easy to limit API access to clients only if you control the clients. Hell, it's a browser based game, with a great community. Set a cookie, or better yet a header, have the web server look for it and tell the player base to back off.  It doesn't need to be super secure, hell, if we were told to stop using it outside the clients most everyone would comply. 

    Honestly I'd prefer they just asked us to rate limit the requests. It would be easy enough to add that into the client API code we wrote so all projects could start using it.

    That being said I'm curious how much third party usage is really causing issues-

    * The League website only updates every 6 hours, and it already has rate limiting built in to prevent it from hammering the API (this is why the upgrade takes more than 20 minutes). All it's doing is reading- there are no writes occurring. 

    * The "screeps-stats" project is set to buffer and rate limit as well. Every 10 seconds or so it reads a point in memory, optionally reads a point from segments, and then it writes back a single console command. There's a secondary system which also reads market orders every few minutes, which in turn also does a room lookup (to see who owns the room- similar to the lookup that happens when you browse the map)- this happens once per transaction at most though.

    * The "screeps-console" project mostly just reads the websocket. I'm not sure how much load this puts on things but if it would help I could look into setting a timer to kill the socket when the console was idle for too long- this was something I was thinking about doing anyways.

    Outside of all that I can't imagine that the third party tools are putting a larger load on the system than the game client itself is. That being said I also can't see what intel tools each alliance has built for themselves, just the ones the culture has, so while I know we're being smart about ratelimiting and using external resources (like the league's "rooms.js" file) I'm not sure others are. So if it really is worthwhile to put in some rate limiting on the node and python clients (since i'm assuming they're the most used ones) just let us know what you think reasonable numbers would be.


  • Culture

    Another note, the API is already rate limited, we can only hit it upto a few times per second before it heavily rate limits, websocket shouldn't be that much load, its using redis pubsub in the background so shouldn't even be touching the database, many things such as Memory are in redis already, so doesn't touch the database, there is still API calls that hit it, but overall our third party code shouldn't be hammering the database near enough to be an issue.


  • Culture

    On another point, I think you are all grossly underestimating the appeal of a second shard. I do not think that tick times should be the way to drive people to the new shard, as I think there are a ton of other reasons why players would be smart to use both shards.

    1. New players will use the less densely populated shard.

    Seriously, the new player experience sucks. If you spawn next to one of the "undiplomatic" alliances you are either going to eventually get absorbed or killed. If you spawn next to one of the alliances that's more friendly you're still going to either have to join up or limit your expansion space, and as a single player you're going to get squashed if you try and attack. Now imagine you've got five friends who want to join, possible as an alliance- where are you going to plop down?

    As territory has solidified there's less room for new players. Expanding the edges works, but doesn't scale. I love the idea of scaling "up" (on the Z axis, with rooms stacked on top of each other) as a way to deal with this. ALso, if we're totally honest with ourselves here even if the novice and spawn zones keep showing up on the old shard new players are more likely to go to the less densely packed area anyways.

    2. Established players will use the shard for defense.

    If I'm a player who has rooms on the "ground floor" (the existing shard), and I know that only a small amount of players have code to move between shards, I would be absolutely stupid not to have at least one backup room positions above my main world rooms. If someone attacks me and wipes out all of my rooms on the main world, but has no ability to go to the second world to take out my other rooms, then I can simply continue to send upgraders and room builders in to retake my lost rooms. It would be absolutely silly for established players not to take advantage of that.

    3. Established players will need to establish presence for strategic purposes.

    Why send troops through the main shard to attack another player when you can shove them up a level, have them walk over to the rooms and bypass creeps and observers in order to mount a surprise attack? To do this you're going to need pathfinding on both worlds, and to do that properly you'll want to have creeps and possible observers in the other shard.

     

    4. Smaller alliances may just up and move.

    If you open up enough space some of the smaller alliances- who have shown themselves willing to mass respawn together in the past- might just move upwards to avoid having to fight for space in an already crowded world.

    5. Open territory is valuable enough to drive people.

    Finally, I really feel that simply having the territory available will make it get used. if your options are to engage in a two week fight over rooms, or expand upwards for basically free, then you're going to expand upwards.

     

    For these reasons I find the whole argument about needing a smaller tick size to motivate people as premature. There are a ton of reasons why people would use the second shard and I think tick times are pretty down on the list. The tick time issue is also the largest issue that people seem to be upset about with this shard idea, so if you eliminate that difference I think this whole thing will go much more smoothly.