Server issues



  • Can you guys do some work to make the impact of server issues less bad?

    For an hour today shard 3 was ticking very slowly. Some people's code wasn't ticking at all. Other people had large CPU penalties.

    Now it's ticking, but I've got two globals, causing a whole bunch of excess CPU overhead and other issues due to things that IVM says are impossible happening.

    I understand that servers fail, but having fallbacks and failsafes to detect those failures and stop the game from breaking would be really nice. Detecting a player's code not even reaching his loop should be possible, no? I'm less sure about the two globals issue.

    I'm also not monitoring 24/7 and while I can file a support ticket when it goes wrong, we've had the same issues for a long time so it doesn't really feel like that's worthwhile. I feel you should be able to detect the most common issues without having to be poked. These issues are pretty common, all the shards have had them in the past week. Maybe I could autodetect the issues and autofile support tickets?

    Lost a room today due to the servers, and they've been broken for nearly two hours (first slow ticks, now two globals). Not entirely pleased.


  • Dev Team

    We're working hard on catching infrastructure issues which cause the described situation.


  • Dev Team

    @tigga said in Server issues:

    I understand that servers fail, but having fallbacks and failsafes to detect those failures and stop the game from breaking would be really nice.

    Could you please elaborate a bit more, what is your suggestion here? Imagine that a failure is detected while running some player's code. What measures should be taken by the system then in order to "stop the game from breaking" for this player?



  • @artch said in Server issues:

    @tigga said in Server issues:

    I understand that servers fail, but having fallbacks and failsafes to detect those failures and stop the game from breaking would be really nice.

    Could you please elaborate a bit more, what is your suggestion here? Imagine that a failure is detected while running some player's code. What measures should be taken by the system then in order to "stop the game from breaking" for this player?

    There have been recent issues where some players' main loops are just not run at all, while other players continue as normal. If that happens (and can be detected) the shard should just be paused until a dev could check it out. Alternatively, an automatic restart with a pause if the problem persists after that. I don't know if you can restart invidivudal runners, maybe that'd be sufficient.

    The two-global bug is less severe. I'm not quite sure what the "official" line is on persistent global state. There's been people reporting it causes them to spawn twice as many creeps as normal. It also drains CPU of anyone using the persistent memory trick. It's less game breaking as it doesn't leave you completely open to attack, but it's pretty annoying.


  • Dev Team

    @tigga If there is a failure on one runner (even after an automatic restart, which we have in place BTW, you observe how it works all the time), it would affect, say, 10 players. Should we pause the entire shard if 10 players are failing?



  • @artch said in Server issues:

    @tigga If there is a failure on one runner (even after an automatic restart, which we have in place BTW, you observe how it works all the time), it would affect, say, 10 players. Should we pause the entire shard if 10 players are failing?

    Yes.

    One reason: @Orlet enjoys making his creeps live a very long time. A very long time. He has some that were spawned in 2018. When his code stops executing but the shard continues, those creeps can die. He's lost a few over the outages we've had so far, and last week there were some that survived with barely ticks remaining. I think it's a bad thing that server outages can kill creeps he's worked hard to keep alive so long.

    The more obvious example is combat. Losing a room or rooms because your code didn't execute for 2000 ticks and all your creeps timed out would suck. Not sure if it's happened yet.

    Often when this happens the whole shard is running 4x slower, and it's fixed within a few hours, so it's not like many ticks globally are skipped with a pause.

    👍

  • Dev Team

    @tigga So your point is that 1000 healthy players should be paused if 10 players are failing?



  • @artch said in Server issues:

    @tigga So your point is that 1000 healthy players should be paused if 10 players are failing?

    Yes.I don't know if that ratio reflects reality though. Judging from slack conversations during the outages, it seems like more than 1% are failing. I think these events are infrequent enough that the server wouldn't spend a lot of time paused, yet frequent enough to be frustrating for the player.

    I don't know what the difference between your automatic procedures and a what happens when an admin comes in to fix things. Clearly there is a difference as most issues are fixed within hours, which seems like it's too slow for an automatic fix, but also seems too fast and too frequent for anything much more complicated than "admin wakes up because server broke in the middle of night and frees disk space/power cycles it". I imagine it's probably a bit more involved than that, but I don't really have a clue what's breaking.

    👍


  • Daaaang... Orlet deserves a medal for that, the more you know.

    I kinda agree with Tigga, time should stop if there are execution issues that aren't player related.
    If the only option is to stop the entire shard so be it, I think that is only way that is fair and able to preserve such gems.



  • @mrfaul Agree! I think pausing time is acceptable, but that my code stops running while server is still ticking is pretty annoying. Last time my creeps carried with commodities just died, and there's nothing I can do.



  • @artch I think see where you're coming from. Expanding an issue of 10 players to 1000 players is obviously making the problem worse.

    Consider this example. A chat app is having an issue and 1% of messages sent are blackholed. Do they keep it up because it's mostly working or do they block all messages being sent. In this case the right choice is shut it down. If people start accepting the idea that the messages sometimes get lost they'll may move to other app even if it's only online 98% of the time. Not losing messages is fundamental feature of a messaging app.

    Shared/fair ticks may or may not be a fundamental feature of a shard. Of course you guys have to deal with reality of complex tradeoffs. 100% of players sharing a 60 second tick rate isn't ideal either.

    I expect the real ideal is something like 99.9% of (well behaved) players shared 90% of the ticks over the last hour or drastic measures are taken; E.g. 10 (60?) second ticks or if that isn't enough shut it down till a human can turn it back on.

    👍


  • @tigga said in Server issues:

    @Orlet enjoys making his creeps live a very long time. A very long time. He has some that were spawned in 2018.

    Currently only 9 left from 2018. Lost over 3/4ths of my old bois to the tombstone .withdraw bug a few months back (Dec 2019). Was a very sad day.

    As of this moment, memorial wall on Shard 1 stands at 126 entries. They all will be remembered.



  • I have two instances of global right now. As I use generators a lot in my latest codebase it had a bad consequences. Creeps were moving back and forth because different global instances had different targets for them. I made a fix for now, it detects if current globalId is different from lastGlobalId and restarts the threads. But I can imagine how other players code that relies on global or generators can suffer.

    Maybe you can add global detection on server side. Simply adding id to global (maybe name that property _id) and storing lastGlobalId per player in some storage will allow you to monitor how often it happends and restart runner containers for that players.

    Here is the method I used to detect multiple globals:

    let globalIndex = Memory.globalIndex = (Memory.globalIndex || 0) + 1;
    Memory.lastGlobalIndex = globalIndex;
    
    function checkGlobal() {
    	if (globalIndex !== Memory.lastGlobalIndex) {
    		console.log('2 globals!', 'globalIndex:', globalIndex, 'lastGlobalIndex:', Memory.lastGlobalIndex);
    		Memory.lastGlobalIndex = globalIndex;
    		fixGlobals(); // some logic to fix
    	} else {
    		console.log('globalIndex:', globalIndex);
    	}
    };
    
    module.exports.loop = function() {
    	checkGlobal();
    	// ... other code
    };
    

    You can use similar method but instead of memory save this 2 variables (globalIndex and lastGlobalIndex) in some database per each player (maybe per shard also).



  • I have a guess. If MMO uses multiple clusters or containers that runs players code, maybe sometimes ballancer picks wrong instance for particular player and alternates that between different ticks. Maybe hash collisions