Server issues


  • Dev Team

    @tigga If there is a failure on one runner (even after an automatic restart, which we have in place BTW, you observe how it works all the time), it would affect, say, 10 players. Should we pause the entire shard if 10 players are failing?



  • @artch said in Server issues:

    @tigga If there is a failure on one runner (even after an automatic restart, which we have in place BTW, you observe how it works all the time), it would affect, say, 10 players. Should we pause the entire shard if 10 players are failing?

    Yes.

    One reason: @Orlet enjoys making his creeps live a very long time. A very long time. He has some that were spawned in 2018. When his code stops executing but the shard continues, those creeps can die. He's lost a few over the outages we've had so far, and last week there were some that survived with barely ticks remaining. I think it's a bad thing that server outages can kill creeps he's worked hard to keep alive so long.

    The more obvious example is combat. Losing a room or rooms because your code didn't execute for 2000 ticks and all your creeps timed out would suck. Not sure if it's happened yet.

    Often when this happens the whole shard is running 4x slower, and it's fixed within a few hours, so it's not like many ticks globally are skipped with a pause.

    👍

  • Dev Team

    @tigga So your point is that 1000 healthy players should be paused if 10 players are failing?



  • @artch said in Server issues:

    @tigga So your point is that 1000 healthy players should be paused if 10 players are failing?

    Yes.I don't know if that ratio reflects reality though. Judging from slack conversations during the outages, it seems like more than 1% are failing. I think these events are infrequent enough that the server wouldn't spend a lot of time paused, yet frequent enough to be frustrating for the player.

    I don't know what the difference between your automatic procedures and a what happens when an admin comes in to fix things. Clearly there is a difference as most issues are fixed within hours, which seems like it's too slow for an automatic fix, but also seems too fast and too frequent for anything much more complicated than "admin wakes up because server broke in the middle of night and frees disk space/power cycles it". I imagine it's probably a bit more involved than that, but I don't really have a clue what's breaking.

    👍


  • Daaaang... Orlet deserves a medal for that, the more you know.

    I kinda agree with Tigga, time should stop if there are execution issues that aren't player related.
    If the only option is to stop the entire shard so be it, I think that is only way that is fair and able to preserve such gems.



  • @mrfaul Agree! I think pausing time is acceptable, but that my code stops running while server is still ticking is pretty annoying. Last time my creeps carried with commodities just died, and there's nothing I can do.



  • @artch I think see where you're coming from. Expanding an issue of 10 players to 1000 players is obviously making the problem worse.

    Consider this example. A chat app is having an issue and 1% of messages sent are blackholed. Do they keep it up because it's mostly working or do they block all messages being sent. In this case the right choice is shut it down. If people start accepting the idea that the messages sometimes get lost they'll may move to other app even if it's only online 98% of the time. Not losing messages is fundamental feature of a messaging app.

    Shared/fair ticks may or may not be a fundamental feature of a shard. Of course you guys have to deal with reality of complex tradeoffs. 100% of players sharing a 60 second tick rate isn't ideal either.

    I expect the real ideal is something like 99.9% of (well behaved) players shared 90% of the ticks over the last hour or drastic measures are taken; E.g. 10 (60?) second ticks or if that isn't enough shut it down till a human can turn it back on.

    👍


  • @tigga said in Server issues:

    @Orlet enjoys making his creeps live a very long time. A very long time. He has some that were spawned in 2018.

    Currently only 9 left from 2018. Lost over 3/4ths of my old bois to the tombstone .withdraw bug a few months back (Dec 2019). Was a very sad day.

    As of this moment, memorial wall on Shard 1 stands at 126 entries. They all will be remembered.



  • I have two instances of global right now. As I use generators a lot in my latest codebase it had a bad consequences. Creeps were moving back and forth because different global instances had different targets for them. I made a fix for now, it detects if current globalId is different from lastGlobalId and restarts the threads. But I can imagine how other players code that relies on global or generators can suffer.

    Maybe you can add global detection on server side. Simply adding id to global (maybe name that property _id) and storing lastGlobalId per player in some storage will allow you to monitor how often it happends and restart runner containers for that players.

    Here is the method I used to detect multiple globals:

    let globalIndex = Memory.globalIndex = (Memory.globalIndex || 0) + 1;
    Memory.lastGlobalIndex = globalIndex;
    
    function checkGlobal() {
    	if (globalIndex !== Memory.lastGlobalIndex) {
    		console.log('2 globals!', 'globalIndex:', globalIndex, 'lastGlobalIndex:', Memory.lastGlobalIndex);
    		Memory.lastGlobalIndex = globalIndex;
    		fixGlobals(); // some logic to fix
    	} else {
    		console.log('globalIndex:', globalIndex);
    	}
    };
    
    module.exports.loop = function() {
    	checkGlobal();
    	// ... other code
    };
    

    You can use similar method but instead of memory save this 2 variables (globalIndex and lastGlobalIndex) in some database per each player (maybe per shard also).



  • I have a guess. If MMO uses multiple clusters or containers that runs players code, maybe sometimes ballancer picks wrong instance for particular player and alternates that between different ticks. Maybe hash collisions