Hard Resets - Outside Causes


  • Culture

    I've gotten 7 hard resets in just the last 12 hours. They keep occurring on the half hours (either at :00 or :30)

    I don't do timers like that, so there's no reason my code should be triggering this on its own. It's frustrating to keep getting penalized for something I'm not doing.

    It only just started like this today. As you can see, there are plenty of other emails at other times, most of these come in specifically at :00 or :30.

     

    Emails: http://imgur.com/a/wYxnc



  • I get these too. I was getting them last night with an empty main loop. Something is certainly wrong.


  • Culture

    Same, I'm not getting them as often as others, but I do get random ones that should never happen. Stats indicate nothing abnormal either.



  • I feel like the half hour is just the default send interval for errors, but still. I've also been getting these intermittently, and as far as I can tell I shouldn't be. I'll admit, I don't have any mechanism to stop my script once I reach a certain amount of used cpu, but I'm also pretty sure I have nothing that should ever cause a hard reset. I suppose I wouldn't be surprised at the very rare soft reset (although from my stat tracking I feel like this hasn't happened in ages either), but again, I see no cause in my code for hard resets. I know others have had intermittent issues with this as well.


  • Culture

    I'm having the same issue, so I"m glad this is brought up again. 

    I wonder if it's related to what other users are doing on the same server. I've noticed a pattern- one of the global reset storms will occur, my code will stabilize on a new node, and performance will be awful until after the next global reset storm performance will go back to normal. It's during these periods where this hard reset is more likely to happen, although even then it displays these weird timing issues like hernanduer has notices where they occur on the half hour mark.


  • Culture

    I also feel like I am probably getting hard resets more often than I would expect, especially since I don't have a common history of soft resets.



  • I'm also tempted to believe that my timeouts are related to something outside my code.I get about 12 per day, and they seem to occur at fairly random places in my loop. These aren't actually hard resets for the most part, although I have seen a couple of those as well.

    I've set up some profiler code and noticed some interesting patterns. My loop gets processed in phases, and I have alarms that trigger if any single phase uses an unusual amount of CPU. Phases are further subdivided as each room gets processed for that phase, and I have alarms there as well. The vast majority of these processes take less than 1 cpu per tick, but when a timeout happens, several of these alarms get triggered together. Process that usually take less than 1 cpu per tick take more than 100 cpu. Sometimes this results in a timeout, sometimes just a VERY heavy cpu use for that tick.

    (One explanation could be process that experience heavy use simultaneously is due to logic that depends on some variant of `Game.time % x === 0`, but I don't do that with anything that could take a significant amount of cpu)

    The easiest scenario would have been a regular pattern that suggested one or more parts of my code that are causing the problem. Since this isn't the case, it seems we can narrow it down to two explanations:

    1) Some process that gets executed throughout my loop

    2) Some extraneous cause (e.g., garbage collection)

    I have very few processes that get executed through my loop. One example is my profiler code, the very code that is triggering the alarms, and i'm currently trying to rule that out. Another example is my pathfinding code, which gets ran predominantly in my "actions" phase and only very occasionally somewhere else. Since I'm not seeing the majority of my timeouts in the actions phase, I think I can rule that out.

    I think one thing that could be very helpful for diagnosing cpu issues is some sort of data about how often they are occurring for others on the server. If you could see a history of your rate of timeouts perhaps together with the average rate on the server, that would be excellent. I don't think the timeouts due to extraneous causes are a big deal if they are affecting everyone to the same degree. It would also be nice not to sink time into investigating the cause when it turns out to be outside the scope of your code.

    I'll go ahead and make this suggestion in the appropriate forum.