I'm also tempted to believe that my timeouts are related to something outside my code.I get about 12 per day, and they seem to occur at fairly random places in my loop. These aren't actually hard resets for the most part, although I have seen a couple of those as well.
I've set up some profiler code and noticed some interesting patterns. My loop gets processed in phases, and I have alarms that trigger if any single phase uses an unusual amount of CPU. Phases are further subdivided as each room gets processed for that phase, and I have alarms there as well. The vast majority of these processes take less than 1 cpu per tick, but when a timeout happens, several of these alarms get triggered together. Process that usually take less than 1 cpu per tick take more than 100 cpu. Sometimes this results in a timeout, sometimes just a VERY heavy cpu use for that tick.
(One explanation could be process that experience heavy use simultaneously is due to logic that depends on some variant of `Game.time % x === 0`, but I don't do that with anything that could take a significant amount of cpu)
The easiest scenario would have been a regular pattern that suggested one or more parts of my code that are causing the problem. Since this isn't the case, it seems we can narrow it down to two explanations:
1) Some process that gets executed throughout my loop
2) Some extraneous cause (e.g., garbage collection)
I have very few processes that get executed through my loop. One example is my profiler code, the very code that is triggering the alarms, and i'm currently trying to rule that out. Another example is my pathfinding code, which gets ran predominantly in my "actions" phase and only very occasionally somewhere else. Since I'm not seeing the majority of my timeouts in the actions phase, I think I can rule that out.
I think one thing that could be very helpful for diagnosing cpu issues is some sort of data about how often they are occurring for others on the server. If you could see a history of your rate of timeouts perhaps together with the average rate on the server, that would be excellent. I don't think the timeouts due to extraneous causes are a big deal if they are affecting everyone to the same degree. It would also be nice not to sink time into investigating the cause when it turns out to be outside the scope of your code.
I'll go ahead and make this suggestion in the appropriate forum.