IVM heap usage & game objects



  • I've been hitting:

    Error: Isolate was disposed during execution due to memory limit

    A lot recently. It seems to happen between ticks: after my code has returned and before the code starts on the next tick. Alternatively it may be supressing logs on the tick it occurs. Impossible to say.

    So I've been digging into heap size. Specifically, I've been logging:

    stats["total_heap_size"] + stats["externally_allocated_size"];

    in various places. I'm on shard 2 and I believe my limit is ~315MB. A few findings:

    • My heap size at the start of a new global (ie. at the top of Main.js) is ~135MB. After Memory parsing this goes to ~142MB. I guess this is related to code compiled size + one instance of all game objects + terrain data?
    • My heap size will regularly climb ~70MB from the end of one tick to the start of another. My belief is that this is new game objects being created and old ones not being cleared. The next tick it'll drop down 70MB again. That means that, before I even use any heap myself, about 210MB is gone.
    • As it happens the two 70MB oscillations happen between ~210MB and 280MB, which suggestes there may be another copy of all the game objects lying around (140 at code load, +70 to 210, +70 to 280).
    • The memory limit is always hit when I end my tick on a high: in other words on the "280MB" ticks. My guess is that another copy of the game objects is being created without clearing the last lot out, and it's failing to do that.

    To be clear, I'm not putting game objects on global or anything like that. I do put game objects on other game objects, but that sort of circular reference should be fine for the GC. I am able to get 1000+ of ticks with the same global at times, and then observe repeated crashes with memory limits every 10-20 ticks or so.... then another period of calm. If I am leaking anything, it's not enough to break it in 1000+ ticks, and then suddenly is enough to break it in 10-20 ticks.

    It's my belief that at high GCL when you have a lot of room visibilty (I can typically see ~160) rooms on any one tick. That the overhead of game objects on the heap severly limits how much the user can use the heap, and can cause crashes that the user has very little control over. My suspicion is that my crash-calm cycle is because the crashes kill my bucket, so my code stops spawning so much, and the game object count drops. In the calm period the spawns climb back up, and I tip over the threshold again.

    Now obviously I can't rule out my code. You never can. I've tried commenting out large chunks of code, but there's no obvious thing on my end that's causing the issue. I'd be interested in hearing stories from other people will very high general room visibility to see if they suffer from the same issues.

    Can we consider raising the limit for high GCL players? Or maybe just in general? Alternatively, can the game objects be optimized so they don't take up 70MB of heap per tick?

    👆




  • I'd also like to add that I cleared out about 5000 flags, and since then I've only had a few OOM errors. It's been much more stable. I'm still pretty close to the threshold though, and given I'm only GCL 23, there's plenty of scope for more room object visibilty.



  • Hm so we should have a base level of heap that than scales with room visibility or global object count?
    That sounds actually pretty reasonable, the question is how demanding the dynamic heap level management would be on the system.

    But since big (slow computing ones) screeps colonies should run with other big colonies together on one server node there should be a surplus of heap memory anyway since they should have to handle less scripts.
    Still, that's just in theory...

    But since memory usage should not have extreme jumps, adjustment could be done every 20 50 or even 100 ticks.



  • Deleting my flags seemed like a short-term solution for me. My bot used the extra CPU it got from not crashing out so frequently to expand a bit, getting more room/structure/creep visibilty and pushing the game object memory size back up into the danger zone. OOM errors aren't as common as they were pre-flag deletion, but they're still happening with reasonable frequency.



  • With 1000+ Game.creeps, 4450+ Game.structures and 220+ Game.rooms I'm definetly at the limit (my rooms are only on Shard0). My heap [(heapStats.total_heap_size + heapStats.externally_allocated_size) / heapStats.heap_size_limit] runs between 76% and 105%. I only trigger gc() myself when it's >110%, because only that would led to that memory error. Normally it's ok for me, but in bad server times I got too much automatic garbage collection time added (~150-300cpu) to my ticktime. With this I got ticktimes where Game.cpu.getUsed()>600.

    This was a bad server day (20.3.) since 7am for me: 0_1521578184240_6f3b26ab-3f11-428f-b3ff-c0201ce95379-image.png

    It normalized in the evening...



  • Currently the server has a top performance. Not seen as strong in a long time. 0_1521798995581_28bc0e6a-e6c9-439e-b2cf-c6e67cf0209e-image.png

    My script also runs very well with this. I almost can not detect any auto-GC within my ticktime, which had lately brought down my skript performance. So currently I don't have that much problems with my big heap, it's constantly at ~91%. Nice server tweaks. Thx!

    Only RoomHistory does not seem to work that well.



  • Since monday I have massive "out-of-tick" heap spikes again. Even with an empty main file like:

    "use strict";
    module.exports.loop = function() {
        console.log(JSON.stringify(Game.cpu.getHeapStatistics()));
    };
    

    I'm getting this output:

    [02:21:50][shard0]{"total_heap_size":136630272,"total_heap_size_executable":3670016,"total_physical_size":132199368,"total_available_size":253438720,"used_heap_size":115367752,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":3157248,"does_zap_garbage":0,"externally_allocated_size":10403655}
    [02:21:54][shard0]{"total_heap_size":224628736,"total_heap_size_executable":3670016,"total_physical_size":223458688,"total_available_size":172878960,"used_heap_size":197190144,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":8342992,"does_zap_garbage":0,"externally_allocated_size":10404068}
    [02:22:00][shard0]{"total_heap_size":319496192,"total_heap_size_executable":3670016,"total_physical_size":317757464,"total_available_size":88642040,"used_heap_size":280341456,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":8342992,"does_zap_garbage":0,"externally_allocated_size":104044810}
    [02:22:04][shard0]{"total_heap_size":205230080,"total_heap_size_executable":3670016,"total_physical_size":203091360,"total_available_size":180387432,"used_heap_size":194743176,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":8342992,"does_zap_garbage":0,"externally_allocated_size":104040682}
    * [02:22:08][shard0]{"total_heap_size":304816128,"total_heap_size_executable":3670016,"total_physical_size":303294784,"total_available_size":87892272,"used_heap_size":281215952,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":8342992,"does_zap_garbage":0,"externally_allocated_size":104044810}
    [02:22:13][shard0]{"total_heap_size":202608640,"total_heap_size_executable":3670016,"total_physical_size":200794288,"total_available_size":180377760,"used_heap_size":194807280,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":8342992,"does_zap_garbage":0,"externally_allocated_size":104040682}
    [02:22:17][shard0]{"total_heap_size":304291840,"total_heap_size_executable":3670016,"total_physical_size":302934176,"total_available_size":88121808,"used_heap_size":280890640,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":8342992,"does_zap_garbage":0,"externally_allocated_size":104044810}
    [02:22:21][shard0]Error: Isolate was disposed during execution due to memory limit
        at ()
        at runUser (/opt/engine/dist/core/runtime/make.js:162:34)
        at _tickDomainCallback (internal/process/next_tick.js:228:7)
    [02:22:25][shard0]Script execution has been terminated: your isolate disposed unexpectedly, restarting virtual machine
    


  • I had a bit of an empire collapse due to a big war. I've recovered a bit now, and recently added some code that leads to quite a bit more room visibilty (checking for power banks pre-observer with static creeps).

    This has lead to my heap hitting the memory limit repeatedly once more. It seems that game objects are still a major issue when it comes to heap.

    @stuy486 mentioned in slack that maybe this is due to JS GC, and maybe could be tweaked. He said:

    Now armed with 15 minutes of reading on Node GC and nothing else... I wonder if, for the Screeps use case, the new-space heap size that's being used isn't large enough? This would cause a bunch of tick-only aged stuff to get moved to old-space, which might be the reason we see the heap size step up like that... This would affect larger players much more since the amount of heap used to generate Game is much higher.

    I do hope this is being looked at. I don't feel we should be penalized by resets/skipped ticks just because we can see a lot of game objects. Rooms visibily seems to be often the culprit.



  • I'm not sure about the new-space old-space thing, as I mentioned in my comment, I just read a quick post about it. Seems like something that should be investigated if it hasn't already been.

    My current experience is that I'm getting 5-10 CPU time limit reached messages an hour, and they are very clearly occurring when GC hits in the middle of my tick. For a period of about a week ending a couple days ago, it seemed that the 500 CPU/tick limit was relaxed. During that time, I was never getting errors, but was frequently seeing ticks that would take 700-800CPU. Given that I short circuit my code at 350CPU and average ~215CPU, this means that GC is taking 400-500CPU when it hits my code mid-tick.

    I suspect these heap/GC issues have always been around and now they're just more obvious with IVM and the addition of get_heap_statistics(). I also suspect that if we're able to figure out what's going on, the servers as a whole will really benefit. Similar to @demawi, I see my used heap climb ~70MB ever tick before my code even runs. It only increases 3-5MB while my code is running.



  • I have more problems with this: (don't know if it's also a heap problem?)

    [00:06:38][shard0]Script execution timed out ungracefully
    [00:06:43][shard0]Script execution timed out ungracefully
    [00:06:48][shard0]Script execution timed out ungracefully
    [00:06:53][shard0]Script execution timed out ungracefully
    [00:06:58][shard0]Script execution timed out ungracefully
    [00:07:03][shard0]Script execution timed out ungracefully
    [00:07:07][shard0]Script execution timed out ungracefully
    [00:07:12][shard0]Script execution timed out ungracefully
    

    currently... each tick without getting to my code.. I already had such a phase a few days ago.where it lasted over 200 ticks..



  • Since wednsday it went back to previously described heap problems. And he doesn't stabilize any more. Currently every 3 ticks I have to call gc myself, but still getting "Isolate was disposed during execution due to memory limit"-errors regularily. Frustrating... 😢

    with an empty main-file:

    [20:00:57][shard0]{"total_heap_size":140148736,"total_heap_size_executable":3670016,"total_physical_size":131834272,"total_available_size":252732968,"used_heap_size":115534184,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":3157248,"does_zap_garbage":0,"externally_allocated_size":104036698}
    [20:01:01][shard0]{"total_heap_size":234536960,"total_heap_size_executable":3670016,"total_physical_size":225970904,"total_available_size":175003304,"used_heap_size":195191992,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104040970}
    [20:01:06][shard0]{"total_heap_size":323624960,"total_heap_size_executable":3670016,"total_physical_size":322073392,"total_available_size":76740336,"used_heap_size":291719864,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104047668}
    [20:01:10][shard0]{"total_heap_size":206999552,"total_heap_size_executable":3670016,"total_physical_size":202776616,"total_available_size":182927992,"used_heap_size":192229592,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104040970}
    [20:01:14][shard0]{"total_heap_size":314101760,"total_heap_size_executable":3670016,"total_physical_size":303882056,"total_available_size":87456760,"used_heap_size":280090032,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104045242}
    [20:01:19][shard0]{"total_heap_size":205426688,"total_heap_size_executable":3670016,"total_physical_size":201582624,"total_available_size":182969808,"used_heap_size":192228672,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104040970}
    [20:01:23][shard0]{"total_heap_size":308334592,"total_heap_size_executable":3670016,"total_physical_size":301953816,"total_available_size":89031408,"used_heap_size":279926960,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104045242}
    [20:01:27][shard0]{"total_heap_size":205950976,"total_heap_size_executable":3670016,"total_physical_size":201367544,"total_available_size":182943552,"used_heap_size":192254952,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104040970}
    [20:01:31][shard0]{"total_heap_size":312528896,"total_heap_size_executable":3670016,"total_physical_size":302439016,"total_available_size":88842216,"used_heap_size":280037088,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104045242}
    [20:01:36][shard0]{"total_heap_size":206475264,"total_heap_size_executable":3670016,"total_physical_size":202223896,"total_available_size":182919864,"used_heap_size":192277880,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104040970}
    [20:01:40][shard0]{"total_heap_size":312528896,"total_heap_size_executable":3670016,"total_physical_size":302239240,"total_available_size":86718288,"used_heap_size":280431392,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104045242}
    [20:01:45][shard0]{"total_heap_size":208048128,"total_heap_size_executable":3670016,"total_physical_size":203117480,"total_available_size":182891712,"used_heap_size":192277784,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104040970}
    [20:01:49][shard0]{"total_heap_size":310681600,"total_heap_size_executable":3670016,"total_physical_size":301337400,"total_available_size":85588216,"used_heap_size":280931136,"heap_size_limit":380633088,"malloced_memory":8192,"peak_malloced_memory":7507824,"does_zap_garbage":0,"externally_allocated_size":104045242}
     
    


  • Me again!

    It's better than it was, but I'm still seeing heap limits hit when I get too much room visibility. Up to 200 rooms visibile before it starts crapping out now.

    I still feel fairly strongly that we shouldn't be seeing errors as users just because we can see too much stuff. It's a reallt weird game mechanic.


  • AYCE

    I've been moving rooms from shard0 to shard1/2 to reduce heap size, and have a similar experience as Tigga visible rooms was close to 200. I've now reduced it to below 180, and have no longer problem with heap size. It is probably a combination of your improvements and my room-reduction.

    I'll add graph of global age for the last month, where most resets the last 10 days have been code uploads.

    0_1524089825560_global.PNG



  • as long as you're tweaking the server and those heap spikes are a regularily problem, it would be nice if you could spend more heap memory (in general, gcl-dependent, or per claimed room don't know what's possilble).. so nobody has to worry about being thrown out of the game after a server tweak



  • Still having issues at about 200 rooms. It's usually around 202/203 it starts going.

    Is there an ETA for a fix? I heard rumours persistent game objects were planned. It'd be nice to get this sooner rather than later, or maybe a +20% heap allocation while we're waiting.

    ☝

  • Dev Team

    @tigga

    maybe a +20% heap allocation while we're waiting

    That would just shift the visible rooms limit to ~240. Do you think the 20% increase solves the issue for you?



  • @artch I guess there's other things going into the equation (such as creeps/structures).

    I think it would probably be sufficient until we get persistent game objects. My understanding was that the ETA was that was not too far away. I'm GCL 24 right now, so there's only 20% more CPU I can get.


  • Dev Team

    Alright, we've temporarily increased heap limit by 64 MB. We'll keep an eye on our servers, to see if they actually can mantain such memory usage.

    👍


  • Just a individual feedback: For me 64MB is not enough to stand the ~100MB spikes. The time before the tweak my memory has magically stabilized, currently he doesn't. I don't have to call gc() myself anymore, to prevent out of memory IVM resets. But intick-autoGC triggers regularly and consumes a significant amount of cpu. Before I was at ~265cpu/tick in average, currently it's over 300.


  • Dev Team

    @demawi Please elaborate what tweak are you referring to?