memory leak/OOM kill

eOn code for long time scale dynamics

Moderator: moderators

Post Reply
inu
Posts: 2
Joined: Wed Jun 05, 2013 7:04 pm

memory leak/OOM kill

Post by inu »

Hi, After noticing that my box was running cool and the system was running unusually slow (everything was moved to swap), I found out that the kernel killed all the running eons (I have 4GB mem, 2GB swap).

kern.log:

Code: Select all

Jun  5 12:09:44 cygnus kernel: [2974759.069810] Killed process 6271 (eonclient_5.00_) total-vm:1063100kB, anon-rss:842092kB, file-rss:208kB
Jun  5 12:12:08 cygnus kernel: [2974898.801129] Killed process 6272 (eonclient_5.00_) total-vm:1063100kB, anon-rss:911912kB, file-rss:0kB
Jun  5 12:13:31 cygnus kernel: [2974986.922215] Killed process 6296 (eonclient_5.00_) total-vm:1063100kB, anon-rss:725892kB, file-rss:0kB
Jun  5 12:13:31 cygnus kernel: [2974986.990299] Killed process 6299 (eonclient_5.00_) total-vm:1063100kB, anon-rss:725916kB, file-rss:0kB
Jun  5 12:13:31 cygnus kernel: [2974987.125096] Killed process 6297 (eonclient_5.00_) total-vm:1063100kB, anon-rss:720416kB, file-rss:8kB
Jun  5 12:16:14 cygnus kernel: [2975139.441500] Killed process 6315 (eonclient_5.00_) total-vm:774156kB, anon-rss:514640kB, file-rss:0kB
1 gig seems awfully huge. Normally, I think they run at 100-300 megs. My BOINC version is "6.12.40 x86_64-pc-linux-gnu", with eon as "eonclient_5.00_x86_64-pc-linux-gnu". I use the stock/downloaded binary for eon. I have been using these two flawlessly together for some time now (since 5.0's release), so I guess something's wrong with the workunits.

Task pages:
http://eon.ices.utexas.edu/eon2/result. ... =229032325
http://eon.ices.utexas.edu/eon2/result. ... =229032310
http://eon.ices.utexas.edu/eon2/result. ... =229031981
http://eon.ices.utexas.edu/eon2/result. ... =229031508

This also happened for four other tasks.
In case they get removed from the database, the "error message" on the workunit page says "Too many total results".

EDIT:
After doing some approximate research, the client sometimes crawls up to 1-1.3GiB across a handful of minutes then drops down to a few K. The times that it doesn't, everything works as expected. Limiting the amount of memory boinc is allowed to use prevents doomsday sluggishness/OOM kills, however the tasks still error with code -177 (0xffffffffffffff4f):

<core_client_version>6.12.40</core_client_version>
<![CDATA[
<message>
Maximum memory exceeded
</message>
<stderr_txt>
SIGSEGV: segmentation violation
Stack trace (2 frames):
[0x83ae89e]
[0xf77bb400]

Exiting...
SIGSEGV: segmentation violation
Stack trace (2 frames):
[0x83ae89e]
[0xf772e400]

Exiting...
SIGSEGV: segmentation violation
Stack trace (2 frames):
[0x83ae89e]
[0xf77c8400]

Exiting...
SIGSEGV: segmentation violation
Stack trace (2 frames):
[0x83ae89e]
[0xf77d8400]

Exiting...

</stderr_txt>
]]>
Augustine
Posts: 8
Joined: Sat Sep 04, 2010 5:30 pm

Re: memory leak/OOM kill

Post by Augustine »

Same here.

Please, advise.
stauff
Posts: 4
Joined: Mon Feb 13, 2012 4:59 pm

Re: memory leak/OOM kill

Post by stauff »

Hi guys,

We are trying to resolve this issue. Can either of you provide a workunitid or resultid of jobs that have crashed--this will greatly speedup our search for the source of this issue.
Augustine
Posts: 8
Joined: Sat Sep 04, 2010 5:30 pm

Re: memory leak/OOM kill

Post by Augustine »

I aborted these when I noticed that they had grown to about 1GB of virtual memory and were suspended waiting for other projects to finish: http://bit.ly/11JCyV4. Others finished with a SIGSEGV, like this one.

HTH
Conan
Posts: 18
Joined: Wed Sep 08, 2010 1:03 pm

Re: memory leak/OOM kill

Post by Conan »

ALL current eOn work units are failing on my Windows 32 bit machine.

They reach 1 GB in 1 minute and 2 Gb in 3 minutes then fail. Chewing up 2 GB of RAMon a 32 bit computer leaves 1 Gb for every other process the computer runs, doesn't work.

Conan
Conan
Posts: 18
Joined: Wed Sep 08, 2010 1:03 pm

Re: memory leak/OOM kill

Post by Conan »

Just noticed that my Linux machine with 8 GB RAM got very sluggish.
Checking System Monitor showed that an eOn task consumed up to 6.1 GB of Memory and 6.9 GB Virtual memory after just 11 minutes.
It then dropped back to 24.4 MB of memory and 39.4 MB of Virtual memory and start to run normally. The Computer then started to respond as quick as it should normally be.

I can't be running work units like this as they will only just run on my 8 GB machine and all my others only have 4 GB with the 2 Windows machines only having 3 GB available, so all work units on my Windows host are failing.

Conan
Yacob
Posts: 5
Joined: Mon Mar 18, 2013 8:59 pm

Re: memory leak/OOM kill

Post by Yacob »

Same here with Windows 7 64 bits.

Code: Select all

232158948 	241325083 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Aborted by user 			277.82 	231.74 	--- 	eOn Client v5.00
232158904 	241325039 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Aborted by user 			277.82 	232.60 	--- 	eOn Client v5.00
232158903 	241325038 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Aborted by user 			277.82 	230.69 	--- 	eOn Client v5.00
232158902 	241325037 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Error while computing 	277.82 	230.82 	--- 	eOn Client v5.00
232158901 	241325036 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Aborted by user 			277.82 	230.93 	--- 	eOn Client v5.00
232158900 	241325035 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Aborted by user 			277.82 	231.13 	--- 	eOn Client v5.00
Conan
Posts: 18
Joined: Wed Sep 08, 2010 1:03 pm

Re: memory leak/OOM kill

Post by Conan »

Well that's it for eOn on my 32 bit Windows computers- NNW. Not requesting any more work for this project.

I have been running this project for ages as it is one of my favorites and hardly ever has a problem.
However after wrestling for control of my computer for over an hour due to eOn tasks consuming all memory on the computer, to the point that the computer no longer did anything, I have stopped getting new tasks.
As I stated before I only have 3 GB of memory available but these eOn tasks had a Windows Commit Charge of almost 5 GB and they was trying to go higher but couldn't get anymore memory so the computer froze.

Prior to recently this project has been very reliable so I don't know what has changed as there has not been an Application update that I can see.

My Linux 64 bit computer with 8 GB RAM can handle a couple of these eOn tasks running together but not my Windows computer.

Conan
Yacob
Posts: 5
Joined: Mon Mar 18, 2013 8:59 pm

Re: memory leak/OOM kill

Post by Yacob »

Not requesting any more work for this project.
Me too!!
Please, update when the issues are fixed so we can resume the work on this project.

Thanks!!!
Sebastien
Posts: 3
Joined: Thu Sep 30, 2010 8:02 pm

Re: memory leak/OOM kill

Post by Sebastien »

stauff wrote:Hi guys,

We are trying to resolve this issue. Can either of you provide a workunitid or resultid of jobs that have crashed--this will greatly speedup our search for the source of this issue.
The problem seems to be located in the function ParallelReplicaJob::dephase()
100,000 dephaseSteps is too high.
felixonmars
Posts: 2
Joined: Fri Jul 05, 2013 3:57 am

Re: memory leak/OOM kill

Post by felixonmars »

stauff wrote:Hi guys,

We are trying to resolve this issue. Can either of you provide a workunitid or resultid of jobs that have crashed--this will greatly speedup our search for the source of this issue.
http://eon.ices.utexas.edu/eon2/result. ... =233022466

My box has 16GB ram and I also enabled 16GB zram to make some of the WUs completed without an error - but that's still not stable so I have to select "no new tasks" for now. Waiting for this issue to be fixed :)
felixonmars
Posts: 2
Joined: Fri Jul 05, 2013 3:57 am

Re: memory leak/OOM kill

Post by felixonmars »

ZPC2THLgate wrote:The client binary has been updated and should now work.
Thanks! It works fine here.
losyguy
Posts: 1
Joined: Mon Aug 19, 2013 2:10 am

Re: memory leak/OOM kill

Post by losyguy »

Yacob wrote:Same here with Windows 7 64 bits.
Here is the code

Code: Select all

232158948 	241325083 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Aborted by user 			277.82 	231.74 	--- 	eOn Client v5.00
232158904 	241325039 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Aborted by user 			277.82 	232.60 	--- 	eOn Client v5.00
232158903 	241325038 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Aborted by user 			277.82 	230.69 	--- 	eOn Client v5.00
232158902 	241325037 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Error while computing 	277.82 	230.82 	--- 	eOn Client v5.00
232158901 	241325036 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Aborted by user 			277.82 	230.93 	--- 	eOn Client v5.00
232158900 	241325035 	33859 	22 Jun 2013 16:47:31 UTC 	22 Jun 2013 17:17:04 UTC 	Aborted by user 			277.82 	231.13 	--- 	eOn Client v5.00
Hello Yacob,

I am having the same problem and I am also on a Windows 7 machine. Did you ever figure out a solution to this? Thanks so much.
Post Reply