Computation errors

Current eOn code and boinc distributed computing

Moderator: moderators

rebirther
Posts: 3
Joined: Sun Sep 05, 2010 9:12 am

Computation errors

Post by rebirther »

Got some more computation errors
here

Looks like its not the app but just wrong parameter.
matt
Posts: 37
Joined: Thu Jul 17, 2008 10:51 pm

Re: Computation errors

Post by matt »

Are you finding that the majority of your work units are crashing or just a few?
rebirther
Posts: 3
Joined: Sun Sep 05, 2010 9:12 am

Re: Computation errors

Post by rebirther »

matt wrote:Are you finding that the majority of your work units are crashing or just a few?
Around 10 of 100 but its lost computation time.
matt
Posts: 37
Joined: Thu Jul 17, 2008 10:51 pm

Re: Computation errors

Post by matt »

The issue should be fixed. Are you still getting bad WUs?
rebirther
Posts: 3
Joined: Sun Sep 05, 2010 9:12 am

Re: Computation errors

Post by rebirther »

matt wrote:The issue should be fixed. Are you still getting bad WUs?
My last files looking good!
Paladin*

Re: Computation errors

Post by Paladin* »

I'm still getting them this morning, seems anything starting with 129 Error's out. Just 1 Box of errors >>> http://eon.ices.utexas.edu/eon2/results.php?hostid=411
Sorceress
Posts: 9
Joined: Fri Sep 10, 2010 1:39 pm

Re: Computation errors

Post by Sorceress »

I am having comp errors as well. WU3770305 errored out. Can someone tell me what's wrong
before I go very far? I have another in waiting.
b0b3r
Posts: 1
Joined: Fri Sep 10, 2010 3:36 pm

Re: Computation errors

Post by b0b3r »

Hello Everyone

I also have problems with new application version for windows:

http://eon.ices.utexas.edu/eon2/results.php?hostid=1100

Like You may see with version 5.04 it was OK. But with 5.07 all tasks end with error.
PinkPenguin
Posts: 3
Joined: Fri Sep 10, 2010 1:25 pm

Re: Computation errors

Post by PinkPenguin »

I can confirm Paladin's statement that all the WUs that error out begin 129.... this seams to happen only on Windows Boxes as my Linux and Mac boxes are returning them OK with 1 exception.

Looks like memory handling error as the Windows WUs all return the following error trying to read address 0:
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00414168 read attempt to address 0x00000000

The only Linux WU that failed had a segment violation which would tend to confirm the Windows error.... the following is the output which seams to relate to Matter.cpp:
This code only supports cubic cells.This code only supports cubic cells.This code only supports cubic cells.This code only supports cubic cells.A carriage return ('\r') has been detected. To work correctly, new lines should be indicated by the new line character (\n).A carriage return ('\r') has been detected. To work correctly, new lines should be indicated by the new line character (\n).This code only supports cubic cells.This code only supports cubic cells.A carriage return ('\r') has been detected. To work correctly, new lines should be indicated by the new line character (\n).A carriage return ('\r') has been detected. To work correctly, new lines should be indicated by the new line character (\n).A carriage return ('\r') has been detected. To work correctly, new lines should be indicated by the new line character (\n).A carriage return ('\r') has been detected. To work correctly, new lines should be indicated by the new line character (\n).This code only supports cubic cells.This code only supports cubic cells.This code only supports cubic cells.This code only supports cubic cells.06:14:58 (28199): called boinc_finish
SIGSEGV: segmentation violation
Stack trace (2 frames):
[0x80a36e7]
[0xc37400]

Exiting...
This is the Linux example: http://eon.ices.utexas.edu/eon2/result. ... id=3755192

All other 129... WUs on Linux seam to be OK according to the validator. Though all WUs on the Linux boxes whether 129... or otherwise generate a SIGSEGV segmentation violation even if they are validated correctly... ;)

Only the Mac WUs don't generate errors...
Sorceress
Posts: 9
Joined: Fri Sep 10, 2010 1:39 pm

Re: Computation errors

Post by Sorceress »

Why are we still getting the 129*** WUs? Now instead of erroring out they are being aborted by the project. Please stop sending them if their aren't any good.
I haven't had a single one 129** validate! Geesh...
chill
Posts: 96
Joined: Tue Jul 28, 2009 9:04 pm

Re: Computation errors

Post by chill »

Sorceress wrote:Why are we still getting the 129*** WUs? Now instead of erroring out they are being aborted by the project. Please stop sending them if their aren't any good.
I haven't had a single one 129** validate! Geesh...
Our project is still working on making our calculations as efficient as possible on a distributed computing platform. Currently in order to parallelize as best as possible we are making more work units than we actually need. I added the ability to cancel these extra work units when they are no longer needed, but I didn't realize that users wouldn't still get credit for the work. I removed this for the 129* work units, so now they will not be canceled. Soon we won't have this problem as we are implementing a method to only make as many work units as we need.

We have gotten many users on this project very quickly! We are working hard to make good use of all this new processing power.
mitrichr
Posts: 27
Joined: Mon Sep 13, 2010 10:42 pm

Re: Computation errors

Post by mitrichr »

I guess add me to the list. I did not keep the WU's id's but, I just signed on to your project. I have a wee bit of credit, so some were O.K.. But I had several where they had run for over 30 minutes with no work done. I aborted them.
mitrichr
Posts: 27
Joined: Mon Sep 13, 2010 10:42 pm

Re: Computation errors

Post by mitrichr »

Sorry for the bother, I could not find where to get admins, my signature is not showing up, even though I selected to show it by default.

>>RSM
Neil Polson
Posts: 4
Joined: Wed Sep 15, 2010 11:00 am

Re: Computation errors

Post by Neil Polson »

Just got another error : stderr out

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Error: size of hessian is zero. Try with smaller min_Displacement_Hessian

</stderr_txt>
]]>
Get a few of these each day. Noticed a lot of others getting them too. The last one ran for nearly 15 mins. Losing credit when it's not my fault is annoying. Any chance of these being eradicated?
mitrichr
Posts: 27
Joined: Mon Sep 13, 2010 10:42 pm

Re: Computation errors

Post by mitrichr »

I show a whole bunch of WU's completed successfully. But when ever I check my tasks running in BOINC Manager, I see tasks with time spent and 0% work done. This is a waste of my time. My computers run 24/7, I do not, can not watch them all of the time.

You have been given precious space and on my new i7-840QM. If I cannot know that I can go away for 10-12 hours and things are going to be O.K., I cannot afford the waste of time and electricity.
Here are just two randomly selected tasks, one succeeded, the other failed.

Succeeded 809877401_669_7571100

no work done aborted 1728637339_675_27309900

>>RSM
Post Reply