Problems should be solved shortly

Current eOn code and boinc distributed computing

Moderator: moderators

Post Reply
ryeterrell
Posts: 23
Joined: Fri Jun 27, 2008 8:49 pm

Problems should be solved shortly

Post by ryeterrell »

The problems with recent work units have been diagnosed and a fix is in the works. Work units should be working again shortly. Thanks again for your patience and alerting us to the problems.
upstatelabs
Posts: 20
Joined: Wed Oct 27, 2010 2:07 pm

Re: Problems should be solved shortly

Post by upstatelabs »

I now have "Completed and Validated" WUs being sent back. Thank you!

Many of them have stderr that are something like this:
<core_client_version>6.12.34</core_client_version>
<![CDATA[
<stderr_txt>
error: bundle: cannot rename client.log to client_0.log: Permission denied
19:12:47 (4932): called boinc_finish

</stderr_txt>
]]>

Is this still an issue to be resolved or is this normal?
_MaRiO
Posts: 3
Joined: Thu Feb 16, 2012 9:24 am

Re: Problems should be solved shortly

Post by _MaRiO »

How about scoring ?

http://eon.ices.utexas.edu/eon2/workuni ... =117800459
988.36 - - 976.30 - - 0.22

http://eon.ices.utexas.edu/eon2/workuni ... =117800580
985.28 - - 976.71 - - 11.03
Slave_Mac
Posts: 9
Joined: Sat Oct 30, 2010 7:26 pm

Re: Problems should be solved shortly

Post by Slave_Mac »

As of this A.M. XP and Lion machines are completing tasks successfully.
mitrichr
Posts: 27
Joined: Mon Sep 13, 2010 10:42 pm

Re: Problems should be solved shortly

Post by mitrichr »

O.K., you need to tell us when we can go back to tasks. And you need to email out the notice, because alerts for replies to posts on this forum do not appear to be working.
Ged T
Posts: 3
Joined: Sat Feb 18, 2012 11:25 am

Re: Problems should be solved shortly

Post by Ged T »

Well, I just tried workunit item 580125238_12_219839_0 on one of my W7 x64 machines and the client does increment the percentage complete, as elapse time is consumed but its 'chunky' - a few tens of seconds passes then the percentage complete is updated. However, the general trend is that as processing executes, the time to complete also increases. I thought that, maybe, the chunkiness was due to checkpointing so I checked that and found that it still doesn't checkpoint a workunit; I aborted the workunit to liberate compute resources for the much more able client projects I support.

The upshot is that, whilst one could see the irony of the eON 4.x client's search for rarely, if ever, occurring events (like checkpointing) over a very long timescale (eons...) I have to conclude that:

1 - The lack of care and attention to the behaviour of the client software is damaging to the other BOINC projects I run, to contribute towards science; to be clear, I mean 'damaging' in the sense that an open-ended completion time is contrary to the spirit of BOINC and disrespectful of those other projects

2 - With still no checkpointing, and especially now that a client instance of eOn workunit's runtime is potentially unbounded, any restart of the client software as a result of a BOINC framework update or the host machine increases the problem: other projects are denied processor cycles much more due to unnecessary reprocessing of eOn client instances (when these had exeution times up to the handful of minutes, it wasn't a huge problem - still wasteful, but manageable...)

3 - An ironic compounding of the eOn client's tendancy to unbound execution times and no checkpointing occurs, due to the short deadlines set of the eON workunits: the BOINC scheduler attempts to walk the line between deadline, the project level proportions of compute resources that have been set versus the availability of those compute resources - For eOn, this all too often results in a "running at high priority" status being granted to eOn workunits - i.e. maximise the compute resource, at the expense of everything (BOINC projects) else, to conclude these 'expressed' workunits in the shortest time possible, so the deadline is not breached - Irony in bucket loads, given eOn is about testing for events that occur rarely, if ever, over very long intervals of time being 'rushed' by the BOINC scheduler...

Given the above, I'm pulling the plug for my support of this project (detaching) as it seems unable to resolve these types of issues both historically and currently.
mitrichr
Posts: 27
Joined: Mon Sep 13, 2010 10:42 pm

Re: Problems should be solved shortly

Post by mitrichr »

Ged T-

I completely understand your position and conclusion. I am not technically proficient, just a grunt cruncher.

So, what I am going to do is give these guys a chance. I put all my machines on NNT for this project. I want to wait until I see an admin or project scientist come here and tell us that it is safe to go back in the water.
Ged T
Posts: 3
Joined: Sat Feb 18, 2012 11:25 am

Re: Problems should be solved shortly

Post by Ged T »

@mitrichr -

I've detached from the project because even though we can suspend work fetch from any project, its presence in the scheduling list seems to affects BOINC manager scheduling decisions regarding other projects I'm running (Milkyway, Einstein and LHC, currently...). I don't know if you've seen this happen, but when BOINC loads up and eON the eOn project is present, it seems to automatically schedule a project status update - Something I haven't seen in any other BOINC project I've ever run and that implies it somehow "pokes" the scheduler/manager...

In conclusion, when/if there is fresh 'grey smoke' for this project, I'll reattach, so that should cause the latest project software artefacts/versions to be downloaded (to be certain as one can that the expected runtime environment is clean, at least at the BOINC/project level.) This project still represents worthy science and it has willing participants but is in serious need of some project/admin TLC! I understand your position and, to be clear, I'll being checking back on this forum to see if anyone from the project has responded and has indicated that we can get back in the pool -;) I'll reiterate that I think this project is about important science and would truely like to see it work and perform well but it needs a lot of frustrating wrinkles ironing out, right now!
_MaRiO
Posts: 3
Joined: Thu Feb 16, 2012 9:24 am

Re: Problems should be solved shortly

Post by _MaRiO »

jhankin1
Posts: 2
Joined: Fri May 11, 2012 5:10 am

Re: Problems should be solved shortly

Post by jhankin1 »

What's this "Communication de..." error message? How can it be fixed?
CMmoose
Posts: 46
Joined: Tue Apr 17, 2012 9:39 pm
Contact:

Re: Problems should be solved shortly

Post by CMmoose »

Hi!
What's the full error message?
jhankin1
Posts: 2
Joined: Fri May 11, 2012 5:10 am

Re: Problems should be solved shortly

Post by jhankin1 »

That's all the Advanced View will tell me. When using Terminal, I get a little more information:

[a bunch of boring startup commands]
dir_open: Could not open directory 'slots'.
11-May-2012 18:17:07 [---] GUI RPC bind to port 31416 [hmm, that number looks familiar] failed: 98
gstate.init() failed
Error Code: -180

I don't know whether or not that's related. I'm shutting the laptop down for now.
CMmoose
Posts: 46
Joined: Tue Apr 17, 2012 9:39 pm
Contact:

Re: Problems should be solved shortly

Post by CMmoose »

It's a Boinc communication problem - see this thread.
You might well have 2 boinc managers running (make sure you've not got some old files still running) or need to change your firewall/security settings.
CMmoose
Posts: 46
Joined: Tue Apr 17, 2012 9:39 pm
Contact:

Re: Problems should be solved shortly

Post by CMmoose »

Yes, or you can use a free program like Whats Running which gives lots of details - see http://www.whatsrunning.net/.
Post Reply