Page 1 of 1

Problems should be solved shortly

Posted: Mon Feb 20, 2012 5:50 pm
by ryeterrell
The problems with recent work units have been diagnosed and a fix is in the works. Work units should be working again shortly. Thanks again for your patience and alerting us to the problems.

Re: Problems should be solved shortly

Posted: Tue Feb 21, 2012 12:58 am
by upstatelabs
I now have "Completed and Validated" WUs being sent back. Thank you!

Many of them have stderr that are something like this:
<core_client_version>6.12.34</core_client_version>
<![CDATA[
<stderr_txt>
error: bundle: cannot rename client.log to client_0.log: Permission denied
19:12:47 (4932): called boinc_finish

</stderr_txt>
]]>

Is this still an issue to be resolved or is this normal?

Re: Problems should be solved shortly

Posted: Tue Feb 21, 2012 1:07 pm
by _MaRiO
How about scoring ?

http://eon.ices.utexas.edu/eon2/workuni ... =117800459
988.36 - - 976.30 - - 0.22

http://eon.ices.utexas.edu/eon2/workuni ... =117800580
985.28 - - 976.71 - - 11.03

Re: Problems should be solved shortly

Posted: Tue Feb 21, 2012 2:34 pm
by Slave_Mac
As of this A.M. XP and Lion machines are completing tasks successfully.

Re: Problems should be solved shortly

Posted: Thu Feb 23, 2012 1:05 am
by mitrichr
O.K., you need to tell us when we can go back to tasks. And you need to email out the notice, because alerts for replies to posts on this forum do not appear to be working.

Re: Problems should be solved shortly

Posted: Sat Feb 25, 2012 6:55 am
by Ged T
Well, I just tried workunit item 580125238_12_219839_0 on one of my W7 x64 machines and the client does increment the percentage complete, as elapse time is consumed but its 'chunky' - a few tens of seconds passes then the percentage complete is updated. However, the general trend is that as processing executes, the time to complete also increases. I thought that, maybe, the chunkiness was due to checkpointing so I checked that and found that it still doesn't checkpoint a workunit; I aborted the workunit to liberate compute resources for the much more able client projects I support.

The upshot is that, whilst one could see the irony of the eON 4.x client's search for rarely, if ever, occurring events (like checkpointing) over a very long timescale (eons...) I have to conclude that:

1 - The lack of care and attention to the behaviour of the client software is damaging to the other BOINC projects I run, to contribute towards science; to be clear, I mean 'damaging' in the sense that an open-ended completion time is contrary to the spirit of BOINC and disrespectful of those other projects

2 - With still no checkpointing, and especially now that a client instance of eOn workunit's runtime is potentially unbounded, any restart of the client software as a result of a BOINC framework update or the host machine increases the problem: other projects are denied processor cycles much more due to unnecessary reprocessing of eOn client instances (when these had exeution times up to the handful of minutes, it wasn't a huge problem - still wasteful, but manageable...)

3 - An ironic compounding of the eOn client's tendancy to unbound execution times and no checkpointing occurs, due to the short deadlines set of the eON workunits: the BOINC scheduler attempts to walk the line between deadline, the project level proportions of compute resources that have been set versus the availability of those compute resources - For eOn, this all too often results in a "running at high priority" status being granted to eOn workunits - i.e. maximise the compute resource, at the expense of everything (BOINC projects) else, to conclude these 'expressed' workunits in the shortest time possible, so the deadline is not breached - Irony in bucket loads, given eOn is about testing for events that occur rarely, if ever, over very long intervals of time being 'rushed' by the BOINC scheduler...

Given the above, I'm pulling the plug for my support of this project (detaching) as it seems unable to resolve these types of issues both historically and currently.

Re: Problems should be solved shortly

Posted: Sat Feb 25, 2012 11:41 am
by mitrichr
Ged T-

I completely understand your position and conclusion. I am not technically proficient, just a grunt cruncher.

So, what I am going to do is give these guys a chance. I put all my machines on NNT for this project. I want to wait until I see an admin or project scientist come here and tell us that it is safe to go back in the water.

Re: Problems should be solved shortly

Posted: Mon Feb 27, 2012 9:34 am
by Ged T
@mitrichr -

I've detached from the project because even though we can suspend work fetch from any project, its presence in the scheduling list seems to affects BOINC manager scheduling decisions regarding other projects I'm running (Milkyway, Einstein and LHC, currently...). I don't know if you've seen this happen, but when BOINC loads up and eON the eOn project is present, it seems to automatically schedule a project status update - Something I haven't seen in any other BOINC project I've ever run and that implies it somehow "pokes" the scheduler/manager...

In conclusion, when/if there is fresh 'grey smoke' for this project, I'll reattach, so that should cause the latest project software artefacts/versions to be downloaded (to be certain as one can that the expected runtime environment is clean, at least at the BOINC/project level.) This project still represents worthy science and it has willing participants but is in serious need of some project/admin TLC! I understand your position and, to be clear, I'll being checking back on this forum to see if anyone from the project has responded and has indicated that we can get back in the pool -;) I'll reiterate that I think this project is about important science and would truely like to see it work and perform well but it needs a lot of frustrating wrinkles ironing out, right now!

Re: Problems should be solved shortly

Posted: Wed Feb 29, 2012 6:56 am
by _MaRiO

Re: Problems should be solved shortly

Posted: Fri May 11, 2012 5:14 am
by jhankin1
What's this "Communication de..." error message? How can it be fixed?

Re: Problems should be solved shortly

Posted: Fri May 11, 2012 5:53 am
by CMmoose
Hi!
What's the full error message?

Re: Problems should be solved shortly

Posted: Fri May 11, 2012 10:23 pm
by jhankin1
That's all the Advanced View will tell me. When using Terminal, I get a little more information:

[a bunch of boring startup commands]
dir_open: Could not open directory 'slots'.
11-May-2012 18:17:07 [---] GUI RPC bind to port 31416 [hmm, that number looks familiar] failed: 98
gstate.init() failed
Error Code: -180

I don't know whether or not that's related. I'm shutting the laptop down for now.

Re: Problems should be solved shortly

Posted: Sat May 12, 2012 4:39 am
by CMmoose
It's a Boinc communication problem - see this thread.
You might well have 2 boinc managers running (make sure you've not got some old files still running) or need to change your firewall/security settings.

Re: Problems should be solved shortly

Posted: Thu May 24, 2012 9:10 pm
by CMmoose
Yes, or you can use a free program like Whats Running which gives lots of details - see http://www.whatsrunning.net/.