 leiboldPremium,MVM join:2002-07-09 Sunnyvale, CA kudos:6 Reviews:
·SONIC.NET
| [F@H] Bad workunit keeps terminating immediately (forever ?) I seem to have a bad workunit on my FAH client that has already been attempted 19 times and doesn't make any progress.
I tried restarting both FAHClient and FAHControl with no change.
Is it time to terminate that workunit with extreme prejudice ?
The attached log file shows the end of the most recent successful workunit with the download of the bad workunit interleaved. Ignore the lines starting with WU00 if you only want to see the messages related to the bad workunit.
It is: Project: 7611 (Run 3, Clone 56, Gen 180)
and always fails with:
FahCore returned: INTERRUPTED (102 = 0x66)
The operating system (Linux) indicates that FahCore_a4 fails with a segmentation violation (attempt to access memory not allocated to the process):
[17338259.062045] FahCore_a4[834]: segfault at fffffffe00da6460 ip 00000000004aaf36 sp 00007fbf32fb13a8 error 4 in FahCore_a4[400000+5e9000] -- Got some spare cpu cycles ? Join Team Helix or Team Starfire! |
|
 jaynicklit upPremium join:2001-02-06 Sterling Heights, MI kudos:2 | This could be related, might be worth the look: »foldingforum.org/viewtopic.php?f=19&t=23072 Sounds like a similar issue on a Linux machine and same project number. |
|
 leiboldPremium,MVM join:2002-07-09 Sunnyvale, CA kudos:6 | The segfault is at the same location in the FahCore so it is very likely the same issue. |
|
 sortofageekNot TroublePremium,Mod join:2001-08-19 There & Then kudos:14 | reply to leibold FWIW, nobody else has completed Project: 7611 (Run 3, Clone 56, Gen 180) at this time. |
|
|
|
 leiboldPremium,MVM join:2002-07-09 Sunnyvale, CA kudos:6 Reviews:
·SONIC.NET
| said by tjlane (Pande Group) :Hi All,
I am sorry for the re-occurence of this issue. Unfortunately I can't always guarantee this issue won't crop up, but I've done my best to mitigate it. The root cause is a bug in the A4 core, and I've reported it to the correct people. I think that because this issue occurs in >.1% of WUs, it hasn't been a priority for our dev team, which is usually swamped.
This issue will be resolved eventually, but in the mean time please feel free to dump these WUs. Not only are the points bad, but when this issue occurs the returned WU is meaningless. Therefore it's beneficial for the science (and the donor) to dump the WU.
I do apologize and will bug the core dev team again. Please let me know if there are any further questions or concerns.
Thanks,
TJ
There is a confirmed problem with the A4 core effecting some project 7611 workunits. On Linux this seems to cause segmentation faults/violations but for others these workunits simply run extremely slow.
I'm concerned about the fact that the FAHClient keeps retrying those bad workunits indefinitely (20 times before I finally deleted it). This problem could take a lot of Linux F@H clients that run unattended out of circulation! -- Got some spare cpu cycles ? Join Team Helix or Team Starfire! |
|