Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
ohNet hang after TimerManager thread crashes
19-04-2013, 08:00 AM (This post was last modified: 19-04-2013 08:13 AM by andreww.)
Post: #11
RE: ohNet hang after TimerManager thread crashes
I had another look at the stack dumps. I think LWP 8015 could well be the timer thread. It does not appear in the Java stack dump, and it appears early on between two other threads both created by ohNet. Since it is not in the Java stack dump, I would assume that it has never run Java code at all.

For reference, here's a mapping between Java thread names and LWP numbers:
Code:
Thread-378 is LWP 19630
Thread-377 is LWP 19629
Thread-354 is LWP 17824
Thread-353 is LWP 17823
Thread-17 is LWP 8023
Thread-16 is LWP 8022
Timer-0 is LWP 8046
Thread-14 is LWP 8021
Thread-12 is LWP 8020
Thread-10 is LWP 8013
Thread-9 is LWP 8025
Thread-8 is LWP 8019
Thread-7 is LWP 8018
Thread-6 is LWP 8017
Thread-5 is LWP 8016
Thread-4 is LWP 8012
Thread-3 is LWP 8011
Thread-1 is LWP 8002
Thread-0 is LWP 8001
Service Thread is LWP 7997
C1 CompilerThread0 is LWP 7996
Signal Dispatcher is LWP 7995
Finalizer is LWP 7994
Reference Handler is LWP 7993
main is LWP 7991
VM Thread is LWP 7992

Is the hang reproduceable? If it's a Debian-based distro, have you installed the libc6-dbg package? I think that package should provide the debug information needed for gdb to give an accurate stack trace through frames in libc. When it's missing debug information it tries to figure it out automatically, and I think gdb is a bit poor at doing so on armel.

Also, here's a snippet of Python (2.7) that takes the Java stack dump and prints out the mapping of thread names to LWP numbers:

Code:
import re
regex = re.compile('"([^"]*)".* nid=0x([0-9a-f]+) .*')
for match in map(rex.match,open('qnaphangjava.txt')):
    if match is not None:
        print "{} is LWP {}".format(match.group(1), int(match.group(2),16))
Visit this user's website Find all posts by this user
19-04-2013, 08:55 AM
Post: #12
RE: ohNet hang after TimerManager thread crashes
Thanks for looking into this. See comments inline below.

(19-04-2013 08:00 AM)andreww Wrote:  I had another look at the stack dumps. I think LWP 8015 could well be the timer thread. It does not appear in the Java stack dump, and it appears early on between two other threads both created by ohNet. Since it is not in the Java stack dump, I would assume that it has never run Java code at all.

I don't think LWP 8015 is the TimerManager thread. I've done a thread dump of MinimServer running normally on the QNAP (see attachment), and the TimerManager thread has an LWP number that is 3 less than the NetworkAdapterChangeNotifier thread, indicating that it was created before the NetworkAdapterChangeNotifier thread. There is a thread in the "normal" dump (LWP 5293) that has an LWP number of 1 more than the the NetworkAdapterChangeNotifier thread (as does LWP 8015), with a stack trace that looks the same as LWP 8015, and it isn't the TimerManager thread.

Quote:For reference, here's a mapping between Java thread names and LWP numbers:
Code:
Thread-378 is LWP 19630
Thread-377 is LWP 19629
Thread-354 is LWP 17824
Thread-353 is LWP 17823
Thread-17 is LWP 8023
Thread-16 is LWP 8022
Timer-0 is LWP 8046
Thread-14 is LWP 8021
Thread-12 is LWP 8020
Thread-10 is LWP 8013
Thread-9 is LWP 8025
Thread-8 is LWP 8019
Thread-7 is LWP 8018
Thread-6 is LWP 8017
Thread-5 is LWP 8016
Thread-4 is LWP 8012
Thread-3 is LWP 8011
Thread-1 is LWP 8002
Thread-0 is LWP 8001
Service Thread is LWP 7997
C1 CompilerThread0 is LWP 7996
Signal Dispatcher is LWP 7995
Finalizer is LWP 7994
Reference Handler is LWP 7993
main is LWP 7991
VM Thread is LWP 7992

Thanks!

Quote:Is the hang reproduceable?

No, I have only seen it once. I have seen other hangs that might have been similar, but at the time of these other hangs I didn't have a working gdb on the QNAP so I couldn't get a native thread dump.

Quote:If it's a Debian-based distro, have you installed the libc6-dbg package? I think that package should provide the debug information needed for gdb to give an accurate stack trace through frames in libc. When it's missing debug information it tries to figure it out automatically, and I think gdb is a bit poor at doing so on armel.

The QNAP doesn't use debian. I'll look at libc6-dbg to see if it's possible to retrofit it into the QNAP environment.

Quote:Also, here's a snippet of Python (2.7) that takes the Java stack dump and prints out the mapping of thread names to LWP numbers:

Code:
import re
regex = re.compile('"([^"]*)".* nid=0x([0-9a-f]+) .*')
for match in map(rex.match,open('qnaphangjava.txt')):
    if match is not None:
        print "{} is LWP {}".format(match.group(1), int(match.group(2),16))

Thanks!


Attached File(s)
.txt  qnapnormal.txt (Size: 14.86 KB / Downloads: 1)
Find all posts by this user
25-04-2013, 09:30 AM
Post: #13
RE: ohNet hang after TimerManager thread crashes
I'd like to fix the MinimServer bug of not terminating the ohNet process after an unhandled exception fatal error call.

To ensure the Visual Studio debugger is called on Windows, I'd like to do the process termination by calling Os::Quit or abort(). I don't think this is possible with the current Java bindings. Would you be willing to accept a patch to the Java bindings to provide this capability?
Find all posts by this user
25-04-2013, 09:50 AM
Post: #14
RE: ohNet hang after TimerManager thread crashes
(25-04-2013 09:30 AM)simoncn Wrote:  I'd like to fix the MinimServer bug of not terminating the ohNet process after an unhandled exception fatal error call.

To ensure the Visual Studio debugger is called on Windows, I'd like to do the process termination by calling Os::Quit or abort(). I don't think this is possible with the current Java bindings. Would you be willing to accept a patch to the Java bindings to provide this capability?

That sounds like a good idea. It'd be great if you were able to provide a patch for this.
Find all posts by this user
29-04-2013, 10:28 AM
Post: #15
RE: ohNet hang after TimerManager thread crashes
(25-04-2013 09:50 AM)simonc Wrote:  That sounds like a good idea. It'd be great if you were able to provide a patch for this.

The patch is attached. It adds a new exitProcess() method to the Library class.

The following files are affected:
OpenHome/Net/Bindings/Java/org/openhome/net/core/Library.java
OpenHome/Net/Bindings/Java/Library.c
OpenHome/Net/Bindings/Java/Library.h
OpenHome/Net/Bindings/C/OhNetC.cpp
OpenHome/Net/Bindings/C/OhNet.h

I've tested this on Windows, Linux and Mac.


Attached File(s)
.zip  exitprocess.zip (Size: 2 KB / Downloads: 1)
Find all posts by this user
30-04-2013, 09:14 AM
Post: #16
RE: ohNet hang after TimerManager thread crashes
(29-04-2013 10:28 AM)simoncn Wrote:  
(25-04-2013 09:50 AM)simonc Wrote:  That sounds like a good idea. It'd be great if you were able to provide a patch for this.

The patch is attached. It adds a new exitProcess() method to the Library class.

Thanks very much. I've applied this locally so it should be on github later today.

Note that I made a couple of small changes to your patch - the function is now called abortProcess() (I thought this made it slightly clearer that the function was not intended to be called during normal execution of a program) and added an equivalent C# API.
Find all posts by this user
30-04-2013, 10:05 AM
Post: #17
RE: ohNet hang after TimerManager thread crashes
(30-04-2013 09:14 AM)simonc Wrote:  Thanks very much. I've applied this locally so it should be on github later today.

Note that I made a couple of small changes to your patch - the function is now called abortProcess() (I thought this made it slightly clearer that the function was not intended to be called during normal execution of a program) and added an equivalent C# API.

Thanks very much! I chose the name exitProcess() to correspond to the Java System.exit() call, but I think you're right that it might cause misunderstanding.
Find all posts by this user


Forum Jump: