Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Race conditions in EventServerUpnp
03-02-2013, 11:32 AM
Post: #1
Race conditions in EventServerUpnp
I've seen a couple of ohNet control point crashes recently. The symptoms were different, but debugging using detailed trace logs has shown that the root cause was the same in both cases.

The control point has a number of EventSessionUpnp instances receiving property update event notifications from the same server. Each of these instances runs asynchronously on its own thread, which sets up a race condition where these event notifications can be reordered incorrectly at various stages of processing, as described in detail below.

1) The calls to Socket::Accept on line 552 of Network.cpp happen in the correct sequence.

2) Each of these Socket::Accept calls is made from a different thread, which then calls EventSessionUpnp::Run to process the event message. The code in EventSessionUpnp::Run runs in parallel on different threads, with no synchronization.

3) At line 110 of EventUpnp.cpp, the subscription sequence number is checked. If the event messages have been reordered by thread parallelism when they reach this point, the sequence number check will fail, triggering an unsubscribe and resubscribe. For more details, see the description of trace1.log below.

4) This unsubscribe and resubscribe can trigger a bug in CpiSubscriptionManager::WaitForPendingAdd, which I've described in this thread. The symptom of this bug is an assertion failure, as shown in line 1074 of trace1.log.

5) At line 151 of EventUpnp.cpp, ProcessNotification() is called to process the updated property values. If the event messages have made it through line 110 in the correct order, they may be in the wrong order by the time they arrive here.

6) If the calls to ProcessNotification() are made in the wrong order, event message 1 can be processed by this code before message 0. The processing of message 1 triggers an InitialEvent callback to the application, which attempts to get the initial property values. Because message 0 hasn't yet been processed by ProcessNotification(), this attempt to get the initial values fails with an assertion message. For more details, see the description of trace2.log below.

I'm attaching two trace logs which demonstrate the reorderings described in 3) and 6) above.

trace1.log:

On line 96 of trace1.log, thread 215 completes reading subscription event message 31 and discovers that the sequence number is out of order, so it calls SetNotificationError (line 111 of EventUpnp.cpp) which initiates an unsubscribe and resubscribe. The relevant sequence numbers are:

28, read by thread 215 on lines 51 to 54 of trace1.log
29, read by thread 212 on lines 72 to 80 of trace1.log
30, read by thread 216 on lines 91 to 106 of trace1.log
31, read by thread 215 on lines 92 to 96 of trace1.log
32, read by thread 212 on lines 126 to 129 of trace1.log
33, read by thread 215 on lines 97 to 132 of trace1.log

Because of the unsubscribe/resubscribe previously initiated by thread 215 for message 31, thread 212 is unable to to find a subscription for message 32 (line 94 of EventUpnp.cpp), so it calls WaitForPendingAdd (line 100 of EventUpnp.cpp) in an attempt to find the subscription later. This happens on line 129 of trace1.log, and the same thing happens with thread 213 for message 33 on line 132 of trace1.log.

On the final line 1074 of trace1.log, thread 212 awakes and triggers an assertion error:
Assertion failed. Build/Include/OpenHome/OsWrapper.inl:44
This is caused by the bug in CpiSubscriptionManager::WaitForPendingAdd described in this thread.

trace2.log:

On line 239 of trace2.log, thread 60 calls EventSessionUpnp::Run to read event message 0. This message is more than 1024 bytes long, so two read operations are needed, and the read is completed on line 279 of trace2.log. Meanwhile, thread 61 starts reading event message 1 on line 250 of trace2.log and finishes reading this message on line 254 of trace2.log, overtaking the processing of event message 0. When event message 1 is processed by ProcessNotification(), the InitialEvent notification is delivered to the application on line 264 of trace2.log. On line 275 of trace2.log, the application attempts to read the value of the AllContextStatus property, which fails with an assertion error because event message 0 hasn't yet reached ProcessNotification(). The assertion message is on line 275 of trace2.log:
Assertion failed. OpenHome/Net/Service.cpp:311

Proposed solution:

There's no easy solution for this problem. When event messages are received, they might be for the same subscription or for different subscriptions. If they are for the same subscription, they need to be processed in the same order as they were received. If the messages are for different subscriptions, it's fine to process them in parallel and reorder them.

The complication is that it takes a reasonable amount of processing to identify the correct subscription. During this initial processing, the messages must be kept in strict order in case it turns out that they are for the same subscription. When the subscription is known, the messages for that subscription can be kept in the right order by using a mutex on the CpiSubscription object, and messages for other subscriptions can be processed in parallel.

I think a global mutex on EventServerUpnp is required to ensure correct ordering of event message processing for the initial stages until the subscription for each event message is known. This will require some reorganization of the current code.

As a temporary workaround, I will set NumEventSessionThreads to 1. This will avoid the problem at some cost to performance.


Attached File(s)
.zip  logs.zip (Size: 8.82 KB / Downloads: 0)
Find all posts by this user
04-02-2013, 11:51 AM
Post: #2
RE: Race conditions in EventServerUpnp
(03-02-2013 11:32 AM)simoncn Wrote:  I've seen a couple of ohNet control point crashes recently. The symptoms were different, but debugging using detailed trace logs has shown that the root cause was the same in both cases.

The control point has a number of EventSessionUpnp instances receiving property update event notifications from the same server. Each of these instances runs asynchronously on its own thread, which sets up a race condition where these event notifications can be reordered incorrectly at various stages of processing, as described in detail below.

Thanks, another great spot and very helpful diagnosis.

I've committed a likely fix locally. This doesn't attempt to prevent the races which cause resubscribes. It just guarantees that a successful call to UpdateSequenceNumber guarantees that this event will be the next to be processed and notified to client code.

I'm don't plan on doing anything about the races which cause resubscribes. As well as the thread scheduling causes you've found, the same symptoms can be prompted by the network stack. This latter cause is out of our control so there's limited benefit in worrying about the subset of resubscribe races we potentially could control.
Find all posts by this user
04-02-2013, 04:58 PM
Post: #3
RE: Race conditions in EventServerUpnp
(04-02-2013 11:51 AM)simonc Wrote:  Thanks, another great spot and very helpful diagnosis.

I've committed a likely fix locally. This doesn't attempt to prevent the races which cause resubscribes. It just guarantees that a successful call to UpdateSequenceNumber guarantees that this event will be the next to be processed and notified to client code.

OK, thanks.

Quote:I'm don't plan on doing anything about the races which cause resubscribes. As well as the thread scheduling causes you've found, the same symptoms can be prompted by the network stack. This latter cause is out of our control so there's limited benefit in worrying about the subset of resubscribe races we potentially could control.

I don't quite understand how the network stack could cause this. Are you saying that if ohNet were to accept a sequence of connections on a single thread (or using a mutex to serialize multiple threads), these accepts might not reflect the order in which the TCP messages were received by the network stack? Doesn't TCP guarantee ordered delivery?
Find all posts by this user
04-02-2013, 05:06 PM
Post: #4
RE: Race conditions in EventServerUpnp
(04-02-2013 04:58 PM)simoncn Wrote:  I don't quite understand how the network stack could cause this. Are you saying that if ohNet were to accept a sequence of connections on a single thread (or using a mutex to serialize multiple threads), these accepts might not reflect the order in which the TCP messages were received by the network stack? Doesn't TCP guarantee ordered delivery?

Each event server thread represents a different TCP connection between device and control point. The UPnP protocol which is layered over TCP may treat some of these streams as related but, as far as TCP is concerned, each of these connections is entirely separate.

I'm not sure that TCP guarantees any ordering of delivery for these multiple, effectively independent streams.
Find all posts by this user
04-02-2013, 05:53 PM (This post was last modified: 04-02-2013 05:57 PM by simoncn.)
Post: #5
RE: Race conditions in EventServerUpnp
(04-02-2013 05:06 PM)simonc Wrote:  Each event server thread represents a different TCP connection between device and control point. The UPnP protocol which is layered over TCP may treat some of these streams as related but, as far as TCP is concerned, each of these connections is entirely separate.

I'm not sure that TCP guarantees any ordering of delivery for these multiple, effectively independent streams.

I think ordering of connection establishments is guaranteed as long as the server waits for the first connection (used for sending message n) to become established before it initiates a second connection (used for sending message n+1). This is because TCP connection establishment requires a handshake from both the client and server sides. This means there is no possibiity of connection establishment being reordered on the control point side as long as the server uses a sequence like the following:

1) Establish connection n
2) Send message n using connection n
3) Establish connection n+1
4) Send message n+1 using connection n+1
etc.

The key point is that the sequence of connection establishments is guaranteed to be ordered, not the ordering of TCP messages sent on different connections. So the control point might receive message n+1 on connection n+1 before it has received message n on connection n, but this won't matter as long as the control point reads and processes messages from the first connection that was accepted (connection n) before it reads and processes messages from the second connection that was accepted (connection n+1).
Find all posts by this user
08-02-2013, 01:53 PM
Post: #6
RE: Race conditions in EventServerUpnp
(04-02-2013 11:51 AM)simonc Wrote:  Thanks, another great spot and very helpful diagnosis.

I've committed a likely fix locally. This doesn't attempt to prevent the races which cause resubscribes. It just guarantees that a successful call to UpdateSequenceNumber guarantees that this event will be the next to be processed and notified to client code.

I haven't seen this fix (or the related fix to WaitForPendingAdd) appear on Github yet. Will these two fixes be available soon?
Find all posts by this user
08-02-2013, 01:57 PM
Post: #7
RE: Race conditions in EventServerUpnp
(08-02-2013 01:53 PM)simoncn Wrote:  
(04-02-2013 11:51 AM)simonc Wrote:  Thanks, another great spot and very helpful diagnosis.

I've committed a likely fix locally. This doesn't attempt to prevent the races which cause resubscribes. It just guarantees that a successful call to UpdateSequenceNumber guarantees that this event will be the next to be processed and notified to client code.

I haven't seen this fix (or the related fix to WaitForPendingAdd) appear on Github yet. Will these two fixes be available soon?

We've had problems with our automated test / publication process. These are hopefully resolved now; I'll push the code manually if it doesn't make it to github over the weekend.

Note that I've changed (and may have broken) one of your fixes. The changes to CpiSubscriptionManager::WaitForPendingAdd weren't quite right as they resulted in us deleting a semaphore while a different thread was waiting on it. I've changed this but amn't convinced I haven't reintroduced a variant of the original problem. I'll look at this more next week.
Find all posts by this user
14-02-2013, 01:40 PM
Post: #8
RE: Race conditions in EventServerUpnp
(08-02-2013 01:57 PM)simonc Wrote:  Note that I've changed (and may have broken) one of your fixes. The changes to CpiSubscriptionManager::WaitForPendingAdd weren't quite right as they resulted in us deleting a semaphore while a different thread was waiting on it. I've changed this but amn't convinced I haven't reintroduced a variant of the original problem. I'll look at this more next week.

It looks like the code here is now back to how it was originally, so this would have broken my attempted fix. Smile

I've had another look at this. Instead of trying to do a quick fix with a couple of lines of code, I've taken a step back to think about what's needed to make this mechanism work reliably in all cases. From this, I've produced a patch (attached), which I've tested as far as I'm able.

The main change is to split the RemovePendingAdd(const Brx& aSid) method into two separate methods:

1) a method RemovePendingAdd(PendingSubscription* aSubscription) which removes a single specified subscription object.

2) a method RemovePendingAdds(const Brx& aSid) which removes all subscription objects matching a specified Sid.

The calls to the existing RemovePendingAdd(const Brx& aSid) method are allocated to these two methods as follows:

a) CpiSubscriptionManager::WaitForPendingAdd calls RemovePendingAdd(pending) to remove the same subscription object that it created earlier in the method, without removing any other subscription objects.

b) CpiSubscriptionManager::Add calls RemovePendingAdds(sid) because it needs to signal all threads that are waiting on this Sid. It is currently signalling only a single waiting thread.

c) CpiSubscriptionManager::HandleInterfaceChange calls RemovePendingAdd(iPendingSubscriptions[0]) from within the loop to remove all subscriptions for all waiting threads.

I've left the delete operation in its current position in WaitForPendingAdd. This is sufficient to delete all removed PendingSubscription objects, because calling RemovePendingAdd(s) will signal the appropriate thread(s) and each signalled thread will delete its own PendingSubscription object.


Attached File(s)
.zip  PendingAdd.zip (Size: 699 bytes / Downloads: 2)
Find all posts by this user
14-02-2013, 03:55 PM
Post: #9
RE: Race conditions in EventServerUpnp
(14-02-2013 01:40 PM)simoncn Wrote:  I've produced a patch (attached), which I've tested as far as I'm able.

Thanks! Looking at this now...
Find all posts by this user
14-02-2013, 05:20 PM
Post: #10
RE: Race conditions in EventServerUpnp
(14-02-2013 03:55 PM)simonc Wrote:  
(14-02-2013 01:40 PM)simoncn Wrote:  I've produced a patch (attached), which I've tested as far as I'm able.

Thanks! Looking at this now...

Now committed. I changed the patch very slightly (RemovePendingAdds iterates from end to start of its list rather than tweaking a loop counter) but its otherwise as you suggested.

Thanks again for taking the time to do this!
Find all posts by this user


Forum Jump: