Thread Rating:
  • 0 Votes - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
UTF-8 problem in Java bindings
06-02-2014, 04:26 PM
Post: #1
UTF-8 problem in Java bindings
The Java bindings don't handle some UTF-8 characters correctly. This is because the code in DvInvocation.c for InvocationWriteString and InvocationReadString uses the JNI functions GetStringUTFChars and NewStringUTF to convert between Java strings and C strings.

These APIs use a modified form of UTF-8 for the C strings. This modified UTF-8 has two differences from standard UTF-8:

1) A Unicode surrogate pair is encoded as a 6-byte UTF-8 sequence by encoding the high surrogate and low surrogate parts separately as two 3-byte sequences. The correct way to encode these characters is by combining the high surrogate and low surrogate parts into a single Unicode character and encoding that character as a 4-byte UTF-8 sequence.

2) The NUL character (0x00) is encoded as the two byte sequence 0xC0 0x80 instead of the one-byte sequence 0x00.

The first difference has been reported as bug by a user, and I have verified this. It causes Kinsky to crash when browsing tracks served by MinimServer whose names contain Unicode surrogate pairs. See this thread for confirmation of this issue.

To fix this, the code in DvInvocation.c for InvocationWriteString and InvocationReadString needs to be changed to call Java methods that encode and decode these strings correctly.

As a side-effect of making this change, the encoded C strings might now contain embedded 0x00 characters, which could cause them to be truncated prematurely by the C bindings APIs that are used by the Java bindings. This requires adding one new API function to the C bindings that accepts a length for the encoded byte sequence instead of assuming that it is terminated by the first 0x00 character.

The new API function is named DvInvocationWriteStringAsBuffer and is an analogue of the existing function DvInvocationReadStringAsBuffer. By changing the Java bindings to use the "...AsBuffer" forms of thse two calls, the possibility of premature truncation by an embedded 0x00 character is eliminated.

I checked what the C# bindings do about this. These have been coded to prevent the possibility of premature truncation by calling Brn methods that take an explicit length instead of using implicit NUL termination. This change to the C bindings allows Java to handle this case in the same way as C#.

I am attaching a patch with the changes. This is (hopefully) the last in the current series of patches. Smile


Attached File(s)
.zip  utf8.zip (Size: 2.39 KB / Downloads: 3)
Find all posts by this user
06-02-2014, 04:36 PM
Post: #2
RE: UTF-8 problem in Java bindings
Thanks for diagnosing this. I'll try to get the fix integrated today; failing that, it'll go in tomorrow at the latest.

Out of interest, can you point me to the spec that mandates embedding nul characters in the middle of a UTF-8 sequence please? I'm happy to assume you're correct and integrate the diff but I'd be interested to understand the need for the change better... I have vague memories of some libraries using utf8 for all text and assuming that a nul-terminated char* can always describe all utf8-encoded strings. If these libraries are all slightly wrong, it'd be interesting to understand where/why.
Find all posts by this user
06-02-2014, 05:31 PM (This post was last modified: 06-02-2014 05:33 PM by simoncn.)
Post: #3
RE: UTF-8 problem in Java bindings
(06-02-2014 04:36 PM)simonc Wrote:  Thanks for diagnosing this. I'll try to get the fix integrated today; failing that, it'll go in tomorrow at the latest.

Out of interest, can you point me to the spec that mandates embedding nul characters in the middle of a UTF-8 sequence please? I'm happy to assume you're correct and integrate the diff but I'd be interested to understand the need for the change better... I have vague memories of some libraries using utf8 for all text and assuming that a nul-terminated char* can always describe all utf8-encoded strings. If these libraries are all slightly wrong, it'd be interesting to understand where/why.

You might be right about this.

The UPnP Device Architecture 1.1 document (section 2.5, page 53) says under <dataType>:

REQUIRED. Same as data types defined by XML Schema, Part 2: Datatypes.

....

string
Unicode string. No limit on length.


Following the 'XML Schema, Part 2: Datatypes' reference leads to this section:

3.2.1 string

[Definition:] The string datatype represents character strings in XML. The ·value space· of string is the set of finite-length sequences of characters (as defined in [XML 1.0 (Second Edition)]) that ·match· the Char production from [XML 1.0 (Second Edition)]. A character is an atomic unit of communication; it is not further specified except to note that every character has a corresponding Universal Character Set code point, which is an integer.


Following the 'XML 1.0 (Second Edition)' reference leads to this section:

2.2 Characters

[Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.] [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646] [E67](see also [ISO/IEC 10646-2000]). Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. [E69]The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors must accept any character in the range specified for Char. The use of "compatibility characters", as defined in section 6.8 of [Unicode] [E67](see also D21 in section 3.6 of [Unicode3]), is discouraged.]

Character Range

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */


This seems to indicate that #x0 would not be legal as a character in a UPnP service state variable of type string, along with #x1, #x2, #x3, etc.

My conclusion from this is that there is no problem with the current C implementation of ohNet. Also, it would be possible to avoid adding the new API that I have included in the patch, but this would have performance implications because it would require an extra copying + memory allocation step in the Java bindings to add a NUL character to the end of the UTF-8 string that was obtained from the Java getBytes() method, and the Brn class would need to scan this string to find its length. These strings can be very long for some Browse responses.
Find all posts by this user
06-02-2014, 10:16 PM
Post: #4
RE: UTF-8 problem in Java bindings
(06-02-2014 04:36 PM)simonc Wrote:  I have vague memories of some libraries using utf8 for all text and assuming that a nul-terminated char* can always describe all utf8-encoded strings. If these libraries are all slightly wrong, it'd be interesting to understand where/why.

To clarify my previous post, the comments there about legal characters are specific to XML strings and shouldn't (I believe) be generalised to all UTF-8-encoded strings. The code point U+0000 is a legal Unicode code point, so it would be legal in a UTF-8-encoded Unicode string unless specifically excluded, as it has been by the XML specification.
Find all posts by this user


Forum Jump: