Working Around JNI UTF-8 Strings Deprogramming

tubo posted @ 2014年9月03日 16:13 in 未分类 , 747 阅读

private static native void printString(String text); ... void examplePrintString() { String str = "A" + "\u00ea" + "\u00f1" + "\u00fc" + "C"; System.out.println("String = " + str); printString(str); } }

To access the string, C++ needs to retrieve the bytes of the string using a function from the JNI library, GetStringUTFChars(), like so:

JNIEXPORT void JNICALL Java_Example_printString(JNIEnv *env, jclass, jstring text) {
  const char* text_input = env->GetStringUTFChars(text, NULL);
  for (int i = 0; text_input[i] != 0; ++i) {
    printf("jni[%d] = %x\n", i, ((unsigned char *) text_input)[i]);
  }
  env->ReleaseStringUTFChars(text, text_input);
}

In a sample run, I get the following output:

String = AêñüC
jni[0] = 41
jni[1] = c3
jni[2] = aa
jni[3] = c3
jni[4] = b1
jni[5] = c3
jni[6] = bc
jni[7] = 43

The five character string “AêñüC” is encoded in eight bytes under UTF-8, because three of the characters occupy two bytes each.

Now this works fine in this example. What isn’t yet apparent is that UTF-8 strings generated by JNI are not standard, but instead are modified UTF-8. According the JNI spec:

There are two differences between this format and the standard UTF-8 format. First, the null character (char)0 is encoded using the two-byte format rather than the one-byte format. This means that modified UTF-8 strings never have embedded nulls. Second, only the one-byte, two-byte, and three-byte formats of standard UTF-8 are used. The Java VM does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.

If there’s a technical reason JNI does not use standard UTF-8 format, I have not seen a discussion, and I cannot fathom why. A case may be made for the non-embedded nulls, but that’s easy to work around by relying on a length variable instead of null to mark the end. The avoidance of four-byte UTF-8 characters seems more mysterious.

Here’s an example of passing a valid four-byte-character: The Java routine now passes in the following string:

class Example {
  ...
  void examplePrintString() {
    byte[] bb = new byte[4];
    bb[0] = (byte) 0xf0;
    bb[1] = (byte) 0xa0;
    bb[2] = (byte) 0x9c;
    bb[3] = (byte) 0x8e;
    String str = new String(bb, "UTF-8");
    System.out.println("String = " + str);
    printString(str);
  }
}

And the output is now:

String = <unprintable>*
jni[0] = ed
jni[1] = a1
jni[2] = 81
jni[3] = ed
jni[4] = bc
jni[5] = 8e

* This blog can’t handle that character.

The Java example sets the four bytes of the character explicitly, so it is obvious this character was converted to a 5-byte sequence.

Suppose you relied on a string processing library in your native function to manipulate the strings from the Java call. And also suppose this library expects and produces standard UTF-8 encoding, because, why would it not use the standard? And suppose it reacted unpredictably when faced with non-standard, or more politely, “modified” encoding. At best, it hopefully discards characters it can’t interpret. At worst it crashes. In the case of passing strings from native back to Java, the JNI definitely crashes if not in correctly modified UTF-8, so you have this problem too.

Chances are you’d never encounter the problem lurking, because use of four-byte characters seems sufficiently rare. But I wouldn’t want to rely on the scarcity of these characters to avoid a potential bug. As I’ve learned from running code that drives popular web-sites, once running on sufficiently enough data, even the unlikeliest of bugs become commonplace.

So how to work around this without needing to write a converter in native code? Well, it turns out converting to UTF-8 in Java (as opposed to JNI) produces standard encoding. Therefore, the workaround is to convert in Java, and send a byte array in lieu of a String.

Now, the Java example looks like:

class Example {
  ...
  private static native void printBytes(byte[] text);
  ...
  void examplePrintString() {
    byte[] bb = new byte[4];
    bb[0] = (byte) 0xf0;
    bb[1] = (byte) 0xa0;
    bb[2] = (byte) 0x9c;
    bb[3] = (byte) 0x8e;
    String str = new String(bb, "UTF-8");
    System.out.println("String = " + str);
    printBytes(str.getBytes("UTF-8")); // Do the conversion here.
  }
}

JNIEXPORT void JNICALL Java_Example_printBytes(JNIEnv *env, jclass, jbyteArray text) {
  jbyte* text_input = env->GetByteArrayElements(text, NULL);
  jsize size = env->GetArrayLength(text);
  for (int i = 0; i < size; ++i) {
    printf("bytes[%d] = %x\n", i, ((const unsigned char *) text_input)[i]);
  }
  env->ReleaseByteArrayElements(text, text_input, NULL);
}

Now this prints the following expected four bytes:

String = <unprintable>
bytes[0] = f0
bytes[1] = a0
bytes[2] = 9c
bytes[3] = 8e

When using a UTF-8 library in JNI, I prefer byte array over String when passing data from Java.

[回复]

charlly 说:
2023年2月22日 15:17

This article provides useful insight into how to work around the deprogramming of JNI UTF-8 strings. It starts off by defining the complexities of dealing with these strings and then offers clear steps for how to navigate the deprogramming process. This is an invaluable resource for anyone who works with JNI strings, as the material is explained in a comprehensive and easy-to-understand manner. The article also provides helpful examples throughout, making it even simpler to understand the concepts that are being discussed. Overall, this is an excellent guide for anyone who needs to know engagement rings how to address the deprogramming of JNI UTF-8 strings.

TBlog

tubo

计数器

搜索

分类

RSS

Working Around JNI UTF-8 Strings Deprogramming