Breaking java.lang.String
Jul 11, 2023
Let's abuse a bug in java.lang.String
to make some weird Strings.
We'll make "hello world" not start with "hello", and show that not all empty Strings are equal to each other.
Intro: Equality among Strings
Before we get started, we look at what it takes for two Strings to be equal in the JDK.
Why is "foo".equals("fox")
false?
Because Strings are compared character by character, and the third character of those Strings is different.
Why is "foo".equals("foo")
true?
You might think that in this case the Strings are also compared character by character.
But String literals are interned.
When the same String appears as a constant multiple times in the source code, it's not just another String with the same content.
It is the same instance of String.
And the first thing that String.equals does is if (this == anObject) { return true; }
, so it doesn't even look at the contents.
Why is "foo!".equals("foo⁉")
false?
Since JDK 9 (since JEP 254: Compact Strings), a String represents its content internally as a byte array.
"foo!" only contains simple characters, with a codepoint less than 256. The String class internally encodes such values using the latin-1 encoding, with one byte per character.
"foo⁉" contains a character (⁉) that cannot be represented using latin-1, so it encodes the whole String using UTF-16, with two bytes per character.
The String.coder
field keeps track of which of the two encodings was used.
When comparing two Strings with a different coder
, then String.equals
always returns false.
It does not even look at the contents, because if one String can be represented in latin-1, and the other one can not, then they can't be the same. Or can they?
Note: The compact strings feature can be disabled, but it's enabled by default. This blog post assumes it is enabled.
Creating a broken String
How are Strings created? How exactly does java.lang.String
choose to use latin-1 or not?
Strings can be created in multiple ways, we'll focus here on the String constructor that takes a char[]
.
It first tries to encode the characters as latin-1 using StringUTF16.compress
. If that fails, it returns null
and the constructor falls back to using UTF-16.
Here is a simplified version of how that is implemented.
(For readability I removed irrelevant indirections, checks and arguments from the actual implementation
here
and here)
There is the bug. This code does not always preserve the assumptions that String.equals
makes that we talked about above.
Do you see it?
The javadoc points out that "subsequent modification of the character array does not affect the newly created string".
But how about concurrent modification?
The String constructor has a race condition.
The contents of value
may have changed between trying to encode it as latin-1, and encoding it as UTF-16.
That way we can end up with a String that only contains latin-1 characters, but is encoded as UTF-16.
We can trigger that race condition like this:
The broken Strings we can create this way have some interesting properties. Strange. I didn't expect this weird behaviour from such a fundamental Java class.
Spooky action at a distance
Just because we can, let's have some more fun with this. We can not only create a broken String, we can also break a String at a distance, in another class.
If we write that in IntelliJ then it warns us that Result of '"hello world".startsWith("hello")' is always 'true'.
This code doesn't even take any input, but we can still make it print false
, by injecting a broken "hello" into it, through interning:
We break a String containing hello before any other code has literally mentioned or explicitly interned it, and we intern that broken version.
That way, we have broken every "hello"
String literal in our JVM.
Challenge: Empty or not?
Using our breakIt method, we can create an equivalent but not-actually-equal String of any latin-1 String. But it doesn't work for the empty String, because that one doesn't have any characters to trigger the race condition. Yet, we can still create a broken empty String. I'll leave that as a challenge to the reader.
Concretely:
Can you make a java.lang.String
object, for which this is true: s.isEmpty() && !s.equals("")
.
No cheating: You're only allowed to use public APIs for this, e.g. no .setAccessible
to access privates, no instrumentation.
If you find it, let me know here.
I'll update this page with submitted answers later.
Submissions are now closed. If you still want to challenge yourself to find it, don't read ahead yet.
Challenge solutions
The simplest way to make a broken empty String, is by using breakIt(" ").trim()
. That is because the trim method correctly assumes that if the original string contained non-latin-1 characters, then the result of trimming (which only removed latin-1 characters) must still have non-latin-1 characters.
This answer was submitted by multiple people: Zac Kologlu, Jan, ichttt, Robert (who correctly pointed out that my "> 256" check was off by one). Nice!
I also got two original solutions that use StringBuilder, and only work on JDK 19.
Ihor Herasymenko submitted this code which triggers a race condition with StringBuilder.deleteCharAt. Great find!
Ihor's solution using deleteCharAt (click to expand)
I saved the best for last. Xavier Cooney came up with this gem, that does not even involve any race condition.
It throws an exception from a CharSequence.charAt
to put a StringBuilder
in an inconsistent state.
That looks like another JDK bug to me. Amazing.
Xavier's solution throwing from CharSequence.charAt (click to expand)
Thanks for these interesting solutions!
Discussion about this post: Twitter, Mastodon, Reddit, Hacker News
JDK bug report: JDK-8311906