Breaking java.lang.String

Jul 11, 2023

Let's abuse a bug in java.lang.String to make some weird Strings. We'll make "hello world" not start with "hello", and show that not all empty Strings are equal to each other.

Intro: Equality among Strings

Before we get started, we look at what it takes for two Strings to be equal in the JDK.

Why is "foo".equals("fox") false?

Because Strings are compared character by character, and the third character of those Strings is different.

Why is "foo".equals("foo") true?

You might think that in this case the Strings are also compared character by character. But String literals are interned. When the same String appears as a constant multiple times in the source code, it's not just another String with the same content. It is the same instance of String. And the first thing that String.equals does is if (this == anObject) { return true; }, so it doesn't even look at the contents.

Why is "foo!".equals("foo⁉") false?

Since JDK 9 (since JEP 254: Compact Strings), a String represents its content internally as a byte array. "foo!" only contains simple characters, with a codepoint less than 256. The String class internally encodes such values using the latin-1 encoding, with one byte per character. "foo⁉" contains a character (⁉) that cannot be represented using latin-1, so it encodes the whole String using UTF-16, with two bytes per character. The String.coder field keeps track of which of the two encodings was used. When comparing two Strings with a different coder, then String.equals always returns false. It does not even look at the contents, because if one String can be represented in latin-1, and the other one can not, then they can't be the same. Or can they?

Note: The compact strings feature can be disabled, but it's enabled by default. This blog post assumes it is enabled.

Creating a broken String

How are Strings created? How exactly does java.lang.String choose to use latin-1 or not?

Strings can be created in multiple ways, we'll focus here on the String constructor that takes a char[]. It first tries to encode the characters as latin-1 using StringUTF16.compress. If that fails, it returns null and the constructor falls back to using UTF-16. Here is a simplified version of how that is implemented. (For readability I removed irrelevant indirections, checks and arguments from the actual implementation here and here)

/**
 * Allocates a new {@code String} so that it represents the sequence of
 * characters currently contained in the character array argument. The
 * contents of the character array are copied; subsequent modification of
 * the character array does not affect the newly created string.
 */
public String(char value[]) {
  byte[] val = StringUTF16.compress(value);
  if (val != null) {
    this.value = val;
    this.coder = LATIN1;
    return;
  }
  this.coder = UTF16;
  this.value = StringUTF16.toBytes(value);
}

There is the bug. This code does not always preserve the assumptions that String.equals makes that we talked about above. Do you see it?

The javadoc points out that "subsequent modification of the character array does not affect the newly created string". But how about concurrent modification? The String constructor has a race condition. The contents of value may have changed between trying to encode it as latin-1, and encoding it as UTF-16. That way we can end up with a String that only contains latin-1 characters, but is encoded as UTF-16. We can trigger that race condition like this:

/**
 * Given a latin-1 String, creates a copy that is
 * incorrectly encoded as UTF-16.
 */
static String breakIt(String original) {
  if (original.chars().max().orElseThrow() > 256) {
    throw new IllegalArgumentException(
        "Can only break latin-1 Strings");
  }

  char[] chars = original.toCharArray();

  // in another thread, flip the first character back
  // and forth between being encodable as latin-1 or not
  Thread thread = new Thread(() -> {
    while (!Thread.interrupted()) {
      chars[0] ^= 256;
    }
  });
  thread.start();

  // at the same time call the String constructor,
  // until we hit the race condition
  while (true) {
    String s = new String(chars);
    if (s.charAt(0) < 256 && !original.equals(s)) {
      thread.interrupt();
      return s;
    }
  }
}

The broken Strings we can create this way have some interesting properties.

String a = "foo";
String b = breakIt(a);

// they are not equal to each other
System.out.println(a.equals(b));
// => false

// they do contain the same series of characters
System.out.println(Arrays.equals(a.toCharArray(),
    b.toCharArray()));
// => true

// compareTo does consider them equal (even though its javadoc
// specifies that "compareTo returns 0 exactly when the
// equals(Object) method would return true")
System.out.println(a.compareTo(b));
// => 0

// they have the same length, and one startsWith the other,
// but not the other way around (because if it wasn't broken,
// a latin-1 string cannot start with a non-latin-1
// substring).
System.out.println(a.length() == b.length());
// => true
System.out.println(b.startsWith(a));
// => true
System.out.println(a.startsWith(b));
// => false
Strange. I didn't expect this weird behaviour from such a fundamental Java class.

Spooky action at a distance

Just because we can, let's have some more fun with this. We can not only create a broken String, we can also break a String at a distance, in another class.

class OtherClass {
  static void startWithHello() {
    System.out.println("hello world".startsWith("hello"));
  }
}

If we write that in IntelliJ then it warns us that Result of '"hello world".startsWith("hello")' is always 'true'. This code doesn't even take any input, but we can still make it print false, by injecting a broken "hello" into it, through interning: We break a String containing hello before any other code has literally mentioned or explicitly interned it, and we intern that broken version. That way, we have broken every "hello" String literal in our JVM.

breakIt("hell".concat("o")).intern();
OtherClass.startWithHello(); // prints false

Challenge: Empty or not?

Using our breakIt method, we can create an equivalent but not-actually-equal String of any latin-1 String. But it doesn't work for the empty String, because that one doesn't have any characters to trigger the race condition. Yet, we can still create a broken empty String. I'll leave that as a challenge to the reader.

Concretely: Can you make a java.lang.String object, for which this is true: s.isEmpty() && !s.equals(""). No cheating: You're only allowed to use public APIs for this, e.g. no .setAccessible to access privates, no instrumentation.

If you find it, let me know here. I'll update this page with submitted answers later.

Submissions are now closed. If you still want to challenge yourself to find it, don't read ahead yet.

Challenge solutions

The simplest way to make a broken empty String, is by using breakIt(" ").trim(). That is because the trim method correctly assumes that if the original string contained non-latin-1 characters, then the result of trimming (which only removed latin-1 characters) must still have non-latin-1 characters. This answer was submitted by multiple people: Zac Kologlu, Jan, ichttt, Robert (who correctly pointed out that my "> 256" check was off by one). Nice!

I also got two original solutions that use StringBuilder, and only work on JDK 19. Ihor Herasymenko submitted this code which triggers a race condition with StringBuilder.deleteCharAt. Great find!

Ihor's solution using deleteCharAt (click to expand)
public class BrokenEmptyStringChallenge {
    public static void main(String[] args) {
        String s = breakIt();
        System.out.println("s.isEmpty() && !s.equals(\"\") = " + (s.isEmpty() && !s.equals("")));
    }

    static String breakIt() {
        String notLatinString = "\u0457";
        AtomicReference<StringBuilder> sb = new AtomicReference<>(new StringBuilder(notLatinString));

        Thread thread = new Thread(() -> {
            while (!Thread.interrupted()) {
                sb.get().deleteCharAt(0);
                sb.set(new StringBuilder(notLatinString));
            }
        });
        thread.start();

        while (true) {
            String s = sb.get().toString();
            if (s.isEmpty() && !s.equals("")) {
                thread.interrupt();
                return s;
            }
        }
    }

}

I saved the best for last. Xavier Cooney came up with this gem, that does not even involve any race condition. It throws an exception from a CharSequence.charAt to put a StringBuilder in an inconsistent state. That looks like another JDK bug to me. Amazing.

Xavier's solution throwing from CharSequence.charAt (click to expand)

Gist

// requires Java 19+

class WatSequence implements CharSequence {
    public CharSequence subSequence(int start, int end) {
        return this;
    }
    public int length() {
        return 2;
    }
    public char charAt(int index) {
        // look Ma, no concurrency!
        if (index == 1) throw new RuntimeException();
        return '⁉';
    }
    public String toString() {
        return "wat";
    }
}

class Wat {
    static String wat() {
        if (Runtime.version().feature() < 19) {
            throw new RuntimeException("this doesn't work pre java-19 :(");
        }

        StringBuilder sb = new StringBuilder();
        try {
            sb.append(new WatSequence());
        } catch (RuntimeException e) {}

        return new String(sb);
    }

    public static void main(String[] args) {
        String s = wat();
        System.out.println(s.isEmpty() && !s.equals(""));
    }
}

Thanks for these interesting solutions!

Discussion about this post: Twitter, Mastodon, Reddit, Hacker News
JDK bug report: JDK-8311906