Java 21 Regex Word-boundary matcher Unicode change - Stack Overflow

admin2025-04-17  4

I noticed that the semantics of the Java Regex word-boundary matcher \b changed significantly with Java 21. Up until (at least) Java 17, it used to support Unicode, so the regex a\b.* DID NOT match the string "aß".

Apparently with Java 21 it is now defined in terms of the \w character class, which by default is not Unicode-enabled. So now a\b.* suddenly DOES match "aß". The only way I can see to "fix" \b is to enable the UNICODE_CHARACTER_CLASS flag, but that of course changes ALL the character classes, which is also different from the pre-Java-21 behavior.

Weirdly, I cannot find any information on this breaking change. Nothing in the Java 21 release notes, and various googling attempts did not yield anything helpful. For such breaking changes of essential core libs I would at least expect a big fat warning and also a feature flag to re-enable the old behavior. Anyone know anything about that?

MWE:

echo 'System.out.println("aß".matches("a\\b.*"))' | /usr/lib/jvm/java-17-openjdk/bin/jshell -> false

vs.

echo 'System.out.println("aß".matches("a\\b.*"))' | /usr/lib/jvm/java-21-openjdk/bin/jshell -q -> true

I noticed that the semantics of the Java Regex word-boundary matcher \b changed significantly with Java 21. Up until (at least) Java 17, it used to support Unicode, so the regex a\b.* DID NOT match the string "aß".

Apparently with Java 21 it is now defined in terms of the \w character class, which by default is not Unicode-enabled. So now a\b.* suddenly DOES match "aß". The only way I can see to "fix" \b is to enable the UNICODE_CHARACTER_CLASS flag, but that of course changes ALL the character classes, which is also different from the pre-Java-21 behavior.

Weirdly, I cannot find any information on this breaking change. Nothing in the Java 21 release notes, and various googling attempts did not yield anything helpful. For such breaking changes of essential core libs I would at least expect a big fat warning and also a feature flag to re-enable the old behavior. Anyone know anything about that?

MWE:

echo 'System.out.println("aß".matches("a\\b.*"))' | /usr/lib/jvm/java-17-openjdk/bin/jshell -> false

vs.

echo 'System.out.println("aß".matches("a\\b.*"))' | /usr/lib/jvm/java-21-openjdk/bin/jshell -q -> true

Share Improve this question edited Jan 31 at 14:33 Victor Mataré asked Jan 31 at 13:36 Victor MataréVictor Mataré 2,7312 gold badges17 silver badges22 bronze badges 8
  • Deleted my comment as I misread the pattern – g00se Commented Jan 31 at 14:02
  • So what do you want in the end? Make \b not Unicode-aware and the shorthand character classes Unicode-aware? Then just enclose \b with the flags: a(?-U)\\b(?U).*. I think your example is not representative enough, have a look at my test here. – Wiktor Stribiżew Commented Jan 31 at 14:17
  • I want Java to not break existing code ;-) (or at least have some control over whether or not I accept the breakage, i.e. a feature flag) – Victor Mataré Commented Jan 31 at 14:19
  • 1 The MWE clearly shows how the behavior changed from Java 17 to 21. That is the entire problem. It's a severe, breaking behavior change that any responsible language developer would add a feature flag for. I can't find one and I wanna know if I'm missing something or if the Java developers were just irresponsible here. – Victor Mataré Commented Jan 31 at 14:28
  • 6 Here's the bug report that changed the old behaviour. The change is in Java 19, apparently. The old documentation didn't mention that \b works for all Unicode. IMO there is little reason to think that the old behaviour is intentional. You still depended on this behaviour anyway, so IMO at least part of the responsibility is on you. – Sweeper Commented Jan 31 at 14:51
 |  Show 3 more comments

1 Answer 1

Reset to default 6

Thanks to @Sweeper for digging it out. Here is the original bug report:

https://bugs.openjdk.org/browse/JDK-8282129

So the change was released with Java 19, and it is in fact mentioned in the release notes: https://www.oracle.com/java/technologies/javase/19all-relnotes.html

From what I gather there, there is no feature flag for it because the bug's compatibility risk was classified as "low", due to the following opinion:

The existing behavior of the \b metacharacter in Java regex strings is longstanding and changing it may impact existing regular expressions that rely on this inconsistent (with respect to Unicode characters) behavior. However, the use of \b is less common and code that focuses on ASCII-encoded data or similar will be unaffected.

So effectively everyone who depends on the old (admittedly inconsistent) behavior will have problems to deal with :-/

转载请注明原文地址:http://anycun.com/QandA/1744859552a88629.html