[PHP-DEV] pcre extended character class support

Hi internals

On PHP 8.5-dev, we ship with pcre2lib 10.45.

This includes a new opt-in feature called "PCRE2_ALT_EXTENDED_CLASS".
It enables the use of complex character set operations in accordance to UTS#18 (Unicode Technical Standard 18).
This means it becomes possible to nest character sets, perform set operations on them, etc.
One example of such a set operation is a set subtraction, e.g. the regex "[\ep{L}--[QW]]" means "Unicode letters other than Q and W".
Or a more realistic example (inspired from [1]): the regex "[\p{Lu}--[0-9]]" matches all non-ASCII unicode numbers.
You can also do ORs, ANDs, etc.

The reason this is opt-in in pcre2lib, is because the interpretation of existing regexes may change.
This standard is being adopted in other languages too, also opt-in, for example in JavaScript [1].
To expose this functionality in PHP, we also have to make it opt-in via a modifier.

In JavaScript, this is enabled via the /v modifier at the end of the regex [1].
This does the same thing as the /u modifier, but extends it with this UTS#18 standard.
We also already have /u in PHP that enables UTF-8 unicode mode. So we could do the same as JavaScript and add a /v modifier that extends /u and also enables PCRE2_ALT_EXTENDED_CLASS. Technically, you don't need unicode processing for enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a unicode standard (and that at least JavaScript does this too), it may make sense to enable them both.

The actual patch is trivial:

diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c
index 8e0fb2cce5f..4a4727545ad 100644
--- a/ext/pcre/php_pcre.c
+++ b/ext/pcre/php_pcre.c
@@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry* pcre_get_compiled_regex_cache_ex(zend_string *regex, bo
 			case 'S':	/* Pass. */					break;
 			case 'X':	/* Pass. */					break;
 			case 'U':	coptions |= PCRE2_UNGREEDY;		break;
+#ifdef PCRE2_ALT_EXTENDED_CLASS
+			case 'v':	coptions |= PCRE2_ALT_EXTENDED_CLASS; ZEND_FALLTHROUGH;
+#endif
 			case 'u':	coptions |= PCRE2_UTF;
 	/* In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
 	   characters, even in UTF-8 mode. However, this can be changed by setting

What do we think?

[1] GitHub - tc39/proposal-regexp-v-flag: UTS18 set notation in regular expressions

Kind regards
Niels

On 25 July 2025 23:17:43 BST, Niels Dossche <dossche.niels@gmail.com> wrote:

Hi internals

On PHP 8.5-dev, we ship with pcre2lib 10.45.

This includes a new opt-in feature called "PCRE2_ALT_EXTENDED_CLASS".
It enables the use of complex character set operations in accordance to UTS#18 (Unicode Technical Standard 18).
This means it becomes possible to nest character sets, perform set operations on them, etc.
One example of such a set operation is a set subtraction, e.g. the regex "[\ep{L}--[QW]]" means "Unicode letters other than Q and W".
Or a more realistic example (inspired from [1]): the regex "[\p{Lu}--[0-9]]" matches all non-ASCII unicode numbers.
You can also do ORs, ANDs, etc.

The reason this is opt-in in pcre2lib, is because the interpretation of existing regexes may change.
This standard is being adopted in other languages too, also opt-in, for example in JavaScript [1].
To expose this functionality in PHP, we also have to make it opt-in via a modifier.

In JavaScript, this is enabled via the /v modifier at the end of the regex [1].
This does the same thing as the /u modifier, but extends it with this UTS#18 standard.
We also already have /u in PHP that enables UTF-8 unicode mode. So we could do the same as JavaScript and add a /v modifier that extends /u and also enables PCRE2_ALT_EXTENDED_CLASS. Technically, you don't need unicode processing for enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a unicode standard (and that at least JavaScript does this too), it may make sense to enable them both.

The actual patch is trivial:

diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c
index 8e0fb2cce5f..4a4727545ad 100644
--- a/ext/pcre/php_pcre.c
+++ b/ext/pcre/php_pcre.c
@@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry* pcre_get_compiled_regex_cache_ex(zend_string *regex, bo
			case 'S':	/* Pass. */					break;
			case 'X':	/* Pass. */					break;
			case 'U':	coptions |= PCRE2_UNGREEDY;		break;
+#ifdef PCRE2_ALT_EXTENDED_CLASS
+			case 'v':	coptions |= PCRE2_ALT_EXTENDED_CLASS; ZEND_FALLTHROUGH;
+#endif
			case 'u':	coptions |= PCRE2_UTF;
	/* In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
	   characters, even in UTF-8 mode. However, this can be changed by setting

What do we think?

[1] GitHub - tc39/proposal-regexp-v-flag: UTS18 set notation in regular expressions

Yes, please.

cheers
Derick

On Friday, 25 July 2025 at 23:20, Niels Dossche <dossche.niels@gmail.com> wrote:

Hi internals

On PHP 8.5-dev, we ship with pcre2lib 10.45.

This includes a new opt-in feature called "PCRE2_ALT_EXTENDED_CLASS".
It enables the use of complex character set operations in accordance to UTS#18 (Unicode Technical Standard 18).
This means it becomes possible to nest character sets, perform set operations on them, etc.
One example of such a set operation is a set subtraction, e.g. the regex "[\ep{L}--[QW]]" means "Unicode letters other than Q and W".
Or a more realistic example (inspired from [1]): the regex "[\p{Lu}--[0-9]]" matches all non-ASCII unicode numbers.
You can also do ORs, ANDs, etc.

The reason this is opt-in in pcre2lib, is because the interpretation of existing regexes may change.
This standard is being adopted in other languages too, also opt-in, for example in JavaScript [1].
To expose this functionality in PHP, we also have to make it opt-in via a modifier.

In JavaScript, this is enabled via the /v modifier at the end of the regex [1].
This does the same thing as the /u modifier, but extends it with this UTS#18 standard.
We also already have /u in PHP that enables UTF-8 unicode mode. So we could do the same as JavaScript and add a /v modifier that extends /u and also enables PCRE2_ALT_EXTENDED_CLASS. Technically, you don't need unicode processing for enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a unicode standard (and that at least JavaScript does this too), it may make sense to enable them both.

The actual patch is trivial:
`diff diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c index 8e0fb2cce5f..4a4727545ad 100644 --- a/ext/pcre/php_pcre.c +++ b/ext/pcre/php_pcre.c @@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry* pcre_get_compiled_regex_cache_ex(zend_string *regex, bo case 'S': /* Pass. */ break; case 'X': /* Pass. */ break; case 'U': coptions |= PCRE2_UNGREEDY; break; +#ifdef PCRE2_ALT_EXTENDED_CLASS + case 'v': coptions |= PCRE2_ALT_EXTENDED_CLASS; ZEND_FALLTHROUGH; +#endif case 'u': coptions |= PCRE2_UTF; /* In PCRE, by default, \\d, \\D, \\s, \\S, \\w, and \\W recognize only ASCII characters, even in UTF-8 mode. However, this can be changed by setting`

What do we think?

[1] GitHub - tc39/proposal-regexp-v-flag: UTS18 set notation in regular expressions

Kind regards
Niels

I'm in favour of this, as this is small and self-contained, I don't think this should require an RFC.

Best regards,

Gina P. Banyard

+1000 for me.

Cheers.

On Fri, 25 Jul 2025 at 23:20, Niels Dossche <dossche.niels@gmail.com> wrote:

Hi internals

On PHP 8.5-dev, we ship with pcre2lib 10.45.

This includes a new opt-in feature called “PCRE2_ALT_EXTENDED_CLASS”.
It enables the use of complex character set operations in accordance to UTS#18 (Unicode Technical Standard 18).
This means it becomes possible to nest character sets, perform set operations on them, etc.
One example of such a set operation is a set subtraction, e.g. the regex “[\ep{L}–[QW]]” means “Unicode letters other than Q and W”.
Or a more realistic example (inspired from [1]): the regex “[\p{Lu}–[0-9]]” matches all non-ASCII unicode numbers.
You can also do ORs, ANDs, etc.

The reason this is opt-in in pcre2lib, is because the interpretation of existing regexes may change.
This standard is being adopted in other languages too, also opt-in, for example in JavaScript [1].
To expose this functionality in PHP, we also have to make it opt-in via a modifier.

In JavaScript, this is enabled via the /v modifier at the end of the regex [1].
This does the same thing as the /u modifier, but extends it with this UTS#18 standard.
We also already have /u in PHP that enables UTF-8 unicode mode. So we could do the same as JavaScript and add a /v modifier that extends /u and also enables PCRE2_ALT_EXTENDED_CLASS. Technically, you don’t need unicode processing for enabling PCRE2_ALT_EXTENDED_CLASS, but as it comes from a unicode standard (and that at least JavaScript does this too), it may make sense to enable them both.

The actual patch is trivial:

diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c
index 8e0fb2cce5f..4a4727545ad 100644
--- a/ext/pcre/php_pcre.c
+++ b/ext/pcre/php_pcre.c
@@ -718,6 +718,9 @@ PHPAPI pcre_cache_entry* pcre_get_compiled_regex_cache_ex(zend_string *regex, bo
case 'S': /* Pass. */ break;
case 'X': /* Pass. */ break;
case 'U': coptions |= PCRE2_UNGREEDY; break;
+#ifdef PCRE2_ALT_EXTENDED_CLASS
+ case 'v': coptions |= PCRE2_ALT_EXTENDED_CLASS; ZEND_FALLTHROUGH;
+#endif
case 'u': coptions |= PCRE2_UTF;
/* In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
characters, even in UTF-8 mode. However, this can be changed by setting

What do we think?

[1] https://github.com/tc39/proposal-regexp-v-flag

Kind regards
Niels

The reason this is opt-in in pcre2lib, is because the interpretation of existing regexes may change.
This standard is being adopted in other languages too, also opt-in, for example in JavaScript [1].
To expose this functionality in PHP, we also have to make it opt-in via a modifier.

Hi Niels,
I'm also very much in favor of adding it too. Thank you.

In JavaScript, all current browsers seem to support[^1] it, and at
least for Firefox, it even defaults[^2] to the `/v` flag for the HTML
input `pattern` attribute.

[^1]: RegExp.prototype.unicodeSets - JavaScript | MDN
[^2]: HTML attribute: pattern - HTML | MDN