Re: [PHP-DEV] [RFC] Unicode Text Processing

2022年12月16日(金) 0:34 Derick Rethans <derick@php.net>:

Hi,

I have just published an initial draft of the “Unicode Text Processing”
RFC, a proposal to have performant unicode text processing always
available to PHP users, by introducing a new “Text” class.

You can find it at:
https://wiki.php.net/rfc/unicode_text_processing

I’m looking forwards to hearing your opinions, additions, and
suggestions — the RFC specifically asks for these in places.

cheers,
Derick


https://derickrethans.nl | https://xdebug.org | https://dram.io

Author of Xdebug. Like it? Consider supporting me: https://xdebug.org/support
Host of PHP Internals News: https://phpinternals.news

mastodon: @derickr@phpc.social @xdebug@phpc.social
twitter: @derickr and @xdebug


PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Hi, Derick and Internals

Is still available this topic?
I have interesting this Text class.
I’m glad to control based on grapheme cluster such as Swift’s string type.

I have some idea.

  1. Move to Intl extension such as \Intl\Text
  • I think keep it simple for implementation.
  1. Add Text type for grapheme_* function only such as string|Text.
  • It is some complexy for implementation but userland is simple
  1. If UTF-8 validaion failed, throws an exception

__toString method returns string type is seems good.
Please consider this.

Regards
Yuya

On Tue, 12 May 2026, youkidearitai wrote:

2022年12月16日(金) 0:34 Derick Rethans <derick@php.net>:

> I have just published an initial draft of the "Unicode Text
> Processing" RFC, a proposal to have performant unicode text
> processing always available to PHP users, by introducing a new
> "Text" class.
>
> You can find it at:
> PHP: rfc:unicode_text_processing
>
> I'm looking forwards to hearing your opinions, additions, and
> suggestions — the RFC specifically asks for these in places.

Is still available this topic?
I have interesting this Text class.
I'm glad to control based on grapheme cluster such as Swift's string type.

I still have interest in working this out into supporting even more
things. Since I wrote that Draft RFC, I did add a few more features:

I have some idea.

1. Move to Intl extension such as \Intl\Text
  * I think keep it simple for implementation.

I don't agree with this, as although it builds on top of ICU like the
classes in the Intl extension, it isn't following ICU's API style at
all.

It is meant to be a much more opiniated API that does the simple 80%
case well.

2. Add Text type for grapheme_* function only such as string|Text.
   * It is some complexy for implementation but userland is simple

I am not too sure about this. The grapheme_* functions closely match
ICUs internal, and powerful, API. If you want them to accept a Test
object too, that means these grapheme_* functions' signature needs to be
overloaded.

for example:

grapheme_strstr(string $haystack, string $needle, bool $beforeNeedle = false, string $locale = "" ): string|false

would need to change into:

grapheme_strstr(string|Text $haystack, string|Text $needle, bool $beforeNeedle = false, string $locale = "" ): string|false

And then '$locale' makes no sense, as this is already part of each of
the Text objects themselves.

Instead, the 'contains' method on the Text object already does something
very similar:

I think the grapheme functions should stay as they are, and additional
methods can be added on the Text class, where there is currently
functionality missing that the grapheme_* functions already support.

The RFC document also already lists more functions than I have
implemented so far too.

3. If UTF-8 validaion failed, throws an exception

It already does that, see this test case:

— although the exception message itself could be improved.

__toString method returns string type is seems good.
Please consider this.

This is already implemented too:

cheers,
Derick

--
https://derickrethans.nl | https://xdebug.org | https://dram.io

Author of Xdebug. Like it? Consider supporting me: Xdebug: Support

mastodon: @derickr@phpc.social @xdebug@phpc.social

2026年5月13日(水) 19:27 Derick Rethans <derick@php.net>:

On Tue, 12 May 2026, youkidearitai wrote:

> 2022年12月16日(金) 0:34 Derick Rethans <derick@php.net>:
>
> > I have just published an initial draft of the "Unicode Text
> > Processing" RFC, a proposal to have performant unicode text
> > processing always available to PHP users, by introducing a new
> > "Text" class.
> >
> > You can find it at:
> > PHP: rfc:unicode_text_processing
> >
> > I'm looking forwards to hearing your opinions, additions, and
> > suggestions — the RFC specifically asks for these in places.
>
> Is still available this topic?
> I have interesting this Text class.
> I'm glad to control based on grapheme cluster such as Swift's string type.

I still have interest in working this out into supporting even more
things. Since I wrote that Draft RFC, I did add a few more features:

Commits · derickr/php-text · GitHub

>
> I have some idea.
>
> 1. Move to Intl extension such as \Intl\Text
> * I think keep it simple for implementation.

I don't agree with this, as although it builds on top of ICU like the
classes in the Intl extension, it isn't following ICU's API style at
all.

It is meant to be a much more opiniated API that does the simple 80%
case well.

> 2. Add Text type for grapheme_* function only such as string|Text.
> * It is some complexy for implementation but userland is simple

I am not too sure about this. The grapheme_* functions closely match
ICUs internal, and powerful, API. If you want them to accept a Test
object too, that means these grapheme_* functions' signature needs to be
overloaded.

for example:

grapheme_strstr(string $haystack, string $needle, bool $beforeNeedle = false, string $locale = "" ): string|false

would need to change into:

grapheme_strstr(string|Text $haystack, string|Text $needle, bool $beforeNeedle = false, string $locale = "" ): string|false

And then '$locale' makes no sense, as this is already part of each of
the Text objects themselves.

Instead, the 'contains' method on the Text object already does something
very similar:

php-text/tests/text-contains.phpt at main · derickr/php-text · GitHub

I think the grapheme functions should stay as they are, and additional
methods can be added on the Text class, where there is currently
functionality missing that the grapheme_* functions already support.

The RFC document also already lists more functions than I have
implemented so far too.

> 3. If UTF-8 validaion failed, throws an exception

It already does that, see this test case:
php-text/tests/text-in-out-basic.phpt at main · derickr/php-text · GitHub
— although the exception message itself could be improved.

> __toString method returns string type is seems good.
> Please consider this.

This is already implemented too:
php-text/text.c at main · derickr/php-text · GitHub

cheers,
Derick

--
https://derickrethans.nl | https://xdebug.org | https://dram.io

Author of Xdebug. Like it? Consider supporting me: Xdebug: Support

mastodon: @derickr@phpc.social @xdebug@phpc.social

Thanks, Derick.

I confirmed already almost implemented.
Surely, already grapheme_* functions are implemented `$locale` but
conflict `Text::$locale`.

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

Hello Youkidearitai!

On Thu, May 14, 2026 at 7:04 AM youkidearitai <youkidearitai@gmail.com> wrote:

I confirmed already almost implemented.
Surely, already grapheme_* functions are implemented `$locale` but
conflict `Text::$locale`.

That's amazing news :slight_smile: It is much needed.

As the initial RFC mentioned "performant", I wonder if
GitHub - simdutf/simdutf: Unicode routines (UTF8, UTF16, UTF32) and Base64: billions of characters per second using SSE2, AVX2, NEON, AVX-512, RISC-V Vector Extension, LoongArch64, POWER. Part of Node.js, WebKit/Safari, Ladybird, Chromium, Cloudflare Workers, Ghostty and Bun. · GitHub was considered for relevant ops,
covering many of the APIs needs, proven implementation and highly
optimized. Other areas in php could also benefit from it (like json
f.e.). Obviously not a must for an initial version, it could be worth
considering as it is used in so many projects we all use already
(chrome, webkit, node etc).

Best,
--
Pierre

@pierrejoye