[PHP-DEV][DISCUSSION] Add locale and strength for grapheme functions

Hi, Internals

I changed below the RFC.
- PHP: rfc:grapheme_add_locale_for_case_insensitive
Pull request is below:
- [RFC] Add a locale for grapheme case-insensitive functions by youkidearitai · Pull Request #18792 · php/php-src · GitHub

Change point is below:
- Add a strength for grapheme_* functions
  - Affect to all over the world characters, ex: Ideographic Variation
Sequence(IVS)
  - Use Collator object const values.

$locale parameter is not change anything. Because I could not find any way.

Maybe I overlooked something, So please point it out to me.

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

On Wed, 9 Jul 2025, youkidearitai wrote:

Hi, Internals

I changed below the RFC.
- PHP: rfc:grapheme_add_locale_for_case_insensitive
Pull request is below:
- [RFC] Add a locale for grapheme case-insensitive functions by youkidearitai · Pull Request #18792 · php/php-src · GitHub

Change point is below:
- Add a strength for grapheme_* functions
  - Affect to all over the world characters, ex: Ideographic Variation
Sequence(IVS)
  - Use Collator object const values.

These settings are indeed important for these functions, but I can't get
around the fact that it makes these APIs really cluttered and
complicated — something that many functions in the grapheme_ / intl
extension already suffer from.

Is this API really the best way?

$locale parameter is not change anything. Because I could not find any way.

It seems that I came to a similar conclusion, but locales are much more
complicated than just languageCode_regionCode (for example, see
php-text/tests/text-contains.phpt at main · derickr/php-text · GitHub)

You also don't really need a strength argument, as you can 'encode' that
in the locale name, like: 'nb_NO-u-ks-primary' (I know, it's rather ugly
and the list of options is vast:

cheers,
Derick

2025年7月14日(月) 19:22 Derick Rethans <derick@php.net>:

On Wed, 9 Jul 2025, youkidearitai wrote:

> Hi, Internals
>
> I changed below the RFC.
> - PHP: rfc:grapheme_add_locale_for_case_insensitive
> Pull request is below:
> - [RFC] Add a locale for grapheme case-insensitive functions by youkidearitai · Pull Request #18792 · php/php-src · GitHub
>
> Change point is below:
> - Add a strength for grapheme_* functions
> - Affect to all over the world characters, ex: Ideographic Variation
> Sequence(IVS)
> - Use Collator object const values.

These settings are indeed important for these functions, but I can't get
around the fact that it makes these APIs really cluttered and
complicated — something that many functions in the grapheme_ / intl
extension already suffer from.

Is this API really the best way?

> $locale parameter is not change anything. Because I could not find any way.

It seems that I came to a similar conclusion, but locales are much more
complicated than just languageCode_regionCode (for example, see
php-text/tests/text-contains.phpt at main · derickr/php-text · GitHub)

You also don't really need a strength argument, as you can 'encode' that
in the locale name, like: 'nb_NO-u-ks-primary' (I know, it's rather ugly
and the list of options is vast:
Unicode Locale Data Markup Language (LDML) Part 5: Collation

cheers,
Derick

Hi, Derick

Thank you very much for response.

Is this API really the best way?

I reconsidered the function signature based on what you said.

It seems that I came to a similar conclusion, but locales are much more
complicated than just languageCode_regionCode (for example, see
php-text/tests/text-contains.phpt at main · derickr/php-text · GitHub)

You also don't really need a strength argument, as you can 'encode' that
in the locale name, like: 'nb_NO-u-ks-primary' (I know, it's rather ugly
and the list of options is vast:
Unicode Locale Data Markup Language (LDML) Part 5: Collation

Indeed, since strength can be specified in the locale,
I thought it would be better to specify it in the locale rather than
as a parameter for strength.

For example, The grapheme_* functions can detect difference for IVS.

$ sapi/cli/php -r 'var_dump(grapheme_levenshtein("\u{908A}",
"\u{908A}\u{E0101}", locale: "ja_JP-u-ks-identic"));'
int(1)
$ sapi/cli/php -r 'var_dump(grapheme_levenshtein("\u{908A}",
"\u{908A}\u{E0101}"));'
int(0)
$ sapi/cli/php -r 'var_dump(grapheme_strpos("\u{908A}", "\u{908A}\u{E0101}"));'
int(0)
$ sapi/cli/php -r 'var_dump(grapheme_strpos("\u{908A}",
"\u{908A}\u{E0101}", locale: "ja_JP-u-ks-identic"));'
bool(false)

Since ideographic characters also have identities (e.g., names), we
would like to make IVS compatible with them.
However, it should be simple, so we should compromise somewhere.

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

2025年7月15日(火) 16:05 youkidearitai <youkidearitai@gmail.com>:

2025年7月14日(月) 19:22 Derick Rethans <derick@php.net>:
>
> On Wed, 9 Jul 2025, youkidearitai wrote:
>
> > Hi, Internals
> >
> > I changed below the RFC.
> > - PHP: rfc:grapheme_add_locale_for_case_insensitive
> > Pull request is below:
> > - [RFC] Add a locale for grapheme case-insensitive functions by youkidearitai · Pull Request #18792 · php/php-src · GitHub
> >
> > Change point is below:
> > - Add a strength for grapheme_* functions
> > - Affect to all over the world characters, ex: Ideographic Variation
> > Sequence(IVS)
> > - Use Collator object const values.
>
> These settings are indeed important for these functions, but I can't get
> around the fact that it makes these APIs really cluttered and
> complicated — something that many functions in the grapheme_ / intl
> extension already suffer from.
>
> Is this API really the best way?
>
> > $locale parameter is not change anything. Because I could not find any way.
>
> It seems that I came to a similar conclusion, but locales are much more
> complicated than just languageCode_regionCode (for example, see
> php-text/tests/text-contains.phpt at main · derickr/php-text · GitHub)
>
> You also don't really need a strength argument, as you can 'encode' that
> in the locale name, like: 'nb_NO-u-ks-primary' (I know, it's rather ugly
> and the list of options is vast:
> Unicode Locale Data Markup Language (LDML) Part 5: Collation
>
> cheers,
> Derick

Hi, Derick

Thank you very much for response.

> Is this API really the best way?

I reconsidered the function signature based on what you said.

> It seems that I came to a similar conclusion, but locales are much more
> complicated than just languageCode_regionCode (for example, see
> php-text/tests/text-contains.phpt at main · derickr/php-text · GitHub)
>
> You also don't really need a strength argument, as you can 'encode' that
> in the locale name, like: 'nb_NO-u-ks-primary' (I know, it's rather ugly
> and the list of options is vast:
> Unicode Locale Data Markup Language (LDML) Part 5: Collation

Indeed, since strength can be specified in the locale,
I thought it would be better to specify it in the locale rather than
as a parameter for strength.

For example, The grapheme_* functions can detect difference for IVS.

$ sapi/cli/php -r 'var_dump(grapheme_levenshtein("\u{908A}",
"\u{908A}\u{E0101}", locale: "ja_JP-u-ks-identic"));'
int(1)
$ sapi/cli/php -r 'var_dump(grapheme_levenshtein("\u{908A}",
"\u{908A}\u{E0101}"));'
int(0)
$ sapi/cli/php -r 'var_dump(grapheme_strpos("\u{908A}", "\u{908A}\u{E0101}"));'
int(0)
$ sapi/cli/php -r 'var_dump(grapheme_strpos("\u{908A}",
"\u{908A}\u{E0101}", locale: "ja_JP-u-ks-identic"));'
bool(false)

Since ideographic characters also have identities (e.g., names), we
would like to make IVS compatible with them.
However, it should be simple, so we should compromise somewhere.

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

Hi, Internals

I have revised this RFC.

I believe I have done my best to address the complexity of Unicode.
I would like to go to "Voting" phase.

If there are no objections, I would like to start voting this week.

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------