[PHP-DEV][DISCUSSION] Limit of code point for grapheme cluster in programming languages

Hi, Internals

I noticed grapheme cluster is not limit code points in UAX#29.

And there is no limit code point in Unicode that confirmed in issue of ICU.

So that means create many code points in 1 grapheme cluster,
That is crash for program because computer resource is limited.

For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt

php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u
{200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M >
emoji_bomb.txt

(PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)

So, I think we(php-src, programming language level) need to create new
custom limit function.
My idea is below:

grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): bool

I don't have heavy opinion that $max_codepoints is 32.
However, 32 code points is enough of grapheme cluster because
human language max code points is maybe Hakṣhmalawarayaṁ(ཧྐྵྨླྺྼྻྂ) in
9 code points.

If need more than code points in grapheme cluster,
Userland can to increase $max_codepoints.

Please see also my speakerdeck.

What do you think about this idea?

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

Hi Yuya,

I think this is a good idea. While spec compliance is generally desirable, DoS via unbounded grapheme clusters is a real threat, and it’s reasonable for a language-level implementation to impose practical limits that the Unicode spec itself doesn’t define. This kind of gap between a general-purpose spec and a concrete implementation is not unusual.

The default of 32 code points sounds sensible given that natural language grapheme clusters top out well below that.

One minor note: it might help to clarify the intended behavior of grapheme_limit_codepoints a bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.

Regards,
Kentaro Takeda

2026年2月23日(月) 20:28 youkidearitai <youkidearitai@gmail.com>:

2026年2月24日(火) 11:38 Kentaro Takeda <takeda@youmind.jp>:

Hi Yuya,

I think this is a good idea. While spec compliance is generally desirable, DoS via unbounded grapheme clusters is a real threat, and it's reasonable for a language-level implementation to impose practical limits that the Unicode spec itself doesn't define. This kind of gap between a general-purpose spec and a concrete implementation is not unusual.

The default of 32 code points sounds sensible given that natural language grapheme clusters top out well below that.

One minor note: it might help to clarify the intended behavior of `grapheme_limit_codepoints` a bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.

Regards,
Kentaro Takeda

2026年2月23日(月) 20:28 youkidearitai <youkidearitai@gmail.com>:

Hi, Internals

I noticed grapheme cluster is not limit code points in UAX#29.
UAX #29: Unicode Text Segmentation

And there is no limit code point in Unicode that confirmed in issue of ICU.
Jira

So that means create many code points in 1 grapheme cluster,
That is crash for program because computer resource is limited.

For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt

php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u
{200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M >
emoji_bomb.txt

(PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)

So, I think we(php-src, programming language level) need to create new
custom limit function.
My idea is below:

grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): bool

I don't have heavy opinion that $max_codepoints is 32.
However, 32 code points is enough of grapheme cluster because
human language max code points is maybe Hakṣhmalawarayaṁ(ཧ) in
9 code points.

If need more than code points in grapheme cluster,
Userland can to increase $max_codepoints.

Please see also my speakerdeck.
Limit of code point for grapheme cluster in programming language side. - Speaker Deck

What do you think about this idea?

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

Hi, Kentaro

Thank you very much for your feedback.

One minor note: it might help to clarify the intended behavior of `grapheme_limit_codepoints` a bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.

Okay. I'll show you.

// something string in $_POST['text']
// Validate many code points in a grapheme cluster.
if (grapheme_limit_codepoints($_POST['text'], 32) !== true) {
   throw new InvalidException("Found invalid / many code points in
grapheme cluster");
}

// Validate grapheme cluster length
if (grapheme_strlen($_POST['text']) > 100) {
  throw new InvalidException("Invalid grater than 100 graphemes");
}

// do anything...

The intention is "count correct graphemes avoid DoS".
And I want to overcoming to
[Validator] drop grapheme_strlen in LengthValidator by nicolas-grekas · Pull Request #13527 · symfony/symfony · GitHub in grapheme_strlen
function.

Feel free to more comment.
Regards
Yuya.

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

2026年2月24日(火) 16:21 youkidearitai <youkidearitai@gmail.com>:

2026年2月24日(火) 11:38 Kentaro Takeda <takeda@youmind.jp>:
>
> Hi Yuya,
>
> I think this is a good idea. While spec compliance is generally desirable, DoS via unbounded grapheme clusters is a real threat, and it's reasonable for a language-level implementation to impose practical limits that the Unicode spec itself doesn't define. This kind of gap between a general-purpose spec and a concrete implementation is not unusual.
>
> The default of 32 code points sounds sensible given that natural language grapheme clusters top out well below that.
>
> One minor note: it might help to clarify the intended behavior of `grapheme_limit_codepoints` a bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.
>
> Regards,
> Kentaro Takeda
>
>
> 2026年2月23日(月) 20:28 youkidearitai <youkidearitai@gmail.com>:
>>
>> Hi, Internals
>>
>> I noticed grapheme cluster is not limit code points in UAX#29.
>> UAX #29: Unicode Text Segmentation
>>
>> And there is no limit code point in Unicode that confirmed in issue of ICU.
>> Jira
>>
>> So that means create many code points in 1 grapheme cluster,
>> That is crash for program because computer resource is limited.
>>
>> For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt
>> ```
>> php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u
>> {200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M >
>> emoji_bomb.txt
>> ```
>> (PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)
>>
>> So, I think we(php-src, programming language level) need to create new
>> custom limit function.
>> My idea is below:
>>
>> ```
>> grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): bool
>> ```
>>
>> I don't have heavy opinion that $max_codepoints is 32.
>> However, 32 code points is enough of grapheme cluster because
>> human language max code points is maybe Hakṣhmalawarayaṁ(ཧ) in
>> 9 code points.
>>
>> If need more than code points in grapheme cluster,
>> Userland can to increase $max_codepoints.
>>
>> Please see also my speakerdeck.
>> Limit of code point for grapheme cluster in programming language side. - Speaker Deck
>>
>> What do you think about this idea?
>>
>> Regards
>> Yuya
>>
>> --
>> ---------------------------
>> Yuya Hamada (tekimen)
>> - https://tekitoh-memdhoi.info
>> - youkidearitai (tekimen) · GitHub
>> -----------------------------

Hi, Kentaro

Thank you very much for your feedback.

> One minor note: it might help to clarify the intended behavior of `grapheme_limit_codepoints` a bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.

Okay. I'll show you.

// something string in $_POST['text']
// Validate many code points in a grapheme cluster.
if (grapheme_limit_codepoints($_POST['text'], 32) !== true) {
   throw new InvalidException("Found invalid / many code points in
grapheme cluster");
}

// Validate grapheme cluster length
if (grapheme_strlen($_POST['text']) > 100) {
  throw new InvalidException("Invalid grater than 100 graphemes");
}

// do anything...

The intention is "count correct graphemes avoid DoS".
And I want to overcoming to
[Validator] drop grapheme_strlen in LengthValidator by nicolas-grekas · Pull Request #13527 · symfony/symfony · GitHub in grapheme_strlen
function.

Feel free to more comment.
Regards
Yuya.

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

Hi, Internals

I created a PoC and RFC.

I tried to ask Unicode that UAX#29 add for limit of codepoint for
grapheme cluster.
Perhaps Unicode adds my suggestion if it is make sense. However, I
don't know what happen.

Anyway, I think make sense that grapheme cluster limits codepoint in PHP side.

Feel free to comment.

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

2026年2月28日(土) 0:59 youkidearitai <youkidearitai@gmail.com>:

2026年2月24日(火) 16:21 youkidearitai <youkidearitai@gmail.com>:
>
> 2026年2月24日(火) 11:38 Kentaro Takeda <takeda@youmind.jp>:
> >
> > Hi Yuya,
> >
> > I think this is a good idea. While spec compliance is generally desirable, DoS via unbounded grapheme clusters is a real threat, and it's reasonable for a language-level implementation to impose practical limits that the Unicode spec itself doesn't define. This kind of gap between a general-purpose spec and a concrete implementation is not unusual.
> >
> > The default of 32 code points sounds sensible given that natural language grapheme clusters top out well below that.
> >
> > One minor note: it might help to clarify the intended behavior of `grapheme_limit_codepoints` a bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.
> >
> > Regards,
> > Kentaro Takeda
> >
> >
> > 2026年2月23日(月) 20:28 youkidearitai <youkidearitai@gmail.com>:
> >>
> >> Hi, Internals
> >>
> >> I noticed grapheme cluster is not limit code points in UAX#29.
> >> UAX #29: Unicode Text Segmentation
> >>
> >> And there is no limit code point in Unicode that confirmed in issue of ICU.
> >> Jira
> >>
> >> So that means create many code points in 1 grapheme cluster,
> >> That is crash for program because computer resource is limited.
> >>
> >> For example, this code is 200MB but 1 grapheme cluster in emoji_bomb.txt
> >> ```
> >> php -r 'echo(mb_trim(str_repeat("\u{200d}\u{1f468}\u{200d}\u{1f466}\u
> >> {200d}\u{1f466}", 10000000), "\u{200d}"));' -d memory_limit=600M >
> >> emoji_bomb.txt
> >> ```
> >> (PLEASE BE CAREFUL OPEN IN emoji_bomb.txt BECAUSE MAYBE CRASH)
> >>
> >> So, I think we(php-src, programming language level) need to create new
> >> custom limit function.
> >> My idea is below:
> >>
> >> ```
> >> grapheme_limit_codepoints(string $str, integer $max_codepoints = 32): bool
> >> ```
> >>
> >> I don't have heavy opinion that $max_codepoints is 32.
> >> However, 32 code points is enough of grapheme cluster because
> >> human language max code points is maybe Hakṣhmalawarayaṁ(ཧ) in
> >> 9 code points.
> >>
> >> If need more than code points in grapheme cluster,
> >> Userland can to increase $max_codepoints.
> >>
> >> Please see also my speakerdeck.
> >> Limit of code point for grapheme cluster in programming language side. - Speaker Deck
> >>
> >> What do you think about this idea?
> >>
> >> Regards
> >> Yuya
> >>
> >> --
> >> ---------------------------
> >> Yuya Hamada (tekimen)
> >> - https://tekitoh-memdhoi.info
> >> - youkidearitai (tekimen) · GitHub
> >> -----------------------------
>
> Hi, Kentaro
>
> Thank you very much for your feedback.
>
> > One minor note: it might help to clarify the intended behavior of `grapheme_limit_codepoints` a bit more — for instance, whether it is meant as a validation check (returning false when a cluster exceeds the limit) or something else.
>
> Okay. I'll show you.
>
> ```
> // something string in $_POST['text']
> // Validate many code points in a grapheme cluster.
> if (grapheme_limit_codepoints($_POST['text'], 32) !== true) {
> throw new InvalidException("Found invalid / many code points in
> grapheme cluster");
> }
>
> // Validate grapheme cluster length
> if (grapheme_strlen($_POST['text']) > 100) {
> throw new InvalidException("Invalid grater than 100 graphemes");
> }
>
> // do anything...
> ```
> The intention is "count correct graphemes avoid DoS".
> And I want to overcoming to
> [Validator] drop grapheme_strlen in LengthValidator by nicolas-grekas · Pull Request #13527 · symfony/symfony · GitHub in grapheme_strlen
> function.
>
> Feel free to more comment.
> Regards
> Yuya.
>
> --
> ---------------------------
> Yuya Hamada (tekimen)
> - https://tekitoh-memdhoi.info
> - youkidearitai (tekimen) · GitHub
> -----------------------------

Hi, Internals

I created a PoC and RFC.
[RFC][PoC] Add grapheme_limit_codepoints function by youkidearitai · Pull Request #21311 · php/php-src · GitHub
PHP: rfc:grapheme_limit_codepoints

I tried to ask Unicode that UAX#29 add for limit of codepoint for
grapheme cluster.
Perhaps Unicode adds my suggestion if it is make sense. However, I
don't know what happen.

Anyway, I think make sense that grapheme cluster limits codepoint in PHP side.

Feel free to comment.

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

Hi, Internals

This topic, I reported Unicode. Then received reply that is below:

Thank you for your feedback and your interest in Unicode.
Your feedback will be reviewed by one of Unicode’s working groups.
If appropriate, it may be posted to the PRI feedback page or be made part of a list of general feedback that will be considered for the next quarterly UTC meeting.

My understand, if appropriate PRI(Public Review Issues) or UTC.

I'm going to wait and see.

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------