[PHP-DEV][DISCUSSION] Multibyte for levenshtein function

Hello Internals.

I tried implement mb_levenshtein function and create an RFC.
https://wiki.php.net/rfc/mb_levenshtein

I would like discussion, feel free to comment.

Thank you.
Yuya.

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

Hi

Am 2024-09-25 09:21, schrieb youkidearitai:

I tried implement mb_levenshtein function and create an RFC.
PHP: rfc:mb_levenshtein
[Draft][Require RFC] mb_levenshtein function by youkidearitai · Pull Request #16043 · php/php-src · GitHub

I would like discussion, feel free to comment.

Thank you for your RFC. I share the concern raised by cmb in the PR discussion: [Draft][Require RFC] mb_levenshtein function by youkidearitai · Pull Request #16043 · php/php-src · GitHub

Generally working with codepoints is going to be confusing for a user, but sometimes it is necessary when dealing with external systems that themselves work with codepoints (MySQL comes to my mind). However calculating the Levenshtein distance is most certainly something that purely is "user-facing" and not constrained by external systems. Calculating the distance of codepoints is going to be extremely confusing when dealing with things like Emoji. It would probably best to either only offer a `grapheme_*` function here or to leave this fully to userland.

Best regards
Tim Düsterhus

2024年10月5日(土) 1:20 Tim Düsterhus <tim@bastelstu.be>:

Hi

Am 2024-09-25 09:21, schrieb youkidearitai:
> I tried implement mb_levenshtein function and create an RFC.
> PHP: rfc:mb_levenshtein
> [Draft][Require RFC] mb_levenshtein function by youkidearitai · Pull Request #16043 · php/php-src · GitHub
>
> I would like discussion, feel free to comment.

Thank you for your RFC. I share the concern raised by cmb in the PR
discussion:
[Draft][Require RFC] mb_levenshtein function by youkidearitai · Pull Request #16043 · php/php-src · GitHub

Generally working with codepoints is going to be confusing for a user,
but sometimes it is necessary when dealing with external systems that
themselves work with codepoints (MySQL comes to my mind). However
calculating the Levenshtein distance is most certainly something that
purely is "user-facing" and not constrained by external systems.
Calculating the distance of codepoints is going to be extremely
confusing when dealing with things like Emoji. It would probably best to
either only offer a `grapheme_*` function here or to leave this fully to
userland.

Best regards
Tim Düsterhus

Hi, Tim

Thank you for response.
I thinking about wants users what is levenshtein distance.
Surely, I think Levenshtein distance should be measured in terms of
grapheme clusters.

In most userland codes that based on UTF-8. So seems move to grapheme
function is make sense.
I more thinking usecase of levenshtein. Probably I'm going to grapheme function.

Thanks
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

2024年10月6日(日) 14:45 youkidearitai <youkidearitai@gmail.com>:

2024年10月5日(土) 1:20 Tim Düsterhus <tim@bastelstu.be>:
>
> Hi
>
> Am 2024-09-25 09:21, schrieb youkidearitai:
> > I tried implement mb_levenshtein function and create an RFC.
> > PHP: rfc:mb_levenshtein
> > [Draft][Require RFC] mb_levenshtein function by youkidearitai · Pull Request #16043 · php/php-src · GitHub
> >
> > I would like discussion, feel free to comment.
>
> Thank you for your RFC. I share the concern raised by cmb in the PR
> discussion:
> [Draft][Require RFC] mb_levenshtein function by youkidearitai · Pull Request #16043 · php/php-src · GitHub
>
> Generally working with codepoints is going to be confusing for a user,
> but sometimes it is necessary when dealing with external systems that
> themselves work with codepoints (MySQL comes to my mind). However
> calculating the Levenshtein distance is most certainly something that
> purely is "user-facing" and not constrained by external systems.
> Calculating the distance of codepoints is going to be extremely
> confusing when dealing with things like Emoji. It would probably best to
> either only offer a `grapheme_*` function here or to leave this fully to
> userland.
>
> Best regards
> Tim Düsterhus

Hi, Tim

Thank you for response.
I thinking about wants users what is levenshtein distance.
Surely, I think Levenshtein distance should be measured in terms of
grapheme clusters.

In most userland codes that based on UTF-8. So seems move to grapheme
function is make sense.
I more thinking usecase of levenshtein. Probably I'm going to grapheme function.

Thanks
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

Hi, internals
I'm thinking more about use case of mb_levenshtein.
I added test case of mb_levenshtein that compare emoji per code point.

It means make sense to compare Unicode codepoint.
I think need mb_levenshtein, and also needs grapheme_levenshtein.

What do you think?

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------