[PHP-DEV][Discussion] Should All String Functions Become Multi-Byte Safe?

Nick_Lockheart · August 11, 2024, 3:50pm

HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports
the UTF-8 multi-byte character encoding.

It seems like there's still a lot of string functions that assume that
a character is a single byte, and these may actually work as expected
when dealing with Latin characters, but may fail unexpectedly if a
sequence is more than one byte.

Are there any use cases for PHP where **single-byte** characters are
the norm?

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The WHATWG Encoding Standard:

https://encoding.spec.whatwg.org/

Also, according to Mozilla, "[The meta charset] attribute declares the
document's character encoding. If the attribute is present, its value
must be an ASCII case-insensitive match for the string "utf-8", because
UTF-8 is the only valid encoding for HTML5 documents."

Tim_Dusterhus · August 11, 2024, 4:18pm

Hi

On 8/11/24 17:50, Nick Lockheart wrote:

It seems like there's still a lot of string functions that assume that
a character is a single byte, and these may actually work as expected
when dealing with Latin characters, but may fail unexpectedly if a
sequence is more than one byte.

PHP's strings are byte-strings containing arbitrary sequences of bytes. Unless you specifically select functions that interpret the byte-strings as something else, you get a byte-string interpretation. There is nothing unexpected about that.

Are there any use cases for PHP where **single-byte** characters are
the norm?

Dealing with binary formats.

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The premise is false. Everything on the Internet is byte-strings (also called "octet-string").

--------

You might be interested in [RFC] Unicode Text Processing - Externals.

Best regards
Tim Düsterhus

Bilge · August 11, 2024, 4:22pm

Are we going back to PHP 6?

Anton_Smirnov · August 11, 2024, 4:33pm

To mbstring.func_overload

On 8/11/24 19:22, Bilge wrote:

Are we going back to PHP 6?

Anton_Smirnov · August 11, 2024, 4:38pm

Hi Nick,

As a developer who often deals with binary data (like bencode, ipv6 addresses and my own hacks for multibyte arithmetic) I would prefer that functions and syntaxes that allow me to work with bytes keep working with bytes, not characters or code points. So the closest solution would be separate binary/text strings, but then we have PHP6 all over again. Maybe this time it might work in some form, who knows.

On 8/11/24 18:50, Nick Lockheart wrote:

HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports
the UTF-8 multi-byte character encoding.

It seems like there's still a lot of string functions that assume that
a character is a single byte, and these may actually work as expected
when dealing with Latin characters, but may fail unexpectedly if a
sequence is more than one byte.

Are there any use cases for PHP where **single-byte** characters are
the norm?

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The WHATWG Encoding Standard:

https://encoding.spec.whatwg.org/

Also, according to Mozilla, "[The meta charset] attribute declares the
document's character encoding. If the attribute is present, its value
must be an ASCII case-insensitive match for the string "utf-8", because
UTF-8 is the only valid encoding for HTML5 documents."

<meta>: The metadata element - HTML: HyperText Markup Language | MDN

Crell · August 11, 2024, 4:58pm

On Sun, Aug 11, 2024, at 10:50 AM, Nick Lockheart wrote:

HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports
the UTF-8 multi-byte character encoding.

It seems like there's still a lot of string functions that assume that
a character is a single byte, and these may actually work as expected
when dealing with Latin characters, but may fail unexpectedly if a
sequence is more than one byte.

Are there any use cases for PHP where **single-byte** characters are
the norm?

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The WHATWG Encoding Standard:

https://encoding.spec.whatwg.org/

Also, according to Mozilla, "[The meta charset] attribute declares the
document's character encoding. If the attribute is present, its value
must be an ASCII case-insensitive match for the string "utf-8", because
UTF-8 is the only valid encoding for HTML5 documents."

<meta>: The metadata element - HTML: HyperText Markup Language | MDN

Some background and history, for those not familiar...

After PHP 5.2, there was a huge effort to move PHP to using Unicode internally. It was to be released as PHP 6. Unfortunately, it ran into a whole host of problems, among them:

1. It tried to use UTF-16 internally, as there were good libraries for it but it was much much slower than was acceptable.
2. It required rewriting basically everything.
3. Trying to support two string variants at the same time (because binary strings are still very useful) in almost the same syntax turned out be, um, kinda hard.

After a number of years of work, it was eventually concluded that it was a dead end. So the non-Unicode-related bits of what would have been PHP 6 got renamed to PHP 5.3 and released to much fanfare, kicking off the PHP Renaissance Era.

When PHP 5.6+1 was released, there was a vote to decide if it should be called 6 or 7. 7 won, mainly on the grounds that a number of very stupid book publishers had released "PHP 6" books in anticipation of PHP 6's release that were now completely useless and misleading. So we skipped 6 entirely, and PHP 6-compatibility is a running joke among those who have been around a while.

Fortunately, the vast majority of single-byte strings are ASCII, and ASCII is, by design, a strict subset of UTF-8, so in practice the lack of native UTF-8 strings rarely causes an issue.

Trying to introduce Unicode strings to the language now as a native type would... probably break just as much if not more. If anything it's probably harder today than it was in 2008, because the engine and existing code to not-break has grown considerably.

A much better approach would be something like this RFC from Derick a few years ago:

https://wiki.php.net/rfc/unicode_text_processing

If you need something today, then Symfony has a user-space approximation of it:

--Larry Garfield

youkidearitai · August 11, 2024, 5:03pm

2024年8月12日(月) 1:42 Anton Smirnov <sandfox@sandfox.me>:

Hi Nick,

As a developer who often deals with binary data (like bencode, ipv6
addresses and my own hacks for multibyte arithmetic) I would prefer that
functions and syntaxes that allow me to work with bytes keep working
with bytes, not characters or code points. So the closest solution would
be separate binary/text strings, but then we have PHP6 all over again.
Maybe this time it might work in some form, who knows.

On 8/11/24 18:50, Nick Lockheart wrote:
>
> HTML 5 was adopted in 2014, over ten years ago. HTML 5 only supports
> the UTF-8 multi-byte character encoding.
>
> It seems like there's still a lot of string functions that assume that
> a character is a single byte, and these may actually work as expected
> when dealing with Latin characters, but may fail unexpectedly if a
> sequence is more than one byte.
>
> Are there any use cases for PHP where **single-byte** characters are
> the norm?
>
> It seems that if everything on the Internet is multi-byte encoded now,
> then all of the PHP string functions should be multi-byte safe.
>
>
> The WHATWG Encoding Standard:
>
> https://encoding.spec.whatwg.org/
>
> Also, according to Mozilla, "[The meta charset] attribute declares the
> document's character encoding. If the attribute is present, its value
> must be an ASCII case-insensitive match for the string "utf-8", because
> UTF-8 is the only valid encoding for HTML5 documents."
>
> <meta>: The metadata element - HTML: HyperText Markup Language | MDN

Hi Nick

I'm confused what is "multibyte safe".

Usually, PHP's string type is binary.

If you want to use multibyte character, you can use mbstring functions.
(Is "multibyte safe" says about mbstring functions?)

There is no consistent solution I think, because you have to think a
lot about multibyte characters.

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

Alain_D_D_Williams · August 11, 2024, 6:10pm

On Mon, Aug 12, 2024 at 02:03:08AM +0900, youkidearitai wrote:

2024年8月12日(月) 1:42 Anton Smirnov <sandfox@sandfox.me>:

I'm confused what is "multibyte safe".

I think that he means that the bytes are only valid UTF-8 sequences.

This would mean that some byte sequences would not be allowed.

-1 to this idea.

--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT Lecturer.
+44 (0) 787 668 0256 https://www.phcomp.co.uk/
Parliament Hill Computers. Registration Information: How to contact Parliament Hill Computers Ltd
#include <std_disclaimer.h>

Nick_Lockheart · August 11, 2024, 9:36pm

Some background and history, for those not familiar...

After PHP 5.2, there was a huge effort to move PHP to using Unicode
internally. It was to be released as PHP 6. Unfortunately, it ran
into a whole host of problems, among them:

1. It tried to use UTF-16 internally, as there were good libraries
for it but it was much much slower than was acceptable.
2. It required rewriting basically everything.
3. Trying to support two string variants at the same time (because
binary strings are still very useful) in almost the same syntax
turned out be, um, kinda hard.

After a number of years of work, it was eventually concluded that it
was a dead end. So the non-Unicode-related bits of what would have
been PHP 6 got renamed to PHP 5.3 and released to much fanfare,
kicking off the PHP Renaissance Era.

When PHP 5.6+1 was released, there was a vote to decide if it should
be called 6 or 7. 7 won, mainly on the grounds that a number of very
stupid book publishers had released "PHP 6" books in anticipation of
PHP 6's release that were now completely useless and misleading. So
we skipped 6 entirely, and PHP 6-compatibility is a running joke
among those who have been around a while.

Fortunately, the vast majority of single-byte strings are ASCII, and
ASCII is, by design, a strict subset of UTF-8, so in practice the
lack of native UTF-8 strings rarely causes an issue.

Trying to introduce Unicode strings to the language now as a native
type would... probably break just as much if not more. If anything
it's probably harder today than it was in 2008, because the engine
and existing code to not-break has grown considerably.

A much better approach would be something like this RFC from Derick a
few years ago:

PHP: rfc:unicode_text_processing

If you need something today, then Symfony has a user-space
approximation of it:

Creating and Manipulating Strings (Symfony Docs)

--Larry Garfield

I think that when people think of "strings", they think of human
readable text.

I wasn't suggesting that unicode strings be a native type, but rather
that functions that have "string" in the name should be UTF-8 safe.

There's a lot of pitfalls here, and I don't think the documentation
clearly calls out which functions are OK to use with UTF-8 and which
ones may cause unexpected surprises.

The compatibility between ASCII and UTF-8 for Latin characters is both
a curse and a blessing. An application may work fine in testing, but
then break when a user submits an emoji.

It seems like it would be good to have a set of functions, each for an
intended use case, that behave in accordance with their intended usage.

For example:

Math and number functions for calculations; string functions for human
readable text (which are UTF-8 safe), and byte functions for binary
processing that are binary safe.

Using the functions for certain use cases right now requires knowing
the internals of the function, where developers should be able to rely
on the name to know that it will work for a specific use case.

For many functions, the manual doesn't specify if it is safe for multi-
byte characters or not.

`ltrim` doesn't mention multi-byte:

The `trim` page doesn't mention it either, except there is a user
contributed note at the bottom: "Note that trim() is not aware of
Unicode points that represent whitespace (e.g., in the General
Punctuation block), except, of course, for the ones mentioned in this
page. There is no Unicode-specific trim function in PHP at the time of
writing (July 2023), but you can try some examples of trims using
multibyte strings posted on the comments for the mbstring extension:
https://www.php.net/manual/en/ref.mbstring.php"\.

So what I would propose is:

(1) All string functions should state in the official man page if they
are safe for UTF-8 or not.

(2) Functions intended for working with text should be made UTF-8 safe.

(3) Functions intended for processing binary should be added if
necessary, and should be named something like "binary" or "byte".

Ayesh_Karunaratne · August 11, 2024, 10:39pm

There's a lot of pitfalls here, and I don't think the documentation
clearly calls out which functions are OK to use with UTF-8 and which
ones may cause unexpected surprises.

The compatibility between ASCII and UTF-8 for Latin characters is both
a curse and a blessing. An application may work fine in testing, but
then break when a user submits an emoji.

[snip]

(1) All string functions should state in the official man page if they
are safe for UTF-8 or not.

GitHub - php/doc-en: English PHP documentation where our official documentation source.
Open source, and often towards the end of the year before the PHP
major version release, the team and contributors spend a tremendous
amount of work to update the documentation to match the latest new
features, deprecations, etc. Always welcome for contributions,
including the ones that warn about certain functions not being
multi-byte safe.

(2) Functions intended for working with text should be made UTF-8 safe.

Generally speaking, all functions that deal with strings are in fact
UTF-8 safe because UTF-8 strings are also a sequence of bytes, just
like the other strings are. The problems occur only if you try to
modify or inspect the text in a way that expects how it should be
handled as human readable text.

Take the _text_ "å" for example. What is the length of the string?

strlen('å'); // 3
mb_strlen('å'); // 2
grapheme_strlen('å'); // 1

The correct length of the string above (`a\xCC\x8A`) is... well, all of them:

- `strlen` is useful if you validate the length of a user-input
before saving it to a database field with a `varchar` limit, or to
avoid exceeding index length.
- `mb_strlen` is useful if you want to count how many human
code-points are used in that string. The mbstring extension knows from
Unicode data shows that "\xCC\x8A" is a single code-point. However, it will
only consider upto 4 bytes per character because UTF-8 representation
limits it to 4 bytes.
- `grapheme_strlen` counts the actual human-perceived characters
(grapheme clusters), which is what you should really be using if you
are formatting text for a specific length.

It's also important to understand and appreciate that a lot of PHP
functionality today has been there for a very long time. You can't
simply change a critical function like `strpos` this late in a
programming language. See the excellent reply Larry made about what
happened the time PHP tried to do exactly what you are suggesting.

Replacing all `strlen` calls in a code base `mb_strlen` or
`graphme_strlen` is not a good idea because they serve a different
requirement to `strlen`, and they should only be used intentionally
where necessary. The latter functions also have to inspect the strings
sequentially because UTF-8 is not fixed-length. This is quite slow and
it adds up when you process thousands of strings.

(3) Functions intended for processing binary should be added if
necessary, and should be named something like "binary" or "byte".

We are already doing it, just the other way around. See `mb_*` and
`grapheme_*` functions: All of them are purposefully built to support
those features, and are clearly named as such.

The rest of the functions consistently consider all strings as a
sequence of bytes.

This naming pattern is arguably the correct way, because the majority
of functions do not need to care whether the strings they deal with
need to be human-perceived characters or not. For example,
`base64_encode`/`decode` functions, `file_(get|put)_contents`,
`pack`/`unpack`, etc will work with any string regardless of their
UTF-8 correctness. Why should those strings need to be UTF-8 formatted
in the first place?

Anton_Smirnov · August 12, 2024, 3:30am

On 12/08/2024 00:36, Nick Lockheart wrote:

So what I would propose is:

(1) All string functions should state in the official man page if they
are safe for UTF-8 or not.

Reasonable but see below

(2) Functions intended for working with text should be made UTF-8 safe.

Define precisely UTF-8 safe. Also, what about BC breaks here?

(3) Functions intended for processing binary should be added if
necessary, and should be named something like "binary" or "byte".

That would require renaming and deprecating most of the standard string library, I guess no one would agree to that.

But generally they are already named differently, str* are binary, mb_* and grapheme_* are text-oriented

Rowan_Tommins_IMSoP · August 12, 2024, 6:53am

On 11 August 2024 16:50:52 BST, Nick Lockheart <lists@ageofdream.com> wrote:

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The phrase "multibyte safe" may have made sense about 30 years ago, when it was thought that a "universal character set" could just be a "wide ASCII", encoding a straightforward list of characters, just more of them.

Modern Unicode is so much more than that, because the world's writing systems don't all work the same way. Should strlen() measure bytes, code points, or graphemes? Should strtoupper() accept a locale, so it can handle cases like Turkish "dotless i" where "I" is not the uppercase of "i"? And so on, and so on.

I've seen plenty of languages boast that they are "Unicode aware" but few actually engaging with the question of what that actually means. Often they equate "character" with "code point" and stop there, which leads to results that are just as useless to most of the world as if they'd equated it with "byte".

Regards,
Rowan Tommins
[IMSoP]

danielhaber · August 12, 2024, 9:50am

On 8/12/2024 9:53 AM, Rowan Tommins [IMSoP] wrote:

On 11 August 2024 16:50:52 BST, Nick Lockheart <lists@ageofdream.com> wrote:

It seems that if everything on the Internet is multi-byte encoded now,
then all of the PHP string functions should be multi-byte safe.

The phrase "multibyte safe" may have made sense about 30 years ago, when it was thought that a "universal character set" could just be a "wide ASCII", encoding a straightforward list of characters, just more of them.

Modern Unicode is so much more than that, because the world's writing systems don't all work the same way. Should strlen() measure bytes, code points, or graphemes? Should strtoupper() accept a locale, so it can handle cases like Turkish "dotless i" where "I" is not the uppercase of "i"? And so on, and so on.

I've seen plenty of languages boast that they are "Unicode aware" but few actually engaging with the question of what that actually means. Often they equate "character" with "code point" and stop there, which leads to results that are just as useless to most of the world as if they'd equated it with "byte".

Regards,
Rowan Tommins
[IMSoP]

Feels appropriate to link to this:
"The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)"

youkidearitai · August 12, 2024, 10:33am

2024年8月12日(月) 18:54 Daniel Haber <danielhaber@gmail.com>:

On 8/12/2024 9:53 AM, Rowan Tommins [IMSoP] wrote:
>
>
> On 11 August 2024 16:50:52 BST, Nick Lockheart <lists@ageofdream.com> wrote:
>> It seems that if everything on the Internet is multi-byte encoded now,
>> then all of the PHP string functions should be multi-byte safe.
>
> The phrase "multibyte safe" may have made sense about 30 years ago, when it was thought that a "universal character set" could just be a "wide ASCII", encoding a straightforward list of characters, just more of them.
>
> Modern Unicode is so much more than that, because the world's writing systems don't all work the same way. Should strlen() measure bytes, code points, or graphemes? Should strtoupper() accept a locale, so it can handle cases like Turkish "dotless i" where "I" is not the uppercase of "i"? And so on, and so on.
>
> I've seen plenty of languages boast that they are "Unicode aware" but few actually engaging with the question of what that actually means. Often they equate "character" with "code point" and stop there, which leads to results that are just as useless to most of the world as if they'd equated it with "byte".
>
> Regards,
> Rowan Tommins
> [IMSoP]

Feels appropriate to link to this:
"The Absolute Minimum Every Software Developer Must Know About Unicode
in 2023 (Still No Excuses!)"
The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!) @ tonsky.me

Hi, there

Feels appropriate to link to this:
"The Absolute Minimum Every Software Developer Must Know About Unicode
in 2023 (Still No Excuses!)"
The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!) @ tonsky.me

I think it's the same as the quoted site.
However, In programming, there are times when you want to operate on
bytes, code points, or grapheme clusters.
UTF-8 can't solve everything, what to program is important for
programmers (byte programming, character programming etc).

Also, other character encodings are also important in mainly CJK.
Character set has a lot of consider of many things.

Regards
Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------