Some background and history, for those not familiar...
After PHP 5.2, there was a huge effort to move PHP to using Unicode
internally. It was to be released as PHP 6. Unfortunately, it ran
into a whole host of problems, among them:
1. It tried to use UTF-16 internally, as there were good libraries
for it but it was much much slower than was acceptable.
2. It required rewriting basically everything.
3. Trying to support two string variants at the same time (because
binary strings are still very useful) in almost the same syntax
turned out be, um, kinda hard.
After a number of years of work, it was eventually concluded that it
was a dead end. So the non-Unicode-related bits of what would have
been PHP 6 got renamed to PHP 5.3 and released to much fanfare,
kicking off the PHP Renaissance Era.
When PHP 5.6+1 was released, there was a vote to decide if it should
be called 6 or 7. 7 won, mainly on the grounds that a number of very
stupid book publishers had released "PHP 6" books in anticipation of
PHP 6's release that were now completely useless and misleading. So
we skipped 6 entirely, and PHP 6-compatibility is a running joke
among those who have been around a while.
Fortunately, the vast majority of single-byte strings are ASCII, and
ASCII is, by design, a strict subset of UTF-8, so in practice the
lack of native UTF-8 strings rarely causes an issue.
Trying to introduce Unicode strings to the language now as a native
type would... probably break just as much if not more. If anything
it's probably harder today than it was in 2008, because the engine
and existing code to not-break has grown considerably.
A much better approach would be something like this RFC from Derick a
few years ago:
PHP: rfc:unicode_text_processing
If you need something today, then Symfony has a user-space
approximation of it:
Creating and Manipulating Strings (Symfony Docs)
--Larry Garfield
I think that when people think of "strings", they think of human
readable text.
I wasn't suggesting that unicode strings be a native type, but rather
that functions that have "string" in the name should be UTF-8 safe.
There's a lot of pitfalls here, and I don't think the documentation
clearly calls out which functions are OK to use with UTF-8 and which
ones may cause unexpected surprises.
The compatibility between ASCII and UTF-8 for Latin characters is both
a curse and a blessing. An application may work fine in testing, but
then break when a user submits an emoji.
It seems like it would be good to have a set of functions, each for an
intended use case, that behave in accordance with their intended usage.
For example:
Math and number functions for calculations; string functions for human
readable text (which are UTF-8 safe), and byte functions for binary
processing that are binary safe.
Using the functions for certain use cases right now requires knowing
the internals of the function, where developers should be able to rely
on the name to know that it will work for a specific use case.
For many functions, the manual doesn't specify if it is safe for multi-
byte characters or not.
`ltrim` doesn't mention multi-byte:
The `trim` page doesn't mention it either, except there is a user
contributed note at the bottom: "Note that trim() is not aware of
Unicode points that represent whitespace (e.g., in the General
Punctuation block), except, of course, for the ones mentioned in this
page. There is no Unicode-specific trim function in PHP at the time of
writing (July 2023), but you can try some examples of trims using
multibyte strings posted on the comments for the mbstring extension:
https://www.php.net/manual/en/ref.mbstring.php"\.
So what I would propose is:
(1) All string functions should state in the official man page if they
are safe for UTF-8 or not.
(2) Functions intended for working with text should be made UTF-8 safe.
(3) Functions intended for processing binary should be added if
necessary, and should be named something like "binary" or "byte".