[PHP-DEV] Potential RFC: mb_rawurlencode() ?

Paul_M_Jones · March 18, 2025, 5:48pm

Hi all,

The discussion around WHATWG-URL on this list, as well as my work coordinating Uri-Interop <https://github.com/uri-interop/interface>, lead me to think PHP needs a multibyte equivalent of rawurlencode().

Broadly speaking, as far as I can tell:

- For an RFC 3986 URI, delimiters need to be percent-encoded, as well as non-ASCII characters.
- For an RFC 3987 IRI, delimiters need to be percent-encoded, but UCS characters do not.

(There are other details but I think you get the idea.)

The rawurlencode() function does fine for URIs, but not for IRIs. Using rawurlencode() for an IRI will encode multibyte characters when it should leave them alone. For example:

$val = 'fü bar';

$uriPath = '/heads/' . rawurlencode($val) . '/tails/';
assert($uriPath === '/heads/f%C3%BC%20bar/tails/'); // true

$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // false

(This might apply to WHATWG-URL component construction as well.)

Have I missed something, either in the specs or in PHP itself?

If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland implementation might look something like the code below.

Thoughts?

* * *

function mb_rawurlencode(string $string) : string
{
    $encoded = '';

    foreach (mb_str_split($string) as $char) {
        $encoded .= match ($char) {
            chr(0) => "%00",
            chr(1) => "%01",
            chr(2) => "%02",
            chr(3) => "%03",
            chr(4) => "%04",
            chr(5) => "%05",
            chr(6) => "%06",
            chr(7) => "%07",
            chr(8) => "%08",
            chr(9) => "%09",
            chr(10) => "%0A",
            chr(11) => "%0B",
            chr(12) => "%0C",
            chr(13) => "%0D",
            chr(14) => "%0E",
            chr(15) => "%0F",
            chr(16) => "%10",
            chr(17) => "%11",
            chr(18) => "%12",
            chr(19) => "%13",
            chr(20) => "%14",
            chr(21) => "%15",
            chr(22) => "%16",
            chr(23) => "%17",
            chr(24) => "%18",
            chr(25) => "%19",
            chr(26) => "%1A",
            chr(27) => "%1B",
            chr(28) => "%1C",
            chr(29) => "%1D",
            chr(30) => "%1E",
            chr(31) => "%1F",
            chr(127) => "%7F",
            "!" => '%21',
            "#" => '%23',
            "$" => '%24',
            "%" => '%25',
            "&" => '%26',
            "'" => '%27',
            "(" => '%28',
            ")" => '%29',
            "*" => '%2A',
            "+" => '%2B',
            "," => '%2C',
            "/" => '%2F',
            ":" => '%3A',
            ";" => '%3B',
            "=" => '%3D',
            "?" => '%3F',
            "[" => '%5B',
            "]" => '%5D',
            default => $char,
        };
    }

    return $encoded;
}

* * *

-- pmj

youkidearitai · March 20, 2025, 6:31am

---------- Forwarded message ---------
From: youkidearitai <youkidearitai@gmail.com>
Date: 2025年3月20日(木) 14:41
Subject: Re: [PHP-DEV] Potential RFC: mb_rawurlencode() ?
To: Paul M. Jones <pmjones@pmjones.io>

2025年3月19日(水) 2:52 Paul M. Jones <pmjones@pmjones.io>:

Hi all,

The discussion around WHATWG-URL on this list, as well as my work coordinating Uri-Interop <https://github.com/uri-interop/interface>, lead me to think PHP needs a multibyte equivalent of rawurlencode().

Broadly speaking, as far as I can tell:

- For an RFC 3986 URI, delimiters need to be percent-encoded, as well as non-ASCII characters.
- For an RFC 3987 IRI, delimiters need to be percent-encoded, but UCS characters do not.

(There are other details but I think you get the idea.)

The rawurlencode() function does fine for URIs, but not for IRIs. Using rawurlencode() for an IRI will encode multibyte characters when it should leave them alone. For example:
$val = 'fü bar';

$uriPath = '/heads/' . rawurlencode($val) . '/tails/';
assert($uriPath === '/heads/f%C3%BC%20bar/tails/'); // true

$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // false
(This might apply to WHATWG-URL component construction as well.)

Have I missed something, either in the specs or in PHP itself?

If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland implementation might look something like the code below.

Thoughts?

* * *
function mb_rawurlencode(string $string) : string
{
    $encoded = '';

    foreach (mb_str_split($string) as $char) {
        $encoded .= match ($char) {
            chr(0) => "%00",
            chr(1) => "%01",
            chr(2) => "%02",
            chr(3) => "%03",
            chr(4) => "%04",
            chr(5) => "%05",
            chr(6) => "%06",
            chr(7) => "%07",
            chr(8) => "%08",
            chr(9) => "%09",
            chr(10) => "%0A",
            chr(11) => "%0B",
            chr(12) => "%0C",
            chr(13) => "%0D",
            chr(14) => "%0E",
            chr(15) => "%0F",
            chr(16) => "%10",
            chr(17) => "%11",
            chr(18) => "%12",
            chr(19) => "%13",
            chr(20) => "%14",
            chr(21) => "%15",
            chr(22) => "%16",
            chr(23) => "%17",
            chr(24) => "%18",
            chr(25) => "%19",
            chr(26) => "%1A",
            chr(27) => "%1B",
            chr(28) => "%1C",
            chr(29) => "%1D",
            chr(30) => "%1E",
            chr(31) => "%1F",
            chr(127) => "%7F",
            "!" => '%21',
            "#" => '%23',
            "$" => '%24',
            "%" => '%25',
            "&" => '%26',
            "'" => '%27',
            "(" => '%28',
            ")" => '%29',
            "*" => '%2A',
            "+" => '%2B',
            "," => '%2C',
            "/" => '%2F',
            ":" => '%3A',
            ";" => '%3B',
            "=" => '%3D',
            "?" => '%3F',
            "[" => '%5B',
            "]" => '%5D',
            default => $char,
        };
    }

    return $encoded;
}
* * *

-- pmj

Hi, Paul.

I think signature is below:

function mb_rawurlencode(string $string, string $encode): string {}

Because the mbstring function is other than Unicode (ISO-8859-1 to
ISO-8859-16, Shift_JIS, EUC-* etc).
Other than that I don't know yet

Oops, I missing to send to internals.
Sorry resend this is.

Yuya

--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- youkidearitai (tekimen) · GitHub
-----------------------------

Paul_M_Jones · March 20, 2025, 4:46pm

Hi all,

On Mar 20, 2025, at 01:31, youkidearitai <youkidearitai@gmail.com> wrote:

---------- Forwarded message ---------
From: youkidearitai <youkidearitai@gmail.com>
Date: 2025年3月20日(木) 14:41
Subject: Re: [PHP-DEV] Potential RFC: mb_rawurlencode() ?
To: Paul M. Jones <pmjones@pmjones.io>

2025年3月19日(水) 2:52 Paul M. Jones <pmjones@pmjones.io>:
If not, how do we feel about an RFC for mb_rawurlencode()? A naive userland implementation might look something like the code below.

Thoughts?

* * *
function mb_rawurlencode(string $string) : string
Hi, Paul.

I think signature is below:
function mb\_rawurlencode$string $string, string $encode$: string \{\}
\`\`\`

Ah yes, you're right -- probably `?string $encode = null` to match with mb_substr().

Oops, I missing to send to internals.
Sorry resend this is.

Not to worry, thank you!

-- pmj

Tim_Dusterhus · March 21, 2025, 11:17am

Hi

Am 2025-03-20 17:46, schrieb Paul M. Jones:

function mb_rawurlencode(string $string, string $encode): string {}
Ah yes, you're right -- probably `?string $encode = null` to match with mb_substr().

I am not sure if that signature makes sense and if the proposed functionality fits into mbstring for that reason. IRIs are defined as UTF-8, any other encoding results in invalid output / results that are not interoperable. As one example paragraph from RFC 3987:

Conversions from URIs to IRIs MUST NOT use any character encoding
other than UTF-8 in steps 3 and 4, even if it might be possible to
guess from the context that another character encoding than UTF-8 was
used in the URI.

The correct solution to me is to build a proper thought-through API as part of the proposed new Uri namespace and not adding new standalone functions without a clear vision.

Best regards
Tim Düsterhus

Tim_Dusterhus · March 21, 2025, 11:22am

Hi

Am 2025-03-18 18:48, schrieb Paul M. Jones:

$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // false

From my reading of RFC 3987 that result is incorrect. The space is neither listed as `iunreserved`, not as `sub-delims`, thus isn't a valid `ipchar`. Thus the space needs to be encoded as %20 for IRIs as well. The same mistake applies to the reference userland implementation below.

Best regards
Tim Düsterhus

Paul_M_Jones · March 22, 2025, 1:43pm

Hi Tim & all,

On Mar 21, 2025, at 06:22, Tim Düsterhus <tim@bastelstu.be> wrote:

Am 2025-03-18 18:48, schrieb Paul M. Jones:

$iriPath = '/heads/' . rawurlencode($val) . '/tails/');
assert($iriPath === '/heads/fü bar/tails/'; // false

From my reading of RFC 3987 that result is incorrect. The space is neither listed as `iunreserved`, not as `sub-delims`, thus isn't a valid `ipchar`. Thus the space needs to be encoded as %20 for IRIs as well. The same mistake applies to the reference userland implementation below.

Agreed; the naive implementation would need to less naive and pay closer attention to the ABNF for `ucschar` and `ipchar` in the spec.

Along those lines, I think there might need to be two additional changes/additions to help with encoding for RFC 3987 and WHATWG-URL component values:

- `http_build_query()` would need PHP_QUERY_3987 and PHP_QUERY_WHATWG flags and corresponding logic (or entirely new functions); and
- `parse_str()` would need a corresponding `mb_parse_str()`.

-- pmj

Rowan_Tommins_IMSoP · March 22, 2025, 3:20pm

On 21/03/2025 11:17, Tim Düsterhus wrote:

I am not sure if that signature makes sense and if the proposed functionality fits into mbstring for that reason. IRIs are defined as UTF-8, any other encoding results in invalid output / results that are not interoperable.

This confirms a nagging feeling I had when I first saw the thread: the name "mb_rawurlencode" implies "do the same things as rawurlencode, but for multi-byte strings", but that's not what is being proposed.

Notably, a similar feature is actually slated for removal; to quote PHP: Deprecated Features - Manual

> Usage of the QPrint, Base64, Uuencode, and HTML-ENTITIES 'text encodings' is deprecated for all MBString functions. Unlike all the other text encodings supported by MBString, these do not encode a sequence of Unicode codepoints, but rather a sequence of raw bytes. It is not clear what the correct return values for most MBString functions should be when one of these non-encodings is specified.

The same applies here: if you write mb_rawurlencode($my_string, 'SHIFT-JIS'), does that mean convert what you can to ASCII, and percent encode the rest for a URI; or does it mean convert to UTF-8, and percent encode as necessary for an IRI? If the input contains sequences which are not valid SHIFT-JIS, are those bytes treated as unencodable (producing errors or substitution characters), or are they directly percent encoded?

The correct solution to me is to build a proper thought-through API as part of the proposed new Uri namespace and not adding new standalone functions without a clear vision.

I completely agree.

For instance, the IRI standard does include an algorithm for converting a non-Unicode IRI representation to a URI - but it requires a Unicode Normalization step, which is a complex algorithm not included in ext/standard or ext/mbstring, only ext/intl. However, a function in the URI namespace that only handled the UTF-8 input case might still be useful.

Along those lines, I think there might need to be two additional changes/additions to help with encoding for RFC 3987 and WHATWG-URL component values:

- `http_build_query()` would need PHP_QUERY_3987 and PHP_QUERY_WHATWG flags and corresponding logic (or entirely new functions); and
- `parse_str()` would need a corresponding `mb_parse_str()`.

I haven't followed the other URI thread at all, but isn't replacing the scattered standard library functions with a consistent API the whole point of that effort?

parse_str() in particular has a non-descriptive name, and a weird function signature because it used to directly overwrite variables by name.

As a comparison, we didn't extend the shuffle() function with an algorithm parameter, we added a shuffleArray() method to the new Randomizer class.

--
Rowan Tommins
[IMSoP]

Paul_M_Jones · March 22, 2025, 4:08pm

Hi Rowan & all,

On Mar 22, 2025, at 10:20, Rowan Tommins [IMSoP] <imsop.php@rwec.co.uk> wrote:

On 21/03/2025 11:17, Tim Düsterhus wrote:

I am not sure if that signature makes sense and if the proposed functionality fits into mbstring for that reason. IRIs are defined as UTF-8, any other encoding results in invalid output / results that are not interoperable.

This confirms a nagging feeling I had when I first saw the thread: the name "mb_rawurlencode" implies "do the same things as rawurlencode, but for multi-byte strings", but that's not what is being proposed.

[snip]

No argument; my point is more "if we are going to do IRI and WHATWG-URL, we're going to need some additional support functionality around encoding component values for them." How that is achieved is up for grabs. If this discussion has revealed a tentative consensus that it needs to happen, I consider it a success.

Next up: what exactly should the API around this functionality look like? I suggested functions but that's clearly a non-starter; what do we feel is a good alternative, and can it be achieved independently from (but in support of) the URI+WHATWG-URL proposal?

-- pmj

Rowan_Tommins_IMSoP · March 23, 2025, 12:04pm

On 22 March 2025 16:08:54 GMT, "Paul M. Jones" <pmjones@pmjones.io> wrote:

Next up: what exactly should the API around this functionality look like? I suggested functions but that's clearly a non-starter; what do we feel is a good alternative, and can it be achieved independently from (but in support of) the URI+WHATWG-URL proposal?

As I say, I haven't followed the previous conversation at all, but from a glance at the RFC, it seems the proposed classes are called "Url"/"Uri", not "UrlParser"/"UriParser", so could maybe be expanded to creating *from* parts. I don't know where exactly IRIs should fit in, but maybe as a new object in the same hierarchy?

There's also definitely a place for standalone functions for handling specific jobs on fragments of URIs. It would actually be really great to have a replacement for parse_str which didn't carry the baggage of old PHP versions - no by-reference output, no name mangling of keys (at least not by default). http_build_query isn't as urgently in need of replacement, but a clean start could default the separator to '&' rather than pulling from an INI setting.

Whether each function should take an enum flag for encoding variants, or be split into a family of similar functions, I don't know. At the moment, http_build_query accepts constants, (raw)urlencode is split into two functions, and parse_str doesn't give any option.

I don't want to have to memorise a bunch of RFC numbers in order to know whether spaces will be encoded as plus signs, but maybe we can find something more descriptive than "raw" to distinguish them.

In short, I wouldn't start from the point of "how do we extend current functions to handle IRIs?", I'd start from the point of "what functions do we need for handling URI/URL/IRI parts, and what variations of each?"

Rowan Tommins
[IMSoP]