[PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Tim_Dusterhus · March 30, 2025, 12:36pm

Hi

Apologies for getting back to you just now.

On 3/2/25 23:00, Máté Kocsis wrote:

What happens for Rfc3986 when passing an invalid URI to the constructor?
Will an exception be thrown? What will the error array contain? Is it
perhaps necessary to subclass Uri\InvalidUriException for use with
WhatWgUrl, since `$errors` is not applicable for 3986?

[…]

The $errors property will contain an empty array though, as you supposed. I
don't see much problem
with using the same exception in both cases, however I'm also fine
with making the $errors property
nullable in order to indicate that returning errors is not supported by the
implementation triggering
the error.

I think I would prefer:

     namespace Uri {
         class InvalidUriException extends \Uri\UriException
         {
         }
     }

     namespace Uri\WhatWg {
         class InvalidUrlException extends \Uri\InvalidUriException {
             /** @var list<UrlValidationError> */
             public readonly array $errors;
         }
     }

(note the use of Url in the name of the sub-exception)

While this would result in a little more boilerplate, it would make static analysis tools more useful, since the `$errors` array could be properly typed instead of being just `array<mixed>`.

7.

In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it's important
to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is
decoded to `=`. This really is the same case as with `%2F` in a path.
The explanation

[…]
The relevant sections will give a little more reasoning why I went with
these rules.

I've tested some of the examples against the implementation, but it does not match the description. Is the implementation up to date?

<?php

$url = new Uri\WhatWg\Url("https://example.com/foo/bar%2Fbaz"\);

var_dump($url->getPath()); // /foo/bar%2Fbaz
var_dump($url->getRawPath()); // /foo/bar%2Fbaz

results in:

string(12) "/foo/bar/baz"
string(14) "/foo/bar%2Fbaz"

The implementation for Rfc3986 appears to be correct.

"the URI is normalized (when applicable), and then the reserved

characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded that
are not allowed in a component. This behavior is needed to be able to
unambiguously retrieve components."

alone is not clear to me. “reserved characters that are not allowed in a
component”. I assume this means that `%2F` (/) in a path will not be
decoded, but `%3F` (?) will, because a bare `?` can't appear in a path?

I hope that this question is also clear after my clarifications + the
reconsidered logic.

Please also give an explicit example for `%3F` in a path. I know that it is reserved from reading the Rfc3986, but I think it's a little unintuitive. You can adjust the last example in the component retrieval section to make it show all cases. So:

$uri = new Uri\Rfc3986\Uri("https://[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux");

     echo $uri->getHost(); // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
     echo $uri->getRawHost(); // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
     echo $uri->getPath(); // /foo/bar%3Fbaz
     echo $uri->getRawPath(); // /foo/bar%3Fbaz
     echo $uri->getQuery(); // foo=bar%26baz%3Dqux
     echo $uri->getRawQuery(); // foo=bar%26baz%3Dqux

During testing I also noticed that the Rfc3986 implementation removes trailing slashes from the path when using the normalized version. This was a little unexpected, because to me this is the difference between a directory and a file. I don't think there are clear examples showing that. So:

$uri = new Uri\Rfc3986\Uri("https://example.com/foo/bar/"\);

echo $uri->getPath(); // /foo/bar
echo $uri->getRawPath(); // /foo/bar/

9.

In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode `?` and `#` as necessary. Will the same happen
for Rfc3986? Will the encoding of `#` also happen for the query-string
component? The RFC only mentions the path component.

The above referenced sections will give a clear answer for this question as
well.
TLDR: after your message, I realized that automatic percent-encoding also
triggers a (soft)
error case for WHATWG, so I changed my mind with regards to Uri\Rfc3986\Uri,
so it won't do any automatic percent-encoding. It's unfortunate, because
this behavior is not
consistent with WHATWG, but it's more consistent with the parsing rules of its
own specification,
where there are only hard errors, and there's no such thing as "automatic
correction".

Is the implementation already up to date with this change? When I try:

     var_dump(
       (new Uri\Rfc3986\Uri('https://example.com/foo/path’))
         ->withPath('some/path?foo=bar')
         ->toString()
     );

I get

string(36) "https://example.comsome/path?foo=bar"

which is completely wrong.

-------

It also surprised me, but IP address normalization is only performed by
WHATWG
during recomposition! But nowhere else...

I think this might be a misunderstanding of the WHATWG specification. It seems to be also normalized during parsing:

When I do the following in my Google Chrome:

(new URL('https://[0:0::1]')).host;

I get `[::1]`, which indicates the normalization happening. And likewise will:

(new URL('https://[2001:db8:0:0:0:0:0:1]')).host;

result in `[2001:db8::1]`.

I've also tested this with the implementation to see if this is just something that is not clear in the RFC text, but correctly handled in the implementation and noticed that the behavior is pretty broken.

Consider this script:

<?php
$url = 'https://[2001:db8:0:0:0:0:0:1]/foo/path’;

var_dump((new Uri\Rfc3986\Uri($url))->getHost());
var_dump((new Uri\WhatWg\Url($url))->getAsciiHost());

This outputs:

string(20) "2001:db8:0:0:0:0:0:1"
string(23) "[8193:3512:0:0:0:0:0:1]"

For Rfc3986: The square brackets are missing.
For WhatWg: The IPv6 is completely broken.

My expectation be be `[2001:db8:0:0:0:0:0:1]` for Rfc3986 and `[2001:db8::1]` for WhatWg. I have also tested the behavior of `withHost()` when leaving out the square brackets. The Rfc3986 correctly throws an Exception, but WhatWg silently does nothing:

$url = 'https://example.com/foo/path’;

var_dump((new Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString());

results in

string(28) "https://example.com/foo/path"

Best regards
Tim Düsterhus

Tim_Dusterhus · March 30, 2025, 12:42pm

Hi

Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera:

Hi Máté,

   for RFC 3986:
   RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax), and then
   this string is parsed and validated. Unfortunately, I recently
   realized that this approach may leave room for some kind of parsing
   confusion attack, namely when the scheme is for example "https", the
   authority is empty, and the path is "example.com
   <http://example.com>". This will result in a https://example.com
   URI. I believe a similar bug is not possible with the rest of the
   components because they have their delimiters. So possibly some
   other solution will be needed, or maybe adding some additional
   validation (?).

This is not correct according to RFC3986 RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

*When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). *

So in your example it should throw an Uri\InvalidUriException for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes.
This is also one of the many reasons why at least for RFC3986 the path component can never be `null` but that's another discussion. Like I said having a `fromComponenta` named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements.

I just tested this with the implementation and it also appears to not yet be correct:

     var_dump((new Uri\Rfc3986\Uri("example.com"))->getHost()); // NULL
     var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->getHost()); // string(11) "example.com"
     var_dump((new Uri\Rfc3986\Uri("example.com"))->withScheme('https')->toRawString()); // string(19) "https://example.com"

and

var_dump((new Uri\Rfc3986\Uri("foo/bar"))->withPath('//foo/bar')->getHost()); // string(3) "foo"

Best regards
Tim Düsterhus

nyamsprod_the_funky · March 30, 2025, 8:53pm

On 30/03/2025 14:42, Tim Düsterhus wrote:

Hi

Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera:

Hi Máté,

for RFC 3986:
RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax), and then
this string is parsed and validated. Unfortunately, I recently
realized that this approach may leave room for some kind of parsing
confusion attack, namely when the scheme is for example "https", the
authority is empty, and the path is "example.com
<http://example.com>". This will result in a https://example.com
URI. I believe a similar bug is not possible with the rest of the
components because they have their delimiters. So possibly some
other solution will be needed, or maybe adding some additional
validation (?).

This is not correct according to RFC3986 RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

*When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). *

So in your example it should throw an Uri\InvalidUriException for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes.
This is also one of the many reasons why at least for RFC3986 the path component can never be `null` but that's another discussion. Like I said having a `fromComponenta` named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements.

I just tested this with the implementation and it also appears to not yet be correct:
var\_dump$\(new Uri\\Rfc3986\\Uri\(&quot;example\.com&quot;$\)\-&gt;getHost\); // NULL
var\_dump$\(new Uri\\Rfc3986\\Uri\(&quot;example\.com&quot;$\)\-&gt;withScheme$&#39;https&#39;$\-&gt;getHost\); // string$11$ &quot;example\.com&quot;
var\_dump$\(new Uri\\Rfc3986\\Uri\(&quot;example\.com&quot;$\)\-&gt;withScheme$&#39;https&#39;$\-&gt;toRawString\); // string$19$ &quot;https://example.com&quot;
and
var\_dump$\(new Uri\\Rfc3986\\Uri\(&quot;foo/bar&quot;$\)\-&gt;withPath$&#39;//foo/bar&#39;$\-&gt;getHost\); // string$3$ &quot;foo&quot;
Best regards
Tim Düsterhus

Hi Tim and Maté upon further inspection and verification of RFC3986 I also see an issue with the example used for normalization in the RFC. According to RFC3986 (RFC 3986: Uniform Resource Identifier (URI): Generic Syntax) :

      The reg-name syntax allows percent-encoded octets in order to
       represent non-ASCII registered names in a uniform way that is
        independent of the underlying name resolution technology. Non-ASCII
        characters must first be encoded according to UTF-8 [STD63 <RFC 3986: Uniform Resource Identifier (URI): Generic Syntax], and then
        each octet of the corresponding UTF-8 sequence must be percent-
        encoded to be represented as URI characters. URI producing
        applications must not use percent-encoding in host unless it is used
        to represent a UTF-8 character sequence. When a non-ASCII registered
        name represents an internationalized domain name intended for
        resolution via the DNS, the name must be transformed to the IDNA
        encoding [RFC3490 <https://www.rfc-editor.org/rfc/rfc3490>\] prior to name lookup.

From this we can infer that:

- Host encoding can only happen for UTF-8 sequence but in your example "ex%61mple.com" is used which is not conforming to the rules (ie it should throw an InvalidUriException IMHO for the Uri class) I presume for WhatWg URL it will get correctly converted with a soft error (??).

- That when available IDNA is preferred to percent-encoded sequences

Best regards

Ignace Nyamagana Butera

nyamsprod_the_funky · March 31, 2025, 7:15pm

On 30/03/2025 22:53, Ignace Nyamagana Butera wrote:

On 30/03/2025 14:42, Tim Düsterhus wrote:
Hi

Am 2025-03-27 23:49, schrieb Ignace Nyamagana Butera:

Hi Máté,

for RFC 3986:
RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax), and then
this string is parsed and validated. Unfortunately, I recently
realized that this approach may leave room for some kind of parsing
confusion attack, namely when the scheme is for example "https", the
authority is empty, and the path is "example.com
<http://example.com>". This will result in a https://example.com
URI. I believe a similar bug is not possible with the rest of the
components because they have their delimiters. So possibly some
other solution will be needed, or maybe adding some additional
validation (?).

This is not correct according to RFC3986 RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

*When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). *

So in your example it should throw an Uri\InvalidUriException for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes.
This is also one of the many reasons why at least for RFC3986 the path component can never be `null` but that's another discussion. Like I said having a `fromComponenta` named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements.

I just tested this with the implementation and it also appears to not yet be correct:
var\_dump$\(new Uri\\Rfc3986\\Uri\(&quot;example\.com&quot;$\)\-&gt;getHost\); // NULL
var\_dump$\(new Uri\\Rfc3986\\Uri\(&quot;example\.com&quot;$\)\-&gt;withScheme$&#39;https&#39;$\-&gt;getHost\); // string$11$ &quot;example\.com&quot;
var\_dump$\(new Uri\\Rfc3986\\Uri\(&quot;example\.com&quot;$\)\-&gt;withScheme$&#39;https&#39;$\-&gt;toRawString\); // string$19$ &quot;https://example.com&quot;
and
var\_dump$\(new Uri\\Rfc3986\\Uri\(&quot;foo/bar&quot;$\)\-&gt;withPath$&#39;//foo/bar&#39;$\-&gt;getHost\); // string$3$ &quot;foo&quot;
Best regards
Tim Düsterhus
Hi Tim and Maté upon further inspection and verification of RFC3986 I also see an issue with the example used for normalization in the RFC. According to RFC3986 (RFC 3986: Uniform Resource Identifier (URI): Generic Syntax) :

      The reg-name syntax allows percent-encoded octets in order to
       represent non-ASCII registered names in a uniform way that is
        independent of the underlying name resolution technology. Non-ASCII
        characters must first be encoded according to UTF-8 [STD63 <RFC 3986: Uniform Resource Identifier (URI): Generic Syntax], and then
        each octet of the corresponding UTF-8 sequence must be percent-
        encoded to be represented as URI characters. URI producing
        applications must not use percent-encoding in host unless it is used
        to represent a UTF-8 character sequence. When a non-ASCII registered
        name represents an internationalized domain name intended for
        resolution via the DNS, the name must be transformed to the IDNA
        encoding [RFC3490 <https://www.rfc-editor.org/rfc/rfc3490>\] prior to name lookup.

From this we can infer that:

- Host encoding can only happen for UTF-8 sequence but in your example "ex%61mple.com" is used which is not conforming to the rules (ie it should throw an InvalidUriException IMHO for the Uri class) I presume for WhatWg URL it will get correctly converted with a soft error (??).

- That when available IDNA is preferred to percent-encoded sequences

Best regards

Ignace Nyamagana Butera

Hi Maté and all,

I spotted another inconsistency in the normalization under RFC3986

According to the RFC (RFC 3986: Uniform Resource Identifier (URI): Generic Syntax)

For consistency, URI producers and normalizers should use uppercase hexadecimal
digits for all percent-encodings.

So during normalization for any component uppercased percent-encodings should be used which is not the case for the example in the RFC. see for instance

$uri = Uri\Rfc3986\Uri::parse("https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com"); // percent-encoded form of https://你好你好.com
echo $uri->toString(); // https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com

the `toString` method should return
`https://%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD.com` instead.

Best regards

Ignace Nyamagana Butera

Mate_Kocsis · April 2, 2025, 5:59pm

Hi Ignace,

I spotted another inconsistency in the normalization under RFC3986

Thanks for spotting this: apparently, it is due to a small bug in the uriparser library, which I managed to fix locally, PR is on its way to upstream.

Máté

Mate_Kocsis · April 2, 2025, 8:41pm

Hi Ignace,

upon further inspection and verification of RFC3986 I also see an issue with the example used for normalization in the RFC. According to RFC3986 (https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) :
 The reg-name syntax allows percent-encoded octets in order to
  represent non-ASCII registered names in a uniform way that is
   independent of the underlying name resolution technology.  Non-ASCII
   characters must first be encoded according to UTF-8 [[STD63](https://www.rfc-editor.org/rfc/rfc3986.html#ref-STD63)], and then
   each octet of the corresponding UTF-8 sequence must be percent-
   encoded to be represented as URI characters.  URI producing
   applications must not use percent-encoding in host unless it is used
   to represent a UTF-8 character sequence.  When a non-ASCII registered
   name represents an internationalized domain name intended for
   resolution via the DNS, the name must be transformed to the IDNA
   encoding [[RFC3490](https://www.rfc-editor.org/rfc/rfc3490)] prior to name lookup.
From this we can infer that:

Host encoding can only happen for UTF-8 sequence but in your example “ex%61mple.com” is used which is not conforming to the rules (ie it should throw an InvalidUriException IMHO for the Uri class) I presume for WhatWg URL it will get correctly converted with a soft error (??).

Oh, that’s a very interesting catch again. If your interpretation is correct, then I think it must also be some bug
with the parser library, but I have to dig into the code first, or reach out to its author.

I have some suspicion though that the “URI producing applications” part may not apply for this case, at least I have a hard-time
to decide what this expression really means. The RFC also uses “URI reference parsers” that is really
a straightforward name, while “URI producers” isn’t. For example, there is a paragraph in the RFC:

URI producers and normalizers should omit the “:” delimiter that separates host from port if the port component is empty. Some schemes do not allow the userinfo and/or port subcomponents.

Clearly, omitting “:” is not done during parse-time, but when a URI (reference) is produced. So I find it possible that
“URI producing” mean when the URI string is created, not when the URI is parsed, although the RFC usually
uses URI and URI reference consistently. So I’m not sure. Maybe it’s a typo, and it should have been “URI normalizers”.

Regards,
Máté

nyamsprod_the_funky · April 4, 2025, 5:46pm

On 02/04/2025 19:59, Máté Kocsis wrote:

Hi Ignace,

I spotted another inconsistency in the normalization under RFC3986

Thanks for spotting this: apparently, it is due to a small bug in the uriparser library, which I managed to fix locally, PR is on its way to upstream.

Máté

Hi Máté I have a couple of questions regarding RFC3986\Uri

- I believe during normalization of IPv6 host the letter a-f should be lowercase in accordance with the RFC since

RFC3986 follows RFC 3513: Internet Protocol Version 6 (IPv6) Addressing Architecture which has been replaced by RFC 4291: IP Version 6 Addressing Architecture which is updated by RFC 5952: A Recommendation for IPv6 Address Text Representation which recommends lowecasing the letters. (yeah that was quite a digging I know )

- Since the withers expect well encoded components does it means that it is the same for the constructor. What is

the expected result for the following code ?


$uri =new Uri\Rfc3986\Uri("https://example,com/?foo[]=1&foo[]=2");

Should the above trigger an exception because the query component contain invalid characters or
is it acceptable ? Asking because currently our dear old parse_url does not fail on this and
probably most PHP developers expect this not to fail.

IMHO I am in favor of it failing to get a consistent experience when using the class because
otherwse you introduce an inconsistency between the constructor behaviour and the rest of the class
API.

Best regards,
Ignace Nyamagana Butera

Mate_Kocsis · April 7, 2025, 11:00pm

Hey Ignace,

(let me answer in the original thread, as apparently the discussion continued in a separate thread from the main one)

I believe during normalization of IPv6 host the letter a-f should be lowercase in accordance with the RFC since

RFC3986 follows https://www.rfc-editor.org/rfc/rfc3513 which has been replaced by https://www.rfc-editor.org/rfc/rfc4291 which is updated by https://www.rfc-editor.org/rfc/rfc5952#section-4.3 which recommends lowecasing the letters. (yeah that was quite a digging I know )

That’s quite a long chain of RFC updates… But yes, RFC 3986 explicitly mentions this here:

Although host is case-insensitive, producers and normalizers should use lowercase for registered names and hexadecimal addresses for the sake of uniformity, while only using uppercase letters for percent-encodings.

And that’s what the current implementation does.

Since the withers expect well encoded components does it means that it is the same for the constructor. What is

the expected result for the following code ?
~~~
$uri = new Uri\Rfc3986\Uri(["https://example,com/?foo[]=1&foo[]=2"](https://example,com/?foo[]=1&foo[]=2));
Should the above trigger an exception because the query component contain invalid characters or
is it acceptable ? Asking because currently our dear old parse_url does not fail on this and
probably most PHP developers expect this not to fail.

IMHO I am in favor of it failing to get a consistent experience when using the class because
otherwse you introduce an inconsistency between the constructor behaviour and the rest of the class
API.

Yes, generally, creation or any mutation of Uri\Rfc3986\Uri fails when the URI is invalid, exactly in order to offer a consistent experience.

Regards,
Máté

Mate_Kocsis · April 7, 2025, 11:27pm

Hi Ignace,

it might be brought back for future improvements.

Yes, surely!

I have one last question regarding the URI implementations which are raised by my polyfill:

Did you also took into account the delimiters when submitting data via the withers ? In other words is

```php
$uri->withQuery('?foo=bar');
//the same as 
$uri->withQuery('foo=bar');
```

I know it is the case in of the WHATWG specification but I do not know if you kept this behaviour in your implementation for the WhatWgUrl for the Rfc3986 or for both. I would lean toward not accepting this "normalization" but since this is not documented in the RFC I wanted to know what is the expected behaviour.

Yes, very good question! As you said, this aspect is not defined by either the RFC 3986, or the present PHP RFC… But yes, this normalization
won’t be accepted by the RFC implementation. I’ve just included this piece of information in the relevant section (https://wiki.php.net/rfc/url_parsing_api#component_modification).

Regards,
Máté

Mate_Kocsis · April 13, 2025, 12:10pm

Hi Tim,

I think I would prefer:

namespace Uri {
class InvalidUriException extends \Uri\UriException
{
}
}

namespace Uri\WhatWg {
class InvalidUrlException extends \Uri\InvalidUriException {
/** @var list */
public readonly array $errors;
}
}

(note the use of Url in the name of the sub-exception)

While this would result in a little more boilerplate, it would make
static analysis tools more useful, since the $errors array could be
properly typed instead of being just array<mixed>.

OK, this makes sense to me, and I’ve just implemented it.

In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it’s important
to know if %26 is decoded to & in a query-string. Or if %3D is
decoded to =. This really is the same case as with %2F in a path.
The explanation

[…]
The relevant sections will give a little more reasoning why I went with
these rules.

I’ve tested some of the examples against the implementation, but it does
not match the description. Is the implementation up to date?
<?php $url = new Uri\WhatWg\Url("[https://example.com/foo/bar%2Fbaz](https://example.com/foo/bar%2Fbaz)"); var_dump($url->getPath()); // /foo/bar%2Fbaz var_dump($url->getRawPath()); // /foo/bar%2Fbaz results in: string(12) "/foo/bar/baz" string(14) "/foo/bar%2Fbaz"

Yes, it is currently up-to-date, but I made some changes in WHATWG encoding not long ago and I didn’t notice that
the chosen behavior negatively affects this case… Let me share the details, because decoding of WHATWG
URLs seems very problematic.

Originally, my intention was to percent-decode characters based on the individual components’ “percent-encode set” (i.e.
https://url.spec.whatwg.org/#fragment-percent-encode-set for the fragment). These are the characters that are
automatically percent-encoded when encountered. One of my problems with this behavior was that characters in “percent-encode sets”
are not entirely in line with “URL code points” (basically valid characters in an URL: https://url.spec.whatwg.org/#url-code-points).
Most notably, the “#”, the “[”, and “]” characters are present in some percent-encoding sets, while missing from the valid URL
code points.

If characters were percent-decoded based on the “percent-encode sets”, then there would be some issues when the result is
passed to a wither: the WHATWG setter algorithms emit a soft error in these cases (e.g. in case of the query string, the
https://url.spec.whatwg.org/#dom-url-search steps trigger https://url.spec.whatwg.org/#query-state, where the 3.1. step takes
into action). To be fair, soft errors are not exposed in case of WHATWG withers, so it’s currently rather a theoretical problem
than an actual one (but I’m still considering adding a $softErrors parameter to WHATWG withers).

In any case, I believe the end of the “Component modification section” of the RFC shares some background information
regarding percent-decoding behavior.

At last, when I changed the RFC so that only those characters were percent-decoded which were “URL code points”, I didn’t notice
that the example you referred to above would go outdated: as “/” is an URL code point, it’s currently percent-decoded by getPath().
Unfortunately, I still don’t know what the best approach would be.

Please also give an explicit example for %3F in a path. I know that it
is reserved from reading the Rfc3986, but I think it’s a little
unintuitive. You can adjust the last example in the component retrieval
section to make it show all cases. So:

$uri = new
Uri\Rfc3986\Uri(“https://[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux”);

echo $uri->getHost(); //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
echo $uri->getRawHost(); //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
echo $uri->getPath(); // /foo/bar%3Fbaz
echo $uri->getRawPath(); // /foo/bar%3Fbaz
echo $uri->getQuery(); //
foo=bar%26baz%3Dqux
echo $uri->getRawQuery(); //
foo=bar%26baz%3Dqux

Why is this behavior unintuitive? I think the already added examples should already make it clear that percent-encoded
characters are never percent-decoded (the component modification part also has one example).

During testing I also noticed that the Rfc3986 implementation removes
trailing slashes from the path when using the normalized version. This
was a little unexpected, because to me this is the difference between a
directory and a file. I don’t think there are clear examples showing
that. So:

$uri = new Uri\Rfc3986\Uri(“https://example.com/foo/bar/”);

echo $uri->getPath(); // /foo/bar
echo $uri->getRawPath(); // /foo/bar/

Yes, I agree it’s weird. I’ll have a look at the code again if the normalizer removes the trailing slash, or I messed up something.

In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode ? and # as necessary. Will the same
happen
for Rfc3986? Will the encoding of # also happen for the query-string
component? The RFC only mentions the path component.

I think the question for RFC 3986 is answered in the PHP RFC by the following paragraph:

In order to offer consistent behavior with the parsing rules of RFC 3986,
withers of Uri\Rfc3986\Uri also only accept properly formatted input, meaning characters
that are not allowed to be present in a component must be
percent-encoded. Let’s see what this means in practice through the following example

Effectively, RFC 3986 has different behavior than what WHATWG does.

The latter question (“Will the encoding of # also happen for the query-string component?”)
was supposed to be answered by the RFC, because of this sentence:

WHATWG algorithm automatically percent-encodes characters that fall into the percent-encoding
character set of the given component

It may be possible that “the given” part is misleading, but the behavior actually follows the WHATWG spec
for all components. In any case, I change a few words to make this clear.

Is the implementation already up to date with this change? When I try:

var_dump(
(new Uri\Rfc3986\Uri('https://example.com/foo/path’))
->withPath(‘some/path?foo=bar’)
->toString()
);

I get

string(36) “https://example.comsome/path?foo=bar”

which is completely wrong.

I haven’t completely implemented withers yet for RFC 3986 (first and foremost validation is missing),
so that’s why you experienced this behavior. I would fix this later, but only if the vote succeeds. I’ve already
worked a lot on the implementation without having any promise of the RFC to succeed.

I think this might be a misunderstanding of the WHATWG specification. It
seems to be also normalized during parsing:

When I do the following in my Google Chrome:

(new URL(‘https://[0:0::1]’)).host;

I get [::1], which indicates the normalization happening. And likewise
will:

(new URL(‘https://[2001:db8:0:0:0:0:0:1]’)).host;

result in [2001:db8::1].

Yes, I realized that you are right. IP6 support used to be indeed incomplete or buggy until now,
but I took some time, and corrected the behavior.

My expectation be be [2001:db8:0:0:0:0:0:1] for Rfc3986 and
[2001:db8::1] for WhatWg. I have also tested the behavior of
withHost() when leaving out the square brackets. The Rfc3986 correctly
throws an Exception, but WhatWg silently does nothing:

$url = ‘https://example.com/foo/path’;

var_dump((new
Uri\WhatWg\Url($url))->withHost(‘2001:db8:0:0:0:0:0:1’)->toAsciiString());

results in

string(28) “https://example.com/foo/path”

This looks like this is the result of WHATWG’s host setter algorithm (https://url.spec.whatwg.org/#dom-url-hostname).
After debugging the behavior, I noticed that “new Uri\WhatWg\Url(‘2001:db8:0:0:0:0:0:1’)” only fails when trying to parse
the port after the first “:” character. However, the setter algorithm obviously doesn’t reach this point, since it only tries to
parse the host, and then it stops (because of the state override). So I’m not sure this gotcha can be cured.

I tried to reproduce the problem in Chrome, but I realized that the URL properties are not validated at all
when they are set (“url.hostname = “2001:db8:0:0:0:0:0:1”;” will change the hostname no problem)…

Regards,
Máté

Tim_Dusterhus · April 15, 2025, 2:20pm

Hi

Am 2025-04-13 14:10, schrieb Máté Kocsis:

     namespace Uri {
         class InvalidUriException extends \Uri\UriException
         {
         }
     }

     namespace Uri\WhatWg {
         class InvalidUrlException extends \Uri\InvalidUriException {
             /** @var list<UrlValidationError> */
             public readonly array $errors;
         }
     }

(note the use of Url in the name of the sub-exception)

While this would result in a little more boilerplate, it would make
static analysis tools more useful, since the `$errors` array could be
properly typed instead of being just `array<mixed>`.

OK, this makes sense to me, and I've just implemented it.

Great. Don't forget to adjust the RFC text (that's the more important part :-)).

At last, when I changed the RFC so that only those characters were
percent-decoded which were "URL code points", I didn't notice
that the example you referred to above would go outdated: as "/" is an URL
code point, it's currently percent-decoded by getPath().
Unfortunately, I still don't know what the best approach would be.

I see, thank you. I did some tests myself and read the spec. I've also checked How should parser handle percent-encoded characters like `%66` U+0066 (f) in path segments? · Issue #565 · whatwg/url · GitHub.

Perhaps the correct solution would be to offer only the non-raw methods for WHATWG URL and to not attempt any additional percent-decoding there? My reasoning is that the WHATWG URL is a living standard anyways, so trying to add additional semantics on top will result in sadness. My understanding is also that it is primarily intended for interaction with web browsers or to embed these URLs into HTML. For access control, e.g. in your framework the RFC3986 URI should be used. It's what HTTP uses internally and it supports well-defined normalization.

What do you think?

Please also give an explicit example for `%3F` in a path. I know that it
is reserved from reading the Rfc3986, but I think it's a little
unintuitive. You can adjust the last example in the component retrieval
section to make it show all cases. So:

     $uri = new
Uri\Rfc3986\Uri("https://
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]/foo/bar%3Fbaz?foo=bar%26baz%3Dqux");

     echo $uri->getHost(); //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
     echo $uri->getRawHost(); //
[2001:0db8:0001:0000:0000:0ab9:C0A8:0102]
     echo $uri->getPath(); // /foo/bar%3Fbaz
     echo $uri->getRawPath(); // /foo/bar%3Fbaz
     echo $uri->getQuery(); //
foo=bar%26baz%3Dqux
     echo $uri->getRawQuery(); //
foo=bar%26baz%3Dqux

Why is this behavior unintuitive? I think the already added examples should

Unintuive probably is not the best word. But I expect users to primarily interact with the path component of an URL (e.g. within their framework’s router). So I think it makes sense to be extra explicit with examples there. As an example, I recently learned that Symfony's router does not support (encoded) slashes within a component:

#[Route('/test/{message}', name: 'test')]

will work for http://localhost:8000/test/foo, but not for http://localhost:8000/test/foo%2Fbar, resulting in:

No route found for "GET http://localhost:8000/test/foo%2Fbar"

So if you would just extend the: “Let's have a look at some other tricky example with Uri\Rfc3986\Uri:” to my suggestion, I would be happy

Note: I believe there is a small mistake in the example when you last modified it. It says:

echo $uri->getHost(); // [2001:0db8:0001:0000:0000:0ab9:C0a8:0102]

Should the 'C' in 'C0a8' also be lowercased?

>> In the “Component Modification” section, the RFC states that WhatWgUrl
>> will automatically encode `?` and `#` as necessary. Will the same
>> happen
>> for Rfc3986? Will the encoding of `#` also happen for the query-string
>> component? The RFC only mentions the path component.

I think the question for RFC 3986 is answered in the PHP RFC by the
following paragraph:

In order to offer consistent behavior with the parsing rules of RFC 3986,
withers of Uri\Rfc3986\Uri also only accept properly formatted input,

meaning characters

that are not allowed to be present in a component must be
percent-encoded. Let's see what this means in practice through the

following example

Yes, thank you for pointing that out.

Effectively, RFC 3986 has different behavior than what WHATWG does.

Understood, makes sense.

The latter question ("Will the encoding of `#` also happen for the
query-string component?")
was supposed to be answered by the RFC, because of this sentence:

WHATWG algorithm automatically percent-encodes characters that fall into

the percent-encoding

character set of the given component

It may be possible that "the given" part is misleading, but the behavior
actually follows the WHATWG spec
for all components. In any case, I change a few words to make this clear.

Yes, that makes sense. It's also explained in the “Percent-encoding & decoding” subsection of the “Important concepts” section, but I already forgot about that when I got down to the “Component recomposition” bit. My mistake!

I haven't completely implemented withers yet for RFC 3986 (first and
foremost validation is missing),
so that's why you experienced this behavior. I would fix this later, but
only if the vote succeeds. I've already
worked a lot on the implementation without having any promise of the RFC
to succeed.

Understood.

My expectation be be `[2001:db8:0:0:0:0:0:1]` for Rfc3986 and
`[2001:db8::1]` for WhatWg. I have also tested the behavior of
`withHost()` when leaving out the square brackets. The Rfc3986 correctly
throws an Exception, but WhatWg silently does nothing:

     $url = 'https://example.com/foo/path’;

     var_dump((new
Uri\WhatWg\Url($url))->withHost('2001:db8:0:0:0:0:0:1')->toAsciiString());

results in

     string(28) "https://example.com/foo/path"

This looks like this is the result of WHATWG's host setter algorithm (
URL Standard).
After debugging the behavior, I noticed that "new
Uri\WhatWg\Url('2001:db8:0:0:0:0:0:1')" only fails when trying to parse
the port after the first ":" character. However, the setter algorithm
obviously doesn't reach this point, since it only tries to
parse the host, and then it stops (because of the state override). So I'm
not sure this gotcha can be cured.

I tried to reproduce the problem in Chrome, but I realized that the URL
properties are not validated at all
when they are set ("url.hostname = "2001:db8:0:0:0:0:0:1";" will change the
hostname no problem)...

I just tested it with node.js:

     > u = new URL('https://example.com/foo/path’);
     URL {
       href: 'https://example.com/foo/path’,
       origin: 'https://example.com',
       protocol: 'https:‘,
       username: '',
       password: '',
       host: 'example.com',
       hostname: 'example.com',
       port: '',
       pathname: '/foo/path',
       search: '',
       searchParams: URLSearchParams {},
       hash: ''
     }
     > u.hostname = '2001:db8:0:0:0:0:0:1'
     '2001:db8:0:0:0:0:0:1'
     > u
     URL {
       href: 'https://example.com/foo/path’,
       origin: 'https://example.com',
       protocol: 'https:‘,
       username: '',
       password: '',
       host: 'example.com',
       hostname: 'example.com',
       port: '',
       pathname: '/foo/path',
       search: '',
       searchParams: URLSearchParams {},
       hash: ''
     }
     > u.toString()
     'https://example.com/foo/path’
     > u.hostname = '[2001:db8:0:0:0:0:0:1]'
     '[2001:db8:0:0:0:0:0:1]'
     > u
     URL {
       href: 'https://[2001:db8::1]/foo/path’,
       origin: 'https://[2001:db8::1]',
       protocol: 'https:‘,
       username: '',
       password: '',
       host: '[2001:db8::1]',
       hostname: '[2001:db8::1]',
       port: '',
       pathname: '/foo/path',
       search: '',
       searchParams: URLSearchParams {},
       hash: ''
     }
     > u.toString()
     'https://[2001:db8::1]/foo/path’

So it indeed seems to be a limitation of the WHATWG specification and your PHP implementation is consistent with node.js. That is a good thing and when a user stumbles upon this, we can point them towards node.js / the spec. Not great, but this is workable!

Best regards
Tim Düsterhus

nyamsprod_the_funky · April 15, 2025, 5:12pm

Perhaps the correct solution would be to offer only the non-raw methods for WHATWG URL and to not attempt any additional percent-decoding there? My reasoning is that the WHATWG URL is a living standard anyways, so trying to add additional semantics on top will result in sadness. My understanding is also that it is primarily intended for interaction with web browsers or to embed these URLs into HTML. For access control, e.g. in your framework the RFC3986 URI should be used. It's what HTTP uses internally and it supports well-defined normalization.

What do you think?

Hi Tim and Maté

As a primary user of RFC3986/87 and with my experiences with WHATWG URL I fully support the removal of the `raw` methods on the WhatWgUrl implementation. The specification defines in one go via a state machine parsing, validation and normalization basically you always work with normalized URLs. I believe Javascript developers and browser vendors expect normalization out of the box for security and coherence between browsers. So in the context of browsers raw values are never expected nor wanted. I always wonder how you could extract raw value since the specification always talk about codepoints and parse the URL while normalizing the input.

As Tim also pointed out, the WHATWG is a living standard so the URL produces today may not be the one produces tomorrow which would then add more burden on the maintenance side if you constantly need to update how raw values are being extract in a specification that does not even consider them (does not offer an official way to access them).

Last but not least I tried several time to implement a polyfill for the Whatwg Url and I fail for that specific reason. I always go back to my initial comment both specs are great in that they complement each other. They may overlaps but they are fundamently different, so their public API should probably also reflect that. (ie WhatwgURL supports IDN host, RFC3986 does not) encoding differs for query string and so on. Trying to offer the same API for both even for `raw` method is IMHO not helping. And probably it may ease even your implementation since you would not have to worry about more edge cases.

Best regards,

Ignace Nyamagana Butera

Mate_Kocsis · April 15, 2025, 9:55pm

Hi Tim,

Perhaps the correct solution would be to offer only the non-raw methods
for WHATWG URL and to not attempt any additional percent-decoding there?
My reasoning is that the WHATWG URL is a living standard anyways, so
trying to add additional semantics on top will result in sadness. My
understanding is also that it is primarily intended for interaction with
web browsers or to embed these URLs into HTML. For access control, e.g.
in your framework the RFC3986 URI should be used. It’s what HTTP uses
internally and it supports well-defined normalization.

What do you think?

This was one of my (unspoken) ideas as well. I used to think there must have been a correct logic
for percent-decoding of WHATWG components, but if none of us can come up with a sensible
idea, then it’s best not to try it, I agree.

Unintuive probably is not the best word. But I expect users to primarily
interact with the path component of an URL (e.g. within their
framework’s router). So I think it makes sense to be extra explicit with
examples there. As an example, I recently learned that Symfony’s router
does not support (encoded) slashes within a component:

#[Route(‘/test/{message}’, name: ‘test’)]

will work for http://localhost:8000/test/foo, but not for
http://localhost:8000/test/foo%2fbar, resulting in:

No route found for “GET http://localhost:8000/test/foo%2fbar”

So if you would just extend the: “Let’s have a look at some other tricky
example with Uri\Rfc3986\Uri:” to my suggestion, I would be happy

Alright, I’ll add it. It won’t hurt for sure!

Note: I believe there is a small mistake in the example when you last
modified it. It says:

echo $uri->getHost(); //
[2001:0db8:0001:0000:0000:0ab9:C0a8:0102]

Should the ‘C’ in ‘C0a8’ also be lowercased?

Yes, nice catch! I swear I double checked it multiple times if there was any uppercase letters that should
be lowercased…

So it indeed seems to be a limitation of the WHATWG specification and
your PHP implementation is consistent with node.js. That is a good thing
and when a user stumbles upon this, we can point them towards node.js /
the spec. Not great, but this is workable!

Thank you for the test! To be honest, I pretty much don’t like how WHATWG
setters are specified, they seem to behave very “ad hoc” based on what I saw so far.

Regards,
Máté

Tim_Dusterhus · April 17, 2025, 7:22am

Hi

Am 2025-04-15 23:55, schrieb Máté Kocsis:

This was one of my (unspoken) ideas as well. I used to think there must
have been a correct logic
for percent-decoding of WHATWG components, but if none of us can come up
with a sensible
idea, then it's best not to try it, I agree.

Sweet. I believe this was/is the last remaining blocker for the RFC or is there still anyone else from your side that needs to be discussed? I need to give the RFC another read once you made the adjustment to remove the WhatWg raw methods (and adjusted the corresponding explanations), but I think I'm happy then

-----

For the latest changes from Tuesday, I see that you added the WhatWg-specific `InvalidUrlException`. The `Uri\InvalidUriException` however still has the `$errors` property. I think you might have forgotten to remove it, since the Rfc3986 implementation / the base exception does not expose any errors, right?

Best regards
Tim Düsterhus

Mate_Kocsis · April 17, 2025, 11:18am

Hi,

Tim Düsterhus <tim@bastelstu.be> ezt írta (időpont: 2025. ápr. 17., Cs, 9:22):

Hi

Am 2025-04-15 23:55, schrieb Máté Kocsis:

This was one of my (unspoken) ideas as well. I used to think there must
have been a correct logic
for percent-decoding of WHATWG components, but if none of us can come
up
with a sensible
idea, then it’s best not to try it, I agree.

Sweet. I believe this was/is the last remaining blocker for the RFC or
is there still anyone else from your side that needs to be discussed? I
need to give the RFC another read once you made the adjustment to remove
the WhatWg raw methods (and adjusted the corresponding explanations),
but I think I’m happy then

No, I also think that was the last one, as I don’t have any questions left. Although,
we should finalize what the WHATWG getters should be named? I like the explicit “raw”
that you suggested, but I can also see that it may be confusing for some people. Altogether
I think I prefer adding “raw” so that it’s clear that they behave similarly how the raw RFC 3986 getters
do.

For the latest changes from Tuesday, I see that you added the
WhatWg-specific InvalidUrlException. The Uri\InvalidUriException
however still has the $errors property. I think you might have
forgotten to remove it, since the Rfc3986 implementation / the base
exception does not expose any errors, right?

I made the changes in the RFC in a hurry, so yes, I forgot to remove the property. Thanks!

Máté

nyamsprod_the_funky · April 17, 2025, 11:49am

I still have one last question regarding the RFC3986 URI path component.
Currently the path is nullable but according to the RFC the path can not be nullable
According to the RFC the path can have up to 5 ABNF representation

path = path-abempty ; begins with “/” or is empty

                    / path-absolute   ; begins with "/" but not "//"
                    / path-noscheme   ; begins with a non-colon segment
                    / path-rootless   ; begins with a segment
                    / path-empty      ; zero characters

      path-abempty  = *( "/" segment )
      path-absolute = "/" [ segment-nz *( "/" segment ) ]
      path-noscheme = segment-nz-nc *( "/" segment )
      path-rootless = segment-nz *( "/" segment )
      path-empty    = 0<pchar>

but none of which is null. The path can only be a string empty or not. so I would change the getPath and withPath signature

to highlight that fact. Apart from that I have no more comments.

On Thu, Apr 17, 2025 at 1:21 PM Máté Kocsis <kocsismate90@gmail.com> wrote:

Hi,

Tim Düsterhus <tim@bastelstu.be> ezt írta (időpont: 2025. ápr. 17., Cs, 9:22):

Hi

Am 2025-04-15 23:55, schrieb Máté Kocsis:

This was one of my (unspoken) ideas as well. I used to think there must
have been a correct logic
for percent-decoding of WHATWG components, but if none of us can come
up
with a sensible
idea, then it’s best not to try it, I agree.

Sweet. I believe this was/is the last remaining blocker for the RFC or
is there still anyone else from your side that needs to be discussed? I
need to give the RFC another read once you made the adjustment to remove
the WhatWg raw methods (and adjusted the corresponding explanations),
but I think I’m happy then

No, I also think that was the last one, as I don’t have any questions left. Although,
we should finalize what the WHATWG getters should be named? I like the explicit “raw”
that you suggested, but I can also see that it may be confusing for some people. Altogether
I think I prefer adding “raw” so that it’s clear that they behave similarly how the raw RFC 3986 getters
do.

For the latest changes from Tuesday, I see that you added the
WhatWg-specific InvalidUrlException. The Uri\InvalidUriException
however still has the $errors property. I think you might have
forgotten to remove it, since the Rfc3986 implementation / the base
exception does not expose any errors, right?

I made the changes in the RFC in a hurry, so yes, I forgot to remove the property. Thanks!

Máté

Mate_Kocsis · April 17, 2025, 11:53am

Hi Ignace,

Currently the path is nullable but according to the RFC the path can not be nullable
According to the RFC the path can have up to 5 ABNF representation

Uh, this is something that I also forgot to sync between the implementation and the RFC. I also recently found out that
the get*Path() methods should be non-nullable for both classes, so I recently fixed them. Sorry for the confusion!

Regards,
Máté

Mate_Kocsis · April 17, 2025, 12:04pm

Hi Ignace,

Uh, this is something that I also forgot to sync between the implementation and the RFC. I also recently found out that
the get*Path() methods should be non-nullable for both classes, so I recently fixed them. Sorry for the confusion!

Actually, I realized after checking the RFC that it was up-to-date this time with the recent changes. So maybe you read an older version, didn’t you?

Regards,
Máté

Paul_M_Jones · April 17, 2025, 8:47pm

Hi Maté and all,

A one-off comment about the exceptions:

The RFC posits that _Uri\UriException_ extends _Exception_, and _Uri\InvalidUriException_ extends _UriException_.

However, pre-existing userland solutions to the URI problem lean more heavily on the native PHP _InvalidArgumentException_, which extends _LogicException_. (Cf. <interface/README-RESEARCH.md at 1.x · uri-interop/interface · GitHub)

(_LogicException_ "represents an error in the program logic. This kind of exception should lead directly to a fix in your code.")

As such, the _InvalidUriException_ would better extend from _LogicException_.

What then to do with _UriException_ ? It's a base, it never gets thrown anywhere. If a base is actually necessary, perhaps it should be renamed _UriLogicException) and extend _LogicException_; then _InvalidUriException_ can extend from that base. This leaves room for a _UriRuntimeException_ base, should one ever be needed.

-- pmj

Tim_Dusterhus · April 17, 2025, 8:58pm

Hi

On 4/17/25 22:47, Paul M. Jones wrote:

As such, the _InvalidUriException_ would better extend from _LogicException_.

No. There is a de facto policy of “not using SPL exceptions in new code”. The replacement for LogicException is the Error hierarchy.

Also, as you quoted yourself, LogicException would be not appropriate to use as the base for InvalidUriException, since passing invalid URIs is not a programming error. The point of the URI classes is that they validate URIs, thus malformed inputs are expected in correctly written code.

See also Add ext/random Exception hierarchy by TimWolla · Pull Request #9220 · php/php-src · GitHub for the rationale behind the exception hierarchy in ext/random (which is the first API that was rewritten for “modern PHP”). The choices there also served as the basis for the new ext/date hierarchy in PHP 8.3: PHP: rfc:datetime-exceptions

Best regards
Tim Düsterhus