[PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Paul_M_Jones · April 17, 2025, 9:14pm

On Apr 17, 2025, at 15:58, Tim Düsterhus <tim@bastelstu.be> wrote:

Hi

On 4/17/25 22:47, Paul M. Jones wrote:

As such, the _InvalidUriException_ would better extend from _LogicException_.

No. There is a de facto policy of “not using SPL exceptions in new code”. The replacement for LogicException is the Error hierarchy.

Ah so -- I was not aware. I retract the comment, and thanks for the correction.

-- pmj

Tim_Dusterhus · April 17, 2025, 9:19pm

Hi

On 4/17/25 23:14, Paul M. Jones wrote:

On 4/17/25 22:47, Paul M. Jones wrote:

As such, the _InvalidUriException_ would better extend from _LogicException_.

No. There is a de facto policy of “not using SPL exceptions in new code”. The replacement for LogicException is the Error hierarchy.

Ah so -- I was not aware. I retract the comment, and thanks for the correction.

Yes, we absolutely should make this an official policy in the new-ish policies repository (GitHub - php/policies: A collection of the policies and guidelines that set out PHP's development) to give folks an official resource to reference and hopefully making it easier for RFC authors to make the “correct choice” without someone needing to remember the existing gentleman’s agreement.

I've put writing such a policy RFC onto my TODO list to handle when I have the time.

Best regards
Tim Düsterhus

Tim_Dusterhus · April 23, 2025, 10:50am

Hi

Am 2025-04-17 13:18, schrieb Máté Kocsis:

Sweet. I believe this was/is the last remaining blocker for the RFC or
is there still anyone else from your side that needs to be discussed? I
need to give the RFC another read once you made the adjustment to remove
the WhatWg raw methods (and adjusted the corresponding explanations),
but I think I'm happy then

No, I also think that was the last one, as I don't have any questions left.
Although,
we should finalize what the WHATWG getters should be named? I like the
explicit "raw"
that you suggested, but I can also see that it may be confusing for some
people. Altogether
I think I prefer adding "raw" so that it's clear that they behave similarly
how the raw RFC 3986 getters
do.

In php.internals: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API I suggest to only provide the "non-raw" methods, so I believe you misread that. I've just given the RFC another read and thought about the naming and I believe I still prefer not having the "raw" in the name:

- Having the `raw` in the name makes the API very clunky / verbose to use.
- Other implementations, such as in browsers or node.js, also simply use the component name without any indication of the output being raw.
- Future changes to the WHATWG URL specification might introduce some normalization for components that currently doesn't have normalization. This would make the `raw` naming a misnomer and might require new methods / deprecations on PHP's end.

So it seems to be safer to use the naming without the `raw` and then in the documentation explain what happens with useful examples, just like the RFC already does.

------------

Other than that, I noticed the following small issues:

1.

The `UrlValidationError` class is `final` in the implementation, but not in the RFC text. I assume that is an oversight.

2.

In the "Advanced examples" section, the "another tricky example". There is a duplicate `?foo=bar%26baz%3Dqux` in the query-string. I assume that is unintentional and not part of the example.

3.

In the "Advanced examples" section, the "another tricky example". I think it would be useful to have an explicit comparison to the output of the WHATWG URL, especially around the IPv6 normalization. I've seen that this is also mentioned later, but it's probably useful to have here as well.

4.

In the "Component modification" section, for the "In order to offer consistent behavior with the parsing rules of RFC 3986, withers of Uri\Rfc3986\Uri also only accept properly formatted input," example:

There is a `echo $uri->getRawHost(); // [2001:0db8:0001:0000:0000:0ab9:C0A8:0102]` call, but the host is never modified. That appears to be an error.

5.

In the "Serialization" section: The explanation of the serialization format is overly specific regarding the implementation details. I would simplify that to just say "it supports serialization by using the toRawString() output and performs strict checks during unserialization" or similar. The reason is that I want to make some suggestions to the serialization format to provide greater flexibility for future changes during the technical review of the implementation

------------

I did not give the implementation another test, since with the removal of the percent-decoding for WHATWG, the RFC just does what the other specifications already require. So this all makes sense to me and any differences would simply be a regular bug in the code, rather than the RFC text.

Best regards
Tim Düsterhus

nyamsprod_the_funky · April 27, 2025, 8:30pm

Hi Maté

I see you updated the RFC but I believe there’s still some errors in the example:

$url = Uri\WhatWg\Url::parse("/foo", ".com");                 // Throws Uri\WhatWg\InvalidUrlException because of $baseUri

Since parse is used shouldn’t it return null instead of throwing ?

$uri = Uri\Rfc3986\Uri::parse("https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%b[d.com](http://d.com)"); // percent-encoded form of [https://你好你好.com](https://xn--6qqa088eba.com)
echo $uri->toString();                             // https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%[bd.com](http://bd.com)

RFC3986 host normalization states that URL encoded part should be encoded using uppercased letter so on normalization:

https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%[bd.com](http://bd.com) should be https://%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD.com

I updated my polyfill to reflect the latest changes in the RFC

Tim_Dusterhus · April 27, 2025, 8:32pm

Hi

On 4/27/25 22:30, ignace nyamagana butera wrote:

$url = Uri\WhatWg\Url::parse("/foo", ".com"); //
Throws Uri\WhatWg\InvalidUrlException because of $baseUri

Since parse is used shouldn't it return null instead of throwing ?

In this case the `$baseUri` is invalid. Since this is not expected to be an untrusted value, it makes sense to me to throw an `InvalidUrlException` here. The `null` return should only be used for an invalid `$uri`.

Best regards
Tim Düsterhus

nyamsprod_the_funky · April 27, 2025, 8:50pm

Le dim. 27 avr. 2025, 22:32, Tim Düsterhus <tim@bastelstu.be> a écrit :

Hi

On 4/27/25 22:30, ignace nyamagana butera wrote:

$url = Uri\WhatWg\Url::parse(“/foo”, “.com”); //
Throws Uri\WhatWg\InvalidUrlException because of $baseUri

Since parse is used shouldn’t it return null instead of throwing ?

In this case the $baseUri is invalid. Since this is not expected to be
an untrusted value, it makes sense to me to throw an
InvalidUrlException here. The null return should only be used for an
invalid $uri.

Best regards
Tim Düsterhus

Hi,

I understand that but then I fail to see the added value of the parse method vs the default constructor since from the RFC the only difference is that the parse named constructor should instead of throwing return null. If the parse method can still throw from a consumer POV it looses much of it’s utility. If I really want that level of knowledge using the constructor should be the only way to go AFAIK.

Best regards,
Ignace Nyamagana Butera

Tim_Dusterhus · April 27, 2025, 9:05pm

Hi

On 4/27/25 22:50, ignace nyamagana butera wrote:

I understand that but then I fail to see the added value of the parse
method vs the default constructor since from the RFC the only difference is
that the parse named constructor should instead of throwing return null. If
the parse method can still throw from a consumer POV it looses much of it's
utility. If I really want that level of knowledge using the constructor
should be the only way to go AFAIK.

Since the `$baseUri` is a known existing URI, I expect it to be always be valid, otherwise it would be a programming error. The (relative) $uri is the bit that comes from an untrusted source. Handling both cases by returning `null` would make the API much worse, since it is no longer clear which of the values is invalid.

Perhaps as a solution, it would make sense to change the signature to:

(string $uri, ?self $baseUri = null)

instead to enforce that the $baseUri must be valid. This might also improve performance, by allowing to avoid repeatedly parsing the $baseUri, e.g. when bulk processing a number of relative links.

Best regards
Tim Düsterhus

Mate_Kocsis · April 27, 2025, 9:47pm

Hi Tim,

In https://news-web.php.net/php.internals/127114 I suggest to only
provide the “non-raw” methods, so I believe you misread that. I’ve just
given the RFC another read and thought about the naming and I believe I
still prefer not having the “raw” in the name:

Having the raw in the name makes the API very clunky / verbose to
use.

Other implementations, such as in browsers or node.js, also simply use
the component name without any indication of the output being raw.

Future changes to the WHATWG URL specification might introduce some
normalization for components that currently doesn’t have normalization.
This would make the raw naming a misnomer and might require new
methods / deprecations on PHP’s end.

So it seems to be safer to use the naming without the raw and then in
the documentation explain what happens with useful examples, just like
the RFC already does.

We discussed this off the list, and the recommendation made sense to me at last.
I described the rationale in the RFC around the end of the “Component retrieval” / “Basic examples” section.

Additionally, I recorded a few WHATWG related “deviations” from the specified getter and setter steps along with
the rationale of these choices.

Other than that, I noticed the following small issues:

I fixed all these small errors, thanks for pointing them out.

In the “Serialization” section: The explanation of the serialization
format is overly specific regarding the implementation details. I would
simplify that to just say “it supports serialization by using the
toRawString() output and performs strict checks during unserialization”
or similar. The reason is that I want to make some suggestions to the
serialization format to provide greater flexibility for future changes
during the technical review of the implementation

After an off-the-list discussion, I updated the RFC text so that it reflects the
desired behavior (that is consistent with the serialization format of ext/random).

Regards,
Máté

Tim_Dusterhus · April 27, 2025, 10:33pm

Hi

On 4/27/25 23:47, Máté Kocsis wrote:

[…]

Thank you. I have just given the RFC another full read (the 2025/04/27 21:44 version) and I do not have any further remarks. I'm happy with everything that is said in the RFC and I'm really looking forward to vote “Yes”

Best regards
Tim Düsterhus

nyamsprod_the_funky · April 28, 2025, 7:05am

Hi Maté,

I found another typo in the RFC examples due to the use of boolean as parameters

// The fragment component of Uri\WhatWg\Url can also be taken into account
$url = new Uri\WhatWg\Url("[https://example.com#foo](https://example.com#foo)");
$url->equals(new Uri\WhatWg\Url("[https://example.com](https://example.com)"), true); // false

The $excludeFragment is true by default so in the example it should be false instead. Perhaps using an Enum instead would make the DX easier than using a boolean ?I believe the same issue is in all examples regarding the use of that parameter.

Best regards,
Ignace Nyamagana Butera

On Mon, Apr 28, 2025 at 12:35 AM Tim Düsterhus <tim@bastelstu.be> wrote:

Hi

On 4/27/25 23:47, Máté Kocsis wrote:

[…]

Thank you. I have just given the RFC another full read (the 2025/04/27
21:44 version) and I do not have any further remarks. I’m happy with
everything that is said in the RFC and I’m really looking forward to
vote “Yes”

Best regards
Tim Düsterhus

nyamsprod_the_funky · April 28, 2025, 8:42am

On Mon, Apr 28, 2025 at 9:05 AM ignace nyamagana butera <nyamsprod@gmail.com> wrote:

Hi Maté,

I found another typo in the RFC examples due to the use of boolean as parameters
// The fragment component of Uri\WhatWg\Url can also be taken into account
$url = new Uri\WhatWg\Url("[https://example.com#foo](https://example.com#foo)");
$url->equals(new Uri\WhatWg\Url("[https://example.com](https://example.com)"), true); // false
The $excludeFragment is true by default so in the example it should be false instead. Perhaps using an Enum instead would make the DX easier than using a boolean ?I believe the same issue is in all examples regarding the use of that parameter.

Best regards,
Ignace Nyamagana Butera

On Mon, Apr 28, 2025 at 12:35 AM Tim Düsterhus <tim@bastelstu.be> wrote:

Hi

On 4/27/25 23:47, Máté Kocsis wrote:

[…]

Thank you. I have just given the RFC another full read (the 2025/04/27
21:44 version) and I do not have any further remarks. I’m happy with
everything that is said in the RFC and I’m really looking forward to
vote “Yes”

Best regards
Tim Düsterhus

Hi I would propose to use the following Enum in the Uri namespace

enum UriComparison {

    case IncludeFragment;
    case ExcludeFragment;
}

It is a bit verbose for less error prone and by default the equals method on both class would use `UriComparison::ExcludeFragment` PS: naming can change as long

as the enum reduces the errors.

Paul_M_Jones · April 28, 2025, 7:49pm

Hi Maté and all,

On Apr 27, 2025, at 16:47, Máté Kocsis <kocsismate90@gmail.com> wrote:

Hi Tim,

...

So it seems to be safer to use the naming without the `raw` and then in
the documentation explain what happens with useful examples, just like
the RFC already does.

We discussed this off the list, and the recommendation made sense to me at last.

I am glad to see it!

* * *

Removing the `raw()` methods from the Whatwg\Url class opens up another opportunity.

The Rfc3986\Uri `raw()` methods present a departure from existing userland expectations when working with URIs. No existing URI package that I'm aware of retains the normalized values as their "main" values; the values are generally retained-as-given (i.e. "raw"). Nor do they afford getting two versions of the retained values (one raw, one normalized).

This might be solved by renaming the Rfc3986\Uri methods so that the "main" methods return the raw values, and the alternative methods return the normalized versions. For example, getPath() would become getNormalizedPath(), and getRawPath() would become getPath().

But that's pretty verbose, and on considering it further, I think I think there are two classes combined inside Rfc3986\Uri.

Proposal:

Instead of a single Rfc3986\Uri class that tries to hold *both* raw *and* normalized values and logic at the same time, introduce a NormalizedUri class to operate with normalized values, and treat the current Uri class as operating with raw values. That would, among other things:

- fulfill existing userland expectations;
- eliminate the getRaw() methods;
- replace the toString()/toRawString() with a single idiomatic __toString() in each class;
- move normalization logic into the NormalizedUri class.

Optionally, there could be one additional method one or both classes, toNormalizedUri(), to create and return a normalized instance. For Uri the return would be a new NormalizedUri; for NormalizedUri, the return would either be itself ($this) or a clone of itself.

If the RFC pursues that approach, it will also lend itself to either an abstract they each extend or (preferably) an interface they each implement. If an interface, I opine it should be called Uri; the current Uri class might become RawUri (with NormalizedUri not needing a rename).

Thoughts?

-- pmj

nyamsprod_the_funky · April 28, 2025, 8:47pm

Hi Paul,

The Rfc3986\Uri raw() methods present a departure from existing userland expectations when working with URIs. No existing URI package that I’m aware of retains the normalized values as their “main” values; the values are generally retained-as-given (i.e. “raw”). Nor do they afford getting two versions of the retained values (one raw, one normalized).

As a maintainer of a userland URI package I disagree with this approach. I believe offering both raw and normalized methods in a single class while representing a new approach in PHP also offers a better representation of URIs in general. The current approach in userland mixes both raw and half normalized components as well as RFC3986 and RFC3987 specification with ambiguity around normalization, input, constructior, what needs to be encoded where and when, something this proposal has been successful at avoiding by using the raw and normalized methods.

fulfill existing userland expectations;

Existing userland expectations are mostly built around parse_url which is one of the reasons the RFC exists to improve the status quo and to introduce in PHP valid parsers against recognizable URI specifications. Yes some adaptation will be needed to use them in userland but I believe this work is easy to do, talking from the POV of a URI package maintainer.

replace the toString()/toRawString() with a single idiomatic __toString() in each class;

For all the reasons explained in the RFC, adding a __toString method is a bad architectural design for an URI. There are so many ways to represent an URI that having a __toString for string representation gives a false sense of “there can be only one true representation for a single URI” which is not true. URI can be normalized, raw, and have different representations depending on the context in which it will be used. So again, I believe the RFC made the right call to not implement the Stringable interface to force the developer to make the right call or to encapsulate the value object into a proper URI representational class or method that can use the exposed raw and normalized representation of each component to produce the expected URI representation.

move normalization logic into the NormalizedUri class.
The classes follow specifications that describe how normalization should be. Why would you split the responsibilities in other classes ? What would be the added value ?

Again, I understand this is new code and current URI packages, mine included, will have to adapt but on the longer run I believe the proposed API is more predictive and easier to reason about. To quote someone “Comfort and the fear of change are the greatest enemies of success.”

Best regards,
Ignace Nyamagana Butera

On Mon, Apr 28, 2025 at 9:53 PM Paul M. Jones <pmjones@pmjones.io> wrote:

Hi Maté and all,

On Apr 27, 2025, at 16:47, Máté Kocsis <kocsismate90@gmail.com> wrote:

Hi Tim,
…

So it seems to be safer to use the naming without the raw and then in
the documentation explain what happens with useful examples, just like
the RFC already does.

We discussed this off the list, and the recommendation made sense to me at last.

I am glad to see it!

Removing the raw() methods from the Whatwg\Url class opens up another opportunity.

The Rfc3986\Uri raw() methods present a departure from existing userland expectations when working with URIs. No existing URI package that I’m aware of retains the normalized values as their “main” values; the values are generally retained-as-given (i.e. “raw”). Nor do they afford getting two versions of the retained values (one raw, one normalized).

This might be solved by renaming the Rfc3986\Uri methods so that the “main” methods return the raw values, and the alternative methods return the normalized versions. For example, getPath() would become getNormalizedPath(), and getRawPath() would become getPath().

But that’s pretty verbose, and on considering it further, I think I think there are two classes combined inside Rfc3986\Uri.

Proposal:

Instead of a single Rfc3986\Uri class that tries to hold both raw and normalized values and logic at the same time, introduce a NormalizedUri class to operate with normalized values, and treat the current Uri class as operating with raw values. That would, among other things:

fulfill existing userland expectations;

eliminate the getRaw() methods;

replace the toString()/toRawString() with a single idiomatic __toString() in each class;

move normalization logic into the NormalizedUri class.

Optionally, there could be one additional method one or both classes, toNormalizedUri(), to create and return a normalized instance. For Uri the return would be a new NormalizedUri; for NormalizedUri, the return would either be itself ($this) or a clone of itself.

If the RFC pursues that approach, it will also lend itself to either an abstract they each extend or (preferably) an interface they each implement. If an interface, I opine it should be called Uri; the current Uri class might become RawUri (with NormalizedUri not needing a rename).

Thoughts?

– pmj

Mate_Kocsis · April 28, 2025, 9:20pm

Hi Ignace,

The $excludeFragment is true by default so in the example it should be false instead. Perhaps using an Enum instead would make the DX easier than using a boolean ?
I believe the same issue is in all examples regarding the use of that parameter.

You are right, I completely messed up the value of the $excludeFragment variables in the examples. After having thought about your suggestion, I’m fine with adding the enum.
It’s a bit verbose indeed, but at least it properly conveys the meaning of the parameter, so hopefully it will reduce the number of WTFs when people start to use the new API.

I fiddled a little bit with the implementation, and I went with the Uri\UriComparisonMode enum name at last. I hope that it is OK on your side.

Regards,
Máté

nyamsprod_the_funky · April 28, 2025, 9:31pm

Hi Maté,

I fiddled a little bit with the implementation, and I went with the Uri\UriComparisonMode enum name at last. I hope that it is OK on your side.

If no one objects with your name choice I am fine with it, as long as it is not a boolean I will adapt my polyfill. I think I have no more remarks from my side of things, great job! Since I do not have the right to vote I hope this one will pass when time for voting comes!

On Mon, Apr 28, 2025 at 11:21 PM Máté Kocsis <kocsismate90@gmail.com> wrote:

Hi Ignace,

The $excludeFragment is true by default so in the example it should be false instead. Perhaps using an Enum instead would make the DX easier than using a boolean ?
I believe the same issue is in all examples regarding the use of that parameter.

You are right, I completely messed up the value of the $excludeFragment variables in the examples. After having thought about your suggestion, I’m fine with adding the enum.
It’s a bit verbose indeed, but at least it properly conveys the meaning of the parameter, so hopefully it will reduce the number of WTFs when people start to use the new API.

I fiddled a little bit with the implementation, and I went with the Uri\UriComparisonMode enum name at last. I hope that it is OK on your side.

Regards,
Máté

nyamsprod_the_funky · April 29, 2025, 8:54am

Hi Maté and Time,

I have one last question while reviewing my polyfill implementation. Is it worth it adding a SensitiveParameter attribute on the argument of the following methods ?

Uri\Rfc3986\Uri::withUserInfo
Uri\WhatWg\Url::withPassword

I’m fine with any answer ? Does it warrant a paragraph in the RFC ? That I do not know but I feel the question may be raised ?

Best regards,
Ignace Nyamagana Butera

On Mon, Apr 28, 2025 at 11:31 PM ignace nyamagana butera <nyamsprod@gmail.com> wrote:

Hi Maté,

I fiddled a little bit with the implementation, and I went with the Uri\UriComparisonMode enum name at last. I hope that it is OK on your side.

If no one objects with your name choice I am fine with it, as long as it is not a boolean I will adapt my polyfill. I think I have no more remarks from my side of things, great job! Since I do not have the right to vote I hope this one will pass when time for voting comes!

On Mon, Apr 28, 2025 at 11:21 PM Máté Kocsis <kocsismate90@gmail.com> wrote:

Hi Ignace,

The $excludeFragment is true by default so in the example it should be false instead. Perhaps using an Enum instead would make the DX easier than using a boolean ?
I believe the same issue is in all examples regarding the use of that parameter.

You are right, I completely messed up the value of the $excludeFragment variables in the examples. After having thought about your suggestion, I’m fine with adding the enum.
It’s a bit verbose indeed, but at least it properly conveys the meaning of the parameter, so hopefully it will reduce the number of WTFs when people start to use the new API.

I fiddled a little bit with the implementation, and I went with the Uri\UriComparisonMode enum name at last. I hope that it is OK on your side.

Regards,
Máté

Paul_M_Jones · April 29, 2025, 1:55pm

Hi Ignace & Maté and all,

tl;dr: I argue against Ignace's objections to splitting the URI class into two classes (one that retains raw URI values and another that normalizes values as-it-goes). Jump to the very end for a discussion regarding the with() methods (search for the word "asymmetry" herein).

* * *

On Apr 28, 2025, at 15:47, ignace nyamagana butera <nyamsprod@gmail.com> wrote:

The current approach in userland mixes both raw and half normalized components as well as RFC3986 and RFC3987 specification with ambiguity around normalization, input, constructior, what needs to be encoded where and when

Based on my research into existing URI projects <https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md> I don't think that's an accurate assessment of the ecosystem.

For example, can you point out which projects mix "raw and half-normalized components"? Nette is the only one that comes to mind, in that (during parsing) it applies rawurldecode() to the host, user, password, and fragment; but that's only one of the 18 projects.

Likewise, of the 15 URI-centric projects, only one of them (league/uri) offers both RFC3986 and 3987 parsing; the two IRI-centric projects (ml/iri and rmccue/requests) are explicitly IRIs; and rowbot is clearly WHATWG-URL centric. So I don't see much ambiguity in any projects there.

As far as normalization, only one project (opis) affords the ability to normalize at creation time, though five of them offer a normalize() method with various effects (<interface/README-RESEARCH.md at 1.x · uri-interop/interface · GitHub). So, again, I don't see much ambiguity there either; they don't do normalizing as-you-go, it's something you have to apply explicitly.

Regarding inputs, they all presume "raw" inputs. Regarding constructors, they mostly side with a full URI string. Regarding encoding, they mostly retain values in their encoded form (there are three outliers, cf. <interface/README-RESEARCH.md at 1.x · uri-interop/interface · GitHub).

With all that in mind, we can see that the various authors of userland projects have settled on remarkably similar patterns of usage that they found valuable and useful for working with URIs.

> - fulfill existing userland expectations;

Existing userland expectations are mostly built around `parse_url`

That's kind of true; 9 of the 18 projects use parse_url(), and 7/18 implement the RFC 3986 parsing algorithm ...

which is one of the reasons the RFC exists to improve the status quo and to introduce in PHP valid parsers against recognizable URI specifications. Yes some adaptation will be needed to use them in userland but I believe this work is easy to do, talking from the POV of a URI package maintainer.

... but I don't imagine that replacing parse_url() in those projects with the RFC 3986 algo would cause those projects to change any of their other design decisions. What adaptations do you think would be needed around that replacement?

> - replace the toString()/toRawString() with a single idiomatic __toString() in each class;

For all the reasons explained in the RFC, adding a `__toString` method is a bad architectural design for an URI. There are so many ways to represent an URI that having a `__toString` for string representation gives a false sense of "there can be only one true representation for a single URI" which is not true.

For Rfc3986\Uri, it looks like there are only two that are recognized: raw and normalized. Are there other string representations you feel the Uri class should recognize?

(For Whatwg\Url, it looks like there are also only two: as-parsed, and as ASCII, but I'm not addressing that part of the RFC here.)

> - move normalization logic into the NormalizedUri class.

The classes follow specifications that describe how normalization should be. Why would you split the responsibilities in other classes ? What would be the added value ?

For one, unless I am missing something, there is an asymmetry between the get() methods and the with() methods. What I'm seeing is that (e.g.) Uri::withPath() expects a raw path argument, but getPath() returns the normalized version. For symmetry, I would expect either:

- `Uri::withPath(raw_value) : self` and `Uri::getPath() : raw_value`, or
- `Uri::withRawPath(raw_value) : self` and `Uri::getRawPath() : raw_value`

Thus my first intuition that the "main" values in the URI need to be the raw ones, and that getting the normalized ones should be the more verbose case (e.g. `getNormalizedPath() : normalized_value`).

So, one value added by splitting the classes is to resolve that asymmetry. Consumers expecting to get back from the URI what they put into it can use the raw Uri variation; "API clients or signers fall in this category that want to avoid introducing any unnecessary changes to URIs, in order to avoid causing subtle bugs."

Other consumers, who want to do things this new and different way (normalized as-you-go, unlike anything currently in userland) can use the NormalizedUri.

(Or you could flip it around and say that the normalized variation is the Uri class, and the raw version is RawUri.)

-- pmj

Tim_Dusterhus · April 29, 2025, 6:55pm

Hi

On 4/29/25 10:54, ignace nyamagana butera wrote:

I have one last question while reviewing my polyfill implementation. Is it
worth it adding a SensitiveParameter attribute on the argument of the
following methods ?

- Uri\Rfc3986\Uri::withUserInfo
- Uri\WhatWg\Url::withPassword

I'm fine with any answer ? Does it warrant a paragraph in the RFC ? That I
do not know but I feel the question may be raised ?

Good catch. Since they may throw an exception for malformed inputs, they should have the attribute. Especially since folks might try to use special characters in passwords, which might need encoding.

No paragraph in the RFC needed, but the attribute should be added to the “stub”.

Best regards
Tim Düsterhus

nyamsprod_the_funky · April 29, 2025, 8:08pm

Hi Paul,

I will try to address your concerns. Keep in mind that I am not the author of the RFC but I do like how it is currently shaped with some caveats but those can be put under future improvements.

So, one value added by splitting the classes is to resolve that asymmetry.

First, I agree with you. The method naming in the Uri\Rfc3986\Uri class could be improved even though it does not represent a showstopper to me, Adding the raw prefix or indeed flipping the raw* method and using normalized* would perhaps make for some clarification but I will leave that decision to Máté.
Apart from that, I believe the current RFC (especially around RFC3986) does address most if not all the issues regarding the specification. RFC3986 provides information around 3 key URI features: parsing, resolution and equivalence. In order to offer resolution and equivalence you ought to address normalization and thus encoding. Any userland package that does offer those features is required to handle component encoding/normalization first before performing the expected operation. Hence why I believe that if the new URI class does offer equivalence by consequence it can/should be able to expose URI component normalization out of the box. The need for a separate class is IMHO not needed.

For example, can you point out which projects mix “raw and half-normalized components”?

Laminas for example or any PSR implementing class will try to encode the input string regardless of its encoding hence the wording around not to double encode the string you often encounter in mutator method docblock. The Uri on the other hand only expects well formed and encoded strings which leaves room for no wrong interpretation. This is an area that is left to be filled by URI packages for instance.

For Rfc3986\Uri, it looks like there are only two that are recognized: raw and normalized. Are there other string representations you feel the Uri class should recognize?

If there are at least two representations possible then a __toString method is still a bad design because it may lead the developper to think that this is the only one string representation which is not true. Both representations are equivalent and represent as much the URI. And as a bonus, not having a __toString method prevents accidental URI comparison using the == sign instead of using the correct equals method. (I know that because I’ve seen codebase where PSR-7 URI instances are compared using the class __toString method which is just wrong).

PS1: I do appreciate the work you did put into your study around URI packages in the PHP ecosystem but we should not restrict the new API to only resolve or align to those used solutions instead we should try to expose an API susceptible to allow more flexibility than what PHP currently offers.
PS2: I do not think the new API will replace the URI packages, we will still need them because, in the case of RFC3986 URI class, parsing is just one aspect or URI consumption, we still need scheme specific validation that only PHP userland package can offer.

Best regards,
Ignace Nyamagana Butera

On Tue, Apr 29, 2025 at 3:55 PM Paul M. Jones <pmjones@pmjones.io> wrote:

Hi Ignace & Maté and all,

tl;dr: I argue against Ignace’s objections to splitting the URI class into two classes (one that retains raw URI values and another that normalizes values as-it-goes). Jump to the very end for a discussion regarding the with() methods (search for the word “asymmetry” herein).

On Apr 28, 2025, at 15:47, ignace nyamagana butera <nyamsprod@gmail.com> wrote:

The current approach in userland mixes both raw and half normalized components as well as RFC3986 and RFC3987 specification with ambiguity around normalization, input, constructior, what needs to be encoded where and when

Based on my research into existing URI projects <https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md> I don’t think that’s an accurate assessment of the ecosystem.

For example, can you point out which projects mix “raw and half-normalized components”? Nette is the only one that comes to mind, in that (during parsing) it applies rawurldecode() to the host, user, password, and fragment; but that’s only one of the 18 projects.

Likewise, of the 15 URI-centric projects, only one of them (league/uri) offers both RFC3986 and 3987 parsing; the two IRI-centric projects (ml/iri and rmccue/requests) are explicitly IRIs; and rowbot is clearly WHATWG-URL centric. So I don’t see much ambiguity in any projects there.

As far as normalization, only one project (opis) affords the ability to normalize at creation time, though five of them offer a normalize() method with various effects (<https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#normalizing>). So, again, I don’t see much ambiguity there either; they don’t do normalizing as-you-go, it’s something you have to apply explicitly.

Regarding inputs, they all presume “raw” inputs. Regarding constructors, they mostly side with a full URI string. Regarding encoding, they mostly retain values in their encoded form (there are three outliers, cf. <https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md#component-encoding>).

With all that in mind, we can see that the various authors of userland projects have settled on remarkably similar patterns of usage that they found valuable and useful for working with URIs.

fulfill existing userland expectations;

Existing userland expectations are mostly built around parse_url

That’s kind of true; 9 of the 18 projects use parse_url(), and 7/18 implement the RFC 3986 parsing algorithm …

which is one of the reasons the RFC exists to improve the status quo and to introduce in PHP valid parsers against recognizable URI specifications. Yes some adaptation will be needed to use them in userland but I believe this work is easy to do, talking from the POV of a URI package maintainer.

… but I don’t imagine that replacing parse_url() in those projects with the RFC 3986 algo would cause those projects to change any of their other design decisions. What adaptations do you think would be needed around that replacement?

replace the toString()/toRawString() with a single idiomatic __toString() in each class;

For all the reasons explained in the RFC, adding a __toString method is a bad architectural design for an URI. There are so many ways to represent an URI that having a __toString for string representation gives a false sense of “there can be only one true representation for a single URI” which is not true.

For Rfc3986\Uri, it looks like there are only two that are recognized: raw and normalized. Are there other string representations you feel the Uri class should recognize?

(For Whatwg\Url, it looks like there are also only two: as-parsed, and as ASCII, but I’m not addressing that part of the RFC here.)

move normalization logic into the NormalizedUri class.

The classes follow specifications that describe how normalization should be. Why would you split the responsibilities in other classes ? What would be the added value ?

For one, unless I am missing something, there is an asymmetry between the get() methods and the with() methods. What I’m seeing is that (e.g.) Uri::withPath() expects a raw path argument, but getPath() returns the normalized version. For symmetry, I would expect either:

Uri::withPath(raw_value) : self and Uri::getPath() : raw_value, or

Uri::withRawPath(raw_value) : self and Uri::getRawPath() : raw_value

Thus my first intuition that the “main” values in the URI need to be the raw ones, and that getting the normalized ones should be the more verbose case (e.g. getNormalizedPath() : normalized_value).

So, one value added by splitting the classes is to resolve that asymmetry. Consumers expecting to get back from the URI what they put into it can use the raw Uri variation; “API clients or signers fall in this category that want to avoid introducing any unnecessary changes to URIs, in order to avoid causing subtle bugs.”

Other consumers, who want to do things this new and different way (normalized as-you-go, unlike anything currently in userland) can use the NormalizedUri.

(Or you could flip it around and say that the normalized variation is the Uri class, and the raw version is RawUri.)

– pmj

nyamsprod_the_funky · April 30, 2025, 7:58am

Hi Máté and Tim

I read the following in the RFC

Withers of Uri\WhatWg\Url follow the relevant “setter steps” that are defined by WHATWG URL. Unfortunately, these algorithms sometimes have surprising behavior where modification fails silently, and the original values are kept. For example. Even though this RFC acknowledges the fact that the WHATWG URL “setter steps” have gotchas, it doesn’t try to prevent them - as doing so would be spec-incompliant.

Reading the WHATWG URL specification and checking how

Chrome,
Firefox
and even https://github.com/TRowbotham/URL-Parser

behave I see that mutator either silently reject the invalid input on setter or normalize them I was wondering if it still make sense to still say that URL mutator can throws InvalldUrlException ? Since AFAIK only a TypeError could actually be thrown if the wrong input is given, no specially crafted string can make the spec throw unless I have overlooked it.

On Tue, Apr 29, 2025 at 8:55 PM Tim Düsterhus <tim@bastelstu.be> wrote:

Hi

On 4/29/25 10:54, ignace nyamagana butera wrote:

I have one last question while reviewing my polyfill implementation. Is it
worth it adding a SensitiveParameter attribute on the argument of the
following methods ?

Uri\Rfc3986\Uri::withUserInfo

Uri\WhatWg\Url::withPassword

I’m fine with any answer ? Does it warrant a paragraph in the RFC ? That I
do not know but I feel the question may be raised ?

Good catch. Since they may throw an exception for malformed inputs, they
should have the attribute. Especially since folks might try to use
special characters in passwords, which might need encoding.

No paragraph in the RFC needed, but the attribute should be added to the
“stub”.

Best regards
Tim Düsterhus