[PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Hammed_Ajao · February 24, 2025, 2:05pm

What’s wrong with declaring all the methods as final eg. https://github.com/lnear-dev/ada-url/blob/main/ada_url.stub.php

On Mon, Feb 24, 2025, 7:00 a.m. Gina P. Banyard internals@gpb.moe wrote:

On Monday, 24 February 2025 at 11:08, Nicolas Grekas <nicolas.grekas+php@gmail.com> wrote:

I’m seeing a push to make the classes final. Please don’t!
This would badly break the open/closed principle to me.

When shipping a new class, one ships two things: a behavior and a type. The behavior is what some want to close by making the class final. But the result is that the type will also be final. And this would lead to a situation where people tighly couple their code to one single implementation - the internal one.

The situation I’m telling about is when one will accept an argument described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible implementation can ever be passed: the native one. Composition cannot be achieve because there’s no type to compose.

Fine-tuning the behavior provided by the RFC is what we might be most interested in, but we should not forget that we also ship a type. By making the type non-final, we keep things open enough for userland to build on it. If not, we’re going to end up with a fragmented community: some will tightly couple to the native Url implementation, some others will define a UriInterface of their own and will compose it with the native implementation, all these with non-interoperable base types of course, because interop is hard.

By making the classes non-final, there will be one base type to build upon for userland.
(the alternative would be to define native UrlInterface, but that’d increase complexity for little to no gain IMHO - althought that’d solve my main concern).

The open/closed principle does not mean “open to inheritance”.
Just pulling in the Wikipedia definition: [1]

In object-oriented programming, the open–closed principle (OCP) states “software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification”;

You can extend a class by using a decorator or the delegation pattern.
But most importantly, I would like to focus on the “closed for modification” part of the principle.
Unless we make all the methods final, inheritance allows you to modify the behaviour of the methods, which is in opposition to the principle.

Moreover, if you extend a WhatWg\Uri to behave differently to the WhatWg spec, then you do not have a WhatWg URI.
Which means the type becomes meaningless.

Quoting Dijkstra:

The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise.

A concrete WhatWg\Uri type is abstracting over a raw string.
And it creates a new semantic level where when you are in possession of such a type,

you know with absolute certainty how it behaves and what you can do with it, and know that if a consumer needs a WhatWg URI it will not reject it.
This also means consumers of said WhatWg\Uri type do not need to care about validation of it.

If one is able to extend a WhatWg URI, then none of the above applies, and you just have a raw string with fancy methods.

I.e. you are now vague, and any consumer of the type needs to do validation because it cannot trust the type, and you have created a useless abstraction.

It also seems you did not read the relevant “Why a common URI interface is not supported?” [2] section of the RFC.
The major reason why this RFC has had so many iterations and been in discussion for so long is because Máté tried, again and again, to have a common interface.
But this just does not make any sense, you cannot make something extremely concrete vague and abstract, unless you want to lose all the benefits of the abstraction.

Best regards,

Gina P. Banyard

[1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle

[2] https://wiki.php.net/rfc/url_parsing_api#why_a_common_uri_interface_is_not_supported

Sebastian_Bergmann · February 24, 2025, 2:23pm

Am 24.02.2025 um 14:57 schrieb Marco Pivetta:

The `DateTimeImmutable` type should've been `final` from the start: it is trivial to declare a userland interface, and then use the `DateTimeImmutable` type as an implementation detail of a userland- provided interface.

+1

Nicolas_Grekas · February 24, 2025, 2:29pm

Le lun. 24 févr. 2025 à 14:57, Gina P. Banyard internals@gpb.moe a écrit :

On Monday, 24 February 2025 at 11:08, Nicolas Grekas <nicolas.grekas+php@gmail.com> wrote:

I’m seeing a push to make the classes final. Please don’t!
This would badly break the open/closed principle to me.

When shipping a new class, one ships two things: a behavior and a type. The behavior is what some want to close by making the class final. But the result is that the type will also be final. And this would lead to a situation where people tighly couple their code to one single implementation - the internal one.

The situation I’m telling about is when one will accept an argument described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible implementation can ever be passed: the native one. Composition cannot be achieve because there’s no type to compose.

Fine-tuning the behavior provided by the RFC is what we might be most interested in, but we should not forget that we also ship a type. By making the type non-final, we keep things open enough for userland to build on it. If not, we’re going to end up with a fragmented community: some will tightly couple to the native Url implementation, some others will define a UriInterface of their own and will compose it with the native implementation, all these with non-interoperable base types of course, because interop is hard.

By making the classes non-final, there will be one base type to build upon for userland.
(the alternative would be to define native UrlInterface, but that’d increase complexity for little to no gain IMHO - althought that’d solve my main concern).

The open/closed principle does not mean “open to inheritance”.
Just pulling in the Wikipedia definition: [1]

In object-oriented programming, the open–closed principle (OCP) states “software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification”;

You can extend a class by using a decorator or the delegation pattern.

Yes.
You can strike decoration with a non-final class (and no base interface), that’s my point.

But most importantly, I would like to focus on the “closed for modification” part of the principle.
Unless we make all the methods final, inheritance allows you to modify the behaviour of the methods, which is in opposition to the principle.

Moreover, if you extend a WhatWg\Uri to behave differently to the WhatWg spec, then you do not have a WhatWg URI.
Which means the type becomes meaningless.

Quoting Dijkstra:

The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise.

A concrete WhatWg\Uri type is abstracting over a raw string.
And it creates a new semantic level where when you are in possession of such a type,

you know with absolute certainty how it behaves and what you can do with it, and know that if a consumer needs a WhatWg URI it will not reject it.
This also means consumers of said WhatWg\Uri type do not need to care about validation of it.

If one is able to extend a WhatWg URI, then none of the above applies, and you just have a raw string with fancy methods.

I.e. you are now vague, and any consumer of the type needs to do validation because it cannot trust the type, and you have created a useless abstraction.

A couple of non-final Url classes would still be absolutely useful: e.g. as a consumer/callee, I would have stated very clearly that I need an object that behaves like native Url objects. Then, if the implementation doesn’t, that’s on the caller. The abstraction would do its job. I don’t think the extra guarantees you’re describing would be useful in practice (but you could still do an exact ::class comparison if you’d really want to).

It also seems you did not read the relevant “Why a common URI interface is not supported?” [2] section of the RFC.

This sentence comes to me as unnecessarily confrontational. I’d really like to keep this discussion as constructive as possible so that php-internal remains a welcoming space for everyone.

The major reason why this RFC has had so many iterations and been in discussion for so long is because Máté tried, again and again, to have a common interface.
But this just does not make any sense, you cannot make something extremely concrete vague and abstract, unless you want to lose all the benefits of the abstraction.

I was considering the alternative of providing TWO interfaces indeed. Sorry if that wasn’t clear enough.

Nicolas

Tim_Dusterhus · February 24, 2025, 4:22pm

Hi

Am 2025-02-24 15:05, schrieb Hammed Ajao:

What's wrong with declaring all the methods as final eg.
ada-url/ada_url.stub.php at main · lnear-dev/ada-url · GitHub

It is not possible to construct a subclass in a generic fashion, because you don't know the constructor’s signature and you also don’t know if it added some properties with a certain semantic. That means that the `with*()`ers are unable to return an instance of the subclass, leading to confusing behavior in cases like these:

     final class HttpUrl extends \Uri\Rfc3986\Uri {
         public function __construct(string $uri, public readonly bool $allowInsecure) {
             parent::__construct($uri);

             if ($this->getScheme() !== 'https') {
                 if ($allowInsecure) {
                    if ($this->getScheme() !== 'http') {
                        throw new ValueError('Scheme must be https or http');
                    }
                 } else {
                     throw new ValueError('Scheme must be https');
                 }
             }
         }
     }

$httpUrl = (new HttpUrl('https://example.com'))->withPath('/foo');
get_class($httpUrl); // \Uri\Rfc3986\Uri

Best regards
Tim Düsterhus

Paul_M_Jones · February 25, 2025, 12:36pm

Hi there,

On Feb 24, 2025, at 03:36, Tim Düsterhus <tim@bastelstu.be> wrote:

...

but had a look at the “after-action summary” thread and specifically Côme’s response, which you apparently agreed with:

My take on that is more that functionality in core needs to be «perfect», or at least near unanimous.

Côme Chilliet's full quote goes on to say "And of course it’s way easier to find a solution which pleases everyone when it’s for something quite simple" -- the example was of `str_contains()`.

Or perhaps phrased differently, like I did just a few days ago in: Introduction - Sam Lewis - Externals

The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application.

*Correctly* processing URIs is a common need for developers and it’s complicated to do right, thus it qualifies as a “building block”.

Agreed. Add to that:

On Feb 23, 2025, at 18:48, Gina P. Banyard <internals@gpb.moe> wrote:

Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections do not apply here.

(The previous objections being that this ought to be left in userland.)

* * *

I'm repeatedly on record as saying that PHP, as a web-centric language, ought to have more web-centric objects available in core. A _Request_ would be one of those; a _Response_ another; and as being discussed here, a _Url_.

However, if it is true that ...

- "it’s way easier to find a solution which pleases everyone when it’s for something quite simple"

- "The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application."

- "one of the other stated goals of this RFC is to provide this API to other core extensions"

- "Parsing is the single most important operation to use with URIs where a URI string is decomposed into multiple components during the process." (from the RFC)

... then an extensive set of objects and exceptions is not strictly necessary.

Something like `function parse_url_whatwg(string $url_string, ?string $base_url = null) : array`, with an array of returned components, would meet all of those needs.

Similarly, something like a `function parse_url_rfc3986(string $uri_string, ?string $base_url = null) : array` does the same for RFC 3986 parsing.

Those things combined provide solid parsing functionality to userland, and make that parsing functionality available to other core extensions.

-- pmj

nyamsprod_the_funky · February 25, 2025, 3:55pm

On 25/02/2025 13:36, Paul M.Jones wrote:

Hi there,

On Feb 24, 2025, at 03:36, Tim Düsterhus <tim@bastelstu.be> wrote:

...

but had a look at the “after-action summary” thread and specifically Côme’s response, which you apparently agreed with:

My take on that is more that functionality in core needs to be «perfect», or at least near unanimous.

Côme Chilliet's full quote goes on to say "And of course it’s way easier to find a solution which pleases everyone when it’s for something quite simple" -- the example was of `str_contains()`.

Or perhaps phrased differently, like I did just a few days ago in: Introduction - Sam Lewis - Externals

The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application.

*Correctly* processing URIs is a common need for developers and it’s complicated to do right, thus it qualifies as a “building block”.

Agreed. Add to that:

On Feb 23, 2025, at 18:48, Gina P. Banyard <internals@gpb.moe> wrote:

Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections do not apply here.

(The previous objections being that this ought to be left in userland.)

* * *

I'm repeatedly on record as saying that PHP, as a web-centric language, ought to have more web-centric objects available in core. A _Request_ would be one of those; a _Response_ another; and as being discussed here, a _Url_.

However, if it is true that ...

- "it’s way easier to find a solution which pleases everyone when it’s for something quite simple"

- "The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application."

- "one of the other stated goals of this RFC is to provide this API to other core extensions"

- "Parsing is the single most important operation to use with URIs where a URI string is decomposed into multiple components during the process." (from the RFC)

... then an extensive set of objects and exceptions is not strictly necessary.

Something like `function parse_url_whatwg(string $url_string, ?string $base_url = null) : array`, with an array of returned components, would meet all of those needs.

Similarly, something like a `function parse_url_rfc3986(string $uri_string, ?string $base_url = null) : array` does the same for RFC 3986 parsing.

Those things combined provide solid parsing functionality to userland, and make that parsing functionality available to other core extensions.

-- pmj

Hi Paul,

The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won't be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).

Also keep in mind that URL construction may also differ between specifications so instead of just 2 functions you may end up woth 4 methods not counting error jandling. So indeed using an OOP approach while more complex is IMHO the better approach.

nyamsprod_the_funky · February 25, 2025, 4:00pm

On 24/02/2025 12:08, Nicolas Grekas wrote:

Hi,

Thanks for all the efforts making this RFC happen, it'll be a game changer in the domain!

I'm seeing a push to make the classes final. Please don't!
This would badly break the open/closed principle to me.

When shipping a new class, one ships two things: a behavior and a type. The behavior is what some want to close by making the class final. But the result is that the type will also be final. And this would lead to a situation where people tighly couple their code to one single implementation - the internal one.

The situation I'm telling about is when one will accept an argument described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible implementation can ever be passed: the native one. Composition cannot be achieve because there's no type to compose.

Fine-tuning the behavior provided by the RFC is what we might be most interested in, but we should not forget that we also ship a type. By making the type non-final, we keep things open enough for userland to build on it. If not, we're going to end up with a fragmented community: some will tightly couple to the native Url implementation, some others will define a UriInterface of their own and will compose it with the native implementation, all these with non-interoperable base types of course, because interop is hard.

By making the classes non-final, there will be one base type to build upon for userland.
(the alternative would be to define native UrlInterface, but that'd increase complexity for little to no gain IMHO - althought that'd solve my main concern).

     > 5 - Can the returned array from __debugInfo be used in a "normal"
     > method like `toComponents` naming can be changed/improve to ease
     > migration from parse_url or is this left for userland library ?

    I would prefer not expose this functionality for the same reason that
    there are no raw properties provided: The user must make an explicit
    choice whether they are interested in the raw or in the normalized
    version of the individual components.

The RFC is also missing whether __debugInfo returns raw or non-raw components. Then, I'm wondering if we need this per-component break for debugging at all? It might be less confusing (on this encoding aspect) to dump basically what __serialize() returns (under another key than __uri of course).
This would also close the avenue of calling __debugInfo() directly (at the cost of making it possibly harder to move away from parse_url(), but I don't think we need to make this simpler - getting familiar with the new API before would be required and welcome actually.)

    It can make sense to normalize a hostname, but not the path. My usual
    example against normalizing the path is that SAML signs the *encoded*
    URI instead of the payload and changing the case in percent-encoded
    characters is sufficient to break the signature

I would be careful with this argument: signature validation should be done on raw bytes. Requiring an object to preserve byte-level accuracy while the very purpose of OOP is to provide abstractions might be conflicting. The signing topic can be solved by keeping the raw signed payload around.

Hi Nicolas,

> > 5 - Can the returned array from __debugInfo be used in a "normal"
> > method like `toComponents` naming can be changed/improve to ease
> > migration from parse_url or is this left for userland library ?
>
> I would prefer not expose this functionality for the same reason that
> there are no raw properties provided: The user must make an explicit
> choice whether they are interested in the raw or in the normalized
> version of the individual components.

I only mention this because I saw the debugInfo method being implemented. TBH I would be more be in favor of removing the method all together I fail to see the added value of such method unless we want to hide the class internal property in which case it should then "just" show the raw URL and nothing more.

Paul_M_Jones · February 27, 2025, 1:48pm

On Feb 25, 2025, at 09:55, ignace nyamagana butera <nyamsprod@gmail.com> wrote:

The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won't be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).

The two functions need not return an identical array of components; e.g., the 3986 parsing function might return an array much like parse_url() does now, and the WHATWG function might return a completely different array of components (one that includes the normalized and/or raw components).

All of this is to say that the parsing functionality does not have to be in an object to be useful *both* to the internal API *and* to userland.

Recall that I'm responding at least in part to the comment that "Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections [to the Request/Response objects going into core] do not apply here." If the only reason they don't apply is that the core extensions need a parsing API, that reason becomes obviated by using just functions for the parsing elements.

Unless I'm missing something; happy to hear what that might be.

-- pmj

Faizan_Akram_Dar · February 27, 2025, 9:01pm

Hi,

On Thu, 27 Feb 2025, 20:55 Paul M. Jones, <pmjones@pmjones.io> wrote:

On Feb 25, 2025, at 09:55, ignace nyamagana butera <nyamsprod@gmail.com> wrote:

The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won’t be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).

The two functions need not return an identical array of components; e.g., the 3986 parsing function might return an array much like parse_url() does now, and the WHATWG function might return a completely different array of components (one that includes the normalized and/or raw components).

All of this is to say that the parsing functionality does not have to be in an object to be useful both to the internal API and to userland.

It most definitely needs to be an object. Arrays are awful DX wise, there is array shape which modern IDEs like phpstorm support and so does static analysis but the overall experience remains subpar compared to classes (and objects).

Recall that I’m responding at least in part to the comment that “Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections [to the Request/Response objects going into core] do not apply here.” If the only reason they don’t apply is that the core extensions need a parsing API, that reason becomes obviated by using just functions for the parsing elements.

Unless I’m missing something; happy to hear what that might be.

– pmj

Imho Request and Response objects do belong in core, but with a very good api, something which would replace http foundation/PSR7 altogether.

Rob_Landers · February 27, 2025, 11:02pm

On Thu, Feb 27, 2025, at 22:01, Faizan Akram Dar wrote:

Hi,

On Thu, 27 Feb 2025, 20:55 Paul M. Jones, <pmjones@pmjones.io> wrote:

On Feb 25, 2025, at 09:55, ignace nyamagana butera <nyamsprod@gmail.com> wrote:

The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won’t be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).

The two functions need not return an identical array of components; e.g., the 3986 parsing function might return an array much like parse_url() does now, and the WHATWG function might return a completely different array of components (one that includes the normalized and/or raw components).

All of this is to say that the parsing functionality does not have to be in an object to be useful both to the internal API and to userland.

It most definitely needs to be an object. Arrays are awful DX wise, there is array shape which modern IDEs like phpstorm support and so does static analysis but the overall experience remains subpar compared to classes (and objects).

I’m curious why you say this other than an opinion about developer experience? Arrays are values, objects are not. A parsed uri seems more like a value and less like an object. Just reading through the comments so far, it appears that whatever is used will just be wrapped in library code regardless, for userland code, but the objective is to be useful for other extensions and core code. In that case, a hashmap is much easier to work with than a class.

Looking at the objectives of the RFC and the comments here, it almost sounds like it is begging to be a simple array instead of an object.

— Rob

Lynn · February 28, 2025, 8:38am

On Fri, Feb 28, 2025 at 12:05 AM Rob Landers rob@bottled.codes wrote:

On Thu, Feb 27, 2025, at 22:01, Faizan Akram Dar wrote:

Hi,

On Thu, 27 Feb 2025, 20:55 Paul M. Jones, <pmjones@pmjones.io> wrote:

On Feb 25, 2025, at 09:55, ignace nyamagana butera <nyamsprod@gmail.com> wrote:

The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won’t be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).

The two functions need not return an identical array of components; e.g., the 3986 parsing function might return an array much like parse_url() does now, and the WHATWG function might return a completely different array of components (one that includes the normalized and/or raw components).

All of this is to say that the parsing functionality does not have to be in an object to be useful both to the internal API and to userland.

It most definitely needs to be an object. Arrays are awful DX wise, there is array shape which modern IDEs like phpstorm support and so does static analysis but the overall experience remains subpar compared to classes (and objects).

I’m curious why you say this other than an opinion about developer experience? Arrays are values, objects are not. A parsed uri seems more like a value and less like an object. Just reading through the comments so far, it appears that whatever is used will just be wrapped in library code regardless, for userland code, but the objective is to be useful for other extensions and core code. In that case, a hashmap is much easier to work with than a class.

Looking at the objectives of the RFC and the comments here, it almost sounds like it is begging to be a simple array instead of an object.

— Rob

Depends on there being the intention to have it as parameter type. If it’s designed to be passed around to functions I really don’t want it to be an array. I am maintaining a legacy codebase where arrays are being used as hashmaps pretty much everywhere, and it’s error prone. We lose all kinds of features like “find usages” and refactoring key/property names. Silly typos in array keys with no actual validation of any kind cause null values and annoying to find bugs.

I agree that hashmaps can be really easy to use, but not as data structures outside of the function/method scope they were defined in. If value vs object semantics are important here, then something that is forward compatible with whatever structs may hold in the future could be interesting.

Rob_Landers · February 28, 2025, 9:26am

On Fri, Feb 28, 2025, at 09:38, Lynn wrote:

On Fri, Feb 28, 2025 at 12:05 AM Rob Landers rob@bottled.codes wrote:

On Thu, Feb 27, 2025, at 22:01, Faizan Akram Dar wrote:

Hi,

On Thu, 27 Feb 2025, 20:55 Paul M. Jones, <pmjones@pmjones.io> wrote:

On Feb 25, 2025, at 09:55, ignace nyamagana butera <nyamsprod@gmail.com> wrote:

The problem with your suggestion is that the specification from WHATWG and RFC3986/3987 are so different and that the function you are proposing won’t be able to cover the outcome correctly (ie give the developper all the needed information). This is why, for instance, Maté added the getRaw* method alongside the normalized getter (method without the Raw prefix).

The two functions need not return an identical array of components; e.g., the 3986 parsing function might return an array much like parse_url() does now, and the WHATWG function might return a completely different array of components (one that includes the normalized and/or raw components).

All of this is to say that the parsing functionality does not have to be in an object to be useful both to the internal API and to userland.

It most definitely needs to be an object. Arrays are awful DX wise, there is array shape which modern IDEs like phpstorm support and so does static analysis but the overall experience remains subpar compared to classes (and objects).

I’m curious why you say this other than an opinion about developer experience? Arrays are values, objects are not. A parsed uri seems more like a value and less like an object. Just reading through the comments so far, it appears that whatever is used will just be wrapped in library code regardless, for userland code, but the objective is to be useful for other extensions and core code. In that case, a hashmap is much easier to work with than a class.

Looking at the objectives of the RFC and the comments here, it almost sounds like it is begging to be a simple array instead of an object.

— Rob

Depends on there being the intention to have it as parameter type. If it’s designed to be passed around to functions I really don’t want it to be an array. I am maintaining a legacy codebase where arrays are being used as hashmaps pretty much everywhere, and it’s error prone. We lose all kinds of features like “find usages” and refactoring key/property names. Silly typos in array keys with no actual validation of any kind cause null values and annoying to find bugs.

I agree that hashmaps can be really easy to use, but not as data structures outside of the function/method scope they were defined in. If value vs object semantics are important here, then something that is forward compatible with whatever structs may hold in the future could be interesting.

I meant hashmaps from within C, not within php. If it is just going to wrapped in userland libraries as people seem to be suggesting in this thread, then you only have to get it right once, and it is easy to use from C.

— Rob

Mate_Kocsis · March 2, 2025, 10:00pm

Hi Tim,

Thank you again for the thorough review!

The naming of these methods seems to be a little inconsistent. It should
either be:

->getHostForDisplay()
->toStringForDisplay()

or

->getDisplayHost()
->toDisplayString()

but not a mix between both of them.

Yes, I completely agree with your concern. I’m just not sure yet which combination I’d prefer.
Probably the latter one?

Yes. Besides the remark above, my previous arguments still apply (e.g.
with()ers not being able to construct instances for subclasses,
requiring to override all of them). I’m also noticing that serialization
is unsafe with subclasses that add a $__uri property (or perhaps any
property at all?).

Hm, yes, you are right indeed that withers cannot really create new instances on
their own because the whole URI string is needed to instantiate a new object… which is only
accessible if it’s reconstructed by swapping the relevant component with its new value.

Please note that trying to serialize a $__uri property will result in an exception.

The toDisplayString() method that you mentioned above is not in the
RFC. Did you mean toHumanFriendlyString()? Which one is correct?

The toHumanFriendlyString() method stuck there from a previous version of the proposal,
since then I converted it to toDisplayString().

The example output of the $errors array does not match the stub. It
contains a failure property, should that be softError instead?

The $softError property is also an outdated name: I recently changed it to $failure
to be consistent with the wording that the WHATWG specification uses.

The RFC states “When trying to instantiate a WHATWG Url via its
constructor, a Uri\InvalidUriException is thrown when parsing results in
a failure.”

What happens for Rfc3986 when passing an invalid URI to the constructor?
Will an exception be thrown? What will the error array contain? Is it
perhaps necessary to subclass Uri\InvalidUriException for use with
WhatWgUrl, since $errors is not applicable for 3986?

The first two questions are answered right at the top of the parsing section:

“the constructor: It expects a URI, and optionally, a base URL in order to support reference resolution.
When parsing is unsuccessful, a Uri\InvalidUriException is thrown.”

The $errors property will contain an empty array though, as you supposed. I don’t see much problem
with using the same exception in both cases, however I’m also fine with making the $errors property
nullable in order to indicate that returning errors is not supported by the implementation triggering
the error.

The RFC does not specify when UninitializedUriException is thrown.

That’s a very good catch! I completely forgot about some exceptions. This one is used
for indicating that an URI is not correctly initialized: when a URI instance is created
without actually invoking the constructor, or the parse method, or __unserialize(),
then any methods that try to use the internally stored URI will trigger this exception.

The RFC does not specify when UriOperationException is thrown.

Generally speaking I believe it would help understanding if you would
add a /** @throws InvalidUriException */ to each of the methods in the
stub to make it clear which ones are able to throw (e.g. resolve(), or
the withers). It’s harder to find this out from “English” rather than
“code”

Good idea! I’ve added the PHPDoc as well as created a dedicated “Exceptions”
section.

In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it’s important
to know if %26 is decoded to & in a query-string. Or if %3D is
decoded to =. This really is the same case as with %2F in a path.
The explanation

Thanks for calling these cases out, I’ve significantly reworked the relevant sections.
First of all, I added much more details to the general overview about percent-encoding:
https://wiki.php.net/rfc/url_parsing_api#percent-encoding_decoding as well as extended
the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section with more information
about the two component representations, and added a general clarification related to reserved
characters. Additionally, the https://wiki.php.net/rfc/url_parsing_api#component_modification section
makes it clear how percent-encoding is performed when the withers are used.

After thinking about the question a lot, finally the current encoding-decoding rules seem
logical to me, but please double-check them. It’s easy to misinterpret such long and complex
specifications.

Long story short: when parsing an URI or modifying a component, RFC 3986 fails hard if
an invalid character is found, while WHATWG implementation automatically percent-encodes
it while also triggering a soft-error.

While retrieving the “normalized-decoded” representation of a URI component, percent-decoding is
performed when possible:

in case of RFC3986: reserved and invalid characters are not percent-decoded (only unreserved ones are)
in case of WHATWG: invalid characters and characters with special meaning (that fall into the percent-encode set
of the given component) are not percent-decoded

The relevant sections will give a little more reasoning why I went with these rules.

“the URI is normalized (when applicable), and then the reserved
characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded that
are not allowed in a component. This behavior is needed to be able to
unambiguously retrieve components.”

alone is not clear to me. “reserved characters that are not allowed in a
component”. I assume this means that %2F (/) in a path will not be
decoded, but %3F (?) will, because a bare ? can’t appear in a path?

I hope that this question is also clear after my clarifications + the reconsidered logic.

In the “Component retrieval” section: You compare the behavior of
WhatWgUrl and Rfc3986Uri. It would be useful to add something like:

$url->getRawScheme() // does not exist, because WhatWgUrl always
normalizes the scheme

Done.

to better point out the differences between the two APIs with regard to
normalization (it’s mentioned, but having it in the code blocks would
make it more visible).

Done.

In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode ? and # as necessary. Will the same happen
for Rfc3986? Will the encoding of # also happen for the query-string
component? The RFC only mentions the path component.

The above referenced sections will give a clear answer for this question as well.
TLDR: after your message, I realized that automatic percent-encoding also triggers a (soft)
error case for WHATWG, so I changed my mind with regards to Uri\Rfc3986\Uri,
so it won’t do any automatic percent-encoding. It’s unfortunate, because this behavior is not
consistent with WHATWG, but it’s more consistent with the parsing rules of its own specification,
where there are only hard errors, and there’s no such thing as “automatic correction”.

I’m also wondering if there are cases where the withers would not
round-trip, i.e. where $url->withPath($url->getPath()) would not
result in the original URL?

I am currently not aware of any such situation… I even wrote about this aspect fairly
long, because I think “roundtripability” is a very important attribute. Thank you for
raising awareness of this!

Can you add examples where the authority / host contains IPv6 literals?
It would be useful to specifically show whether or not the square
brackets are returned when using the getters. It would also be
interesting to see whether or not IPv6 addresses are normalized (e.g.
shortening 2001:db8:0:0:0:0:0:1 to 2001:db8::1).

Good idea again! I’ve added an example containing an IPv6 host at the very
end of the component retrieval section. And yes, they will be enclosed within a pair as
per the spec.

It also surprised me, but IP address normalization is only performed by WHATWG
during recomposition! But nowhere else…

In “Component Recomposition” the RFC states “The
Uri\Rfc3986\Uri::toString() returns the unnormalized URI string”.

Does this mean that toString() for Rfc3986 will always return the
original input?

Yes, effectively that’s the case, only WHATWG modifies the input according to my knowledge.
In the past, I had the impression that RFC 3986 also did a few changes,
but then I had to realize that it was not the case after I had dug deep into the code of uriparser.

It would be useful to know whether or not the classes implement
__debugInfo() / how they appear when var_dump()ing them.

I’ve added an example.

That’s all I managed to write for now, but I’ll try to answer the rest of the messages and feedback
as soon as possible.

Regards,
Máté

Dennis_Snell · March 5, 2025, 10:45pm

On Feb 16, 2025, at 3:01 PM, Máté Kocsis kocsismate90@gmail.com wrote:

Hi Dennis,

I only harp on the WhatWG spec so much because for many people this will be the only one they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs.

I think Ignace’s examples already highlighted that the two specifications differ in nuances so much that even I had to admit after months of trying to squeeze them into the same interface that doing so would be irresponsible.
The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing URNs or URIs with scheme-specific behavior - like ldap apparently), but even the UriInterface of PSR-7 can build upon it. On the other hand, Uri\WhatWg\Url will be useful for representing browser links and any other URLs for the web (i.e. an HTTP application router component should use this class).

Just to enlighten me and possibly others with less familiarity, how and when are RFC3986 URLs used and what are those systems supposed to do when an invalid URL appears, such as when dealing with percent-encodings as you brought up in response to Tim?

I am not 100% sure what I brought up to Tim, but certainly, the biggest difference between the two specs regarding percent-encoding was recently documented in the RFC: https://wiki.php.net/rfc/url_parsing_api#percent-encoding. The other main difference is how the host component is stored: WHATWG automatically percent-decodes it, while RFC3986 doesn’t. This is summarized in the https://wiki.php.net/rfc/url_parsing_api#component_retrievalsection (a bit below).
This would be fine, knowing in hindsight that it was originally a relative path. Of course, this would mean that it’s critical that [https://example.com](https://example.com) does not replace the actual host part if one is provided in $url. For example, this code should work.
$url = Uri\WhatWgUri::parse( '[https://wiki.php.net/rfc](https://wiki.php.net/rfc)’, ‘[https://example.com](https://example.com)’ );
$url->domain === '[wiki.php.net](http://wiki.php.net)'
Yes. it’s the case. Both classes only use the base URL for relative URIs.

Hopefully this won’t be too controversial, even though the concept was new to me when I started having to reliably work with URLs. I choose the example I did because of human risk factors in security exploits. “xn–google.com” is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com”

I got your point, so I implemented your suggestion. Actually, I made yet another larger API change in the meanwhile, but in any case, the WHATWG implementation now supports IDNA the following way:
$url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘", null);

echo $url->getHost();                // [xn--go8h.com](http://xn--go8h.com)
echo $url->getHostForDisplay();      // 🐘.com
echo $url->toString();               // [https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98](https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98)
echo $url->toDisplayString();        / https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 
Unfortunately, RFC3986 doesn’t support IDNA (as Ignace already pointed out at the end of https://externals.io/message/126182#126184), and adding support for RFC3987 (therefore IRIs) would be a very heavy amount of work, it’s just not feasible within this RFC To make things worse, its code should be written from scratch, since I haven’t found any suitable C library yet for this purpose. That’s why I’ll leave them for

On other notes, let me share some of the changes since my previous message to the mailing list:

First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method from the proposal after Arnaud’s feedback. Now, both the normalized (and decoded), as well as the non-normalized representation can equally be retrieved from the same URI instance. This was necessary to change in order for users to be able to consistently use URIs. Now, if someone needs an exact URI component value, they can use the getRaw*() getter. If they want the normalized and percent-decoded form then a get*() getter should be used. For more information, the

https://wiki.php.net/rfc/url_parsing_api#component_retrievalsection should be consulted.

This seems like a good change.

I made a few less important API changes, like converting the WhatWgError class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changing the return type of some getters (removing nullability) etc.

Love this.

I fixed quite some smaller details of the implementation along with a very important spec incompatibility: until now, the “path” component didn’t contain the leading “/” character when it should have. Now, both classes conform to their respective specifications with regards to path handling.

This is a late thought, and surely amenable to a later RFC, but I was thinking about the get/set path methods and the issue of the / and %2F.

If we exposed getPathIterator() or getPathSegments() could we not report these in their fully-decoded forms? That is, because the path segments are separated by some invocation or array element, they could be decoded?
Probably more valuably, if withPath() accepted an array, could we not allow fully non-escaped PHP strings as path segments which the URL class could safely and by-default handle the escaping for the caller?

Right now, if someone haphazardly joins path segments in order to set withPath() they will likely be unaware of that nuance and get the path wrong. On the grand scale of things, I suspect this is a really minor risk. However, if they could send in an array then they would never need to be aware of that nuance in order to provide a fully-reliable URL, up to the class rejecting path segments which cannot be represented.

I think the RFC is now mature enough to consider voting in the foreseeable future, since most of the concerns which came up until now are addressed some way or another. However, the only remaining question that I still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes should be final? Personally, I don’t see much problem with opening them for extension (other than some technical challenges that I already shared a few months ago), and I think people will have legitimate use cases for extending these classes. On the other hand, having final classes may allow us to make slightly more significant changes without BC concerns until we have a more battle-tested API, and of course completely eliminate the need to overcome the said technical challenges. According to Tim, it may also result in safer code because spec-compliant base classes cannot be extended by possibly non-spec compliant/buggy children. I don’t necessarily fully agree with this specific concern, but here it is.

I’ve taken another fresh and full review of the RFC and I just want to share my appreciation for how well-written it seems, and how meticulously you have taken everyone’s feedback and incorporated it. It seems mature enough to me as well, and I think it’s in a good place. Still, here are some additional thoughts (and a previous one again) related to some of aspects, mostly naming.

The HTML5 library has ::createFromString() instead of parse(). Did you consider following this form? It doesn’t seem that important, but could be a nice improvement in consistency among the newer spec-compliant APIs. Further, I think createFromString() is a little more obvious in intent, as parse() is so generic.

Given the issues around equivalence, what about isEquivalent() instead of equals()? In the RFC I think you have been careful to use the “equivalence” terminology, but then in the actual interface we fall back to equals() and lose some of the nuance.

Something about not implementing getRawScheme() and friends in the WHATWG class seems off. Your rationale makes sense, but then I wonder what the problem is in exposing the raw untranslated components, particularly since the “raw” part of the name already suggests some kind of danger or risk in using it as some semantic piece.

Tim brought up the naming of getHost() and getHostForDisplay() as well as the correspondence with the toString() methods. I’m not sure if it was overlooked or I missed the followup, but I wonder what your thoughts are on passing an enum to these methods indicating the rendering context. Here’s why: I see developers reach for the first method that looks right. In this case, that would almost always be getHost(), yet getHost() or toString() or whatever is going to be inappropriate in many common cases. I see two ways of baking in education into the API surface: creating two symmetric methods (e.g. getDisplayableHost() and getNonDisplayableHost()); or requiring an enum forcing the choice (e.g. getHost( ForDisplay | ForNonDisplay )). In the case on an enum this could be equally applied across all of the relevant methods where such a distinction exists. On one hand this could be seen as forcing callers to make a choice, but on the other hand it can also be seen as a safeguard against an extremely-common foot-gun, making such an easy oversight impossible.

Just this week I stumbled upon an issue with escaping the hash/fragment part of a URL. I think that browsers used to decode percent-encodings in the fragment but they all stopped and this was removed from the WHATWG HTML spec no-percent-escaping. The RFC currently shows getFragment() decoding percent-encoded fragments, However, I believe that the WHATWG URL spec only indicates percent-encoding when setting the fragment. You can test this in a browser with the following example: Chrome, Firefox, and Safari exhibit the same behavior.

u = new URL(window.location)
u.hash = ‘one and two’;
u.hash === ‘#one%20and%20two’;
u.toString() === ‘….#one%20and%20two’;

So I think it may be more accurate and consistent to handle Whatwg\Url::getFragment in the same way as getScheme(). When setting a fragment we should percent-encode the appropriate characters, but when reading it, we should never interpret those characters — it should always return the “raw” value of the fragment.

Once again, thank you for the great work you’ve put into this. I’m so excited to have it. All my comments should be understood exclusively within the WHATWG domain as I don’t have the same experience with the RFC3986 side.

Dennis Snell

Regards,
Máté

Mate_Kocsis · March 10, 2025, 10:51pm

Hi Gina,

The paragraph in at the beginning of the RFC in the > Relevant URI specifications > WHATWG URL section seems to be incomplete.

Hopefully it’s good now. Although I know this section doesn’t include much information.

I don’t really understand how the UninitializedUriException exception can be thrown?
Is it somehow possible to create an instance of a URI without initializing it?
This seems unwise in general.

I think I’ve already answered this since then in my previous email (and in the RFC as well), but yes, it’s possible via reflection.
I don’t really have an idea how this possibility could be avoided without also making the classes final.

I’m not really convinced by using the constructor to be able to create a URI object.
I think it would be better for it to be private/throwing and have two static constructor parse and tryParse,
mimicking the API that exists for creating an instance of a backed enum from a scalar.

I’m not completely against using parse() and tryParse(), but I think the constructor already makes it clear that it either returns
a valid object or throws.

I think changing the name of the toString method to toRawString better matches the rest of the proposed API,
and also removes the question as to why it isn’t the magic method __toString.

For RFC 3986, we could go with toString() instead of toNormalizedString() and toRawString() instead of toString() so that we use
the same convention as for getters.

Recently I learnt that for some reason WHATWG normalizes the IP address during component recomposition, so its toString() is
not really the most rare (at least not in the same way as “raw getters” are). So for WHATWG, I think keeping toString() and
toDisplayString() probably still makes sense.

I will echo Tim’s concerns about the non-final-ity of the URI classes.
This seems like a recipe for disaster.
I can maybe see the usefulness of extending Rfc3986\Uri by a subclass Ldap\Uri,
but being able to extend the WhatWg URI makes absolutely no sense.
The point of these classes is that if you have an instance of one of these, you know that you have a valid URI.
Being able to subclass a URI and mess with the equals, toString, toNormalizedString methods throws away all the safety guarantees provided by possessing a Uri instance.

I’m sure that people will find their use-cases to subclass all these new classes, including the WHATWG implementation. As Nicolas mentioned,
his main use-case is minly adding convenience and new factory methods that don’t specifically need all methods to be reimplemented.

While I share your opinion that leaving the URI classes open for extension is somewhat risky and it’s difficult to assess its impacts right now, I can also
sympathise with what Nicolas wrote in a later message (https://externals.io/message/123997#126489): that we shouldn’t close the door for the public from
using interchangeable implementations.

I know that going final without any interfaces is the most “convenient” for the PHP project itself, because the solution has much less BC surface to maintain,
so we are relatively free and safe to make future changes. This is useful for an API in its early days that is huge like this. Besides the interests of the maintainers,
we should also take two important things into account:

Heterogeneous use-cases: it’s out of question that the current API won’t fit all use-cases, especially because we have already identified some followup tasks
that should be implemented (see “Future Scope” section in the RFC).
Interoperability: Since URI handling is a very widespread problem, many people and libraries will start to use the new extension once it’s available. But because
of the above reason, many of them want to use their own abstraction, and that’s exactly why a common ground is needed: there’s simply not a single right possible
implementation - everyone has their own, given the complexity of the topic.

So we should try to be considerate about these factors by some way or another. So far, we have four options:

Making the classes open for extension: this solution has acknowledged technical challenges (https://github.com/php/php-src/pull/14461#discussion_r1847316607),
and it limits our possibilities of adding changes the most, but users can effectively add any behavior that they need. Of course, they are free to introduce bugs and
spec-incompatible behavior into their own implementation, but none of the other solutions could prevent such bugs either, since people will write their custom code
wherever they can: if they can’t have it in a child class, then they will have in MyUri, or in UriHelper, or just in a 200 lines long function.

Being able to extend the built-in classes also means that child classes can use the behavior of their parent by default - there’s no need to create wrapper
classes around the built-in ones (aka using composition), that is a tedious task to implement, and also which would incur some performance penalty because of the
extra method calls.

Making the classes open for extension, but making some methods final: same benefits as above, without the said technical challenges - in theory. I am currently
trying to figure out if there is a combination of methods that could be made final so that the known challenges become impossible to be triggered - although I haven’t
managed to come up with a sensible solution yet.
Making the classes final: It avoids some edge-cases for the built-in classes (the uninitialized state most prominently), while it leaves the most room for making future
changes. Projects that may want to ship their own abstractions for the two built-in classes can use composition to create their own URI implementations.
They can instantiate these implementations however they want to (i.e. $myUri = new MyUri($uri)). If they need to pass an URI to other libraries then they could extract
the wrapped built-in class (i.e. $myUri->getUri()).

On the flipside, backporting methods added in future PHP versions (aka polyfills) will become impossible to implement for URIs according to my knowledge, as well as mocking
in PHPUnit will also be a lost feature (I’m not sure if it’s a good or a bad thing, but it may be worth to point out).

Also, the current built-in implementations may have alternative implementations that couldn’t be used instead of them. For example, the ADA URL library (which is mentioned
in the RFC) also implements the WHATWG specification - possibly the very same way as Lexbor, the currently used library - does. These alternative implementations may have
different performance characteristics, platform requirements, or level of maintenance/support, which may qualify them as more suitable for some use-cases than what the built-in
ones can offer. If we make these classes final, there’s no way to use alternative implementations as a replacement for the default ones, although they all implement the same
specification having mostly clear semantics.

Making the classes final, but adding a separate interface for each: The impact of making the built-in classes final would be mitigated by adding one interface
for each specification (I didn’t like this idea in the past, but it now looks much more useful in the perspective of the final vs non-final debate). Because of the interfaces,
there would be a common denominator for the different possible implementations. I’m sure that someone would suggest that the community (aka PHP-FIG)
should come up with such an interface, but I think we shouldn’t expect someone else to do the work when we are in the position to do it the best, as those interfaces
should be internal ones, since the built-in URI classes should also implement them.

If we had these interfaces, projects could use whatever abstraction they want via composition, but they could more conveniently pass around the same object everywhere.

I intentionally don’t try to draw a conclusion for now, first of all because it already took me a lot of time to try to mostly objectively compare the different possibilities, and
I hope that we can find more pros-cons (or fix my reasonings if I made mistakes somewhere) in order to finally reach some kind of consensus.

Similarly, I don’t understand why the WhatWgError is not final.
Even if subclassing of the Uri classes is allowed, any error it would have would not be a WhatWg one,
so why should you be able to extend it.

I made it final now.

Thank you for your comments:
Máté

Mate_Kocsis · March 10, 2025, 10:58pm

Hi Juris and Tim,

Am 2025-02-23 18:47, schrieb Juris Evertovskis:

As those are URI validation errors, maybe something like
Uri\WhatWg\ValidationError would be both less clashy and less
redundant?

I like that suggestion.

Best regards
Tim Düsterhus

I liked it as well, so I changed the related classes the following way:

Uri\WhatWg\WhatWgError became Uri\WhatWg\UrlValidationError
Uri\WhatWg\WhatWgErrorType became Uri\WhatWg\UrlValidationErrorType

This way, WhatWg is not duplicated in the FQCN, but the class name is still specific enough to possibly not clash with anything else.
I could also imagine removing the Url prefix, but I like it, since it highlights that it’s related to WHATWG URLs.

Regards,
Máté

Crell · March 11, 2025, 4:34am

On Mon, Mar 10, 2025, at 5:51 PM, Máté Kocsis wrote:

I'm sure that people will find their use-cases to subclass all these
new classes, including the WHATWG implementation. As Nicolas mentioned,
his main use-case is minly adding convenience and new factory methods
that don't specifically need all methods to be reimplemented.

While I share your opinion that leaving the URI classes open for
extension is somewhat risky and it's difficult to assess its impacts
right now, I can also
sympathise with what Nicolas wrote in a later message
([RFC] [Discussion] Add WHATWG compliant URL parsing API - Externals): that we shouldn't close
the door for the public from
using interchangeable implementations.

I know that going final without any interfaces is the most "convenient"
for the PHP project itself, because the solution has much less BC
surface to maintain,
so we are relatively free and safe to make future changes. This is
useful for an API in its early days that is huge like this. Besides the
interests of the maintainers,
we should also take two important things into account:

- Heterogeneous use-cases: it's out of question that the current API
won't fit all use-cases, especially because we have already identified
some followup tasks
that should be implemented (see "Future Scope" section in the RFC).
- Interoperability: Since URI handling is a very widespread problem,
many people and libraries will start to use the new extension once it's
available. But because
of the above reason, many of them want to use their own abstraction,
and that's exactly why a common ground is needed: there's simply not a
single right possible
implementation - everyone has their own, given the complexity of the
topic.

So we should try to be considerate about these factors by some way or
another. So far, we have four options:

- Making the classes open for extension: this solution has acknowledged
technical challenges
([RFC] Add RFC 3986 and WHATWG compliant URL parsing support by kocsismate · Pull Request #14461 · php/php-src · GitHub),
and it limits our possibilities of adding changes the most, but users
can effectively add any behavior that they need. Of course, they are
free to introduce bugs and
spec-incompatible behavior into their own implementation, but none of
the other solutions could prevent such bugs either, since people will
write their custom code
wherever they can: if they can't have it in a child class, then they
will have in MyUri, or in UriHelper, or just in a 200 lines long
function.

Being able to extend the built-in classes also means that child classes
can use the behavior of their parent by default - there's no need to
create wrapper
classes around the built-in ones (aka using composition), that is a
tedious task to implement, and also which would incur some performance
penalty because of the
extra method calls.

- Making the classes open for extension, but making some methods final:
same benefits as above, without the said technical challenges - in
theory. I am currently
trying to figure out if there is a combination of methods that could be
made final so that the known challenges become impossible to be
triggered - although I haven't
managed to come up with a sensible solution yet.

- Making the classes final: It avoids some edge-cases for the built-in
classes (the uninitialized state most prominently), while it leaves the
most room for making future
changes. Projects that may want to ship their own abstractions for the
two built-in classes can use composition to create their own URI
implementations.
They can instantiate these implementations however they want to (i.e.
$myUri = new MyUri($uri)). If they need to pass an URI to other
libraries then they could extract
the wrapped built-in class (i.e. $myUri->getUri()).

On the flipside, backporting methods added in future PHP versions (aka
polyfills) will become impossible to implement for URIs according to my
knowledge, as well as mocking
in PHPUnit will also be a lost feature (I'm not sure if it's a good or
a bad thing, but it may be worth to point out).

Also, the current built-in implementations may have alternative
implementations that couldn't be used instead of them. For example, the
ADA URL library (which is mentioned
in the RFC) also implements the WHATWG specification - possibly the
very same way as Lexbor, the currently used library - does. These
alternative implementations may have
different performance characteristics, platform requirements, or level
of maintenance/support, which may qualify them as more suitable for
some use-cases than what the built-in
ones can offer. If we make these classes final, there's no way to use
alternative implementations as a replacement for the default ones,
although they all implement the same
specification having mostly clear semantics.

- Making the classes final, but adding a separate interface for each:
The impact of making the built-in classes final would be mitigated by
adding one interface
for each specification (I didn't like this idea in the past, but it now
looks much more useful in the perspective of the final vs non-final
debate). Because of the interfaces,
there would be a common denominator for the different possible
implementations. I'm sure that someone would suggest that the community
(aka PHP-FIG)
should come up with such an interface, but I think we shouldn't expect
someone else to do the work when *we* are in the position to do it the
best, as those interfaces
should be internal ones, since the built-in URI classes should also
implement them.

If we had these interfaces, projects could use whatever abstraction
they want via composition, but they could more conveniently pass around
the same object everywhere.

I intentionally don't try to draw a conclusion for now, first of all
because it already took me a lot of time to try to mostly objectively
compare the different possibilities, and
I hope that we can find more pros-cons (or fix my reasonings if I made
mistakes somewhere) in order to finally reach some kind of consensus.

Thought: make the class non-final, but all of the defined methods final, and any internal data properties private. That way we know that a child class cannot break any of the existing guarantees, but can still add convenience methods or static method constructors on top of the existing API, without the need for an interface and a very verbose composing class.

--Larry Garfield

Mate_Kocsis · March 12, 2025, 10:00pm

Hi Tim,

The same is true for UriOperationException. The RFC says that it can
happen for memory issues. Can this actually happen? My understanding is
that the engine bails out when an allocation fails. In any case if a
more graceful handling is desired it should be some generic
OutOfMemoryError rather than an extension-specific exception.

After checking the code of emalloc et al. I agree with you, the exception won’t actually
be thrown for memory errors. Therefore, I removed this part of the RFC.

With regard to unserialization, let me refer to:
https://externals.io/message/118311. ext/random uses \Exception and I
suggest ext/uri to do the same. This should also be handled in a
consistent way across extensions, e.g. by reproposing
https://wiki.php.net/rfc/improve_unserialize_error_handling.

Thanks for bringing this RFC to my attention, I agree with the motivation, so I
changed this aspect of the RFC too to throw an \Exception.

And with “Theoretically, URI component reading may also trigger this
exception” being a theoretical issue only, the UriOperationException
is not actually necessary at all.

I wanted to reserve the right for any 3rd party internal URI implementations
to fail for whatever reason that prevents reading. The built-in implementations
don’t fail for sure, but it doesn’t mean that 3rd party implementations neither will.
Since potential errors can be handled in some way, I think it makes sense
to keep this exception, especially because it’s now basically non-triggerable.

I’m not sure if I’m entirely correct, but it’s possible that a 3rd party URI implementation
won’t (or cannot) use PHP’s memory manager, and it relies on the regular malloc:
in this case, even memory errors could lead to failures.

Regards,
Máté

Mate_Kocsis · March 14, 2025, 7:45pm

Hi Ignace,

All URI components - with the exception of the host - can be
retrieved in two formats:

I believe you mean - with the excepotion of the Port

Even though I specifically meant WHATWG’s host that is only available in only
one format, you are right, the port is never available in two formats. So I’ve
changed the wording accordingly.

0 - It is a unfortunate that there’s no IDNA support for RFC3986, I
understand the reasoning behind that decision but I was wondering if it
was possible to optin its use when the ext-intl extension is present ?

Good question, I think it’s probably not the main concern. My specific concern is that
RFC 3987 has around same length as RFC 3986, in a lot of cases it uses the exact
wording of the initial RFC but changes URI to IRI, and of course adds the
IDNA specific parts. Maybe it’s just me, but it’s not easy to find it out exactly what
has to be implemented above RFC 3986, and also, how it can be best achieved?
By extending the class for RFC 3986? Creating a totally separate class that can
transform itself to an RFC 3986 URI? These and quite some other questions have
to be answered first, which I would like to postpone.

1 - Does it means that if/when Rfc3986/Uri get Rfc3987 supports they
will also get a Uri::toDisplayString and Uri::getHostForDisplay
maybe this should be stated in the Futurscope ?

It’s a question that I also asked from myself. For now, I’d say that
Rfc3986/Uri shouldn’t have these methods, since it doesn’t support any such
capabilities. But Rfc3986\Iri should likely have these toString methods.

4 - For consistency I would use toRawString and toString just like it is
done for components.

I’m fine with this, I also think doing so would reasonably continue the convention
getters do.

5 - Can the returned array from __debugInfo be used in a “normal” method
like toComponents naming can be changed/improve to ease migration from
parse_url or is this left for userland library ?

I intend to add the __debugInfo() method purely to help debugging. Without this
method, even I had a hard time when trying to compare the expected vs actual
URIs in my tests.

But more importantly, sometimes the recomposed string is not enough to have a
good understanding exactly what value each component has. For example
one can naively assume that the “mailto:kocsismate@php.net” URI has a
user(info) component of “kocsismate” and a hostname of “php.net” (I probably
also did so before reading the RFCs). The representation provided by
__debugInfo() can quickly highlight that “kocsismate@php.net” is the path in fact.
One could try to call the individual getters to find the needed component, but having
such a method like __debugInfo() provides a much more clear picture about the anatomy of
the URI.

But otherwise I don’t know how useful this method would be. Is there anything else
besides helping the migration?

Regards,
Máté

Mate_Kocsis · March 14, 2025, 7:54pm

Hi Nicolas,

For now, let me just quickly respond to your question regarding __debugInfo():

The RFC is also missing whether __debugInfo returns raw or non-raw components. Then, I’m wondering if we need this per-component break for debugging at all? It might be less confusing (on this encoding aspect) to dump basically what __serialize() returns (under another key than __uri of course).
This would also close the avenue of calling __debugInfo() directly (at the cost of making it possibly harder to move away from parse_url(), but I don’t think we need to make this simpler - getting familiar with the new API before would be required and welcome actually.)

I mostly have already answered this in my latest message to Ignace: yes, I think it makes sense to provide a clear picture about the anatomy of an URL in some cases. The method uses raw component values in order not to skew the original data.

Máté