[PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

On Thu, Dec 5, 2024, at 5:16 PM, Christoph M. Becker wrote:

On 05.12.2024 at 22:49, Máté Kocsis wrote:

I'm not sure I like the `Interface` suffix on the `UriInterface`
interface. Just `Uri\Uri` would be equally expressive.

Yes, I was expecting this debate :slight_smile: To be honest, I never liked interfaces
without an "Interface"
suffix, and my dislike didn't go away when I had to use such an interface
somewhere, because it
was difficult for me to find out what the symbol I was typing actually
referred to.

By the same argument, you could come up with code like

<?php
class User {
    const defaultGroupNameConstant = "users";
    private string $nameVariable;
    public function getNameMethod() {…}
    …
}
?>

But apart from my personal
experiences, I prefer to stay with "UriInterface" because the 2 most well
known internal PHP interfaces
also have the same suffix (DateTimeInterface, SessionInterface), and this
name definitely conveys that
people should not try to instantiate it.

DateTimeInterface has been introduced after there had already been
DateTime. Otherwise, we would likely have DateTime, DateTimeMutable and
DateTimeImmutable. (Or only DateTime as immutable class.)

SessionHandler/SessionHandlerInterface have been bad naming choices, in
my opinion. The interface could have been SessionHandler, and the class
DefaultSessionHandler (and should have been final). I dislike these
Interface and Implementation (or abbreviations of these) suffixes.

Christoph

I used to be in favor of *Interface, but over time realized how useless it was. :slight_smile: I have stopped doing it in my own code and my code reads way better. Also, the majority of PHP's built in interfaces (Traversable, Countable, etc.) are not suffixed, AFAIK, so it's better to avoid it for consistency. As noted, DateTimeInterface is a special case outlier.

--Larry Garfield

It seems that I’ve mucked up the mailing list again by deleting an old message I intended on replying to. Apologies all around for replying to an older message of my own.
Máté, thanks for your continued work on the URL parsing RFC. I’m only no returning from a bit of an extended leave, so I appreciate your diligence and patience. Here are some thoughts in response to your message from Nov. 19, 2024.

even though the majority does, not everyone builds a browser application

with PHP, especially because URIs are not necessarily accessible on the web

This has largely been touched on elsewhere, but I will echo the idea that it seems valid to have to separate parsers for the two standards, and truly they diverge enough that it seems like it could be only a superficial thing for them to share an interface.

I only harp on the WhatWG spec so much because for many people this will be the only one they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs.

Just to enlighten me and possibly others with less familiarity, how and when are RFC3986 URLs used and what are those systems supposed to do when an invalid URL appears, such as when dealing with percent-encodings as you brought up in response to Tim?

Coming from the XHTML/HTML/XML side I know that there was substantial effort to enforce standards on browsers and that led to decades of security exploits and confusion, when the “official” standards never fully existed in the way people thought. I don’t mean to start any flame wars, but is the URL story at all similar here?

I’m mostly worried that we could accidentally encourage risky behavior for developers who aren’t familiar with the nuances of having to URL specifications vs. having the simplest, least-specific interface point them in the right direction for what they will probably be doing. `parse_url()` is a great example of how the thing that looks _right_ is actually terribly prone to failure.

The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the 2nd (base URI) parameter is provided. So essentially you need to use

this variant of the parse() method if you want to parse a WhatWg compliant
URL

If this means passing something like the following then I suppose it’s okay. It would be nice to be able to know without passing the second parameter, as there are multitude cases where no such base URL would be available, and some dummy parameter would need to be provided.

    $url = Uri\WhatWgUri::parse( $url, 'https://example.com' )
    var_dump( $url->is_relative_or_something_like_that );

This would be fine, knowing in hindsight that it was originally a relative path. Of course, this would mean that it’s critical that `https://example.com` does not replace the actual host part if one is provided in `$url`. For example, this code should work.

    $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, ‘https://example.com’ );
    $url->domain === 'wiki.php.net'

The forDisplay() method also seems to be useful at the first glance, but since this may be a controversial optional feature, I'd defer it for later…

Hopefully this won’t be too controversial, even though the concept was new to me when I started having to reliably work with URLs. I choose the example I did because of human risk factors in security exploits. "xn--google.com" is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com”

This is a misleading URL to human readers, which is why the WhatWG indicates that “browsers should render a URL’s host by running domain to Unicode with the URL’s host and false.” [URL Standard].

The lack of a standard method here means that (a) most code won’t render the URLs the way a human would recognize them, and (b) those who do will run to inefficient and likely-incomplete user-space code to try and decode/render these hosts.

It may be something fine for a follow-up to this work, but it’s also something I personally consider essential for any native support of handling URLs that are destined for human review. If sending to an `href` attribute it should be the normalized URL; but if displayed as text it should be easy to prevent tricking people in this way.

In my HTML decoding RFC I tried to bake in this decision in the type of the function using an enum. Since I figured most people are unaware of the role of the context in which HTML text is decoded, I found the enum to be a suitable convenience as well as educational tool.

    $url->toString( Uri\WhatWg\RenderContext::ForHumans ); // 䕮䕵䕶䕱.com
    $url->toString( Uri\WhatWg\RenderContext::ForMachines ); // xn-google.com

The names probably are terrible in all of my code snippets, but at this point I’m not proposing actual names, just code samples good enough to illustrate the point. By forcing a choice here (no default value) someone will see the options and probably make the right call.

----

This is all looking quite nice. I’m happy to see how the RFC continues to develop, and I’m eagerly looking forward to being able to finally rely on PHP’s handling of URLs.

Happy new year,
Dennis Snell

On 03/01/2025 08:18, Dennis Snell wrote:

It seems that I’ve mucked up the mailing list again by deleting an old message I intended on replying to. Apologies all around for replying to an older message of my own.
Máté, thanks for your continued work on the URL parsing RFC. I’m only no returning from a bit of an extended leave, so I appreciate your diligence and patience. Here are some thoughts in response to your message from Nov. 19, 2024.

even though the majority does, not everyone builds a browser application

with PHP, especially because URIs are not necessarily accessible on the web

This has largely been touched on elsewhere, but I will echo the idea that it seems valid to have to separate parsers for the two standards, and truly they diverge enough that it seems like it could be only a superficial thing for them to share an interface.

I only harp on the WhatWG spec so much because for many people this will be the only one they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs.

Just to enlighten me and possibly others with less familiarity, how and when are RFC3986 URLs used and what are those systems supposed to do when an invalid URL appears, such as when dealing with percent-encodings as you brought up in response to Tim?

Coming from the XHTML/HTML/XML side I know that there was substantial effort to enforce standards on browsers and that led to decades of security exploits and confusion, when the “official” standards never fully existed in the way people thought. I don’t mean to start any flame wars, but is the URL story at all similar here?

I’m mostly worried that we could accidentally encourage risky behavior for developers who aren’t familiar with the nuances of having to URL specifications vs. having the simplest, least-specific interface point them in the right direction for what they will probably be doing. `parse_url()` is a great example of how the thing that looks _right_ is actually terribly prone to failure.

The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the 2nd (base URI) parameter is provided. So essentially you need to use

this variant of the parse() method if you want to parse a WhatWg compliant
URL

If this means passing something like the following then I suppose it’s okay. It would be nice to be able to know without passing the second parameter, as there are multitude cases where no such base URL would be available, and some dummy parameter would need to be provided.

     $url = Uri\WhatWgUri::parse( $url, 'https://example.com' )
     var_dump( $url->is_relative_or_something_like_that );

This would be fine, knowing in hindsight that it was originally a relative path. Of course, this would mean that it’s critical that `https://example.com` does not replace the actual host part if one is provided in `$url`. For example, this code should work.

     $url = Uri\WhatWgUri::parse( 'https://wiki.php.net/rfc’, ‘https://example.com’ );
     $url->domain === 'wiki.php.net'

The forDisplay() method also seems to be useful at the first glance, but since this may be a controversial optional feature, I'd defer it for later…

Hopefully this won’t be too controversial, even though the concept was new to me when I started having to reliably work with URLs. I choose the example I did because of human risk factors in security exploits. "xn--google.com" is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com”

This is a misleading URL to human readers, which is why the WhatWG indicates that “browsers should render a URL’s host by running domain to Unicode with the URL’s host and false.” [URL Standard].

The lack of a standard method here means that (a) most code won’t render the URLs the way a human would recognize them, and (b) those who do will run to inefficient and likely-incomplete user-space code to try and decode/render these hosts.

It may be something fine for a follow-up to this work, but it’s also something I personally consider essential for any native support of handling URLs that are destined for human review. If sending to an `href` attribute it should be the normalized URL; but if displayed as text it should be easy to prevent tricking people in this way.

In my HTML decoding RFC I tried to bake in this decision in the type of the function using an enum. Since I figured most people are unaware of the role of the context in which HTML text is decoded, I found the enum to be a suitable convenience as well as educational tool.

     $url->toString( Uri\WhatWg\RenderContext::ForHumans ); // 䕮䕵䕶䕱.com
     $url->toString( Uri\WhatWg\RenderContext::ForMachines ); // xn-google.com

The names probably are terrible in all of my code snippets, but at this point I’m not proposing actual names, just code samples good enough to illustrate the point. By forcing a choice here (no default value) someone will see the options and probably make the right call.

----

This is all looking quite nice. I’m happy to see how the RFC continues to develop, and I’m eagerly looking forward to being able to finally rely on PHP’s handling of URLs.

Happy new year,
Dennis Snell

Hi Dennis,

> I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs.

Here's my take on both RFC. RFC3986/87 is a "parsing" RFC which leave the validation to each individual scheme, for instance the following URL is valid under RFC3986 but will be problematic under WHATWG URL spec

ldap://ldap1.example.net:6666/o=University%20of%20Michigan,c=US??sub?(cn=Babs%20Jensen)

The LDAP URL is RFC3986 compliant but adds its own validation rules on top of the RFC. This means that LDAP URL generation would be problematic
if we only implement the WHATWG spec, hence why having a RFC3986/87 URI in PHP is crucial.

Futhermore, the WHATWG spec not only parses but also in the same time validates and more agressively normalizes the URL something the RFC3986 does not do or more precisely recognizes and categorizes in two categories, the non-destructive and the destructive normalizations. These normalization affect the scheme, the path and also the host which can be very impactful in your application.

For the following URL 'https://0073.0232.0311.0377/b'

RFC3986:    'https://0073.0232.0311.0377/b'
WHATWG URL: 'https://59.154.201.255/b'

So this can be a source of confusion for developper. Last but not least RFC3986 alone will never be able to parses IDN domain names and required suport of RFC3987 IDN domains to do so.

Hopefully with those examples you will understand the strenghts and weaknesses of each spec and why IMHO PHP needs both to be up to date.

Hi Dennis,

I only harp on the WhatWG spec so much because for many people this will be the only one they are aware of, if they are aware of any spec at all, and this is a sizable vector of attack targeting servers from user-supplied content. I’m curious to hear from folks here hat fraction of the actual PHP code deals with RFC3986 URLs, and of those, if the systems using them are truly RFC3986 systems or if the common-enough URLs are valid in both specs.

I think Ignace’s examples already highlighted that the two specifications differ in nuances so much that even I had to admit after months of trying to squeeze them into the same interface that doing so would be irresponsible.
The Uri\Rfc3986\Uri will be useful for many use-case (i.e. representing URNs or URIs with scheme-specific behavior - like ldap apparently), but even the UriInterface of PSR-7 can build upon it. On the other hand, Uri\WhatWg\Url will be useful for representing browser links and any other URLs for the web (i.e. an HTTP application router component should use this class).

Just to enlighten me and possibly others with less familiarity, how and when are RFC3986 URLs used and what are those systems supposed to do when an invalid URL appears, such as when dealing with percent-encodings as you brought up in response to Tim?

I am not 100% sure what I brought up to Tim, but certainly, the biggest difference between the two specs regarding percent-encoding was recently documented in the RFC: https://wiki.php.net/rfc/url_parsing_api#percent-encoding . The other main difference is how the host component is stored: WHATWG automatically percent-decodes it, while RFC3986 doesn’t. This is summarized in the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section (a bit below).

This would be fine, knowing in hindsight that it was originally a relative path. Of course, this would mean that it’s critical that [https://example.com](https://example.com) does not replace the actual host part if one is provided in $url. For example, this code should work.

$url = Uri\WhatWgUri::parse( '[https://wiki.php.net/rfc](https://wiki.php.net/rfc)’, ‘[https://example.com](https://example.com)’ );
$url->domain === '[wiki.php.net](http://wiki.php.net)'

Yes. it’s the case. Both classes only use the base URL for relative URIs.

Hopefully this won’t be too controversial, even though the concept was new to me when I started having to reliably work with URLs. I choose the example I did because of human risk factors in security exploits. “xn–google.com” is not in fact a Google domain, but an IDNA domain decoding to "䕮䕵䕶䕱.com

I got your point, so I implemented your suggestion. Actually, I made yet another larger API change in the meanwhile, but in any case, the WHATWG implementation now supports IDNA the following way:

$url = Uri\WhatWg\Url::parse("https://🐘.com/🐘?🐘=🐘", null);

echo $url->getHost();                // [xn--go8h.com](http://xn--go8h.com)
echo $url->getHostForDisplay();      // 🐘.com
echo $url->toString();               // [https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98](https://xn--go8h.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98)
echo $url->toDisplayString();        / https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98 

Unfortunately, RFC3986 doesn’t support IDNA (as Ignace already pointed out at the end of https://externals.io/message/126182#126184), and adding support for RFC3987 (therefore IRIs) would be a very heavy amount of work, it’s just not feasible within this RFC :frowning: To make things worse, its code should be written from scratch, since I haven’t found any suitable C library yet for this purpose. That’s why I’ll leave them for

On other notes, let me share some of the changes since my previous message to the mailing list:

  • First and foremost, I removed the Uri\Rfc3986\Uri::normalize() method from the proposal after Arnaud’s feedback. Now, both the normalized (and decoded), as well as the non-normalized representation can equally be retrieved from the same URI instance. This was necessary to change in order for users to be able to consistently use URIs. Now, if someone needs an exact URI component value, they can use the getRaw*() getter. If they want the normalized and percent-decoded form then a get*() getter should be used. For more information, the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section should be consulted.
  • I made a few less important API changes, like converting the WhatWgError class to an enum, adding a Uri\Rfc3986\IUri::getUserInfo() method, changing the return type of some getters (removing nullability) etc.
  • I fixed quite some smaller details of the implementation along with a very important spec incompatibility: until now, the “path” component didn’t contain the leading “/” character when it should have. Now, both classes conform to their respective specifications with regards to path handling.

I think the RFC is now mature enough to consider voting in the foreseeable future, since most of the concerns which came up until now are addressed some way or another. However, the only remaining question that I still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes should be final? Personally, I don’t see much problem with opening them for extension (other than some technical challenges that I already shared a few months ago), and I think people will have legitimate use cases for extending these classes. On the other hand, having final classes may allow us to make slightly more significant changes without BC concerns until we have a more battle-tested API, and of course completely eliminate the need to overcome the said technical challenges. According to Tim, it may also result in safer code because spec-compliant base classes cannot be extended by possibly non-spec compliant/buggy children. I don’t necessarily fully agree with this specific concern, but here it is.

Regards,
Máté

Hi

Am 2025-02-16 23:01, schrieb Máté Kocsis:

I only harp on the WhatWG spec so much because for many people this will
be the only one they are aware of, if they are aware of any spec at all,
and this is a sizable vector of attack targeting servers from user-supplied
content. I’m curious to hear from folks here hat fraction of the actual PHP
code deals with RFC3986 URLs, and of those, if the systems using them are
truly RFC3986 systems or if the common-enough URLs are valid in both specs.

I think Ignace's examples already highlighted that the two specifications
differ in nuances so much that even I had to admit after months of trying
to squeeze them into the same interface that doing so would be
irresponsible.

I think this is also a good argument in favor of finally making the classes final. Not making them final would allow for irresponsible sub-classes :slight_smile:

echo $url->getHost(); // xn--go8h.com
echo $url->getHostForDisplay(); // :elephant:.com
echo $url->toString(); //
https://🐘.com/🐘?🐘=🐘
echo $url->toDisplayString(); /
https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98

The naming of these methods seems to be a little inconsistent. It should either be:

     ->getHostForDisplay()
     ->toStringForDisplay()

or

     ->getDisplayHost()
     ->toDisplayString()

but not a mix between both of them.

I think the RFC is now mature enough to consider voting in the
foreseeable future, since most of the concerns which came up until now are
addressed some way or another. However, the only remaining question that I
still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes
should be final? Personally, I don't see much problem with opening them for

Yes. Besides the remark above, my previous arguments still apply (e.g. `with()`ers not being able to construct instances for subclasses, requiring to override all of them). I'm also noticing that serialization is unsafe with subclasses that add a `$__uri` property (or perhaps any property at all?).

--------------------

We already had extensive off-list discussion about the RFC and I agree it's in a good shape now. I've given it another read and here's my remarks:

1.

The `toDisplayString()` method that you mentioned above is not in the RFC. Did you mean `toHumanFriendlyString()`? Which one is correct?

2.

The example output of the `$errors` array does not match the stub. It contains a `failure` property, should that be `softError` instead?

3.

The RFC states "When trying to instantiate a WHATWG Url via its constructor, a Uri\InvalidUriException is thrown when parsing results in a failure."

What happens for Rfc3986 when passing an invalid URI to the constructor? Will an exception be thrown? What will the error array contain? Is it perhaps necessary to subclass Uri\InvalidUriException for use with WhatWgUrl, since `$errors` is not applicable for 3986?

4.

The RFC does not specify when `UninitializedUriException` is thrown.

5.

The RFC does not specify when `UriOperationException` is thrown.

6.

Generally speaking I believe it would help understanding if you would add a `/** @throws InvalidUriException */` to each of the methods in the stub to make it clear which ones are able to throw (e.g. resolve(), or the withers). It's harder to find this out from “English” rather than “code” :slight_smile:

7.

In the “Component retrieval” section: Please add even more examples of what kind of percent-decoding will happen. For example, it's important to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is decoded to `=`. This really is the same case as with `%2F` in a path. The explanation

"the URI is normalized (when applicable), and then the reserved characters in the context of the given component are percent-decoded. This means that only those reserved characters are percent-decoded that are not allowed in a component. This behavior is needed to be able to unambiguously retrieve components."

alone is not clear to me. “reserved characters that are not allowed in a component”. I assume this means that `%2F` (/) in a path will not be decoded, but `%3F` (?) will, because a bare `?` can't appear in a path?

8.

In the “Component retrieval” section: You compare the behavior of WhatWgUrl and Rfc3986Uri. It would be useful to add something like:

     $url->getRawScheme() // does not exist, because WhatWgUrl always normalizes the scheme

to better point out the differences between the two APIs with regard to normalization (it's mentioned, but having it in the code blocks would make it more visible).

9.

In the “Component Modification” section, the RFC states that WhatWgUrl will automatically encode `?` and `#` as necessary. Will the same happen for Rfc3986? Will the encoding of `#` also happen for the query-string component? The RFC only mentions the path component.

I'm also wondering if there are cases where the withers would not round-trip, i.e. where `$url->withPath($url->getPath())` would not result in the original URL?

10.

Can you add examples where the authority / host contains IPv6 literals? It would be useful to specifically show whether or not the square brackets are returned when using the getters. It would also be interesting to see whether or not IPv6 addresses are normalized (e.g. shortening `2001:db8:0:0:0:0:0:1` to `2001:db8::1`).

11.

In “Component Recomposition” the RFC states "The Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".

Does this mean that toString() for Rfc3986 will always return the original input?

12.

It would be useful to know whether or not the classes implement `__debugInfo()` / how they appear when `var_dump()`ing them.

Best regards
Tim Düsterhus

Hi

[dropping Dennis from the Cc list]

Am 2025-02-21 13:06, schrieb Tim Düsterhus:

We already had extensive off-list discussion about the RFC and I agree it's in a good shape now. I've given it another read and here's my remarks:

One more thing that came to my mind, but where I'm not sure what the correct choice is:

Naming of `WhatWgError` and `WhatWgErrorType`. They are placed within the `Uri\WhatWg` namespace, making the `WhatWg` in their name a little redundant.

For Exceptions the recommendation is to use this kind of redundant naming, to make implicit imports for catch blocks more convenient compared to needing to alias each and every `Exception` class. The same reasoning *could* also apply here, but here I find it less obvious.

The alternative would probably be `Uri\WhatWg\Error` and `Uri\WhatWg\Error\Type`.

No strong opinion from my side, but wanted to mention it nevertheless.

Best regards
Tim Düsterhus

Hi Máté,

I’ve read the latest version of the RFC and while I very much like the RFC, I have some remarks.

The paragraph in at the beginning of the RFC in the > Relevant URI specifications > WHATWG URL section seems to be incomplete.

I don’t really understand how the UninitializedUriException exception can be thrown?
Is it somehow possible to create an instance of a URI without initializing it?
This seems unwise in general.

I’m not really convinced by using the constructor to be able to create a URI object.
I think it would be better for it to be private/throwing and have two static constructor parse and tryParse,
mimicking the API that exists for creating an instance of a backed enum from a scalar.

I think changing the name of the toString method to toRawString better matches the rest of the proposed API,
and also removes the question as to why it isn’t the magic method __toString.

I will echo Tim’s concerns about the non-final-ity of the URI classes.
This seems like a recipe for disaster.
I can maybe see the usefulness of extending Rfc3986\Uri by a subclass Ldap\Uri,
but being able to extend the WhatWg URI makes absolutely no sense.
The point of these classes is that if you have an instance of one of these, you know that you have a valid URI.
Being able to subclass a URI and mess with the equals, toString, toNormalizedString methods throws away all the safety guarantees provided by possessing a Uri instance.

Moreover, like Tim previously mentioned, if you subclass you need to override all the methods,
and you might end up in the similar situation which lead to the removal of the common Uri interface in the first place.
Which basically suggests creating a new Uri class instead of extending anyway.

Making these classes final just removes a lot of edge cases, some that I don’t think we can anticipate,
while also simplifying other aspects, like serialization.
As you won’t need that weird __uri property any longer.

Similarly, I don’t understand why the WhatWgError is not final.
Even if subclassing of the Uri classes is allowed, any error it would have would not be a WhatWg one,
so why should you be able to extend it.

Parsing API and why Monads wouldn’t solve the soft error case anyway.
This is just a remark, but you wouldn’t be able to really implement a monad if you want to support partial success.
So I’m not sure mentioning the lack of monadic support in PHP is the best argument against them for this RFC.

···

Best regards,

Gina P. Banyard

On Friday, 28 June 2024 at 21:06, Máté Kocsis kocsismate90@gmail.com wrote:

Hi Everyone,

I’ve been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn’t have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

-----Original Message-----
From: Tim Düsterhus <tim@bastelstu.be>
Sent: Sunday, February 23, 2025 5:05 PM
To: Máté Kocsis <kocsismate90@gmail.com>
Cc: Internals <internals@lists.php.net>
Subject: Re: [PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing
API

Naming of `WhatWgError` and `WhatWgErrorType`. They are placed within the

`Uri\WhatWg` namespace, making the `WhatWg` in their name a little
redundant.

Hey,

As those are URI validation errors, maybe something like
`Uri\WhatWg\ValidationError` would be both less clashy and less redundant?

If I'd see `WhatWgError` without seeing the "Uri" keyword I'd probably think
it's related to other aspects of the spec, e.g. something went wrong with
the HTML parsing. Although I understand it's validating against the WhatWg
spec, `UriError` would seem clearer to me.

BR,
Juris

Hi all,

In earlier discussions on the [Server-Side Request and Response objects](RFC: Server-Side Request and Response Objects (v2) - Externals) RFC and the [after-action sumamry]([RFC] [EPILOGUE] Server-Side Request and Response Objects (v2) - Externals), one of the common non-technical objections was that it would better be handled in userland.

Seeing as there are at least two other WHATWG-URL projects in userland now ...

- rowbot/url - Packagist
- esperecyan/url - Packagist

... does the same objection continue to hold?

-- pmj

On Sunday, 23 February 2025 at 17:57, Paul M. Jones <pmjones@pmjones.io> wrote:

Hi all,

In earlier discussions on the Server-Side Request and Response objects RFC and the after-action sumamry, one of the common non-technical objections was that it would better be handled in userland.

Seeing as there are at least two other WHATWG-URL projects in userland now ...

- rowbot/url - Packagist
- esperecyan/url - Packagist

... does the same objection continue to hold?

Considering that one of the other stated goals of this RFC is to provide this API to other core extensions, the previous objections do not apply here.

Best regards,

Gina P. Banyard

Hi

Am 2025-02-23 18:30, schrieb Gina P. Banyard:

2.
I don't really understand how the UninitializedUriException exception can be thrown?
Is it somehow possible to create an instance of a URI without initializing it?

It's mentioned in the RFC (it was not yet, when I read through the RFC):

This can happen for example when the object is instantiated via ReflectionClass::newInstanceWithoutConstructor()).

Incidentally this is *also* something that would be fixed by making the classes `final`, since it's illegal to bypass the constructor for final internal classes:

     <?php

     $r = new ReflectionClass(Random\Engine\Mt19937::class);
     $r->newInstanceWithoutConstructor();

results in:

     Fatal error: Uncaught ReflectionException: Class Random\Engine\Mt19937 is an internal class marked as final that cannot be instantiated without invoking its constructor

This seems unwise in general.

I agree. This exception is not really actionable by the user and more of a “should never happen” case. It should be prevented from appearing.

The same is true for `UriOperationException`. The RFC says that it can happen for memory issues. Can this actually happen? My understanding is that the engine bails out when an allocation fails. In any case if a more graceful handling is desired it should be some generic `OutOfMemoryError` rather than an extension-specific exception.

With regard to unserialization, let me refer to: What type of Exception to use for unserialize() failure? - Externals. ext/random uses `\Exception` and I suggest ext/uri to do the same. This should also be handled in a consistent way across extensions, e.g. by reproposing PHP: rfc:improve_unserialize_error_handling.

And with “Theoretically, URI component reading may also trigger this exception” being a theoretical issue only, the `UriOperationException` is not actually necessary at all.

3.
I'm not really convinced by using the constructor to be able to create a URI object.
I think it would be better for it to be private/throwing and have two static constructor `parse` and `tryParse`,
mimicking the API that exists for creating an instance of a backed enum from a scalar.

enums are little different in that they are a singleton. The Dom\HTMLDocument class with only named constructors might be a better comparison. But I don't have a strong opinion on constructor vs named constructor here.

Best regards
Tim Düsterhus

Hi

Am 2025-02-23 18:47, schrieb Juris Evertovskis:

As those are URI validation errors, maybe something like
`Uri\WhatWg\ValidationError` would be both less clashy and less redundant?

I like that suggestion.

Best regards
Tim Düsterhus

On 21/02/2025 13:06, Tim Düsterhus wrote:

Hi

Am 2025-02-16 23:01, schrieb Máté Kocsis:

I only harp on the WhatWG spec so much because for many people this will
be the only one they are aware of, if they are aware of any spec at all,
and this is a sizable vector of attack targeting servers from user-supplied
content. I’m curious to hear from folks here hat fraction of the actual PHP
code deals with RFC3986 URLs, and of those, if the systems using them are
truly RFC3986 systems or if the common-enough URLs are valid in both specs.

I think Ignace's examples already highlighted that the two specifications
differ in nuances so much that even I had to admit after months of trying
to squeeze them into the same interface that doing so would be
irresponsible.

I think this is also a good argument in favor of finally making the classes final. Not making them final would allow for irresponsible sub-classes :slight_smile:

echo $url->getHost(); // xn--go8h.com
echo $url->getHostForDisplay(); // :elephant:.com
echo $url->toString(); //
https://🐘.com/🐘?🐘=🐘
echo $url->toDisplayString(); /
https://🐘.com/%F0%9F%90%98?%F0%9F%90%98=%F0%9F%90%98

The naming of these methods seems to be a little inconsistent. It should either be:

\-&gt;getHostForDisplay\(\)
\-&gt;toStringForDisplay\(\)

or

\-&gt;getDisplayHost\(\)
\-&gt;toDisplayString\(\)

but not a mix between both of them.

I think the RFC is now mature enough to consider voting in the
foreseeable future, since most of the concerns which came up until now are
addressed some way or another. However, the only remaining question that I
still have is whether the Uri\Rfc3986\Uri and the Uri\WhatWg\Url classes
should be final? Personally, I don't see much problem with opening them for

Yes. Besides the remark above, my previous arguments still apply (e.g. `with()`ers not being able to construct instances for subclasses, requiring to override all of them). I'm also noticing that serialization is unsafe with subclasses that add a `$__uri` property (or perhaps any property at all?).

--------------------

We already had extensive off-list discussion about the RFC and I agree it's in a good shape now. I've given it another read and here's my remarks:

1.

The `toDisplayString()` method that you mentioned above is not in the RFC. Did you mean `toHumanFriendlyString()`? Which one is correct?

2.

The example output of the `$errors` array does not match the stub. It contains a `failure` property, should that be `softError` instead?

3.

The RFC states "When trying to instantiate a WHATWG Url via its constructor, a Uri\InvalidUriException is thrown when parsing results in a failure."

What happens for Rfc3986 when passing an invalid URI to the constructor? Will an exception be thrown? What will the error array contain? Is it perhaps necessary to subclass Uri\InvalidUriException for use with WhatWgUrl, since `$errors` is not applicable for 3986?

4.

The RFC does not specify when `UninitializedUriException` is thrown.

5.

The RFC does not specify when `UriOperationException` is thrown.

6.

Generally speaking I believe it would help understanding if you would add a `/** @throws InvalidUriException */` to each of the methods in the stub to make it clear which ones are able to throw (e.g. resolve(), or the withers). It's harder to find this out from “English” rather than “code” :slight_smile:

7.

In the “Component retrieval” section: Please add even more examples of what kind of percent-decoding will happen. For example, it's important to know if `%26` is decoded to `&` in a query-string. Or if `%3D` is decoded to `=`. This really is the same case as with `%2F` in a path. The explanation

"the URI is normalized (when applicable), and then the reserved characters in the context of the given component are percent-decoded. This means that only those reserved characters are percent-decoded that are not allowed in a component. This behavior is needed to be able to unambiguously retrieve components."

alone is not clear to me. “reserved characters that are not allowed in a component”. I assume this means that `%2F` (/) in a path will not be decoded, but `%3F` (?) will, because a bare `?` can't appear in a path?

8.

In the “Component retrieval” section: You compare the behavior of WhatWgUrl and Rfc3986Uri. It would be useful to add something like:

$url\-&gt;getRawScheme\(\) // does not exist, because WhatWgUrl always normalizes the scheme

to better point out the differences between the two APIs with regard to normalization (it's mentioned, but having it in the code blocks would make it more visible).

9.

In the “Component Modification” section, the RFC states that WhatWgUrl will automatically encode `?` and `#` as necessary. Will the same happen for Rfc3986? Will the encoding of `#` also happen for the query-string component? The RFC only mentions the path component.

I'm also wondering if there are cases where the withers would not round-trip, i.e. where `$url->withPath($url->getPath())` would not result in the original URL?

10.

Can you add examples where the authority / host contains IPv6 literals? It would be useful to specifically show whether or not the square brackets are returned when using the getters. It would also be interesting to see whether or not IPv6 addresses are normalized (e.g. shortening `2001:db8:0:0:0:0:0:1` to `2001:db8::1`).

11.

In “Component Recomposition” the RFC states "The Uri\Rfc3986\Uri::toString() returns the unnormalized URI string".

Does this mean that toString() for Rfc3986 will always return the original input?

12.

It would be useful to know whether or not the classes implement `__debugInfo()` / how they appear when `var_dump()`ing them.

Best regards
Tim Düsterhus

Hi Maté I just read the final proposal and here's my quick remarks it may be possible other have already highlighted some of those remarks:

I believe there's a typo in the RFC

> All URI components - with the exception of the host - can be retrieved in two formats:

I believe you mean - with the excepotion of the Port

0 - It is a unfortunate that there's no IDNA support for RFC3986, I understand the reasoning behind that decision but I was wondering if it was possible to optin its use when the ext-intl extension is present ?

1 - Does it means that if/when Rfc3986/Uri get Rfc3987 supports they will also get a `Uri::toDisplayString` and `Uri::getHostForDisplay` maybe this should be stated in the Futurscope ?

2 - I would go with final classes for both classes and promote decoration for extension. This would reduce security issues a lot.

3 - I would make the constructor private using a `from` , `tryFrom` or `parse` and `tryParse` methods to highlight the difference in result

4 - For consistency I would use toRawString and toString just like it is done for components.

5 - Can the returned array from __debugInfo be used in a "normal" method like `toComponents` naming can be changed/improve to ease migration from parse_url or is this left for userland library ?

Hi

Am 2025-02-23 18:57, schrieb Paul M. Jones:

In earlier discussions on the [Server-Side Request and Response objects](RFC: Server-Side Request and Response Objects (v2) - Externals) RFC and the [after-action sumamry]([RFC] [EPILOGUE] Server-Side Request and Response Objects (v2) - Externals), one of the common non-technical objections was that it would better be handled in userland.

I did not read through the entire discussion, but had a look at the “after-action summary” thread and specifically Côme’s response, which you apparently agreed with:

My take on that is more that functionality in core needs to be «perfect», or at least near unanimous.

Or perhaps phrased differently, like I did just a few days ago in: Introduction - Sam Lewis - Externals

The type of functionality that is nowadays added to PHP’s standard library is “building block” functionality: Functions that a userland developer would commonly need in their custom library or application.

*Correctly* processing URIs is a common need for developers and it’s complicated to do right, thus it qualifies as a “building block”.

PHP also already has this functionality in `parse_url()`, but it's severely broken. To me it clearly makes sense to gradually provide better-designed and safer replacement functionality for broken parts of the standard library. This worked for the randomness functionality in PHP 8.2, for DOM in PHP 8.4 and hopefully for URIs in PHP 8.5.

Best regards
Tim Düsterhus

Hi

Am 2025-02-24 10:18, schrieb Ignace Nyamagana Butera:

5 - Can the returned array from __debugInfo be used in a "normal" method like `toComponents` naming can be changed/improve to ease migration from parse_url or is this left for userland library ?

I would prefer not expose this functionality for the same reason that there are no raw properties provided: The user must make an explicit choice whether they are interested in the raw or in the normalized version of the individual components.

It can make sense to normalize a hostname, but not the path. My usual example against normalizing the path is that SAML signs the *encoded* URI instead of the payload and changing the case in percent-encoded characters is sufficient to break the signature, e.g. `%2f` is different than `%2F` from a SAML signature perspective, requiring workarounds like this: php-saml/lib/Saml2/Utils.php at c89d78c4aa398767cf9775d9e32d445e64213425 · SAML-Toolkits/php-saml · GitHub

Best regards
Tim Düsterhus

Hi,

Thanks for all the efforts making this RFC happen, it’ll be a game changer in the domain!

I’m seeing a push to make the classes final. Please don’t!
This would badly break the open/closed principle to me.

When shipping a new class, one ships two things: a behavior and a type. The behavior is what some want to close by making the class final. But the result is that the type will also be final. And this would lead to a situation where people tighly couple their code to one single implementation - the internal one.

The situation I’m telling about is when one will accept an argument described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible implementation can ever be passed: the native one. Composition cannot be achieve because there’s no type to compose.

Fine-tuning the behavior provided by the RFC is what we might be most interested in, but we should not forget that we also ship a type. By making the type non-final, we keep things open enough for userland to build on it. If not, we’re going to end up with a fragmented community: some will tightly couple to the native Url implementation, some others will define a UriInterface of their own and will compose it with the native implementation, all these with non-interoperable base types of course, because interop is hard.

By making the classes non-final, there will be one base type to build upon for userland.
(the alternative would be to define native UrlInterface, but that’d increase complexity for little to no gain IMHO - althought that’d solve my main concern).

5 - Can the returned array from __debugInfo be used in a “normal”
method like toComponents naming can be changed/improve to ease
migration from parse_url or is this left for userland library ?

I would prefer not expose this functionality for the same reason that
there are no raw properties provided: The user must make an explicit
choice whether they are interested in the raw or in the normalized
version of the individual components.

The RFC is also missing whether __debugInfo returns raw or non-raw components. Then, I’m wondering if we need this per-component break for debugging at all? It might be less confusing (on this encoding aspect) to dump basically what __serialize() returns (under another key than __uri of course).
This would also close the avenue of calling __debugInfo() directly (at the cost of making it possibly harder to move away from parse_url(), but I don’t think we need to make this simpler - getting familiar with the new API before would be required and welcome actually.)

It can make sense to normalize a hostname, but not the path. My usual
example against normalizing the path is that SAML signs the encoded
URI instead of the payload and changing the case in percent-encoded
characters is sufficient to break the signature

I would be careful with this argument: signature validation should be done on raw bytes. Requiring an object to preserve byte-level accuracy while the very purpose of OOP is to provide abstractions might be conflicting. The signing topic can be solved by keeping the raw signed payload around.

Hi

Am 2025-02-24 12:08, schrieb Nicolas Grekas:

The situation I'm telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot be
achieve because there's no type to compose.

Yes, that's the point: The behavior and the type are intimately tied together. The Uri/Url classes are representing values, not services. You wouldn't extend an int either. For DateTimeImmutable inheritance being legal causes a ton of needless bugs (especially around serialization behavior).

Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By making

For a given specification (RFC 3986 / WHATWG) there is exactly one correct interpretation of a given URL. “Fine-tuning” means that you are no longer following the specification.

the type non-final, we keep things open enough for userland to build on it.

This works:

     final class HttpUrl {
         private readonly \Uri\Rfc3986\Uri $uri;
         public function __construct(string $uri) {
             $this->uri = new \Uri\Rfc3986\Uri($uri);
             if ($this->uri->getScheme() !== 'http') {
                 throw new ValueError('Scheme must be http');
             }
         }
         public function toRfc3986(): \Uri\Rfc3986\Uri {
             return $this->uri;
         }
     }

Userland can easily build their convenience wrappers around the classes, they just need to export them to the native classes which will then guarantee that the result is fully validated and actually a valid URI/URL. Keep in mind that the ext/uri extension will always be available, thus users can rely on the native implementation.

By making the classes non-final, there will be one base type to build upon
for userland.
(the alternative would be to define native UrlInterface, but that'd
increase complexity for little to no gain IMHO - althought that'd solve my
main concern).

Mate already explained why a native UriInterface was intentionally removed from the RFC in php.internals: Re: [RFC] [Discussion] Add WHATWG compliant URL parsing API.

The RFC is also missing whether __debugInfo returns raw or non-raw
components. Then, I'm wondering if we need this per-component break for
debugging at all? It might be less confusing (on this encoding aspect) to
dump basically what __serialize() returns (under another key than __uri of
course).

That would also work for me.

It can make sense to normalize a hostname, but not the path. My usual
example against normalizing the path is that SAML signs the *encoded*
URI instead of the payload and changing the case in percent-encoded
characters is sufficient to break the signature

I would be careful with this argument: signature validation should be done
on raw bytes. Requiring an object to preserve byte-level accuracy while the
very purpose of OOP is to provide abstractions might be conflicting. The
signing topic can be solved by keeping the raw signed payload around.

Yes, the SAML signature behavior is wrong, but I did not write the SAML specification. I just pointed out how it a possible use-case where choosing the raw or normalized form depends on the component and where a “get all components” function would be dangerous.

Best regards
Tim Düsterhus

Am 2025-02-24 12:08, schrieb Nicolas Grekas:

The situation I’m telling about is when one will accept an argument
described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible
implementation can ever be passed: the native one. Composition cannot
be
achieve because there’s no type to compose.

Yes, that’s the point: The behavior and the type are intimately tied
together. The Uri/Url classes are representing values, not services. You
wouldn’t extend an int either. For DateTimeImmutable inheritance being
legal causes a ton of needless bugs (especially around serialization
behavior).

DatetimeImmutable is a good example of community-proven usefulness for inheritance:
the carbon package has a huge success because it does add a ton of nice helpers (that are better maintained in userland) while still providing compatibility with functions that accept the native type.

The fact that the native implementation had bugs when inheritance was used doesn’t mean inheritance is a problem. It’s just bugs that need to be fixed. Conceptually nothing makes those bugs inevitable.

Closing the class would have hindered community-innovation. The same applies here.

Then, if people make mistakes in their child classes, their problem. But the community shouldn’t be forbidden to extend a class just because mistakes can happen.

Fine-tuning the behavior provided by the RFC is what we might be most
interested in, but we should not forget that we also ship a type. By
making

For a given specification (RFC 3986 / WHATWG) there is exactly one
correct interpretation of a given URL. “Fine-tuning” means that you are
no longer following the specification.

See Carbon example, it’s not specifically about fine-tuning. We cannot anticipate how creative people are. Nor should we prevent them from being so, from the PoV of the PHP engine designers.

the type non-final, we keep things open enough for userland to build on
it.

This works:

final class HttpUrl {
private readonly \Uri\Rfc3986\Uri $uri;
public function __construct(string $uri) {
$this->uri = new \Uri\Rfc3986\Uri($uri);
if ($this->uri->getScheme() !== ‘http’) {
throw new ValueError(‘Scheme must be http’);
}
}
public function toRfc3986(): \Uri\Rfc3986\Uri {
return $this->uri;
}
}

Userland can easily build their convenience wrappers around the classes,
they just need to export them to the native classes which will then
guarantee that the result is fully validated and actually a valid
URI/URL. Keep in mind that the ext/uri extension will always be
available, thus users can rely on the native implementation.

This is an example of what I call community-fragmentation: one hardcoded type that should only be used as an implementation detail, but will leak at type-boundaries and will make things inflexible. Each project will have to think about such designs, and many more will get it wrong. (We will be the ones to blame since we’re the ones educated on the topic.)

By making the classes non-final, there will be one base type to build
upon
for userland.
(the alternative would be to define native UrlInterface, but that’d
increase complexity for little to no gain IMHO - althought that’d solve
my
main concern).

Mate already explained why a native UriInterface was intentionally
removed from the RFC in https://news-web.php.net/php.internals/126425.

The only one option remains - making the class non-final.

Nicolas

TBH, data-point from someone that spends time removing Carbon usages here :stuck_out_tongue:

The DateTimeImmutable type should’ve been final from the start: it is trivial to declare a userland interface, and then use the DateTimeImmutable type as an implementation detail of a userland-provided interface.

PSR-7, for example, will benefit greatly from this new RFC, without ever having to expose the underlying value type to userland.

Inheritance is a tool to be used when there is LSP-compliant divergence from the original type, and here, the PHP RFC aims at modeling something that doesn’t have alternative implementations: it’s closed for modification, and that’s good.

···

Marco Pivetta

https://mastodon.social/@ocramius

https://ocramius.github.io/

[1] https://en.wikipedia.org/wiki/Open%E2%80%93closed_principle

[2] https://wiki.php.net/rfc/url_parsing_api#why_a_common_uri_interface_is_not_supported

···

I’m seeing a push to make the classes final. Please don’t!
This would badly break the open/closed principle to me.

When shipping a new class, one ships two things: a behavior and a type. The behavior is what some want to close by making the class final. But the result is that the type will also be final. And this would lead to a situation where people tighly couple their code to one single implementation - the internal one.

The situation I’m telling about is when one will accept an argument described as
function (\Uri\WhatWg\Url $url)

If the Url class is final, this signature means only one possible implementation can ever be passed: the native one. Composition cannot be achieve because there’s no type to compose.

Fine-tuning the behavior provided by the RFC is what we might be most interested in, but we should not forget that we also ship a type. By making the type non-final, we keep things open enough for userland to build on it. If not, we’re going to end up with a fragmented community: some will tightly couple to the native Url implementation, some others will define a UriInterface of their own and will compose it with the native implementation, all these with non-interoperable base types of course, because interop is hard.

By making the classes non-final, there will be one base type to build upon for userland.
(the alternative would be to define native UrlInterface, but that’d increase complexity for little to no gain IMHO - althought that’d solve my main concern).

The open/closed principle does not mean “open to inheritance”.
Just pulling in the Wikipedia definition: [1]

In object-oriented programming, the open–closed principle (OCP) states “software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification”;

You can extend a class by using a decorator or the delegation pattern.
But most importantly, I would like to focus on the “closed for modification” part of the principle.
Unless we make all the methods final, inheritance allows you to modify the behaviour of the methods, which is in opposition to the principle.

Moreover, if you extend a WhatWg\Uri to behave differently to the WhatWg spec, then you do not have a WhatWg URI.
Which means the type becomes meaningless.

Quoting Dijkstra:

The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise.

A concrete WhatWg\Uri type is abstracting over a raw string.
And it creates a new semantic level where when you are in possession of such a type,

you know with absolute certainty how it behaves and what you can do with it, and know that if a consumer needs a WhatWg URI it will not reject it.
This also means consumers of said WhatWg\Uri type do not need to care about validation of it.

If one is able to extend a WhatWg URI, then none of the above applies, and you just have a raw string with fancy methods.

I.e. you are now vague, and any consumer of the type needs to do validation because it cannot trust the type, and you have created a useless abstraction.

It also seems you did not read the relevant “Why a common URI interface is not supported?” [2] section of the RFC.
The major reason why this RFC has had so many iterations and been in discussion for so long is because Máté tried, again and again, to have a common interface.
But this just does not make any sense, you cannot make something extremely concrete vague and abstract, unless you want to lose all the benefits of the abstraction.

Best regards,

Gina P. Banyard