[PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

nyamsprod_the_funky · April 30, 2025, 4:42pm

Hi Máté and Tim

Why can’t the Url::resolve method also expose the $errors parameter like the constructor and the parse static method ? As far as I understand it nothing prevents the API from exposing the errors during URI resolution which is a proxy method for the constructor call just like the parse named constructor ?

On Wed, Apr 30, 2025 at 9:58 AM ignace nyamagana butera <nyamsprod@gmail.com> wrote:

Hi Máté and Tim

I read the following in the RFC

Withers of Uri\WhatWg\Url follow the relevant “setter steps” that are defined by WHATWG URL. Unfortunately, these algorithms sometimes have surprising behavior where modification fails silently, and the original values are kept. For example. Even though this RFC acknowledges the fact that the WHATWG URL “setter steps” have gotchas, it doesn’t try to prevent them - as doing so would be spec-incompliant.

Reading the WHATWG URL specification and checking how

Chrome,

Firefox

and even https://github.com/TRowbotham/URL-Parser

behave I see that mutator either silently reject the invalid input on setter or normalize them I was wondering if it still make sense to still say that URL mutator can throws InvalldUrlException ? Since AFAIK only a TypeError could actually be thrown if the wrong input is given, no specially crafted string can make the spec throw unless I have overlooked it.

On Tue, Apr 29, 2025 at 8:55 PM Tim Düsterhus <tim@bastelstu.be> wrote:

Hi

On 4/29/25 10:54, ignace nyamagana butera wrote:

I have one last question while reviewing my polyfill implementation. Is it
worth it adding a SensitiveParameter attribute on the argument of the
following methods ?

Uri\Rfc3986\Uri::withUserInfo

Uri\WhatWg\Url::withPassword

I’m fine with any answer ? Does it warrant a paragraph in the RFC ? That I
do not know but I feel the question may be raised ?

Good catch. Since they may throw an exception for malformed inputs, they
should have the attribute. Especially since folks might try to use
special characters in passwords, which might need encoding.

No paragraph in the RFC needed, but the attribute should be added to the
“stub”.

Best regards
Tim Düsterhus

Mate_Kocsis · May 3, 2025, 9:05pm

Hi Ignace,

I have just added the SensitiveParameter attribute to the Uri\Rfc3986\Uri::withUserInfo() and Uri\WhatWg\Url::withPassword() methods.

Reading the WHATWG URL specification and checking how

Chrome,

Firefox

and even https://github.com/TRowbotham/URL-Parser

behave I see that mutator either silently reject the invalid input on setter or normalize them I was wondering if it still make sense to still say that URL mutator can throws InvalldUrlException ? Since AFAIK only a TypeError could actually be thrown if the wrong input is given, no specially crafted string can make the spec throw unless I have overlooked it.

I double the checked the implementation, and I quickly managed to find a case when an exception is thrown:

$url = new Uri\WhatWg\Url(“https://example.com”);
$url->withHost(“[1.2.3.4”);

The above code will throw a Uri\WhatWg\InvalidUrlException that refers to the “IPv6-unclosed” WHATWG URL error,
so I think it makes sense to keep the current behavior, especially with respect to possible future changes of the specification.

Regards,
Máté

Mate_Kocsis · May 3, 2025, 9:07pm

Hi Ignace,

Why can’t the Url::resolve method also expose the $errors parameter like the constructor and the parse static method ? As far as I understand it nothing prevents the API from exposing the errors during URI resolution which is a proxy method for the constructor call just like the parse named constructor ?

Sure, that’s also a good catch! It was an omission until now, and I’ve recently fixed this: so now Uri\WhatWg\Url::resolve() has a 2nd parameter ($softErrors).

Regards,
Máté

Mate_Kocsis · May 3, 2025, 9:18pm

Hey Ignace,

I see you updated the RFC but I believe there’s still some errors in the example:
$url = Uri\WhatWg\Url::parse(“/foo”, “.com”); // Throws Uri\WhatWg\InvalidUrlException because of $baseUri

After following the suggestion of Tim, I changed the type of the $baseUrl parameters at last: now, an URI/URL instance is
accepted instead of a string. As Tim mentioned, this can indeed fix some performance issues when one uses the same
base URL for instantiating multiple URIs/URLs.

RFC3986 host normalization states that URL encoded part should be encoded using uppercased letter so on normalization:

https://%e4%bd%a0%e5%a5%bd%e4%bd%a0%e5%a5%bd.com should be https://%E4%BD%A0%E5%A5%BD%E4%BD%A0%E5%A5%BD.com

Yes, indeed! This example output has stuck here from before I fixed the implementation, so thanks for pointing it out!

Regards,
Máté

Mate_Kocsis · May 5, 2025, 9:32pm

Hi Paul,

I would not presume that the dedicated value objects are what “makes the [Rowbot] library much slower” than the RFC – instead,

my first intuition is that the parsing operations are slower in userland than in C, and are primarily responsible for the comparative slowness.

Speedwise, creation of multiple objects from the parsed results would be a rounding error compared to the parsing itself.

Yes, I may have arrived at the wrong conclusion based on the right factors: the Rowbot library uses objects for not just representing the components,
but even the parser states and other things, whereas in the C library, parsing is just an enormous switch-case. I know that instantiating objects doesn’t
take a lot of time, but I guess the performance difference between a very nicely written, full OO PHP code and an optimized C code will start to be
very much noticeable with a larger iteration number. Anyway, I shouldn’t have tried to compare the performance of the two solutions, since it’s really not
a fair comparison, and not the main point.

I think that’s fair. The main thing that stands out to me is not the Scheme, Host, etc. value objects, but that the RFC presents no UrlRecord –

which is very definitely part the WHATWG-URL specification. That is, from reading the spec, I’d expect to see a UrlRecord, and a Url composed from it.

I believe the UrlRecord is a minor detail of the specification that is possible to omit without sacrificing anything useful: having a record in addition
to the URL class doesn’t bring much to the table. For similar reasons, the RFC doesn’t implement the WHATWG getters either, and the pure
components are exposed instead (the “Component retrieval” section writes about this). So the RFC does not entirely implement the API prescribed by the
WHATWG URL spec, however it accurately follows the parsing details – which is the main benefit in my opinion.

Meanwhile, AFAICT, neither Rowbot nor the RFC provide a percent encoding mechanism, for consumers to put together properly-encoded values.

Have I missed it in the RFC, or is it somehow not necessary, or something else?

Percent-encoding is usually automatically done for WHATWG (even if soft errors may be triggered during the process), so it was not a top priority for me just yet.
But I definitely want to include some sort of percent-encoding support in the followup I plan. But in any case, thanks for raising awareness of this topic.

Because it is part of the WHATWG-URL spec, I think it deserves first-class treatment in this RFC …

Having yet another class in the proposal would open the possibility for a whole lot of new discussion. We should draw the line somewhere in order not
to waste everyone’s time, or the PHPFoundation’s budget any longer, should the RFC fail for any reason. And I just draw the line here, since it’s a nice to have
feature, and we have a meaningful set of functionality even without it.

Which leads to my last point: I would really like to see at least two separate RFCs here. They be a lot easier to review and critique that way:

one for dealing with URIs as they exist now, especially one that the honors the ways-of-working that exist in userland; and,

one for dealing with WHATWG-URL in its entirety, with all its differences (some subtle, some not) from URIs.

I can see arguments for either one being the “base” on which the other would build.

I may have agreed to pursue two separate RFCs a few months earlier, but not anymore, around the very end. Although I should mention that the original
RFC tried to deal with WHATWG URLs only, RFC 3986 URIs were added later, due to public demand. Possibly I should have stepped in around the time
when I included RFC 3986 support. However, I have to mention that working on both specifications parallelly helped me understand a lot of the subtle
differences between the two specifications, and after bringing these differences to the surface, the final API design could reflect and tackle them.

Regards,
Máté

Mate_Kocsis · May 5, 2025, 9:36pm

Hello Internals,

After more than a hundred emails refining even the tiniest details, we have reached a point where I’d like to call for a vote.
I know that the new API still doesn’t support many use-cases, it still has missing pieces, but now it includes a cohesive set
of functionality that could be a very useful basic building block for most people.

That said, I don’t intend to change anything about the RFC anymore, unless there’s still some factual error in it. There are a lot of
possibilities how such a large API can look like, and this RFC approaches the problem the way it is currently described,
and not in any other way.

So unless some very serious issues arise, I’m going to start the vote on 8th May, possibly in the morning (according to UTC).

Regards,
Máté

Paul_M_Jones · May 7, 2025, 7:16pm

Hi Maté and all,

On May 5, 2025, at 16:36, Máté Kocsis <kocsismate90@gmail.com> wrote:

Hello Internals,

After more than a hundred emails refining even the tiniest details, we have reached a point where I'd like to call for a vote.
I know that the new API still doesn't support many use-cases, it still has missing pieces, but now it includes a cohesive set
of functionality that could be a very useful basic building block for most people.

That said, I don't intend to change anything about the RFC anymore, unless there's still some factual error in it. There are a lot of
possibilities how such a large API can look like, and this RFC approaches the problem the way it is currently described,
and not in any other way.

So unless some very serious issues arise, I'm going to start the vote on 8th May, possibly in the morning (according to UTC).

I am on record as wanting very much to see some decent web-centric objects in core PHP (Request, Response, Uri/Url, etc).

To my chagrin, despite the fact that its goals are laudable, I do not think this RFC is in a ready state to provide such objects.

Among other things I find troubling, the RFC as presented ...

- is too broad in scope;
- acknowledges it is incomplete, with work left undone;
- admits to standards non-compliance; and,
- has an uncertain API.

## Too Broad In Scope

The RFC attempts to do too much at once: not just making URI/URL parsing "pluggable" for internals, and providing an RFC 3986 compliant parser, but also creating from scratch entirely new RFC 3986 URI and related Exception classes for userland consumption, along with entirely new WHATWG-URL classes and Exceptions.

The RFC itself remarks on "[t]he already large scope of the RFC" -- and the same has been observed during the on-list discussions. Even Maté's message above mentions "There are a lot of possibilities how such a large API can look like".

It would be better to narrow the scope of the RFC to something more manageable.

## Incomplete, Work Left Undone

This is a consequence of the overly-broad scope. The work remaining is by no means certain to be completed or voted in after followup RFCs, either on a short timeline or a long one.

Maté notes above that the RFC "has missing pieces" -- and here are some examples from the RFC itself:

- "Builder classes are not offered by the present RFC just yet. ... this feature is one of the top candidates of a followup RFC."

- "The topic of query parameter manipulation should be discussed as a followup to the current RFC."

- "There are multiple planned features in future scope that should be supported."

- "There are immediate plans to add new capabilities to the new API"

- "the position of this RFC is not to include this interface [URLSearchParams] yet"

It would better to present a single finished product instead of multiple partially-finished products.

## Standards Non-Compliance

The RFC states early on that "the parse_url() function is offered for parsing URLs, however, it isn't compliant with any standards. ... Incompatibility with current standards is a serious issue" -- but later it says:

Getters of Uri\WhatWg\Url have a few gotchas for the ones who are inherently familiar with the WHATWG URL specification: they don't (entirely) follow the “getter steps” that are defined by the specification, but the individual components are returned directly without any other changes that the “getter steps” would otherwise specify.

The RFC doesn't fully follow the WHATWG-URL standard. This is reminiscent of the complaint regarding parse_url().

Further, "the WHATWG URL specification contains a URLSearchParams interface" but "the position of this RFC is not to include this interface yet".

It would be better to actually follow the WHATWG-URL standard, and not add a partially-compliant and somewhat-nonstandard implementation to core.

## Uncertain API

Because of the unfinished work, and because of the "living standard" nature of WHATWG-URL, the foundation of the API is unsteady:

WHATWG URL doesn't specify percent-decoding rules for most components ... But since the WHATWG URL specification is subject to constant updates, it's possible that normalization or percent-decoding rules change in the future.

"Constant updates" makes me think it is too early to include a WHATWG-URL implementation in core.

Then we have this ...

the current RFC chooses to make the built-in URI implementations final ... until the new API becomes mature enough and becomes tested in practice.

... and this:

Once the API settles, we plan to lift these restrictions [around final classes] at some extent.

If the API needs to "become tested in practice" so that it can "mature" and "settle", it would be better do that in userland (maybe published on Packagist or PECL) instead of in core.

## Remedies

I think all of the above can be remedied, so that we can finally have some decent web-centric objects in core. But that's a discussion for a later time, one we can have if the RFC does not pass.

-- pmj

Gina_P_Banyard · May 7, 2025, 10:02pm

On Wednesday, 7 May 2025 at 20:20, Paul M. Jones <pmjones@pmjones.io> wrote:

Hi Maté and all,

> On May 5, 2025, at 16:36, Máté Kocsis kocsismate90@gmail.com wrote:
>
> Hello Internals,
>
> After more than a hundred emails refining even the tiniest details, we have reached a point where I'd like to call for a vote.
> I know that the new API still doesn't support many use-cases, it still has missing pieces, but now it includes a cohesive set
> of functionality that could be a very useful basic building block for most people.
>
> That said, I don't intend to change anything about the RFC anymore, unless there's still some factual error in it. There are a lot of
> possibilities how such a large API can look like, and this RFC approaches the problem the way it is currently described,
> and not in any other way.
>
> So unless some very serious issues arise, I'm going to start the vote on 8th May, possibly in the morning (according to UTC).

I am on record as wanting very much to see some decent web-centric objects in core PHP (Request, Response, Uri/Url, etc).

To my chagrin, despite the fact that its goals are laudable, I do not think this RFC is in a ready state to provide such objects.
[...]

-- pmj

Considering that this RFC was in discussion for over 10 months,
and you only started providing input 2 months ago after there have already been serious alterations to it _twice_.
I am not sure your "rant" is something that is at all productive.

You are free to vote against it, but stalling the work someone has committed just because you don't think it is ready is not how any of this works.
Looking from the sidelines, you seem to have the opinion that we should be standardizing existing userland design.
This is not what you want, because if you do this you get POSIX, and POSIX is notoriously inconsistent and kinda bad.
And maybe this is what FIG did, which whatever, but core is not FIG nor userland.

So let's go through your points:

- is too broad in scope;

An RFC author is allowed to choose whatever scope they want.

- acknowledges it is incomplete, with work left undone;

Using multiple RFCs to provide incremental improvements to the language is a standard thing we do.
Therefore, this point is moot.

- admits to standards non-compliance; and,

Non-compliance with what?
WHATWG which is a living standard?
Not having one component of the WHATWG spec?
The same way, the new 8.4 DOM classes don't implement the whole living DOM spec?

- has an uncertain API.

Frankly, 90% of the recent uncertainty has seemingly come from you trying to "rework" the RFC to your own taste.
If you think this should first be an extension or a userland package then feel free to do it, regardless of the result of this vote.
Considering that one of the main maintainers of an actual popular userland URI library has actively been participating in the discussions since the beginning and help shape this RFC,
makes me believe this is very much ready to vote, compared to the opinion of someone that is trying to chime in last minute.

Sincerely,

Gina P. Banyard

Paul_M_Jones · May 8, 2025, 5:38pm

Hi all,

On May 7, 2025, at 17:02, Gina P. Banyard <internals@gpb.moe> wrote:

On Wednesday, 7 May 2025 at 20:20, Paul M. Jones <pmjones@pmjones.io> wrote:

I am on record as wanting very much to see some decent web-centric objects in core PHP (Request, Response, Uri/Url, etc).

To my chagrin, despite the fact that its goals are laudable, I do not think this RFC is in a ready state to provide such objects.
[...]

-- pmj

Putting these together, one from the beginning ...

Considering that this RFC was in discussion for over 10 months, and you only started providing input 2 months ago after there have already been serious alterations to it _twice_.

... and one from the end:

the opinion of someone that is trying to chime in last minute.

Sure, I can see what it looks like: Johnny-come-lately starts making noise only as things are finishing up.

I can say only that (at least from my perspective) there was no sure way to tell how much longer discussion would go on, or how many more changes might be considered. Based on other experiences, conscience demanded that I offer comments as requested on the RFC, as many others had before me.

I am not sure your "rant" is something that is at all productive.

I am glad for the scare quotes; it was a factual analysis, not a rant. As to whether it is productive, well, one never knows until afterwards.

You are free to vote against it,

I might? If I do, it strikes me as constructive (and polite) to have provided reasons why -- thus my message.

but stalling the work someone has committed just because you don't think it is ready is not how any of this works.

To be fair, just because someone has committed work does not mean the work should be accepted; but, my individual opinion is of relatively little weight there.

Looking from the sidelines, you seem to have the opinion that we should be standardizing existing userland design.

Not exactly. My opinion is that the RFC should consider the approaches taken by the many others that have produced working URI solutions; and, if those approaches are discarded, then articulate the reasons for doing so. To ignore them out of hand is insufficiently diligent.

So let's go through your points:

- is too broad in scope;

An RFC author is allowed to choose whatever scope they want.

I did not say otherwise; whether the scope chosen is a good one or not is something else.

- has an uncertain API.

Frankly, 90% of the recent uncertainty has seemingly come from you trying to "rework" the RFC to your own taste.

First, the uncertainty I referred to was from the RFC itself, when it states that the API needs to become "mature enough" and "tested in practice" until it "settles." That tells me the authors aren't too sure of it.

Then, to be clear, those observations and suggestions were not based on my "own taste." They are based the decisions of a dozen or so developers working in the URI space, research into which is summarized at <https://github.com/uri-interop/interface/blob/1.x/README-RESEARCH.md>\.

- admits to standards non-compliance; and,

Non-compliance with what?

I mentioned this in the earlier message, but to reiterate: the WHATWG-URL getters "don't (entirely) follow the “getter steps” that are defined by the specification, but the individual components are returned directly without any other changes that the “getter steps” would otherwise specify."

* * *

So, those are some of my concerns around the RFC. Take them or leave them, as you see fit. If the RFC passes, it won't be the worst thing that that ever happened to PHP, and if it turns out that my concerns were unfounded, so much the better.

-- pmj