[PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

On Sun, Jul 7, 2024, at 11:13, Máté Kocsis wrote:

Hi Ignace,

As far as I understand it, if this RFC were to pass as is it will model

PHP URLs to the WHATWG specification. While this specification is

getting a lot of traction lately I believe it will restrict URL usage in

PHP instead of making developer life easier. While PHP started as a

“web” language it is first and foremost a server side general purpose

language. The WHATWG spec on the other hand is created by browsers

vendors and is geared toward browsers (client side) and because of

browsers history it restricts by design a lot of what PHP developers can

currently do using parse_url. In my view the Url class in

PHP should allow dealing with any IANA registered scheme, which is not

the case for the WHATWG specification.

Supporting IANA registered schemes is a valid request, and is definitely useful.

However, I think this feature is not strictly required to have in the current RFC.

Anyone we needs to support features that are not offered by the WHATWG

standard can still rely on parse_url(). And of course, we can (and should) add

support for other standards later. If we wanted to do all these in the same

RFC, then the scope of the RFC would become way too large IMO. That’s why I

opt for incremental improvements.

It’s also worth pointing out (as another reason not to do this) is that IANA may-or-may not be valid in the current network. For example, TOR, Handshake, IPFS, Freenet, etc. all have their own DNS schemes and do not (usually) use IANA registered schemes, and many people create sites that cater to those networks.

Besides, I fail to see why a WHATWG compliant parser wouldn’t be useful in PHP:

yes, PHP is server side, but it still interacts with browsers very heavily. Among other

use-cases I cannot yet image, the major one is most likely validating user-supplied URLs

for opening in the browser. As far as I see the situation, currently there is no acceptably

reliable possibility to decide whether a URL can be opened in browsers or not.

Looking at the spec for WHATWG, it looks like example%2Ecom will be parsed as a valid URL, and transformed to example.com, while this doesn’t currently happen in parse_url():

https://3v4l.org/NtqQm

I don’t know if that may be an issue, but might be if you are expecting the string to remain URL encoded.

  • parse_url and parse_str predates RFC3986

  • URLSearchParans was ratified before PSR-7 BUT the first implementation

landed a year AFTER PSR-7 was released and already implemented.

Thank you for the historical context!

Based on your and others’ feedback, it has now become clear for me that parse_url()

is still useful and ext/url needs quite some additional capabilities until this function

really becomes superfluous. That’s why it now seems to me that the behavior of

parse_url() could be leveraged in ext/url so that it would work with a Url/Url class (e.g.

we had a PhpUrlParser class extending the Url/UrlParser, or a Url\Url::fromPhpParser()

method, depending on which object model we choose. Of course the names are TBD).

For all these arguments I would keep the proposed Url free of all

these concerns and lean toward a nullable string for the query string

representation. And defer this debate to its own RFC regarding query

string parsing handling in PHP.

My WIP implementation still uses nullable properties and return types. I only changed those

when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio for everyone

involved in the discussion, then I think making these types nullable is fine. It was neither my

top prio, but somewhere I had to start the object design, so I went with this.

The spec contains elements and their types. It would be good to adhere to the spec (simplifies documentation):

  1. scheme may be null or empty string

  2. port may be null

  3. path is never null, but may be empty string

  4. query may be null

  5. fragment may be null

  6. user/password may be null (to differentiate between an empty password or no password)

  7. host may be null (for relative URLs

Again, thank you for your constructive criticism.

Regards,

Máté

— Rob

On Sun, Jul 7, 2024, at 12:55, ignace nyamagana butera wrote:

Hi Máté,

Supporting IANA registered schemes is a valid request, and is

definitely useful. However, I think this feature is not strictly

required to have in the current RFC.

True. Having a WHATWG compliant parser in PHP source code is a big +1

from me I have nothing against that inclusion.

Based on your and others’ feedback, it has now become clear for me

that parse_url() is still useful and ext/url needs quite some additional

capabilities until this function really becomes superfluous.

parse_url can only be deprecated when a RFC3986 compliant parser is

added to php-src, hence why I insist in having that parser being present

too.

I will also add that everything up to now in PHP uses RFC3986 as basis

for generating or representing URLs (cURL extension, streams, etc…).

Having the first and only OOP representation of an URL in the language

not following that same specification seems odd to me. It opens the door

to inconcistencies that will only be resolved once an equivalent RFC3986

URL object made its way into the source code.

On the public API side I would recommend the following:

  • if you are to strictly follow the WHATWG specification no URI

component can be null. They must all be strings. If we have to plan to

use the same object for RFC3986 compliant parser, then all components

should be nullable except for the path component which can never be null

as it is always present.

This isn’t true. It’s just that in the language it is specified in, any element can be null (i.e., no nullable types). It specifies what may be null here: URL Standard (whatwg.org)

— Rob

On Sun, Jul 7, 2024, at 12:40, Rob Landers wrote:

On Sun, Jul 7, 2024, at 11:13, Máté Kocsis wrote:

Hi Ignace,

As far as I understand it, if this RFC were to pass as is it will model

PHP URLs to the WHATWG specification. While this specification is

getting a lot of traction lately I believe it will restrict URL usage in

PHP instead of making developer life easier. While PHP started as a

“web” language it is first and foremost a server side general purpose

language. The WHATWG spec on the other hand is created by browsers

vendors and is geared toward browsers (client side) and because of

browsers history it restricts by design a lot of what PHP developers can

currently do using parse_url. In my view the Url class in

PHP should allow dealing with any IANA registered scheme, which is not

the case for the WHATWG specification.

Supporting IANA registered schemes is a valid request, and is definitely useful.

However, I think this feature is not strictly required to have in the current RFC.

Anyone we needs to support features that are not offered by the WHATWG

standard can still rely on parse_url(). And of course, we can (and should) add

support for other standards later. If we wanted to do all these in the same

RFC, then the scope of the RFC would become way too large IMO. That’s why I

opt for incremental improvements.

It’s also worth pointing out (as another reason not to do this) is that IANA may-or-may not be valid in the current network. For example, TOR, Handshake, IPFS, Freenet, etc. all have their own DNS schemes and do not (usually) use IANA registered schemes, and many people create sites that cater to those networks.

Besides, I fail to see why a WHATWG compliant parser wouldn’t be useful in PHP:

yes, PHP is server side, but it still interacts with browsers very heavily. Among other

use-cases I cannot yet image, the major one is most likely validating user-supplied URLs

for opening in the browser. As far as I see the situation, currently there is no acceptably

reliable possibility to decide whether a URL can be opened in browsers or not.

Looking at the spec for WHATWG, it looks like example%2Ecom will be parsed as a valid URL, and transformed to example.com, while this doesn’t currently happen in parse_url():

https://3v4l.org/NtqQm

I don’t know if that may be an issue, but might be if you are expecting the string to remain URL encoded.

  • parse_url and parse_str predates RFC3986

  • URLSearchParans was ratified before PSR-7 BUT the first implementation

landed a year AFTER PSR-7 was released and already implemented.

Thank you for the historical context!

Based on your and others’ feedback, it has now become clear for me that parse_url()

is still useful and ext/url needs quite some additional capabilities until this function

really becomes superfluous. That’s why it now seems to me that the behavior of

parse_url() could be leveraged in ext/url so that it would work with a Url/Url class (e.g.

we had a PhpUrlParser class extending the Url/UrlParser, or a Url\Url::fromPhpParser()

method, depending on which object model we choose. Of course the names are TBD).

For all these arguments I would keep the proposed Url free of all

these concerns and lean toward a nullable string for the query string

representation. And defer this debate to its own RFC regarding query

string parsing handling in PHP.

My WIP implementation still uses nullable properties and return types. I only changed those

when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio for everyone

involved in the discussion, then I think making these types nullable is fine. It was neither my

top prio, but somewhere I had to start the object design, so I went with this.

The spec contains elements and their types. It would be good to adhere to the spec (simplifies documentation):

  1. scheme may be null or empty string

  2. port may be null

  3. path is never null, but may be empty string

  4. query may be null

  5. fragment may be null

  6. user/password may be null (to differentiate between an empty password or no password)

  7. host may be null (for relative URLs

Again, thank you for your constructive criticism.

Regards,

Máté

— Rob

Here’s a list of examples worth adding to the RFC:

//example.com?

ftp://user@example.com/path/to/ffile

https://user:@example.com

https://user:pass@example%2Ecom/?something=other&bool#heading

etc.

— Rob

Hi Maté,

Fantastic RFC :slight_smile:

Le dim. 7 juil. 2024 à 11:17, Máté Kocsis <kocsismate90@gmail.com> a écrit :

Hi Ignace,

As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
“web” language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do using parse_url. In my view the Url class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.

Supporting IANA registered schemes is a valid request, and is definitely useful.
However, I think this feature is not strictly required to have in the current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely on parse_url().

If I may, parse_url is showing its age and issues like https://github.com/php/php-src/issues/12703 make it unreliable. We need an escape plan from it.

FYI, we’re discussing whether a Uri component should make it in Symfony precisely to work around parse_url’s issues in https://github.com/php/php-src/issues/12703
Your RFC would be the perfect answer to this discussion but IRI would need to be part of it.

I agree with everything Ignace said. Supporting RFC3986 from day-1 would be absolutely great!

Note that we use parse_url for http-URLs, but also to parse DSNs like redis://localhost and the likes.

And of course, we can (and should) add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That’s why I
opt for incremental improvements.

Besides, I fail to see why a WHATWG compliant parser wouldn’t be useful in PHP:
yes, PHP is server side, but it still interacts with browsers very heavily. Among other
use-cases I cannot yet image, the major one is most likely validating user-supplied URLs
for opening in the browser. As far as I see the situation, currently there is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or not.

  • parse_url and parse_str predates RFC3986
  • URLSearchParans was ratified before PSR-7 BUT the first implementation
    landed a year AFTER PSR-7 was released and already implemented.

Thank you for the historical context!

Based on your and others’ feedback, it has now become clear for me that parse_url()

is still useful and ext/url needs quite some additional capabilities until this function
really becomes superfluous. That’s why it now seems to me that the behavior of
parse_url() could be leveraged in ext/url so that it would work with a Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are TBD).

For all these arguments I would keep the proposed Url free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.

My WIP implementation still uses nullable properties and return types. I only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio for everyone
involved in the discussion, then I think making these types nullable is fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with this.

Again, thank you for your constructive criticism.

Regards,
Máté

On Fri, Jun 28, 2024 at 3:38 PM Máté Kocsis <kocsismate90@gmail.com> wrote:

Hi Everyone,

I’ve been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn’t have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

I was exploring wrapping ada_url for PHP (https://github.com/lnear-dev/ada-url). It works, but it’s a bit slower, likely due to the implementation of the objects. I was planning to embed the zvals directly in the object, similar to PhpToken, but I haven’t had the chance and don’t really need it anymore. Shouldn’t be too much work to clean it up though

Hey Ignace, Nicolas,

Based on your request for adding support for RFC 3986 spec compatible parsing,
I evaluated another library (https://github.com/uriparser/uriparser/) in the recent days
in order to add support for the requested functionality. As far as I can tell, the results

were very promising, so I’m ok to include this into my proposal (I haven’t pushed my
changes yet and haven’t updated the RFC yet).

Regarding the reference resolution (https://uriparser.github.io/doc/api/latest/#resolution)
feature which has also already been asked for, I’m genuinely wondering what the use-case is?
But in any case, I’m fine with incorporating this as well into the RFC, since apparently
both Lexbor and uriparser support this (naturally).

What I became puzzled about is the correct object structure and naming. Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse URLs, I don’t
know if it’s a good idea to have a dedicated URI and a URL class extending the former one…
If it is, then in my opinion, the logical behavior would be that Lexbor always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is actually an URL, and
choose the instantiated class based on this factor… But in this case the differences between
the RFC 3986 and WHATWG specifications couldn’t be spelled out, since URL objects
could hold URLs parsed based on both specs (and therefore having a unified interface is required).

Or rather we should have a separate URI and a WhatwgUrl class so that the former one would
always be created by uriparser, while the latter one by Lexbor? This way we could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one could have a getUserInfo() method,
while the WHATWG related one could have both getUser() and getPassword() methods). But then
the question is how interchangeable these classes should be? I.e. should we be able to convert them
back and forth, or should there be an interface that is implemented by the two classes?

I’d appreciate any suggestions regarding these questions.

P.S. due to its bad receptance, I got rid of the UrlParser class as well as the UrlComponent enum from my
implementation in the meantime.

Regards,
Máté

On Mon, Jul 15, 2024, at 9:20 AM, Máté Kocsis wrote:

Hey Ignace, Nicolas,

Based on your request for adding support for RFC 3986 spec compatible
parsing,
I evaluated another library (GitHub - uriparser/uriparser: Strictly RFC 3986 compliant URI parsing and handling library written in C89; moved from SourceForge to GitHub)
in the recent days
in order to add support for the requested functionality. As far as I
can tell, the results
were very promising, so I'm ok to include this into my proposal (I
haven't pushed my
changes yet and haven't updated the RFC yet).

Regarding the reference resolution
(uriparser: uriparser)
feature which has also already been asked for, I'm genuinely wondering
what the use-case is?
But in any case, I'm fine with incorporating this as well into the RFC,
since apparently
both Lexbor and uriparser support this (naturally).

What I became puzzled about is the correct object structure and naming.
Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse
URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class
extending the former one...
If it is, then in my opinion, the logical behavior would be that Lexbor
always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is
actually an URL, and
choose the instantiated class based on this factor... But in this case
the differences between
the RFC 3986 and WHATWG specifications couldn't be spelled out, since
URL objects
could hold URLs parsed based on both specs (and therefore having a
unified interface is required).

Or rather we should have a separate URI and a WhatwgUrl class so that
the former one would
always be created by uriparser, while the latter one by Lexbor? This
way we could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one
could have a getUserInfo() method,
while the WHATWG related one could have both getUser() and
getPassword() methods). But then
the question is how interchangeable these classes should be? I.e.
should we be able to convert them
back and forth, or should there be an interface that is implemented by
the two classes?

I'd appreciate any suggestions regarding these questions.

P.S. due to its bad receptance, I got rid of the UrlParser class as
well as the UrlComponent enum from my
implementation in the meantime.

Regards,
Máté

I apologize if I missed this up-thread somewhere, but what precisely are the differences between URI and URL? My understanding was that URL is a subset of URI (all URLs are URIs, but not all URIs are URLs). You're saying they're slightly disjoint sets? Can you give some concrete examples of where the parsing rules would produce different results? That may give us a better sense of what the logic should be.

--Larry Garfield

On 15/07/2024 11:20, Máté Kocsis wrote:

Hey Ignace, Nicolas,

Based on your request for adding support for RFC 3986 spec compatible parsing,
I evaluated another library (GitHub - uriparser/uriparser: Strictly RFC 3986 compliant URI parsing and handling library written in C89; moved from SourceForge to GitHub) in the recent days
in order to add support for the requested functionality. As far as I can tell, the results
were very promising, so I'm ok to include this into my proposal (I haven't pushed my
changes yet and haven't updated the RFC yet).

Regarding the reference resolution (uriparser: uriparser)
feature which has also already been asked for, I'm genuinely wondering what the use-case is?
But in any case, I'm fine with incorporating this as well into the RFC, since apparently
both Lexbor and uriparser support this (naturally).

What I became puzzled about is the correct object structure and naming. Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class extending the former one...
If it is, then in my opinion, the logical behavior would be that Lexbor always instantiates URL
classes, while uriparser would have to decide if the passed-in URI is actually an URL, and
choose the instantiated class based on this factor... But in this case the differences between
the RFC 3986 and WHATWG specifications couldn't be spelled out, since URL objects
could hold URLs parsed based on both specs (and therefore having a unified interface is required).

Or rather we should have a separate URI and a WhatwgUrl class so that the former one would
always be created by uriparser, while the latter one by Lexbor? This way we could have a dedicated
object interface for both standards (e.g. the RFC 3986 related one could have a getUserInfo() method,
while the WHATWG related one could have both getUser() and getPassword() methods). But then
the question is how interchangeable these classes should be? I.e. should we be able to convert them
back and forth, or should there be an interface that is implemented by the two classes?

I'd appreciate any suggestions regarding these questions.

P.S. due to its bad receptance, I got rid of the UrlParser class as well as the UrlComponent enum from my
implementation in the meantime.

Regards,
Máté

Hi Máté,

> As far as I can tell, the results were very promising, so I'm ok to include this into my proposal (I haven't pushed my changes yet and haven't updated the RFC yet).

This is a great news if indeed it is possible to release both specifications at the same time that would be really great.

> Regarding the reference resolution (uriparser: uriparser)
feature which has also already been asked for, I'm genuinely wondering what the use-case is?

Resolution is common when using an HTTP client and you defined a base URI and then you can construct
subsequent URI based on that base URI using resolution.

> What I became puzzled about is the correct object structure and naming. Now that uriparser
which can deal with URIs came into the picture, while Lexbor can parse URLs, I don't
know if it's a good idea to have a dedicated URI and a URL class extending the former one...

Both specification parse and can be represented by a URL value object. The main difference between both
implementation are around normalization and encoding.

RFC3986 only allow non destructive normalization which is not true in the case of WHATWG spec:

Here's a simple example to illustrate the differences:

`HttPs://0300.0250.0000.0001/path?query=foo%20bar`

- with RFC3986 you will end up with `https://0300.0250.0000.0001/path?query=foo%20bar\`
- with WHATWG you will end up with `https://192.168.0.1/path?query=foo+bar\`

In the case of WHATWG the host is changed and the query string follow a distinctive encoding spec.

From my POV you have 2 choices either you use one URL object for both specifications with distinctive named constructors fromRFC3986 and fromWhatwg or you have one interface and two distinctive implementations.
I do not think that one can be the extended to create the other one at least that's my POV.

Hope this helps you in your implementation.

Best regards,
Ignace

On Mon, Jul 8, 2024 at 11:24 AM Lanre <lnearwaju@gmail.com> wrote:

On Fri, Jun 28, 2024 at 3:38 PM Máté Kocsis <kocsismate90@gmail.com> wrote:

Hi Everyone,

I’ve been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn’t have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

I was exploring wrapping ada_url for PHP (https://github.com/lnear-dev/ada-url). It works, but it’s a bit slower, likely due to the implementation of the objects. I was planning to embed the zvals directly in the object, similar to PhpToken, but I haven’t had the chance and don’t really need it anymore. Shouldn’t be too much work to clean it up though

I’ve updated the implementation, and with Ada 2.9.0, the performance is now closer to parse_url for short URLs and even outperforms it for longer URLs. You can see the benchmarks in the “Run benchmark script” section of this GitHub Actions run.

cheers,
Lanre

On 28/06/2024 22:06, Máté Kocsis wrote:

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: PHP: rfc:url_parsing_api

Regards,
Máté

Hi Máté

Something that I thought about lately is how the existing URL parser in PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we rely on the built-in URL parser.
I think it would be beneficial if a URL parser was "pluggable" and the url extension could be used instead of the current one for those usages (opt-in).

Kind regards
Niels

Hi Máté

Something that I thought about lately is how the existing URL parser in PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we rely on the built-in URL parser.
I think it would be beneficial if a URL parser was “pluggable” and the url extension could be used instead of the current one for those usages (opt-in).

Kind regards
Niels

Hi Niels,
As mentioned before, I believe the “pluggable” system can only be applied once a RFC3986 URL object is available, using the WHATWG URL
would constitute a major BC. I would even go a step further and state that even by using the RFC3986 URL object you would still face some issues, for instance,
in regards to file scheme based URL. Those are not parsed the same way with parse_url function and RFC3986 rules.
Maybe that change may land on PHP9 or the behaviour may be deprecated to be removed in PHP10 whenever that one happens.

On Sun, Jul 21, 2024 at 1:22 PM Niels Dossche <dossche.niels@gmail.com> wrote:

On 28/06/2024 22:06, Máté Kocsis wrote:

Hi Everyone,

I’ve been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn’t have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api <https://wiki.php.net/rfc/url_parsing_api>

Regards,
Máté

Hi Máté

Something that I thought about lately is how the existing URL parser in PHP is used in various different places.
So for example, in the http fopen wrapper or in the filter extension we rely on the built-in URL parser.
I think it would be beneficial if a URL parser was “pluggable” and the url extension could be used instead of the current one for those usages (opt-in).

Kind regards
Niels

Hi Ignace, Niels,

Sorry for being silent for so long, I was working hard on the implementation besides some summer activities :slight_smile: I can say that I had
really good progress in the last month and now I think (hope) that I managed to address most of the concerns/suggestions people mentioned
in this thread. To summarize the most important changes:

  • The uriparser library is now used for parsing URIs based on RFC 3986.

  • I renamed the extension to “uri” in favor of “url” in order to make the name more generic and to express the new use-case.

  • There is no Url\UrlParser class anymore. The Uri\Uri class now includes the relevant factory methods.

  • Uri/Uri is now an abstract class which is implemented by 2 concrete classes: Uri\Rfc3986Uri and Uri\WhatwgUri.

  • WhatWG URL parsing now returns the exact error code according to the specification (although a reference parameter is used for now - but this is TBD)

  • As suggested by Niels, it’s now possible to plug an URI parsing implementation into PHP. A new uri.default_handler INI option is also added.

Currently, integration is only implemented for FILTER_VALIDATE_URL though. The approach also makes it possible to register additional 3rd party
libraries for parsing URIs (like ADA URL).

  • It looks like that performance significantly improved according to the rough benchmarks performed in CI.

Please re-read the RFC as it shares a bit more details than my quick summary above: https://wiki.php.net/rfc/url_parsing_api

There are some questions I still didn’t manage to find an answer for though. Most importantly, the URI parser libraries used don’t support modification
of the URI. That’s why I had to get rid of the “wither” methods for now which were originally part of the API. I think it’s unfortunate, and I’ll try to do my
best to reclaim them.

Additionally, due to technical reasons, extending the Uri\Uri class in userland is only possible if all the methods are overridden by the child. It’s because
I had to use “computed” properties in the implementation (roughly, they are stored in an internal C struct unlike regular properties). That’s why it may be
better if userland code could use (and possibly implement) an Uri\Uri interface instead.

In one of my previous emails, I had some concerns that RFC 3986 and WhatWg spec can really share the same interface (they do in my current implementation
despite that they are different classes). I still share this concern because WhatWg specifies the “user” and “password” URL components, while RFC 3986
only specifies the notion of “userinfo” (which is usually just user:password, but it’s not necessarily the case as far as I understood). The RFC implementation
of the RFC 3986 parser currently splits the ‘userinfo’ component at the “:” character, but doing so doesn’t seem very spec compliant.

Arnaud suggested that it would be better if the query parameters could be retrieved both escaped and unescaped after parsing. I haven’t had time to investigate
the possibilities, but my gut feeling is that it’s only possible to achieve with some custom code. Arnaud also had questions regarding canonization. Currently,
it’s not performed when calling the __toString() method, because only uriparser library supports this feature, and I didn’t want to diverge the two implementations.
I’m not even sure that it’s a good idea to always do it so I’m thinking about the possibility to selectively enable this feature (i.e. adding a separate “toCanonizedString”
method).

Regards,
Máté

Hi Everyone,

I’ve been working on a new RFC for a while now, and time has come to
present it to a wider audience.

Last year, I learnt that PHP doesn’t have built-in support for parsing URLs
according to any well established standards (RFC 1738 or the WHATWG URL
living standard), since the parse_url() function is optimized for
performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like
browsers), my new RFC would add a WHATWG compliant URL parser functionality
to the standard library. The API itself is not final by any means, the RFC
only represents how I imagined it first.

You can find the RFC at the following link:
https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

Máté, thanks for putting this together.

Whenever I need to work with URLs there are a few things missing that I would love to see incorporated into any change in PHP that brings us a spec-compliant parsing class.

First of all, I typically care most about WhatWG URLs because the PHP code I’m working with is making decisions about HTML that a browser will interpret. Paramount above all other concerns that code on the server can understand content in the same way that the browsers will, otherwise we will invite security issues. People may have valid critiques with the WhatWG specification, but it’s also the most-relevant specification for users of much or most of the PHP code we write, and it’s valuable because it allows us to talk about URLs in the same way a browser would.

I’m worried about the side-effects that having a global uri.default_handler could have with code running differently for no apparent reason, or differently based on what is calling it. If someone is writing code for a controlled system I could see this being valuable, but if someone is writing a framework like WordPress and has no control over the environments in which code runs, it seems dangerous to hope that every plugin and every host runs compatible system configurations. Nobody is going to check ini_get( ‘uri.default_handler’ ) before every line that parses URLs. Beyond this, even just allowing a pluggable parser invites broken deployments because PHP code that is reading from a browser or sending output to one needs to speak the language the browser is speaking, not some arbitrary language that’s similar to it.

One thing I feel is missing, is a method to parse a (partial) URL relative to another

Being able to parse a relative URL and know if a URL is relative or absolute would help WordPress, which often makes decisions differently based on this property (for instance, when reading an href property of a link). I know these aren’t spec-compliant URLs, but they still represent valid values for URL fields in HTML and knowing if they are relative or not requires some amount of parsing specific details everywhere, vs. in a class that already parses URLs. Effectively, this would imply that PHP’s new URL parser decodes document.querySelector( ‘a’ ).getAttribute( ‘href’ ), which should be the same as document.querySelector( ‘a’ ).href, and indicates whether it found a full URL or only a portion of one.

  • $url->is_relative or $url->is_absolute
  • $url->specificity = URL::Relative | URL::Absolute

the URI parser libraries used don’t support modification of the URI

Having methods to add query arguments, change the path, etc… would be a great way to simplify user-space code working with URLs. For instance, read a URL and then add a query argument if some condition within the URL warrants it (for example, the path ends in .png).

Was it intended to add this to the RFC before it’s finalized?

I would not make Url final. “OMG but then people can extend it!” Exactly.

My counter-point to this argument is that I see security exploits appear everywhere that functions which implement specifications are pluggable and extendable. It’s easy to see the need to create a class that limits possible URLs, but that also doesn’t require extending a class. A class can wrap a URL parser just as it could extend one. Magic methods would make it even easier.

A problem that can arise with adding additional rules onto a specification like this is that the subclass gets used in more places than it should and then somewhere some PHP code allows a malicious URL because it failed to parse and then the inspection rules weren’t applied.


Finally, I frequently find the need to be able to consider a URL in both the display context and the serialization context. With Ada we have normalize_url(), parse_search_params(), and the IDNA functions to convert between the two representations. In order to keep strong boundaries between security domains, it would be nice if PHP could expose the two variations: one is an encoded form of a URL that machines can easily parse while the other is a “plain string” in PHP that’s easier for humans to parse but which might not even be a valid URL. Part of the reason for this need is that I often see user-space code treating an entire URL as a single text span that requires one set of rules for full decoding; it’s multiple segments that each have their own decoding rules.

Having this in the RFC would give everyone the tools they need to effectively and safely set links within an HTML document.


All the best,
Dennis Snell

On Mon, Aug 26, 2024, at 2:40 AM, Máté Kocsis wrote:

Hi Ignace, Niels,

Sorry for being silent for so long, I was working hard on the
implementation besides some summer activities :slight_smile: I can say that I had
really good progress in the last month and now I think (hope) that I
managed to address most of the concerns/suggestions people mentioned
in this thread. To summarize the most important changes:

I'm not fluent enough in the different parsing styles to comment on the difference there.

I do have concerns about the class design, though. Given the improvements to the language, the accessor methods offer zero benefit at all. Public-read properties (readonly or otherwise) would be faster and offer no less of a guarantee. If you want to allow someone to extend the class and provide some custom logic, use aviz instead of readonly and extenders can use hooks instead of the methods. The getters don't offer any value anymore.

It took me a while to realize that, I think, the fromWhatWg() method is using an in/out parameter for error handling. That is an insta-no on my part. in/out reference parameters make sense in C, maybe C++, and basically nowhere else. I view them as a code smell everywhere they're used in PHP. Better alternatives include exceptions or union returns.

It looks like you've removed the with*() methods. Why? That means it cannot be used as a builder mechanism, which is plenty valuable. (Though could be an issue with query as a string vs array.)

The WhatWgError looks to me like it's begging to be an Enum.

I am confused by the new ini value. It's for use in cases where you're NOT parsing the URL yourself, but relying on some other extension that does URL parsing internally as a side effect?

As usual, I am not a fan of an ini setting, but I cannot think of a different alternative off hand.

--Larry Garfield

Hi Dennis,

Even though I didn’t answer for a long time, I was improving my RFC implementation in the meanwhile as well as evaluating your suggestions.

I’m worried about the side-effects that having a global uri.default_handler could have with code running differently for no apparent reason, or differently based on what is calling it. If someone is writing code for a controlled system I could see this being valuable, but if someone is writing a framework like WordPress and has no control over the environments in which code runs, it seems dangerous to hope that every plugin and every host runs compatible system configurations. Nobody is going to check ini_get( ‘uri.default_handler’ ) before every line that parses URLs. Beyond this, even just allowing a pluggable parser invites broken deployments because PHP code that is reading from a browser or sending output to one needs to speak the language the browser is speaking, not some arbitrary language that’s similar to it.

You convinced me with your arguments regarding the issues a global uri.default_handler INI config can cause, especially after having read a blog post by Daniel Stenberg about the topic (https://daniel.haxx.se/blog/2022/01/10/dont-mix-url-parsers/). That’s why I removed this from the RFC in favor of relying on configuring the parser at the individual feature level. However, I don’t agree with removing a pluggable parser because of the following reasons:

  • the current method (parse_url() based parser) is already doomed, isn’t compliant with any spec, so it already doesn’t speak the language the browser is speaking
  • even though the majority does, not everyone builds a browser application with PHP, especially because URIs are not necessarily accessible on the web
  • in addition, there are tools which aren’t compliant with the WhatWg spec, but with some other. Most prominently, cURL is mostly RFC3986 compliant with some additional flavour of WhatWg according to https://everything.curl.dev/cmdline/urls/browsers.html

That’s why I intend to keep support for pluggability.

Being able to parse a relative URL and know if a URL is relative or absolute would help WordPress, which often makes decisions differently based on this property (for instance, when reading an href property of a link). I know these aren’t spec-compliant URLs, but they still represent valid values for URL fields in HTML and knowing if they are relative or not requires some amount of parsing specific details everywhere, vs. in a class that already parses URLs. Effectively, this would imply that PHP’s new URL parser decodes document.querySelector( ‘a’ ).getAttribute( ‘href’ ), which should be the same as document.querySelector( ‘a’ ).href, and indicates whether it found a full URL or only a portion of one.

  • $url->is_relative or $url->is_absolute
  • $url->specificity = URL::Relative | URL::Absolute

The Uri\WhatWgUri::parse() method accepts a (relative) URI parameter when the 2nd (base URI) parameter is provided. So essentially you need to use this variant of the parse() method if you want to parse a WhatWg compliant URL, and then WhatWgUri should let you know whether the originally passed in URI was relative or not, did I get you right? This feature is certainly possible with RFC3986 URIs (even without the base parameter), but WhatWg requires the above mentioned workaround for parsing + I have to look into how this can be implemented…

Having methods to add query arguments, change the path, etc… would be a great way to simplify user-space code working with URLs. For instance, read a URL and then add a query argument if some condition within the URL warrants it (for example, the path ends in .png).

I managed to retain support for the “wither” methods that were originally part of the proposal. This required using custom code for the uriparser library, while the maintainer of Lexbor was kind enough to add native support for modification after I submitted a feature request. However, convenience methods for manipulating query parameters are still not part of the RFC because it would increase the scope of the RFC even more, and due to other issues highlighted by Ignace in his prior email: https://externals.io/message/123997#124077. As I really want such a feature, I’d be eager to create a followup RFC dedicated for handling query strings.

My counter-point to this argument is that I see security exploits appear everywhere that functions which implement specifications are pluggable and extendable. It’s easy to see the need to create a class that limits possible URLs, but that also doesn’t require extending a class. A class can wrap a URL parser just as it could extend one. Magic methods would make it even easier.

Right now, it’s only possible to plug internal URI implementation into PHP, userland classes cannot be used, so this probably reduces the issue. However, I recently bumped into a technical issue with URIs not being final which I am currently trying to assess how to solve. More information is available at one of my comments on my PR: https://github.com/php/php-src/pull/14461/commits/8e21e6760056fc24954ec36c06124aa2f331afa8#r1847316607 As far as I see the situation currently, it would probably be better to make these classes final so that similar unforeseen issues and inconsistencies cannot happen again (we can unfinalize them later anyway).

Finally, I frequently find the need to be able to consider a URL in both the display context and the serialization context. With Ada we have normalize_url(), parse_search_params(), and the IDNA functions to convert between the two representations. In order to keep strong boundaries between security domains, it would be nice if PHP could expose the two variations: one is an encoded form of a URL that machines can easily parse while the other is a “plain string” in PHP that’s easier for humans to parse but which might not even be a valid URL. Part of the reason for this need is that I often see user-space code treating an entire URL as a single text span that requires one set of rules for full decoding; it’s multiple segments that each have their own decoding rules.

Even though I didn’t entirely implement this suggestion, I added normalization support:

  • the normalize() method can be used to create a new URI instance whose components are normalized based on the current object
  • the toNormalizedString() method can be used when only the normalized string representation is needed
  • the newly added equalsTo() method also makes use of normalization to better identify equal URIs

For more information, please refer to the relevant section of the RFC: https://wiki.php.net/rfc/url_parsing_api#api_design. The forDisplay() method also seems to be useful at the first glance, but since this may be a controversial optional feature, I’d defer it for later…

Regards,
Máté

Hi Larry,

I do have concerns about the class design, though. Given the improvements to the language, the accessor methods offer zero benefit at all. Public-read properties (readonly or otherwise) would be faster and offer no less of a guarantee. If you want to allow someone to extend the class and provide some custom logic, use aviz instead of readonly and extenders can use hooks instead of the methods. The getters don’t offer any value anymore.

Yes, I knew you wouldn’t like my traditional style with private properties + getters… :slight_smile: So let me try to answer your suggestions: first of all, I believe the readonly class modifier serves its purpose, and I definitely want to keep it because it can ensure that all URI instances are immutable. That’s why I cannot use property hooks, since they are incompatible with readonly. So only the possibility of using asymmetric visibility remains: however, since extenders still cannot hook them, this idea should also be rejected. Otherwise, I would consider using readonly with public read, although I believe traditional methods are better suited for overriding (easier syntax, decades of experience) than property hooks (my 2cents).

It took me a while to realize that, I think, the fromWhatWg() method is using an in/out parameter for error handling. That is an insta-no on my part. in/out reference parameters make sense in C, maybe C++, and basically nowhere else. I view them as a code smell everywhere they’re used in PHP. Better alternatives include exceptions or union returns.

Yes, originally the RFC used a reference parameter to return the error during parsing. I knew it was controversial, but that’s what was a consistent choice with other internal functions/methods.
After your feedback, I changed this behavior to an union type return type:

public static function parse(string $uri, ?string $baseUrl = null): static|array {}

So that in case of failure, an array of Uri\WhatWgError objects are returned. This practice is not really idiomatic with PHP, so personally I’m not sure I like it, but neither did I particularly like passing a parameter by reference…

It looks like you’ve removed the with*() methods. Why? That means it cannot be used as a builder mechanism, which is plenty valuable. (Though could be an issue with query as a string vs array.)

As I answered to Dennis, they were reclaimed in the meanwhile.

The WhatWgError looks to me like it’s begging to be an Enum.

It’s probably not that visible at the first glance, but Uri\WhatWgError has 2 properties: an error code, and a position, so it’s not feasible to make it an enum. I’d however create a separate Uri\WhatWgErrorCode enum containing all the error codes, so that the class constants could be removed from Uri\WhatWgError, but I felt it’s overengineering so I decided not to do this.

Regards,
Máté

Hi

Am 2024-08-26 09:40, schrieb Máté Kocsis:

Please re-read the RFC as it shares a bit more details than my quick
summary above: PHP: rfc:url_parsing_api

I have now finally found the time to go through the discussion thread and make a first pass through the RFC and have the following remarks.

1.

The RFC is not listed in the overview page: PHP: rfc

2.

I agree with Dennis' remark that the `Rfc3986Uri` and `WhatWgUri` classes must be final. The RFC makes the argument that:

Having separate classes for the two standards makes it possible to indicate explicit intent at the type level that one specific standard is required.

Developers extending the classes could accidentally violate the respective standard, which nullifies the benefit of making invalid states unrepresentable at the type-level.

This also means that the return type of the “withers” should be `self` instead of `static`, which also means that the “withers” in the interface must be `self`. Perhaps this means that they should not exist on the interface at all. `DateTimeInterface` only provides the getters, likely for a similar reason.

3.

I believe the `UriException` class as the base exception should not be `abstract`. There is no real benefit to it, especially since it doesn't specify any additional abstract methods.

See also the PR introducing the Exception hierarchy for ext/random for some opinions / arguments regarding the Exception class design: Add ext/random Exception hierarchy by TimWolla · Pull Request #9220 · php/php-src · GitHub

4.

I'm not sure I like the `Interface` suffix on the `UriInterface` interface. Just `Uri\Uri` would be equally expressive.

5.

I am not sure about the `*User()` and `*Password()` methods existing on the interface. As the RFC acknowledges, RFC 3986 only specifies a “userinfo” segment. Should the `*User()` and `*Password()` methods perhaps be specific to the `WhatWgUri` class?

-------

I'll give the RFC another read later and expect some additional commentary when I think about this more.

Best regards
Tim Düsterhus

Hi

Am 2024-11-24 21:40, schrieb Máté Kocsis:

It took me a while to realize that, I think, the fromWhatWg() method is
using an in/out parameter for error handling. That is an insta-no on my
part. in/out reference parameters make sense in C, maybe C++, and
basically nowhere else. I view them as a code smell everywhere they're
used in PHP. Better alternatives include exceptions or union returns.

Yes, originally the RFC used a reference parameter to return the error
during parsing. I knew it was controversial, but that's what was a
consistent choice with other internal functions/methods.
After your feedback, I changed this behavior to an union type return type:

public static function parse(string $uri, ?string $baseUrl = null):
static|array {}

So that in case of failure, an array of Uri\WhatWgError objects are
returned. This practice is not really idiomatic with PHP, so personally I'm
not sure I like it, but neither did I particularly like passing a parameter
by reference...

I disagree with this change and believe that with the current capabilities of PHP the out-parameter is the correct API design choice, because then the “failure” case would be returning a falsy value, which IMO is pretty idiomatic PHP:

     if (($uri = WhatWgUri::parse($someUri, errors: $errors)) !== null) {
         printf("Your URI '%s' is valid. Here it is: %s", $someUri, $uri);
     } else {
         printf("Your URI '%s' is invalid, there were %d errors.\n", $someUri, $errors);
     }

It would also unify the API between Rfc3986Uri and WhatWgUri.

Best regards
Tim Düsterhus

Hi Tim,

Thanks for your feedback!

The RFC is not listed in the overview page: https://wiki.php.net/rfc

Uh, indeed! I’ve just fixed it.

I agree with Dennis’ remark that the Rfc3986Uri and WhatWgUri
classes must be final. The RFC makes the argument that:

Having separate classes for the two standards makes it possible to
indicate explicit intent at the type level that one specific standard
is required.

Developers extending the classes could accidentally violate the
respective standard, which nullifies the benefit of making invalid
states unrepresentable at the type-level.

On the one hand, I also have some concerns about making these classes final or non-final
as you probably saw in my last email (the concern came up with a question about an implementation
detail: https://github.com/php/php-src/pull/14461#discussion_r1847316607). On the other hand though,
if someone overrides an URI implementation, then I assume there’s definitely a purpose for doing
so (i.e. the child class has additional capabilities, or it can handle additional protocols). If developers cannot
achieve this via inheritance, then they will do so otherwise (by using composition, or putting the custom logic
in a helper class etc.). It’s just not realistic to prevent logical bugs by making classes final.

I would rather ask whether it’s possible to make the 2 built-in URI implementations having
quite some special internal behavior behave consistently with userland classes, even if they are overridden?
For now, the answer seems to be yes (especially after hearing Niels’ solution in the GitHub thread linked above),
but of course new issues may arise later which we don’t know about yet. And of course, it’s much easier to make
a class final first and relax the inheritance rules later, than the other way around… So these are the only reasons
why I’d make the classes final, but otherwise it would be useful to be able to extend them.

This also means that the return type of the “withers” should be self
instead of static, which also means that the “withers” in the
interface must be self. Perhaps this means that they should not exist
on the interface at all. DateTimeInterface only provides the getters,
likely for a similar reason.

Using the self return type over static would be counterproductive in my opinion:
it’s mostly because static is the correct type semantically, and it can be useful for
forward compatibility later if we once want to remove the final modifier.

Regarding the analogy with DateTimeInterface, I think this one is wrong: the ext/uri API is
completely immutable, while ext/date has the mutable DateTime implementation,
so it’s not possible to include setters in the interface, otherwise one couldn’t know
what to expect after modification.

I believe the UriException class as the base exception should not be
abstract. There is no real benefit to it, especially since it doesn’t
specify any additional abstract methods.

I have no hard feelings regarding this. If I make it a concrete class, then likely
implementations will start to throw it instead of more specific subclasses. That’s
probably not an issue, people are not usually interested in the exact reason of an exception.
Since ext/date also added a generic parent exception (DateError) recently which wasn’t abstract,
then I’m fine with doing the same with ext/uri.

I’m not sure I like the Interface suffix on the UriInterface
interface. Just Uri\Uri would be equally expressive.

Yes, I was expecting this debate :slight_smile: To be honest, I never liked interfaces without an “Interface”
suffix, and my dislike didn’t go away when I had to use such an interface somewhere, because it
was difficult for me to find out what the symbol I was typing actually referred to. But apart from my personal
experiences, I prefer to stay with “UriInterface” because the 2 most well known internal PHP interfaces
also have the same suffix (DateTimeInterface, SessionInterface), and this name definitely conveys that
people should not try to instantiate it.

I am not sure about the *User() and *Password() methods existing on
the interface. As the RFC acknowledges, RFC 3986 only specifies a
“userinfo” segment. Should the *User() and *Password() methods
perhaps be specific to the WhatWgUri class?

Really good question, and I hesitated a lot about the same (even in some of my messages to the mailing list).
In fact, RFC 3986 has some notion of user/password, because the specification mentions the “user:password”
format as deprecated [in favor of passing authentication information in other places]. So I think the *User() and
*Password() methods are legitimately part of the interface. And it’s not even without precedent to have them in
an interface: PSR-7 made use of the “user” and “password” notions in the UriInterface::withUserInfo() method
which accepts a $user and a $password parameter. I know people on this list generally don’t like PSR-7,
but t would be useful to know why PHP FIG chose to use these two parameters.

Due to the reasons above, the question for me is really whether we want to add the *UserInfo() methods to the
interface or at least to Uri/Rfc3986Uri. Since WhatWg doesn’t even mention user info (apart from “userinfo
percent-encode set” which refers to something else), I’d prefer not to add the methods in question to Uri/UriInterface.
If people insist on it, then I’m fine to add the methods to Uri\Rfc3986Uri though.

I disagree with this change and believe that with the current
capabilities of PHP the out-parameter is the correct API design choice,
because then the “failure” case would be returning a falsy value, which
IMO is pretty idiomatic PHP:

Yes, I can live with any of the solutions, I’m just not sure which is less bad. :slight_smile: If only we had out parameters… But wishful
thinking aside, I am fine with whatever the majority of people prefer. Probably being able to unify the API of the two
implementations is a good argument no one thought about so far for using passing by reference…

Regards,
Máté

On 05.12.2024 at 22:49, Máté Kocsis wrote:

I'm not sure I like the `Interface` suffix on the `UriInterface`
interface. Just `Uri\Uri` would be equally expressive.

Yes, I was expecting this debate :slight_smile: To be honest, I never liked interfaces
without an "Interface"
suffix, and my dislike didn't go away when I had to use such an interface
somewhere, because it
was difficult for me to find out what the symbol I was typing actually
referred to.

By the same argument, you could come up with code like

<?php
class User {
    const defaultGroupNameConstant = "users";
    private string $nameVariable;
    public function getNameMethod() {…}
    …
}
?>

But apart from my personal
experiences, I prefer to stay with "UriInterface" because the 2 most well
known internal PHP interfaces
also have the same suffix (DateTimeInterface, SessionInterface), and this
name definitely conveys that
people should not try to instantiate it.

DateTimeInterface has been introduced after there had already been
DateTime. Otherwise, we would likely have DateTime, DateTimeMutable and
DateTimeImmutable. (Or only DateTime as immutable class.)

SessionHandler/SessionHandlerInterface have been bad naming choices, in
my opinion. The interface could have been SessionHandler, and the class
DefaultSessionHandler (and should have been final). I dislike these
Interface and Implementation (or abbreviations of these) suffixes.

Christoph