[PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Mate_Kocsis · June 28, 2024, 8:06pm

Hi Everyone,

I’ve been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn’t have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

Marco_Pivetta · June 28, 2024, 8:21pm

Hey Máté,

On Fri, 28 Jun 2024, 22:06 Máté Kocsis, <kocsismate90@gmail.com> wrote:

Hi Everyone,

I’ve been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn’t have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

So far, amazing !

Lynn · June 28, 2024, 9:02pm

On Fri, Jun 28, 2024 at 10:08 PM Máté Kocsis <kocsismate90@gmail.com> wrote:

Hi Everyone,

I’ve been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn’t have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

This is a great addition to have! I see there’s nothing specifically about __toString in the RFC, is this aiming to do the same as PSR-7?

Bilge · June 28, 2024, 10:53pm

On 28/06/2024 22:35, Niels Dossche wrote:

- Why did you choose UrlParser to be a "static" class?

Because "static class" is the hip new cool

Bilge

Niels_Dossche · June 28, 2024, 9:35pm

On 28/06/2024 22:06, Máté Kocsis wrote:

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: PHP: rfc:url_parsing_api

Regards,
Máté

Hi Máté

+1 from me, I'm all for modern web-related APIs as you know.

Some questions/remarks:
- Why did you choose UrlParser to be a "static" class? Right now it's just a fancy namespace.
  I can see the point of having a UrlParser class where you can e.g. configure it with which URL standard you want,
  but as it is now there is no such capability.
- It's a bit of a shame that the PSR interface treats queries as strings.
  In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
  This Javascript class also handles escaping them nicely.
- Why is UrlComponent a backed enum?
- A nit: We didn't bundle the entire Lexbor engine, only select parts of it. Just thought I'd make it clear.
- About edge cases: e.g. what happens if I call the Url constructor and leave every string field empty?

Overall seems good.

Kind regards
Niels

Crell · June 28, 2024, 10:14pm

On Fri, Jun 28, 2024, at 8:06 PM, Máté Kocsis wrote:

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to
present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing
URLs according to any well established standards (RFC 1738 or the
WHATWG URL living standard), since the parse_url() function is
optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs
(like browsers), my new RFC would add a WHATWG compliant URL parser
functionality to the standard library. The API itself is not final by
any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link:
PHP: rfc:url_parsing_api

Regards,
Máté

I am all for proper data modeling of all the things, so I support this effort.

Comments:

* There's no need for UrlComponent to be backed.

* I don't understand why UrlParser is a static class. We just had a whole big debate about that.

There's a couple of ways I could see it working, and I'm not sure which I prefer:

1. Better if we envision the parser getting options or configuration in the future.
$url = new UrlParser()->parseUrl(): Url;

2. The named-constructor pattern is quite common.
$url = Url::parseFromString()
$url = Url::parseToArray();

* I... do not understand the point of having public properties AND getters/withers. A readonly class with withers, OK, a bit clunky to implement but it would be your problem in C, not mine, so I don't care. But why getters AND public properties? If going that far, why not finish up clone-with and then we don't need the withers, either?

* Making all the parameters to Url required except port makes little sense to me. User/pass is more likely to be omitted 99% of the time than port. In practice, most components are optional, in which case it would be inaccurate to not make them nullable. Empty string wouldn't be quite the same, as that is still a value and code that knows to skip empty string when doing something is basically the same as code that knows to skip nulls. We should assume people are going to instantiate this class themselves often, not just get it from the parser, so it should be designed to support that.

* I would not make Url final. "OMG but then people can extend it!" Exactly. I can absolutely see a case for an HttpUrl subclass that enforces scheme as http/https, or an FtpUrl that enforces a scheme of ftp, etc. Or even an InternalUrl that assumes the host is one particular company, or something. (If this sounds like scope creep, it's because I am confident that people will want to creep this direction and we should plan ahead for it.)

* If the intent of the withers is to mimic PSR-7, I don't think it does so effectively. Without the interface, it couldn't be a drop-in replacement for UriInterface anyway. And we cannot extend it to add the interface if it's final. Widening the parameters in PSR-7 interfaces to support both wouldn't work, as that would be a hard-BC break for any existing implementations. So I don't really see what the goal is here.

* If we ever get "data classes", this would be a good candidate.

* Crazy idea:

new UriParser(HttpUrl::class)->parse(string);

To allow a more restrictive set of rules. Or even just to cast the object to that child class.

--Larry Garfield

ramsey · June 28, 2024, 11:28pm

On Jun 28, 2024, at 15:09, Máté Kocsis kocsismate90@gmail.com wrote:

Hi Everyone,

I’ve been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn’t have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

The RFC states:

The Url\Url class is intentionally compatible with the PSR-7 UriInterface.

It mirrors the interface, but it can’t be swapped out for a UriInterface instance, especially since it can’t be extended, so I wouldn’t consider it compatible. I would still need to write a compatibility layer that composes Url\Url and implements UriInterface.

This makes it possible for a next iteration of the PSR-7 standard to use Url\Url directly instead of requiring implementations to provide their own Psr\Http\Message\UriInterface implementation.

Since PSRs are concerned with shared interfaces and this class is final and does not implement any interfaces, I’m not sure how you envision “a next iteration” of PSR-7 to use this directly, unless what you mean is that UriInterface would be deprecated and applications would type directly against Url\Url.

Cheers,
Ben

nyamsprod_the_funky · June 29, 2024, 8:20am

On 28/06/2024 22:06, Máté Kocsis wrote:

Hi Everyone,

I've been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn't have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: PHP: rfc:url_parsing_api

Regards,
Máté

As a maintainer of a PHP userland URI toolkit I have a couple of questioms/remarks on the proposal. Fist, I look forward for finally having a real Url parser AND validator in PHP core. Any effort on that direction is always a welcomed good news.

As far as I understand it, if this RFC were to pass as is it will model PHP URLs to the WHATWG specification. While this specification is getting a lot of traction lately I believe it will restrict URL usage in PHP instead of making developer life easier. While PHP started as a "web" language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers vendors and is geared toward browsers (client side) and because of browsers history it restricts by design a lot of what PHP developers can currently do using `parse_url`. In my view the `Url` class in
PHP should allow dealing with any IANA registered scheme, which is not the case for the WHATWG specification.

Therefore, I would rather suggest we ALSO include support for RFC3986 and RFC3987 specification properly and give both specs a go (at the same time!) and a clear way to instantiate your `Url` with one or the other spec.
In clear, my ideal situation would be to add to the parser at least 2 named constructors `UrlParser::fromRFC3986` and `UrlParser::fromWHATWG`
or something similar (name can be changed or improved).

While this is an old article by Daniel Stenberg (One URL standard please | daniel.haxx.se), it conveys with more in depth analysis my issues with the WHATWG spec and its usage in PHP if it were to be use as the ONLY available parser in PHP core for URL.

the PSR-7 relation is also unfortunate from my POV: PSR-7 UriInterface is designed to be at its core an HTTP URI representation (so it shares the same type of issue as the WHATWG spec!) meaning in absence of a scheme it falls back to the HTTP scheme validation. This is why the interface can forgone any nullable component because the HTTP spec allows it, other schemes do not. For instance the FTP scheme prohibits the presence of the query and fragment components which means they MUST be `null` in that case.

By removing PSR-7 constraints we could add

- the `Url::(get|to)Components` method: it would mimics `parse_url` returned value and as such ease migration from `parse_url`
- the `Url::getUsername` and `Url::getPassword` to access the username and password component individually. You would still use
the `withUserInfo` method to update them but you give the developer the ability to access both components directly from the `Url` object.

These additions would remove the need for

- `UrlParser::parseUrlToArray`
- `UrlParser::parseUrlComponent`
- `UrlComponent` Enum

Cheers,
Ignace

Stephen_Reay · June 29, 2024, 9:57am

On 29 Jun 2024, at 04:48, Niels Dossche <dossche.niels@gmail.com> wrote:

- It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.

Agreed this is a weird choice to me, but I'm also not surprised by weird choices via
php-fig (log level constants I'm looking at you)

We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api....

Cheers

Stephen

Rob_Landers · June 29, 2024, 10:33am

On Sat, Jun 29, 2024, at 11:57, Stephen Reay wrote:

On 29 Jun 2024, at 04:48, Niels Dossche <dossche.niels@gmail.com> wrote:

It’s a bit of a shame that the PSR interface treats queries as strings.

In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.

This Javascript class also handles escaping them nicely.

Agreed this is a weird choice to me, but I’m also not surprised by weird choices via

php-fig (log level constants I’m looking at you)

We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api…

Cheers

Stephen

I personally ignore PSR when it doesn’t make sense to use it. They’re nice for library compatibility, but I will happily toss compatibility when it doesn’t make sense to be compatible. This might be one of those cases as there is no reason it has to be PSR compliant. In fact, a wrapper may be written to make it compliant, if one so chooses. I suspect it is better to be realistic and learn from the short-comings of PSR and apply those learnings here, vs. reiterating them and “engraving them in stone” (so to speak).

— Rob

Juris_Evertovskis · June 29, 2024, 4:19pm

On 2024-06-28 23:06, Máté Kocsis wrote:

Hi Everyone,

I’ve been working on a new RFC for a while now, and time has come to present it to a wider audience.

Last year, I learnt that PHP doesn’t have built-in support for parsing URLs according to any well established standards (RFC 1738 or the WHATWG URL living standard), since the parse_url() function is optimized for performance instead of correctness.

In order to improve compatibility with external tools consuming URLs (like browsers), my new RFC would add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

Regards,
Máté

Hey,

That’s great that you’ve made the Url class readonly. Immutability is realiable. And I fully agree that a better parser is needed.

I agree with the otters that

the enum might be fine without the backing, if it’s needed at all
I’m not convinced a separate UrlParser is needed, Url::someFactory($str) should be enough
getters seem unnecessary, they should only be added if you can be sure they are going to be used for compatibility with PSR-7
treating $query as a single string is clumsy, having some kind of bag or at least an array to represent it would be cooler and easier to build and manipulate

I wanted to add that it might be more useful to make all the Url constructor arguments optional. Either nullable or with reasonable defaults. So you could $url = new Url(path: 'robots.txt'); foreach ($domains as $d) $r[] = file_get_contents($url->withHost($d)) and stuff like that.

Similar modifiers would be very useful for the query stuff, e.g. $u = Url::current(); return $u->withQueryParam('page', $u->queryParam->page + 1);.

Sure, all of that can be done in the userland as long as you drop final

BR,
Juris

ramsey · June 29, 2024, 5:35pm

On Jun 29, 2024, at 03:20, nyamsprod the funky webmaster <nyamsprod@gmail.com> wrote:

Therefore, I would rather suggest we ALSO include support for RFC3986 and RFC3987 specification properly and give both specs a go (at the same time!) and a clear way to instantiate your `Url` with one or the other spec.
In clear, my ideal situation would be to add to the parser at least 2 named constructors `UrlParser::fromRFC3986` and `UrlParser::fromWHATWG`
or something similar (name can be changed or improved).

While this is an old article by Daniel Stenberg (One URL standard please | daniel.haxx.se), it conveys with more in depth analysis my issues with the WHATWG spec and its usage in PHP if it were to be use as the ONLY available parser in PHP core for URL.

I agree that I would love to see a more general IRI parser, with maybe a URI parser being a subtype of an IRI parser.

Cheers,
Ben

Krinkle · June 29, 2024, 8:27pm

On Fri, 28 Jun 2024, at 21:06, Máté Kocsis wrote:

[…] add a WHATWG compliant URL parser functionality to the standard library. The API itself is not final by any means, the RFC only represents how I imagined it first.

You can find the RFC at the following link: https://wiki.php.net/rfc/url_parsing_api

First-pass comments/thoughts.

As others have mentioned, it seems the class would/could not actually satisfy PSR-7. Realistically, the PSR-7 interface package or someone else would need to create a new class that combines the two, potentially as part of a transition away from it to the built-in class, with future PSRs building directly on Url. If we take that as given, we might as well design for the end state, and accept that there will be a (minimal) transition. This end state would benefit from being designed with the logical constraints of PSR-7 (so that migration is possible without major surprises), but without restricting us to its exact API shape, since an intermediary class would come into existence either way.

For example, Url could be a value class with merely 8 public properties. Possibly with a UrlImmutable subclass, akin to DateTime, where the properties are read-only instead a clone method could return Url?).

It might be more ergonomic to leave the parser as implementation detail, allowing the API to be accessed from a single import rather than requiring two. This could look like Url::parse() or Url::parseFromString().

For the Url::parseComponent() method, did you consider accepting the existing PHP_URL_* constants? They appear to fit exactly, in naming, description, and associated return types.

Without UrlParser/UrlComponent, I’d adopt it direclty in applications and frameworks. WIthout it, further wrapping seems likely for improved usability. This is sometimes benefitial when exposing low-level APIs, but it seems like this is close to fitting in a single class, as demonstrated by the WHATWG URL API.

One thing I feel is missing, is a method to parse a (partial) URL relative to another. E.g. to expand or translate paths between two URLs. Consider expanding “/w/index.php”, or “index.php” relative to “https://wikipedia.org/w/”. Or expanding “//example.org” relative to either “https://wikipedia.org” vs “http://wikipedia.org”. The WHATWG URL API does this in the form of a second optional string|Stringable parameter to Url::parse(). Implementing “expand URL” with parsing of incomplete URLs is error-prone and hard to get right. Including this would be valuable.

See also Net_URL2 and its resolve() method https://pear.php.net/package/Net_URL2 https://github.com/pear/Net_URL2

–

Timo Tijhof

https://timotijhof.net/

Mate_Kocsis · June 29, 2024, 10:42pm

Hi Larry,

Thank you very much for your feedback! I think I have already partially answered some of your questions in my previous email to Niels,
but let me answer your other questions below:

I… do not understand the point of having public properties AND getters/withers. A readonly class with withers, OK, a bit clunky to implement but it would be your problem in C, not mine, so I don’t care. But why getters AND public properties? If going that far, why not finish up clone-with and then we don’t need the withers, either?

I know it’s disappointing, but the public modifiers are just a typo which were forgotten there from the very first iteration of the API However, I’m fine with having public readonly properties without getters as well, as long as we declare this a policy that we are going to adopt… Withers are indeed a must for now (and their implementation indeed requires some magic in C…).

Making all the parameters to Url required except port makes little sense to me. User/pass is more likely to be omitted 99% of the time than port. In practice, most components are optional, in which case it would be inaccurate to not make them nullable. Empty string wouldn’t be quite the same, as that is still a value and code that knows to skip empty string when doing something is basically the same as code that knows to skip nulls. We should assume people are going to instantiate this class themselves often, not just get it from the parser, so it should be designed to support that.

I may have misunderstood what you wrote, but all the parameters - including port - are required. If you really meant “nullable” instead of “required”, then you are right. Apart from this, I’m completely fine with making these parameters optional, especially if we decide not to have the UrlParser (my initial assumption was that the Url class is going to be instantiated via UrlParser::parseUrl() calls).

I would not make Url final. “OMG but then people can extend it!” Exactly. I can absolutely see a case for an HttpUrl subclass that enforces scheme as http/https, or an FtpUrl that enforces a scheme of ftp, etc. Or even an InternalUrl that assumes the host is one particular company, or something. (If this sounds like scope creep, it’s because I am confident that people will want to creep this direction and we should plan ahead for it.)

Without having thought much about its consequences on the implementation, I’m fine with removing the final modifier.

If the intent of the withers is to mimic PSR-7, I don’t think it does so effectively. Without the interface, it couldn’t be a drop-in replacement for UriInterface anyway. And we cannot extend it to add the interface if it’s final. Widening the parameters in PSR-7 interfaces to support both wouldn’t work, as that would be a hard-BC break for any existing implementations. So I don’t really see what the goal is here.

I’ve just answered this to Ben, but let me reiterate: PSR-7’s UriInterface is only needed because PHP doesn’t have a Url internal class.

Máté

nyamsprod_the_funky · June 30, 2024, 6:51am

On 29/06/2024 11:57, Stephen Reay wrote:

On 29 Jun 2024, at 04:48, Niels Dossche <dossche.niels@gmail.com> wrote:

- It's a bit of a shame that the PSR interface treats queries as strings.
In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.
This Javascript class also handles escaping them nicely.

Agreed this is a weird choice to me, but I'm also not surprised by weird choices via
php-fig (log level constants I'm looking at you)

We hear all the time how userland is more flexible and can change quicker - and yet here we see a potential built in class having a worse api because it wants to be compatible with an existing userland interface with the same bad api....

Cheers

Stephen

While I do not think the debate should be about compatibility with PSR-7 some historical context shoyld be brought to light for a fair discussion:

- parse_url and parse_str predates RFC3986
- URLSearchParans was ratified before PSR-7 BUT the first implementation landed a year AFTER PSR-7 was released and already implemented.
- PHP historical query parser parse_str logic is so bad (mangled parameter name for instance) that PSR-7 was right not embedding that parsing algorithm in its specification.
- If you take aside URITemplate specification and now URLSearchParams there is no official, referenced and or agreed upon rules/document on
how a query string MUST or SHOULD be parsed.
- Last but not least URLSearchParans encoding/decoding rules DO NOT follow either RFC1738 nor RFC3986 (they follow the form data which is
kind of a mix between both RFC)

THis means that just adding a method or a class that mimic 100% URLSearchParans for instance will constitute a major departure in how
PHP trears query string you will no longer have a 1:1 relation between
the data you have inside your `_GET` array and the one in UrlSearchParams for better or for worse.

For all these arguments I would keep the proposed `Url` free of all
these concerns and lean toward a nullable string for the query string representation. And defer this debate to its own RFC regarding query string parsing handling in PHP.

Mate_Kocsis · June 30, 2024, 6:00am

It mirrors the interface, but it can’t be swapped out for a UriInterface instance, especially since it can’t be extended, so I wouldn’t consider it compatible. I would still need to write a compatibility layer that composes Url\Url and implements UriInterface.

I guess my words were slightly misleading: what I should have written is that the methods themselves are compatible.

Since PSRs are concerned with shared interfaces and this class is final and does not implement any interfaces, I’m not sure how you envision “a next iteration” of PSR-7 to use this directly, unless what you mean is that UriInterface would be deprecated and applications would type directly against Url\Url.

Yes, I meant the latter exactly. If we had a well-usable URL object representation in the standard library, then there would be no need to have an userland interface as well (unless they have different behavior or purpose). Analogically, we have DateTimeImmutable, and there is no PSR for a date time interface. (I know there is Carbon and other libraries, but they are for convenience, not for interoperability).

Mate_Kocsis · June 30, 2024, 6:00am

Hi Niels,

First of all, thank you for your support!

Why did you choose UrlParser to be a “static” class? Right now it’s just a fancy namespace.

That’s a good question, let me explain the reason: one of my major design goals was to make the UrlParser class to be
extendable and configurable (e.g. via an “engine” property similar to what Random/Randomizer has). Of course, UrlParser
doesn’t support any of this yet, but at least the possibility is there for followup RFCs due to the class being final.

Since I knew it would be an overkill to require instantiating an UrlParser instance for a task which is stateless (URL parsing),
finally I settled on using static methods for the purpose. Later, if the need arises, the static methods could be converted to
non-static ones with minimal BC impact.

It’s a bit of a shame that the PSR interface treats queries as strings.

In Javascript we have the URLSearchParams class that we can use as a key-value storage for query parameters.

Hm, yes, that’s an observation I can agree with. However, this restriction shouldn’t limit followups to add key-value storage
support for query parameters. Although, as far as I could determine, neither Lexbor is capable of such a thing currently.

Why is UrlComponent a backed enum?

To be honest, it has no specific reason apart from that’s what I am used to. I’m fine with whatever choice, even with getting rid of
UrlComponent completely. I added the UrlParser::parseUrlComponent() method (and hence the UrlComponent enum) to the
proposal in order to have a direct replacement for parse_url() when it’s called with the $component parameter set, but I wasn’t
really sure whether this is needed at all… So I’m eager to hear any recommendations regarding this problem.

A nit: We didn’t bundle the entire Lexbor engine, only select parts of it. Just thought I’d make it clear.

Yes, my wording was slightly misleading. I’ll clarify this in the RFC.

About edge cases: e.g. what happens if I call the Url constructor and leave every string field empty?

Nothing The Url class in its current form can store invalid URLs. I know that URLs are generally modeled as value objects (that’s
also why the proposed class is immutable), and generally speaking, value objects should protect their invariants. However, due to
separating the parser to its own class, I abandoned this “rule”. So this is one more downside of the current API.

Regards,
Máté

Crell · June 30, 2024, 2:41pm

On Sun, Jun 30, 2024, at 1:00 AM, Máté Kocsis wrote:

It mirrors the interface, but it can’t be swapped out for a UriInterface instance, especially since it can’t be extended, so I wouldn’t consider it compatible. I would still need to write a compatibility layer that composes Url\Url and implements UriInterface.

I guess my words were slightly misleading: what I should have written
is that the methods themselves are compatible.

Since PSRs are concerned with shared interfaces and this class is final and does not implement any interfaces, I’m not sure how you envision “a next iteration” of PSR-7 to use this directly, unless what you mean is that UriInterface would be deprecated and applications would type directly against Url\Url.

Yes, I meant the latter exactly. If we had a well-usable URL object
representation in the standard library, then there would be no need to
have an userland interface as well (unless they have different behavior
or purpose). Analogically, we have DateTimeImmutable, and there is no
PSR for a date time interface. (I know there is Carbon and other
libraries, but they are for convenience, not for interoperability).

I cannot speak on behalf of FIG here, but as a long-time member of FIG and a member of the Core Committee, I would urge you to *not* try to make a core Url object compatible with UriInterface. It's solving a slightly different problem, using a language that is very different (PHP 8.4 vs 5.5 or so), with somewhat different constraints.

Instead, let's make sure that any new Url object is *composable* by PSR-7 UriInterface. A UriInterface implementation that is backed by a Url object internally should be an easy task to do if anyone wants, and we should make sure we don't do anything that makes that unnecessarily hard, but right now the language is simply not capable of making core Url a drop-in replacement for UriInterface, so let's not even try.

Whether that means Url should be readonly or not, have getters/setters/withers, split authority into user:pass, etc. are things we should discuss on their own merits, not based on "well PSR-7 did it this way over 10 years ago, so we'll just do that."

--Larry Garfield

Mate_Kocsis · July 7, 2024, 9:13am

Hi Ignace,

As far as I understand it, if this RFC were to pass as is it will model
PHP URLs to the WHATWG specification. While this specification is
getting a lot of traction lately I believe it will restrict URL usage in
PHP instead of making developer life easier. While PHP started as a
“web” language it is first and foremost a server side general purpose
language. The WHATWG spec on the other hand is created by browsers
vendors and is geared toward browsers (client side) and because of
browsers history it restricts by design a lot of what PHP developers can
currently do using parse_url. In my view the Url class in
PHP should allow dealing with any IANA registered scheme, which is not
the case for the WHATWG specification.

Supporting IANA registered schemes is a valid request, and is definitely useful.
However, I think this feature is not strictly required to have in the current RFC.
Anyone we needs to support features that are not offered by the WHATWG
standard can still rely on parse_url(). And of course, we can (and should) add
support for other standards later. If we wanted to do all these in the same
RFC, then the scope of the RFC would become way too large IMO. That’s why I
opt for incremental improvements.

Besides, I fail to see why a WHATWG compliant parser wouldn’t be useful in PHP:
yes, PHP is server side, but it still interacts with browsers very heavily. Among other
use-cases I cannot yet image, the major one is most likely validating user-supplied URLs
for opening in the browser. As far as I see the situation, currently there is no acceptably
reliable possibility to decide whether a URL can be opened in browsers or not.

parse_url and parse_str predates RFC3986

URLSearchParans was ratified before PSR-7 BUT the first implementation
landed a year AFTER PSR-7 was released and already implemented.

Thank you for the historical context!

Based on your and others’ feedback, it has now become clear for me that parse_url()

is still useful and ext/url needs quite some additional capabilities until this function
really becomes superfluous. That’s why it now seems to me that the behavior of
parse_url() could be leveraged in ext/url so that it would work with a Url/Url class (e.g.
we had a PhpUrlParser class extending the Url/UrlParser, or a Url\Url::fromPhpParser()
method, depending on which object model we choose. Of course the names are TBD).

For all these arguments I would keep the proposed Url free of all
these concerns and lean toward a nullable string for the query string
representation. And defer this debate to its own RFC regarding query
string parsing handling in PHP.

My WIP implementation still uses nullable properties and return types. I only changed those
when I wrote the RFC. Since I see that PSR-7 compatibility is very low prio for everyone
involved in the discussion, then I think making these types nullable is fine. It was neither my
top prio, but somewhere I had to start the object design, so I went with this.

Again, thank you for your constructive criticism.

Regards,
Máté

nyamsprod_the_funky · July 7, 2024, 10:55am

Hi Máté,

> Supporting IANA registered schemes is a valid request, and is definitely useful. However, I think this feature is not strictly required to have in the current RFC.

True. Having a WHATWG compliant parser in PHP source code is a big +1 from me I have nothing against that inclusion.

> Based on your and others' feedback, it has now become clear for me that parse_url() is still useful and ext/url needs quite some additional capabilities until this function really becomes superfluous.

`parse_url` can only be deprecated when a RFC3986 compliant parser is added to php-src, hence why I insist in having that parser being present too.

I will also add that everything up to now in PHP uses RFC3986 as basis for generating or representing URLs (cURL extension, streams, etc...). Having the first and only OOP representation of an URL in the language not following that same specification seems odd to me. It opens the door to inconcistencies that will only be resolved once an equivalent RFC3986 URL object made its way into the source code.

On the public API side I would recommend the following:

- if you are to strictly follow the WHATWG specification no URI component can be null. They must all be strings. If we have to plan to use the same object for RFC3986 compliant parser, then all components should be nullable except for the path component which can never be null as it is always present.

- As other have mention we should add a method to resolve an URI against a base URI something like Url::resolve(string $url, Url|string|null $baseUrl) where the baseURL argument should be an absolute Url if present. If absent the url argument must be absolute otherwise an exception should be thrown

- last but not least the WHATWG specification is not only a URL parser but also a URL validator and can apply some "correction" to malformed URL and report them. The specification has a provision for a structure to report malformed URL errors. I failed to see this mechanism being mention anywhere the RFC. Will the URL only trigger exceptions or will it also triggers warnings ? For inspiration the excellent PHP userland WHATWG URL parser from Trevor Rowbotham GitHub - TRowbotham/URL-Parser: A WHATWG URL spec compliant URL parser for working with URLs and their query strings. allow using a PSR-3 logger to record those errors.

Best regards,
Ignace