[PHP-DEV] [RFC] [Discussion] Add WHATWG compliant URL parsing API

Hi Hammed,

What’s wrong with declaring all the methods as final eg. https://github.com/lnear-dev/ada-url/blob/main/ada_url.stub.php

I’ve just noticed your message, sorry. Coincidentally - as I wrote a few days ago -, I’m also experimenting with making methods final.

Máté

Hi,

Depends on there being the intention to have it as parameter type. If it’s designed to be passed around to functions I really don’t want it to be an array. I am maintaining a legacy codebase where arrays are being used as hashmaps pretty much everywhere, and it’s error prone. We lose all kinds of features like “find usages” and refactoring key/property names. Silly typos in array keys with no actual validation of any kind cause null values and annoying to find bugs.

I agree that hashmaps can be really easy to use, but not as data structures outside of the function/method scope they were defined in. If value vs object semantics are important here, then something that is forward compatible with whatever structs may hold in the future could be interesting.

Yes, I agree here, even if we talk about simple data without behavior. But as the length of the current RFC also suggests, URIs have surprisingly lot of behavior, so I think it’s natural to use OO for modelling them.

Máté

On 14/03/2025 20:45, Máté Kocsis wrote:

Hi Ignace,

      > All URI components - with the exception of the host - can be
    retrieved in two formats:

    I believe you mean - with the excepotion of the Port

Even though I specifically meant WHATWG's host that is only available in only
one format, you are right, the port is never available in two formats. So I've
changed the wording accordingly.

    0 - It is a unfortunate that there's no IDNA support for RFC3986, I
    understand the reasoning behind that decision but I was wondering if it
    was possible to optin its use when the ext-intl extension is present ?

Good question, I think it's probably not the main concern. My specific concern is that
RFC 3987 has around same length as RFC 3986, in a lot of cases it uses the exact
wording of the initial RFC but changes URI to IRI, and of course adds the
IDNA specific parts. Maybe it's just me, but it's not easy to find it out exactly what
has to be implemented above RFC 3986, and also, how it can be best achieved?
By extending the class for RFC 3986? Creating a totally separate class that can
transform itself to an RFC 3986 URI? These and quite some other questions have
to be answered first, which I would like to postpone.

    1 - Does it means that if/when Rfc3986/Uri get Rfc3987 supports they
    will also get a `Uri::toDisplayString` and `Uri::getHostForDisplay`
    maybe this should be stated in the Futurscope ?

It's a question that I also asked from myself. For now, I'd say that
Rfc3986/Uri shouldn't have these methods, since it doesn't support any such
capabilities. But Rfc3986\Iri should likely have these toString methods.

    4 - For consistency I would use toRawString and toString just like
    it is
    done for components.

I'm fine with this, I also think doing so would reasonably continue the convention
getters do.

    5 - Can the returned array from __debugInfo be used in a "normal"
    method
    like `toComponents` naming can be changed/improve to ease migration
    from
    parse_url or is this left for userland library ?

I intend to add the __debugInfo() method purely to help debugging. Without this
method, even I had a hard time when trying to compare the expected vs actual
URIs in my tests.

But more importantly, sometimes the recomposed string is not enough to have a
good understanding exactly what value each component has. For example
one can naively assume that the "mailto:kocsismate@php.net" URI has a
user(info) component of "kocsismate" and a hostname of "php.net <http:// php.net>" (I probably
also did so before reading the RFCs). The representation provided by
__debugInfo() can quickly highlight that "kocsismate@php.net <mailto:kocsismate@php.net>" is the path in fact.
One could try to call the individual getters to find the needed component, but having
such a method like __debugInfo() provides a much more clear picture about the anatomy of
the URI.

But otherwise I don't know how useful this method would be. Is there anything else
besides helping the migration?

Regards,
Máté

Thanks for the clarification.

I have other questions upon further readings:

1) around `Uri\UninitializedUriException` If I look at the behaviour of `DatetimeImmutable` in the same scenario or a Userland object instead of throwing an exception an error is thrown

see:

- Online PHP editor | output for d4VrY
- Online PHP editor | output for Wn7En

Shouldn't the URI feature follow the same path for consistency ? Instead of throwing an exception it should throw an Error on uninitialized issue
at least.

2) around Normalization. In case of query normalization, sorting the query string is not mention does it means that with the current feature

`Example Domain
is different from
`Example Domain

Hi Dennis,

This is a late thought, and surely amenable to a later RFC, but I was thinking about the get/set path methods and the issue of the / and %2F.

  • If we exposed getPathIterator() or getPathSegments() could we not report these in their fully-decoded forms? That is, because the path segments are separated by some invocation or array element, they could be decoded?
  • Probably more valuably, if withPath() accepted an array, could we not allow fully non-escaped PHP strings as path segments which the URL class could safely and by-default handle the escaping for the caller?

Yes, these are very good ideas, and actually they are in line with how I would imagine a second iteration. Probably, getPathSegments() could return
the “%2F” (percent-encoded form of “/”) percent-decoded, sure. But the rest of the reserved characters will also be an issue, since they can also appear
within the path (i.e. “&” inside “Document & Settings” etc.) percent-encoded. So percent decoding of reserved characters should still be taken into account.

Right now, if someone haphazardly joins path segments in order to set withPath() they will likely be unaware of that nuance and get the path wrong. On the grand scale of things, I suspect this is a really minor risk. However, if they could send in an array then they would never need to be aware of that nuance in order to provide a fully-reliable URL, up to the class rejecting path segments which cannot be represented.

Yes, consuming an array is also a good idea, but for the same reason as above, it’s not enough to take care of correctly percent-encoding “/” in order
to have a valid URI as a result. (Of course I’m still talking about RFC 3986, WHATWG still performs automatic percent-encoding)

The HTML5 library has ::createFromString() instead of parse(). Did you consider following this form? It doesn’t seem that important, but could be a nice improvement in consistency among the newer spec-compliant APIs. Further, I think createFromString() is a little more obvious in intent, as parse() is so generic.

Given the issues around equivalence, what about isEquivalent() instead of equals()? In the RFC I think you have been careful to use the “equivalence” terminology, but then in the actual interface we fall back to equals() and lose some of the nuance.

In my implementation, I tried to choose terminology that people are familiar with instead of using the technicus terminus of URIs. Instead of recompose(), I used toString() (or some variant of it), instead of isEquivalent(),
I used equals(). Parse() is probably an outlier, since it’s the correct name of the exact process. But in any case, I consider these names adequately short, and I think they very clearly convey their intent. Using the technicus
terminus would probably even more suit those who have deep familiarity with URIs, but this group will likely be the minority forever. For the rest of the people, the current names make more sense, so I’d prefer keeping them as-is.

Something about not implementing getRawScheme() and friends in the WHATWG class seems off. Your rationale makes sense, but then I wonder what the problem is in exposing the raw untranslated components, particularly since the “raw” part of the name already suggests some kind of danger or risk in using it as some semantic piece.

Hm, interesting remark. Do I understand correctly that you are suggesting to expose getRawScheme() and getRawHost() with their original value? If so, then this has technical challenges: the WHATWG parser doesn’t store
the original value of these two components, so they are effectively lost when automatically transformation happens during parsing. But this is normal, since the WHATWG specification doesn’t really care about the original value of these components.

Tim brought up the naming of getHost() and getHostForDisplay() as well as the correspondence with the toString() methods. I’m not sure if it was overlooked or I missed the followup, but I wonder what your thoughts are on passing an enum to these methods indicating the rendering context. Here’s why: I see developers reach for the first method that looks right. In this case, that would almost always be getHost(), yet getHost() or toString() or whatever is going to be inappropriate in many common cases. I see two ways of baking in education into the API surface: creating two symmetric methods (e.g. getDisplayableHost() and getNonDisplayableHost()); or requiring an enum forcing the choice (e.g. getHost( ForDisplay | ForNonDisplay )). In the case on an enum this could be equally applied across all of the relevant methods where such a distinction exists. On one hand this could be seen as forcing callers to make a choice, but on the other hand it can also be seen as a safeguard against an extremely-common foot-gun, making such an easy oversight impossible.

I am myself also a bit lost on the countless names that I tried out in the implementation, but I think I had toHumanFriendlyString() and toDisplayFriendlyString() methods at some point. These then ended up being toString() and toDisplayString() after some iterations. I would be ok with renaming getHost() and toString() so that their names suggest they don’t use IDNA, but I’d clearly need a good enough suggestion, since neither “MachineFriendly”, nor “NonDisplayable” sound like the best alternative for me. I was also considering using getIdnaHost() and toIdnaString(), but I realized these are the worst looking names I have come up with so far.

Just this week I stumbled upon an issue with escaping the hash/fragment part of a URL. I think that browsers used to decode percent-encodings in the fragment but they all stopped and this was removed from the WHATWG HTML spec no-percent-escaping. The RFC currently shows getFragment() decoding percent-encoded fragments, However, I believe that the WHATWG URL spec only indicates percent-encoding when setting the fragment. You can test this in a browser with the following example: Chrome, Firefox, and Safari exhibit the same behavior.

u = new URL(window.location)
u.hash = ‘one and two’;
u.hash === ‘#one%20and%20two’;
u.toString() === ‘….#one%20and%20two’;

So I think it may be more accurate and consistent to handle Whatwg\Url::getFragment in the same way as getScheme(). When setting a fragment we should percent-encode the appropriate characters, but when reading it, we should never interpret those characters — it should always return the “raw” value of the fragment.

Thank you for the suggestion and for noticing this problem. I believe you must have read a version of the RFC where I was still trying to find out the correct percent-decoding rules for WHATWG. At some point, I was completely misunderstanding what the specification prescribed, so I had to make quite some changes in the RFC regarding this aspect + finally I managed to describe elaborately the reasoning behind the choices. Now I think the rules make sense.

Yes, my implementation automatically percent-encodes the input when parsing or modifying a WHATWG URL. You are also right that WHATWG never percent-decodes the output due to the following reason:

… the point of view of a maintainer of the WHATWG specification is that webservers may legitimately choose to consider encoded and decoded paths distinct, and a standard cannot force them not to do so.

The said author made this clear in multiple comments, but this one is linked in the RFC: https://github.com/whatwg/url/issues/606#issuecomment-926395864

So basically all the non-raw getters return a value that is considered by WHATWG non-equivalent with the original input. This is also explained in the “Component retrieval” section in more detail now (https://wiki.php.net/rfc/url_parsing_api#component_retrieval). I hope

Regards,
Máté

Hi Ignace,

  1. around Uri\UninitializedUriException If I look at the behaviour of
    DatetimeImmutable in the same scenario or a Userland object instead of
    throwing an exception an error is thrown

see:

Shouldn’t the URI feature follow the same path for consistency ? Instead
of throwing an exception it should throw an Error on uninitialized issue
at least.

Yes, you are right! Uri\UninitializedUriException should extend an Error indeed,
since people shouldn’t try to catch it either.

  1. around Normalization. In case of query normalization, sorting the
    query string is not mention does it means that with the current feature

[http://example.com?foo=bar&foo=rab](Example Domain)
is different from
[http://example.com?foo=rab&foo=bar](Example Domain)

Yes, that’s the case, this feature is not implemented. As far as I see though, it’s better
not to change the order of query parameters, especially the order of duplicated
parameters in order not to accidentally change the intended meaning of the query string.
What’s your stance here?

Máté

Hi Maté and all,

There is a pre-existing userland implementation of WHATWG-URL at <https://github.com/TRowbotham/URL-Parser&gt;\. Packagist reports 600K+ downloads <https://packagist.org/packages/rowbot/url&gt;\. It is from Trevor Rowbotham, who is acknowledged in the WHATWG-URL specification itself <URL Standard.

(There is one alternative implementation, <https://packagist.org/packages/esperecyan/url&gt; <https://github.com/esperecyan/url&gt;, but it does not look as recent or robust.)

If we want a full-featured WHATWG-URL implementation in core, would it not be preferable (with Trevor's permission) to convert rowbot/url from userland to core instead? Surely conversion from an existing, well-tested, widely-used implementation would be easier/better/faster than writing an implementation from scratch.

-- pmj

Hi Paul,

If we want a full-featured WHATWG-URL implementation in core, would it not be preferable (with Trevor’s permission) to convert rowbot/url from userland to core instead? Surely conversion from an existing, well-tested, widely-used implementation would be easier/better/faster than writing an implementation from scratch.

There’s no way I would have written an implementation from scratch. I’m using the url module of the Lexbor C library (https://github.com/lexbor/lexbor/) for handling WHATWG URLs. It’s already bundled in core, and it’s also battle tested, and it has exceptional maintenance. All I had to implement is the glue between userland and the C library.

Máté

Hi Maté,

On Mar 18, 2025, at 15:15, Máté Kocsis <kocsismate90@gmail.com> wrote:

There's no way I would have written an implementation from scratch. I'm using the url module of the Lexbor C library (GitHub - lexbor/lexbor: Lexbor is development of an open source HTML Renderer library. https://lexbor.com) for handling WHATWG URLs. It's already bundled in core, and it's also battle tested, and it has exceptional maintenance.

I did not mean to imply writing a parser from scratch; my apologies for phrasing it poorly.

All I had to implement is the glue between userland and the C library.

That is more what I was getting at. Rowbot has a lot of what looks to be good design work on structures that come out of the parsing, in addition to a separate parser class.

The RFC might benefit from an explicit and intentional review of, and maybe incorporation of, some of the pre-existing Rowbot design work. At least one thing from Rowbot is absolutely not applicable to the RFC (e.g. the PSR-3 logging); maybe none of rest of it will be applicable either, but as prior art from someone acknowledged in the WHATWG-URL spec, I think it bears your close attention.

As an overview, the following is a brief comparison between Rowbot and the RFC; any missed or misrepresented functionality is unintentional.

* * *

## RFC

One non-final readonly Url class:

- 5 getRaw...() methods, 8 get...() methods, and one get...ForDisplay() method
- immutability via 8 with...() methods, broadly expecting properly-encoded arguments, and soft-erroring on invalid characters
- a static parse() method, with relative parsing capability and a place to capture errors
- equals() to compare two URLs
- toString() for machine-friendly string recomoposition
- toDisplayString() for human-friendly string recomposition
- resolve() to resolve a relative URL using the current URL as the base
- serialize/deserialize; "the serialized form only includes the recomposed URI itself exposed as the `__uri` field, but the individual properties or URI components are not present."
- no URLSearchParams implementation

## Rowbot

(None of the classes are readonly or final; these look to hew closely to the WHATWG-URL spec.)

A BasicURLParser class:

- affords relative parsing capability and an option parameter for the target URLRecord
- returns a URLRecord

A URLRecord class:

- public mutable properties for the URL components
- $scheme is a Scheme implementation with equals() and other is...() methods
- $host is a HostInterface (and implementations) with equals() and other is...() methods
- $path is a PathInterface (and PathList implementation) with PathSegment manipulation methods
- setUsername() and setPassword() mutators
- serializing
- getOrigin(), includesCredentials(), isEqual()

A URL class:

- Composed of a URLRecord and a URLSearchParams object
- Constructor takes a string, parses it to a URLRecord, and retains the URLRecord
- a static parse() method with relative parsing, as a convenience method
- __toString() and toString() return the serialized URLRecord
- Virtual properties for $href, $origin, $protocol, $username, $password, $host, $hostname, $port, $pathname, $search, $searchParams, $hash
- Mutability of virtual properties via magic __set()
- Readability of virtual properties via magic __get()

A URLSearchParams class:

- search params manipulation methods
- implements Countable, Iterator, Stringable
- composed of a QueryList implementation and (optionally) the originating URLRecord

* * *

-- pmj

On 17/03/2025 20:58, Máté Kocsis wrote:

Hi Ignace,

    1) around `Uri\UninitializedUriException` If I look at the
    behaviour of
    `DatetimeImmutable` in the same scenario or a Userland object
    instead of
    throwing an exception an error is thrown

    see:

    - Online PHP editor | output for d4VrY
    - Online PHP editor | output for Wn7En

    Shouldn't the URI feature follow the same path for consistency ?
    Instead
    of throwing an exception it should throw an Error on uninitialized
    issue
    at least.

Yes, you are right! Uri\UninitializedUriException should extend an Error indeed,
since people shouldn't try to catch it either.

    2) around Normalization. In case of query normalization, sorting the
    query string is not mention does it means that with the current
    feature

    `Example Domain
    <Example Domain;
    is different from
    `Example Domain
    <Example Domain;

Yes, that's the case, this feature is not implemented. As far as I see though, it's better
not to change the order of query parameters, especially the order of duplicated
parameters in order not to accidentally change the intended meaning of the query string.
What's your stance here?

Máté

Hi Maté,

Thanks for the clarifications, I ask for the latter because I am trying to create a polyfill using league/uri-interfaces so

my questions come essentially from me trying to create the correct polyfill to better understand the new class (specifically,

the RFC3986 Uri). You can find the ongoing work here if you want to see

While implementing the polyfill I am finding easier DX wise to make the constructor private and use instead named constructors for instantiation. I would be in favor of

`Uri::parse` and `Uri::tryParse` like it is done currently with Enum and the `from` and `tryfrom` named constructors.

My reasoning is as follow:

there's no right way or wrong way to instantiate an URI there are only contexts. While the parse method is all about parsing a string, one could legitimately use other named constructors like `Uri::fromComponents` which would take for instance the result of parse_url to build a new URI. This can become handy in the case of RFC3986 URI if you need to create an new URI not related to the http scheme and that do not use all the components like the email, data or FTP schemes.

By allowing creating URI based on their respective components value you make it easier for dev to use the class. Also this means that if we want to have a balance API then a `toComponents` method should come hand in hand with the named constructor.

I would understand if that idea to add both components related methods is rejected, they could be implemented on userland, but the main point was to prove that from the VO or the developer POV in absence of a clearly defined instantiation process, having a traditional constructor fails to convey all the different way to create an URI.

Hi Ignace & Maté & all,

On Mar 19, 2025, at 16:18, Ignace Nyamagana Butera <nyamsprod@gmail.com> wrote:

aide-uri/src/Uri.php at main · bakame-php/aide-uri · GitHub
While implementing the polyfill I am finding easier DX wise to make the constructor private and use instead named constructors for instantiation. I would be in favor of `Uri::parse` and `Uri::tryParse` like it is done currently with Enum and the `from` and `tryfrom` named constructors.

Hear, hear. Uri-Interop <https://github.com/uri-interop/interface&gt; has discovered two interfaces in existing projects:

- one method with a `parseUri(stringStringable $uriString) : UriComponents` signature to parse a string and create a URI instance; and,

- a separate method with a `newUri(?string $scheme, ?string $username, ..., ?string $fragment) : UriComponents` signature that creates a URI instance from the individual components.

Neither of them dictates a constructor signature, but having the parser method separated from the factory method turns out to be quite useful. Presenting the two options as separate methods would reflect existing implementations.

* * *

As a side note, the RFC uses the term `user` for the identifying part of the user info. It's perfectly reasonable, as `user` is the most-commonly-used term in existing URI projects. <interface/README-RESEARCH.md at 1.x · uri-interop/interface · GitHub;

However, WHATWG-URL consistently calls it `username`, putting the URL portion of the RFC at odds with the WHATWG-URL spec. Calling it `username` would be more in line with the spec. That would likely mean calling it `username` in the URI portion of the RFC as well. (Uri-Interop reviewers found `username` more suitable as well. <GitHub - uri-interop/interface: Interoperable URI interfaces for PHP 8.4+.)

-- pmj

Hi Paul,

Rowbot

(None of the classes are readonly or final; these look to hew closely to the WHATWG-URL spec.)

A BasicURLParser class:

  • affords relative parsing capability and an option parameter for the target URLRecord
  • returns a URLRecord

A URLRecord class:

  • public mutable properties for the URL components
  • $scheme is a Scheme implementation with equals() and other is…() methods
  • $host is a HostInterface (and implementations) with equals() and other is…() methods
  • $path is a PathInterface (and PathList implementation) with PathSegment manipulation methods
  • setUsername() and setPassword() mutators
  • serializing
  • getOrigin(), includesCredentials(), isEqual()

A URL class:

  • Composed of a URLRecord and a URLSearchParams object
  • Constructor takes a string, parses it to a URLRecord, and retains the URLRecord
  • a static parse() method with relative parsing, as a convenience method
  • __toString() and toString() return the serialized URLRecord
  • Virtual properties for $href, $origin, $protocol, $username, $password, $host, $hostname, $port, $pathname, $search, $searchParams, $hash
  • Mutability of virtual properties via magic __set()
  • Readability of virtual properties via magic __get()

I like some of the solutions this library uses - the usage of dedicated value objects for some components (Scheme, HostInterface, PathInterface) -, but
these features are what make the implementation extremely slow compared to the implementation the RFC proposes. I didn’t dig into the details when
I performed a very quick benchmark last week, so I can only assume that the excessive usage of objects makes the library much slower than what’s possible
even for a userland library (obviously, an internal C implementation will always be faster). According to my results, the RFC’s implementation was
two orders of magnitude faster than the Rowbot library for parsing a very basic “https://example.com” URL 1000 times (~0.002 sec vs ~0.56 sec).

What I want to say with this is that it’s perfectly fine to optimize a userland library for ergonomics and for the usage of advanced OOP in mind, but an internal
implementation should also keep efficiency in mind besides developer experience. That’s why I don’t see myself implement separate objects for some of
the components for now. But nothing would block us from doing it later, if we found out it’s necessary.

I believe the most fundamental difference between the Rowbot library and the RFC is that the RFC has native support for percent-decoding (because
most properties are accessible in 2 variants), while the library completely leaves this task for the user. Apart from that, the mutable design of the library
is fragile for the same reason as the DateTime class is not safe to use in most cases, so that’s definitely a no-go for me.

This RFC is a synthesis of almost a year of discussion and refinement, collaborated by some very clever folks, who have a lot of hands-on experience of
URL parsing and handling. That’s why I would say that input from Trevor Rowbotham is also welcome in the discussion (especially his experience of some
edge cases he had to deal with), but the said library is nowhere near as widely adopted for it to qualify as something we must definitely take into consideration
when designing PHP’s new URL parsing API.

A URLSearchParams class:

  • search params manipulation methods
  • implements Countable, Iterator, Stringable
  • composed of a QueryList implementation and (optionally) the originating URLRecord

I like this concept too. And in fact, support for such a class is on my to-do list, and is mentioned in the “Future Scope”. I just didn’t want to make the RFC
even longer, because we already have a lot of details to discuss.

Máté

Hi Dennis,

I am myself also a bit lost on the countless names that I tried out in the implementation, but I think I had toHumanFriendlyString() and toDisplayFriendlyString() methods at some point. These then ended up being toString() and toDisplayString() after some iterations. I would be ok with renaming getHost() and toString() so that their names suggest they don’t use IDNA, but I’d clearly need a good enough suggestion, since neither “MachineFriendly”, nor “NonDisplayable” sound like the best alternative for me. I was also considering using getIdnaHost() and toIdnaString(), but I realized these are the worst looking names I have come up with so far.

What about getPunycodeHost(), getUnicodeHost(), toPunycodeString(), toUnicodeString()? Or getAsciiHost() and toAsciiString() may also work. These are the best names I managed to come up with so far.

In the meantime, I renamed RFC 3986’s toString() methods too according to another suggestion:

  • toString() became toRawString()
  • toNormalizedString() became toString()

The new names mirror exactly what their getter counterparts do.

Máté

On Mar 25, 2025, at 3:23 PM, Máté Kocsis kocsismate90@gmail.com wrote:

Hi Dennis,

I am myself also a bit lost on the countless names that I tried out in the implementation, but I think I had toHumanFriendlyString() and toDisplayFriendlyString() methods at some point. These then ended up being toString() and toDisplayString() after some iterations. I would be ok with renaming getHost() and toString() so that their names suggest they don’t use IDNA, but I’d clearly need a good enough suggestion, since neither “MachineFriendly”, nor “NonDisplayable” sound like the best alternative for me. I was also considering using getIdnaHost() and toIdnaString(), but I realized these are the worst looking names I have come up with so far.

What about getPunycodeHost(), getUnicodeHost(), toPunycodeString(), toUnicodeString()? Or getAsciiHost() and toAsciiString() may also work. These are the best names I managed to come up with so far.

In the meantime, I renamed RFC 3986’s toString() methods too according to another suggestion:

  • toString() became toRawString()
  • toNormalizedString() became toString()

The new names mirror exactly what their getter counterparts do.

Máté

Hi Máté,

I’ve been pondering these names for the past week and a half and I couldn’t think of anything, but at first glance I like getUnicodeHost() and getAsciiHost(). These communicate a little bit the nuance, though they aren’t totally in-your-face (which in this case I wish there were a more obvious pair that is).

Other pairs I was toying with but don’t like are:

  • getPrintHost() / getDataHost()
  • getDisplayHost() / getAPIHost()
  • getDisplayHost() / getEncodedHost()
  • getDisplayHost() / getEscapedHost()

(the same pairs would apply to the other methods, like toDisplayString() / toEncodedString())

This seems to be taking a lot of effort and time, but thank you still for engaging with it — naming is hard! But it’s worth it.

On Mar 25, 2025, at 4:06 PM, Dennis Snell dennis.snell@automattic.com wrote:

On Mar 25, 2025, at 3:23 PM, Máté Kocsis kocsismate90@gmail.com wrote:

Hi Dennis,

I am myself also a bit lost on the countless names that I tried out in the implementation, but I think I had toHumanFriendlyString() and toDisplayFriendlyString() methods at some point. These then ended up being toString() and toDisplayString() after some iterations. I would be ok with renaming getHost() and toString() so that their names suggest they don’t use IDNA, but I’d clearly need a good enough suggestion, since neither “MachineFriendly”, nor “NonDisplayable” sound like the best alternative for me. I was also considering using getIdnaHost() and toIdnaString(), but I realized these are the worst looking names I have come up with so far.

What about getPunycodeHost(), getUnicodeHost(), toPunycodeString(), toUnicodeString()? Or getAsciiHost() and toAsciiString() may also work. These are the best names I managed to come up with so far.

In the meantime, I renamed RFC 3986’s toString() methods too according to another suggestion:

  • toString() became toRawString()
  • toNormalizedString() became toString()

The new names mirror exactly what their getter counterparts do.

Máté

Hi Máté,

I’ve been pondering these names for the past week and a half and I couldn’t think of anything, but at first glance I like getUnicodeHost() and getAsciiHost(). These communicate a little bit the nuance, though they aren’t totally in-your-face (which in this case I wish there were a more obvious pair that is).

Other pairs I was toying with but don’t like are:

  • getPrintHost() / getDataHost()
  • getDisplayHost() / getAPIHost()
  • getDisplayHost() / getEncodedHost()
  • getDisplayHost() / getEscapedHost()

(the same pairs would apply to the other methods, like toDisplayString() / toEncodedString())

This seems to be taking a lot of effort and time, but thank you still for engaging with it — naming is hard! But it’s worth it.

Just for fun I have tossed this into DeepSeek-R1 671B

WHATWG URLs have two representations: one for humans and one for machines. The reason for having two is that URLs may have IDNA domains which are punycode encoded and there are security issues around showing that to huamns. For example, if a person reads “https://䕮䕵䕶䕱.com” they may assume that the domain belongs to Google, when in fact it points to “https://䕮䕵䕶䕱.com”. You are a modern programming language designer working on a standard library to expose a URL parser and you want the interface of this library to educate developers on where to use the appropriate representation. Given a URL object $u of class URL, propose two methods for converting that URL to a string. The name of the methods should communicate their use, and when a developer searches for the right method to get the string form, they should not be presented with a non-prefixed and prefixed pair like toString() and toHumanString(). Instead, the methods names should form a kind of symmetric pair like toEncodedString() and toDisplayString(). Use your knowledge of WHATWG URL nuances, browser security issues, human developers making typical mistakes, and propose at least ten pairs of words that could be used for returning these two different representations.

A few of the ideas that it returned which stuck out were:

  • toDataString() / toViewString() and getDataHost() / getViewHost()
  • toSerializedString() / toReadableString() and getSerializedHost() / getReadableHost()
  • toProcessingString() / toSafeDisplayString() and getProcessingHost() / getSafeDisplayHost()

After checking in the Gecko source code, I sadly only found helper methods which take a URL/URI and transform them:

  • prepareUrlForDisplay()
  • unEscapeURIForUI()

Node seems to punt on this by providing URL.format() with a { unicode: boolean } option. These all seem to miss the mark, in my opinion, because of how easy it is to assume that toString() or .host is what you’re after.

Thanks for entertaining the extra follow-up here.

Warmly,
Dennis Snell

Hi Ignace,

While implementing the polyfill I am finding easier DX wise to make the constructor private and use instead named constructors for instantiation. I would be in favor of

Uri::parse and Uri::tryParse like it is done currently with Enum and the from and tryfrom named constructors.

My reasoning is as follow:

there’s no right way or wrong way to instantiate an URI there are only contexts. While the parse method is all about parsing a string, one could legitimately use other named constructors like Uri::fromComponents which would take for instance the result of parse_url to build a new URI. This can become handy in the case of RFC3986 URI if you need to create an new URI not related to the http scheme and that do not use all the components like the email, data or FTP schemes.

By allowing creating URI based on their respective components value you make it easier for dev to use the class. Also this means that if we want to have a balance API then a toComponents method should come hand in hand with the named constructor.

I would understand if that idea to add both components related methods is rejected, they could be implemented on userland, but the main point was to prove that from the VO or the developer POV in absence of a clearly defined instantiation process, having a traditional constructor fails to convey all the different way to create an URI.

There are a few things which came to my mind:

  • Currently, the underlying C libraries don’t support a fromComponents feature. How I could naively imagine this to work is that the components are recomposed to a URI string based on the relevant algorithm (for RFC 3986: https://datatracker.ietf.org/doc/html/rfc3986#section-5.3), and then this string is parsed and validated. Unfortunately, I recently realized that this approach may leave room for some kind of parsing confusion attack, namely when the scheme is for example “https”, the authority is empty, and the path is “example.com”. This will result in a https://example.com URI. I believe a similar bug is not possible with the rest of the components because they have their delimiters. So possibly some other solution will be needed, or maybe adding some additional validation (?).

  • Nicolas raised my awareness that if URIs didn’t have a proper constructor, then one wouldn’t be able to use URI objects as parameter default values, like below:
    function (Uri $foo = new Uri(‘blah’))
    I think this omission would cause some usability regression. For this reason, it may make sense to have a distinguished way of instantiating an Uri.

  • I have a similar feeling for a toComponents() method as for another named constructor instead of __construct(): I am not completely against it, but I’m not totally convinced about it.

Máté

On 27/03/2025 22:04, Máté Kocsis wrote:

Hi Ignace,

    While implementing the polyfill I am finding easier DX wise to
    make the constructor private and use instead named constructors
    for instantiation. I would be in favor of

    `Uri::parse` and `Uri::tryParse` like it is done currently with
    Enum and the `from` and `tryfrom` named constructors.

    My reasoning is as follow:

     there's no right way or wrong way to instantiate an URI there are
    only contexts. While the parse method is all about parsing a
    string, one could legitimately use other named constructors like
    `Uri::fromComponents` which would take for instance the result of
    parse_url to build a new URI. This can become handy in the case of
    RFC3986 URI if you need to create an new URI not related to the
    http scheme and that do not use all the components like the email,
    data or FTP schemes.

     By allowing creating URI based on their respective components
    value you make it easier for dev to use the class. Also this means
    that if we want to have a balance API then a `toComponents` method
    should come hand in hand with the named constructor.

    I would understand if that idea to add both components related
    methods is rejected, they could be implemented on userland, but
    the main point was to prove that from the VO or the developer POV
    in absence of a clearly defined instantiation process, having a
    traditional constructor fails to convey all the different way to
    create an URI.

There are a few things which came to my mind:
- Currently, the underlying C libraries don't support a `fromComponents` feature. How I could naively imagine this to work is that the components are recomposed to a URI string based on the relevant algorithm (for RFC 3986: RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax), and then this string is parsed and validated. Unfortunately, I recently realized that this approach may leave room for some kind of parsing confusion attack, namely when the scheme is for example "https", the authority is empty, and the path is "example.com <http://example.com>". This will result in a https://example.com URI. I believe a similar bug is not possible with the rest of the components because they have their delimiters. So possibly some other solution will be needed, or maybe adding some additional validation (?).

- Nicolas raised my awareness that if URIs didn't have a proper constructor, then one wouldn't be able to use URI objects as parameter default values, like below:
function (Uri $foo = new Uri('blah'))
I think this omission would cause some usability regression. For this reason, it may make sense to have a distinguished way of instantiating an Uri.

- I have a similar feeling for a toComponents() method as for another named constructor instead of __construct(): I am not completely against it, but I'm not totally convinced about it.

Máté

Hi Máté,

    for RFC 3986:
    RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax), and then
    this string is parsed and validated. Unfortunately, I recently
    realized that this approach may leave room for some kind of parsing
    confusion attack, namely when the scheme is for example "https", the
    authority is empty, and the path is "example.com
    <http://example.com>". This will result in a https://example.com
    URI. I believe a similar bug is not possible with the rest of the
    components because they have their delimiters. So possibly some
    other solution will be needed, or maybe adding some additional
    validation (?).

This is not correct according to RFC3986 RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

*When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). *

So in your example it should throw an Uri\InvalidUriException :slightly_smiling_face: for RFC3986 and in case of the WhatwgUrl algorithm it should trigger a soft error and correct the behaviour for the http(s) schemes.
This is also one of the many reasons why at least for RFC3986 the path component can never be `null` but that's another discussion. Like I said having a `fromComponenta` named constructor would allow the "removal" of the need for a UriBuilder (in your future section) and would IMHO be useful outside of the context of the http(s) scheme but I can understand it being left out of the current implementation it might be brought back for future improvements.

I have one last question regarding the URI implementations which are raised by my polyfill:

Did you also took into account the delimiters when submitting data via the withers ? In other words is

$uri->withQuery('?foo=bar');
//the same as
$uri->withQuery('foo=bar');

I know it is the case in of the WHATWG specification but I do not know if you kept this behaviour in your implementation for the WhatWgUrl for the Rfc3986 or for both. I would lean toward not accepting this "normalization" but since this is not documented in the RFC I wanted to know what is the expected behaviour.

Thanks for the hard work

Hi Maté and all,

On Mar 25, 2025, at 03:45, Máté Kocsis <kocsismate90@gmail.com> wrote:

Regarding Rowbot slowness compared to the RFC:

I can only assume that the excessive usage of objects makes the library much slower than what's possible
even for a userland library (obviously, an internal C implementation will always be faster). According to my results, the RFC's implementation was
**two orders of magnitude** faster than the Rowbot library for parsing a very basic "https://example.com" URL 1000 times (~0.002 sec vs ~0.56 sec).

I would not presume that the dedicated value objects are what "makes the [Rowbot] library much slower" than the RFC -- instead, my first intuition is that the *parsing* operations are slower in userland than in C, and are primarily responsible for the comparative slowness. Speedwise, creation of multiple objects from the parsed results would be a rounding error compared to the parsing itself.

What I want to say with this is that it's perfectly fine to optimize a userland library for ergonomics and for the usage of advanced OOP in mind, but an internal
implementation should also keep efficiency in mind besides developer experience. That's why I don't see myself implement separate objects for some of
the components for now. But nothing would block us from doing it later, if we found out it's necessary.

I think that's fair. The main thing that stands out to me is not the Scheme, Host, etc. value objects, but that the RFC presents no UrlRecord -- which is very definitely part the WHATWG-URL specification. That is, from reading the spec, I'd expect to see a UrlRecord, and a Url composed from it.

I believe the most fundamental difference between the Rowbot library and the RFC is that the RFC has native support for percent-decoding (because
most properties are accessible in 2 variants), while the library completely leaves this task for the user.

I have some thoughts on that, but I'll save them for later.

Meanwhile, AFAICT, neither Rowbot nor the RFC provide a percent *en*coding mechanism, for consumers to put together properly-encoded values. Have I missed it in the RFC, or is it somehow not necessary, or something else?

This RFC is a synthesis of almost a year of discussion and refinement, collaborated by some very clever folks, who have a lot of hands-on experience of
URL parsing and handling.

I would not presume otherwise! Even so, everyone makes mistakes and oversights from time to time, including very clever folks of the kind you describe above.

That's why I would say that input from Trevor Rowbotham is also welcome in the discussion (especially his experience of some edge cases he had to deal with)

I agree -- it would be great for the RFC team to seek him out and invite him to comment in this thread.

but the said library is nowhere near as widely adopted for it to qualify as something we must definitely take into consideration
when designing PHP's new URL parsing API.

Not to be too blunt, but the Rowbot library is far more widely adopted than the RFC currently is; I think Rowbot represents an intersection of theory and practice that one would be unwise to discard without intentional and extensive consideration.

A URLSearchParams class:

I like this concept too. And in fact, support for such a class is on my to-do list, and is mentioned in the "Future Scope".

Because it is part of the WHATWG-URL spec, I think it deserves first-class treatment in this RFC ...

I just didn't want to make the RFC even longer, because we already have a lot of details to discuss.

... but yeah, the sheer volume of the RFC makes it difficult to review and pick apart.

Which leads to my last point: I would really like to see at least two separate RFCs here. They be a lot easier to review and critique that way:

- one for dealing with URIs as they exist now, especially one that the honors the ways-of-working that exist in userland; and,
- one for dealing with WHATWG-URL in its entirety, with all its differences (some subtle, some not) from URIs.

I can see arguments for either one being the "base" on which the other would build.

-- pmj

Hi Larry and everyone who took part in the final vs non-final debate,

Thought: make the class non-final, but all of the defined methods final, and any internal data properties private. That way we know that a child class cannot break any of the existing guarantees, but can still add convenience methods or static method constructors on top of the existing API, without the need for an interface and a very verbose composing class.

I was thinking about this a lot, hesitating a lot on all the possibilities. At last, I went with final classes. I know this is disappointing for everyone who wanted to have an unlocked implementation,
and I am still sympathetic for providing some kind of extension point. I synthesized my thoughts in a very lengthy section: https://wiki.php.net/rfc/url_parsing_api#why_should_the_uri_rfc3986_uri_and_the_uri_whatwg_url_classes_be_final
so please read my full reasoning there.

TLDR: First of all, let me clarify that I want to open the API as soon as the API becomes mature enough. However, based on the heated debate, we would surely need a lor more time to find the best solution that won’t have unforeseen surprises. Since the final vs non-final question is a very small (but important) detail of the proposal, I would like to discuss it on its own, without affecting the whole work, and without risking to meet the deadline of PHP 8.5. I really hope that this decision will give back the focus on the most essential parts of the proposal that cannot be changed (or only with a lot of difficulties) once the feature goes live: I mostly think about the percent encoding/decoding related behavior, just to name one thing.

Máté

Hi

On 3/12/25 23:00, Máté Kocsis wrote:

I'm not sure if I'm entirely correct, but it's possible that a 3rd party
URI implementation
won't (or cannot) use PHP's memory manager, and it relies on the regular
malloc:
in this case, even memory errors could lead to failures.

We already discussed this in private and the UriOperationException was removed from the RFC, but for public record:

Something like a memory allocation error is not actionable by the user. Thus it should be an `Error` (rather than an exception) or a bail out. Perhaps the engine will one day support gracefully handling the memory limit being exceeded with an OutOfMemoryError being thrown in that situation. Then it would also fit nicely for any URI implementation.

Best regards
Tim Düsterhus

Hi

On 3/22/25 15:01, Paul M. Jones wrote:

However, WHATWG-URL consistently calls it `username`, putting the URL portion of the RFC at odds with the WHATWG-URL spec. Calling it `username` would be more in line with the spec. That would likely mean calling it `username` in the URI portion of the RFC as well.

This makes sense to me. The WHATWG URL standard uses `username`, RFC 3986 uses `user`, but considers that deprecated in favor of the generic `userinfo`. `user` along might be somewhat ambiguous since it could refer to the entire `userinfo` section or to just the part before the first colon.

Best regards
Tim Düsterhus