Hi Tim,
Thank you again for the thorough review!
The naming of these methods seems to be a little inconsistent. It should
either be:
->getHostForDisplay()
->toStringForDisplay()
or
->getDisplayHost()
->toDisplayString()
but not a mix between both of them.
Yes, I completely agree with your concern. I’m just not sure yet which combination I’d prefer.
Probably the latter one?
Yes. Besides the remark above, my previous arguments still apply (e.g.
with()
ers not being able to construct instances for subclasses,
requiring to override all of them). I’m also noticing that serialization
is unsafe with subclasses that add a $__uri
property (or perhaps any
property at all?).
Hm, yes, you are right indeed that withers cannot really create new instances on
their own because the whole URI string is needed to instantiate a new object… which is only
accessible if it’s reconstructed by swapping the relevant component with its new value.
Please note that trying to serialize a $__uri property will result in an exception.
The toDisplayString()
method that you mentioned above is not in the
RFC. Did you mean toHumanFriendlyString()
? Which one is correct?
The toHumanFriendlyString() method stuck there from a previous version of the proposal,
since then I converted it to toDisplayString().
The example output of the $errors
array does not match the stub. It
contains a failure
property, should that be softError
instead?
The $softError property is also an outdated name: I recently changed it to $failure
to be consistent with the wording that the WHATWG specification uses.
The RFC states “When trying to instantiate a WHATWG Url via its
constructor, a Uri\InvalidUriException is thrown when parsing results in
a failure.”
What happens for Rfc3986 when passing an invalid URI to the constructor?
Will an exception be thrown? What will the error array contain? Is it
perhaps necessary to subclass Uri\InvalidUriException for use with
WhatWgUrl, since $errors
is not applicable for 3986?
The first two questions are answered right at the top of the parsing section:
“the constructor: It expects a URI, and optionally, a base URL in order to support reference resolution.
When parsing is unsuccessful, a Uri\InvalidUriException is thrown.”
The $errors property will contain an empty array though, as you supposed. I don’t see much problem
with using the same exception in both cases, however I’m also fine with making the $errors property
nullable in order to indicate that returning errors is not supported by the implementation triggering
the error.
The RFC does not specify when UninitializedUriException
is thrown.
That’s a very good catch! I completely forgot about some exceptions. This one is used
for indicating that an URI is not correctly initialized: when a URI instance is created
without actually invoking the constructor, or the parse method, or __unserialize(),
then any methods that try to use the internally stored URI will trigger this exception.
The RFC does not specify when UriOperationException
is thrown.
Generally speaking I believe it would help understanding if you would
add a /** @throws InvalidUriException */
to each of the methods in the
stub to make it clear which ones are able to throw (e.g. resolve(), or
the withers). It’s harder to find this out from “English” rather than
“code” 
Good idea! I’ve added the PHPDoc as well as created a dedicated “Exceptions”
section.
In the “Component retrieval” section: Please add even more examples of
what kind of percent-decoding will happen. For example, it’s important
to know if %26
is decoded to &
in a query-string. Or if %3D
is
decoded to =
. This really is the same case as with %2F
in a path.
The explanation
Thanks for calling these cases out, I’ve significantly reworked the relevant sections.
First of all, I added much more details to the general overview about percent-encoding:
https://wiki.php.net/rfc/url_parsing_api#percent-encoding_decoding as well as extended
the https://wiki.php.net/rfc/url_parsing_api#component_retrieval section with more information
about the two component representations, and added a general clarification related to reserved
characters. Additionally, the https://wiki.php.net/rfc/url_parsing_api#component_modification section
makes it clear how percent-encoding is performed when the withers are used.
After thinking about the question a lot, finally the current encoding-decoding rules seem
logical to me, but please double-check them. It’s easy to misinterpret such long and complex
specifications.
Long story short: when parsing an URI or modifying a component, RFC 3986 fails hard if
an invalid character is found, while WHATWG implementation automatically percent-encodes
it while also triggering a soft-error.
While retrieving the “normalized-decoded” representation of a URI component, percent-decoding is
performed when possible:
- in case of RFC3986: reserved and invalid characters are not percent-decoded (only unreserved ones are)
- in case of WHATWG: invalid characters and characters with special meaning (that fall into the percent-encode set
of the given component) are not percent-decoded
The relevant sections will give a little more reasoning why I went with these rules.
“the URI is normalized (when applicable), and then the reserved
characters in the context of the given component are percent-decoded.
This means that only those reserved characters are percent-decoded that
are not allowed in a component. This behavior is needed to be able to
unambiguously retrieve components.”
alone is not clear to me. “reserved characters that are not allowed in a
component”. I assume this means that %2F
(/) in a path will not be
decoded, but %3F
(?) will, because a bare ?
can’t appear in a path?
I hope that this question is also clear after my clarifications + the reconsidered logic.
In the “Component retrieval” section: You compare the behavior of
WhatWgUrl and Rfc3986Uri. It would be useful to add something like:
$url->getRawScheme() // does not exist, because WhatWgUrl always
normalizes the scheme
Done.
to better point out the differences between the two APIs with regard to
normalization (it’s mentioned, but having it in the code blocks would
make it more visible).
Done.
In the “Component Modification” section, the RFC states that WhatWgUrl
will automatically encode ?
and #
as necessary. Will the same happen
for Rfc3986? Will the encoding of #
also happen for the query-string
component? The RFC only mentions the path component.
The above referenced sections will give a clear answer for this question as well.
TLDR: after your message, I realized that automatic percent-encoding also triggers a (soft)
error case for WHATWG, so I changed my mind with regards to Uri\Rfc3986\Uri,
so it won’t do any automatic percent-encoding. It’s unfortunate, because this behavior is not
consistent with WHATWG, but it’s more consistent with the parsing rules of its own specification,
where there are only hard errors, and there’s no such thing as “automatic correction”.
I’m also wondering if there are cases where the withers would not
round-trip, i.e. where $url->withPath($url->getPath())
would not
result in the original URL?
I am currently not aware of any such situation… I even wrote about this aspect fairly
long, because I think “roundtripability” is a very important attribute. Thank you for
raising awareness of this!
Can you add examples where the authority / host contains IPv6 literals?
It would be useful to specifically show whether or not the square
brackets are returned when using the getters. It would also be
interesting to see whether or not IPv6 addresses are normalized (e.g.
shortening 2001:db8:0:0:0:0:0:1
to 2001:db8::1
).
Good idea again! I’ve added an example containing an IPv6 host at the very
end of the component retrieval section. And yes, they will be enclosed within a pair as
per the spec.
It also surprised me, but IP address normalization is only performed by WHATWG
during recomposition! But nowhere else…
In “Component Recomposition” the RFC states “The
Uri\Rfc3986\Uri::toString() returns the unnormalized URI string”.
Does this mean that toString() for Rfc3986 will always return the
original input?
Yes, effectively that’s the case, only WHATWG modifies the input according to my knowledge.
In the past, I had the impression that RFC 3986 also did a few changes,
but then I had to realize that it was not the case after I had dug deep into the code of uriparser.
It would be useful to know whether or not the classes implement
__debugInfo()
/ how they appear when var_dump()
ing them.
I’ve added an example.
That’s all I managed to write for now, but I’ll try to answer the rest of the messages and feedback
as soon as possible. 
Regards,
Máté