Re: [PHP-DEV] [RFC] Decoding HTML and the Ambiguous Ampersand

Dennis_Snell · August 22, 2024, 11:02pm

On Aug 22, 2024, at 5:01 PM, Niels Dossche dossche.niels@gmail.com wrote:

On 20/08/2024 00:45, Dennis Snell wrote:

On Jul 9, 2024, at 4:55 PM, Dennis Snell dennis.snell@a8c.com wrote:

Greetings all,

The html_entity_decode( … ENT_HTML5 … ) function has a number of issues that I’d like to correct.

It’s missing 720 of HTML5’s specified named character references.

106 of these are named character references which do not require a trailing semicolon, such as &acute

It’s unaware of the ambiguous ampersand rule, which allows these 106 in special circumstances.

HTML5 asserts that the list of named character references will not expand in the future. It can be found authoritatively at the following URL:

https://html.spec.whatwg.org/entities.json <https://html.spec.whatwg.org/entities.json>

The ambiguous ampersand rule smoothes over legacy behavior from before HTML5 where ampersands were not properly encoded in attribute values, specifically in URL values. For example, in a query string for a search, one might find ?q=dog&not=cat. The &not in that value would decode to U+AC (¬), but since it’s in an attribute value it will be left as plaintext. Inside normal HTML markup it would transform into ?q=dog¬=cat. There are related nuances when numeric character references are found at the end of a string or boundary without the semicolon.

The function signature of html_entity_decode() does not currently allow for correcting this behavior. I’d like to propose an RFC or a bug fix which either extends the function (perhaps by adding a new flag like ENT_AMBIGUOUS_AMPERSAND) or preferably creates a new function. For the missing character references I wonder if it would be enough to add them to the list of default translatable references.

One challenge with the existing function is that the concept of the translation table stands in contrast with the fixed and static nature of HTML5’s replacement tables. A new function or set of functions could open up spec-compliant decoding while providing helpful methods that are necessary in many common server-side operations:

html_decode( ‘attribute’ | ‘data’, $raw_text, $input_encoding = ‘utf-8' )

html_text_contains( ‘attribute’ | ‘data’, $raw_haystack, $needle, $input_encoding = ‘utf-8’ )

html_text_starts_with( ‘attribute’ | ‘data’, $raw_haystack, $needle, $input_encoding = ‘utf-8’ )

These methods are handy for inspecting things like encoded attribute values in a memory-efficient and processing-efficient way, when it’s not necessary to decode the entire value. In common situations, one encounters data-URIs with potentially megabytes of image data and processing only the first few or tens of bytes can save a lot of overhead.

We’re exploring pure-PHP solutions to these problems in WordPress in attempts to improve the reliability and safety of handling HTML. I’d love to hear your thoughts and know if anyone is willing to work with me to create an RFC or directly propose patches. We’ve created a step function which allows finding the next character reference and decoding it separately, enabling some novel features like highlighting the character references in source text.

Should I propose an RFC for this?

Warmly,
Dennis Snell
Automattic Inc.

Thanks everyone for your feedback so far on the decode_html() RFC [https://wiki.php.net/rfc/decode_html <https://wiki.php.net/rfc/decode_html>]

I’ve updated it replacing the new constants with a new HtmlContext enum, and the interface seems much nicer this way. I particularly like how PHP enforces passing a valid value, vs. hoping that the right flag is used.

Additionally I added a section that I previously forgot, which highlights the source of the infamous mojibake/gremlins. HTML has special rules for remapping the C1 control characters, as if they had been stored or recorded for Windows-1251.

Warmly,
Dennis Snell

Hi Dennis

+1 on the concept.
I just have two concerns:

Thanks Niels. I appreciate the help you’ve already provided on this process, and the work you’ve done with lexbor.

I’m not so sure that the name “decode_html” is self-descriptive enough, it sounds very generic.

The name is not very important to me. For the sake of history, the reason I have chosen “decode HTML” is because, unlike an HTML parser, this is focused on taking a snippet of HTML “text” content and decoding it into a “plain PHP string.”

The existing html_entity_decode() is very close in naming but ties this concept into entities, and overlooks other basic text decoding concerns (newline normalization and NULL byte handling).

Originally I had “utf8” in the name but someone else thought it was too long and specific. I want the name to educate developers and also be terse. Naming is hard.

I would strongly suggest to explore an implementation based on Lexbor. I’m pretty confident that it can be done by reusing the internal APIs. The advantage is that it will be less code to maintain. You pull off some fancy tricks in your implementation for performance reasons, but that also adds to complexity and maintenance burden. Also since this is C, we must be extra careful when implementing tricks.

Yeah I agree and I’ll share more below. The tricks I’m using in my PR implementing the RFC are partly there to propose adoption into PHP and partly there to get a real sense of my algorithm vs. those found in Chrome, Firefox, Safari, and lexbor. I’ve attempted to build a search algorithm for named character references that optimizes for cache locality in contrast to algorithmic complexity where RAM access is assumed to be free.

My code isn’t currently well document and doesn’t meet the PHP-src coding standards, but the algorithm is pretty basic and easy to explain. It’s also “unoptimized” for C, mostly. I think there are still large gains to be made that so far I’ve been unable to visualize incorporating into the lexbor parser. For example, decode_html() assumes we’re starting already with a span of text that is HTML text. We’re not making conditional decisions on whether the next byte produces a token that escapes out of the text parsing mode.

If we could have a single implementation, that would be great. I do understand of course your concern that DOM is not a required extension, and therefore basing the internals on Lexbor makes it tied to the DOM extension which may not be available. I however suspect that a large chunk of people needing a function like this have DOM available (as DOM is required by many HTML-processing-related packages). I can also look into it sometime soon if you want; anyway feel free to ping me.

I’m also very open to lexbor-based approaches but I’ve so-far found it more complicated than I expected. In some part this is because it involves setting up the parser and state machine for the HTML specification and much of the actual decoding can be safely done without this.

The other part is the extension aspect. I hear you, that you would expect calling code to have the DOM extensions available, but that’s simply not the case when developing a platform like WordPress, which I do. We don’t have control over the servers or environments where people are deploying this, and the availability of the DOM extensions is low enough that WordPress code simply cannot use DOMDocument (even though it shouldn’t because of the wild problems that has for attempting to parse HTML).

People resort to html_entity_decode() because that’s the only option. In WordPress we now have a spec-compliant decoder, but as it’s in user-space PHP its performance is far below what’s possible.

I’d love your help in setting up lexbor’s state machine to decode text nodes. I’d love it even more if this could be part of the PHP language. It constantly surprises me that the language of the web (PHP) doesn’t have the tools to speak the language of the web (HTML). This RFC is all about taking a step towards ensuring that PHP developers can rely on PHP to be a reliable middle-man between the HTML domain and the PHP domain.

In other words, requiring the DOM extension or DOM\HtmlDocument would be such a non-starter for WordPress (accounting for 43% of the web today) that it would completely unavailable.

And I do have the following thoughts:

We should amend the ENT_HTML5 related docs already that it’s not compliant.

Perhaps ENT_HTML5 should be deprecated. E.g. you could say in your RFC that ENT_HTML5 will be deprecated in the release after the version that will have decode_html(). The reason I suggest the release after and not the same release is because I strongly believe that we should have at least one version where the proper alternative is available without forcing a deprecation to users already.

I love this suggestion. Just for reference, since I’ve looked before and not found it. Can someone indicate where to find the PHP function documentation? There are a number of updates I would love to propose but I don’t know where to find the content that appears in https://www.php.net/manual/en/function.html-entity-decode.php, for instance.

Kind regards
Niels

Mad respect to the work you’ve brought to lexbor and to PHP. I’m excited to start relying on \DOM\HtmlDocument and have started using it in my benchmarks and HTML analysis as we develop the WordPress HTML API (a streaming, low memory-overhead, reentrant HTML parsing and manipulation framework in user-space PHP).

Dennis Snell

Bruce_Weirdan · August 22, 2024, 11:32pm

There is a link on all doc pages (named Submit a Pull Request), in this specific instance it leads to https://github.com/php/doc-en/blob/master/reference/strings/functions/html-entity-decode.xml

···

Best regards,
Bruce Weirdan mailto:weirdan@gmail.com

Christoph_M_Becker · August 24, 2024, 12:47pm

On 23.08.2024 at 01:02, Dennis Snell wrote:

If we could have a single implementation, that would be great. I do understand of course your concern that DOM is not a required extension, and therefore basing the internals on Lexbor makes it tied to the DOM extension which may not be available. I however suspect that a large chunk of people needing a function like this have DOM available (as DOM is required by many HTML-processing-related packages). I can also look into it sometime soon if you want; anyway feel free to ping me.

I’m also very open to lexbor-based approaches but I’ve so-far found it more complicated than I expected. In some part this is because it involves setting up the parser and state machine for the HTML specification and much of the actual decoding can be safely done without this.

The other part is the extension aspect. I hear you, that you would expect calling code to have the DOM extensions available, but that’s simply not the case when developing a platform like WordPress, which I do. We don’t have control over the servers or environments where people are deploying this, and the availability of the DOM extensions is low enough that WordPress code simply cannot use `DOMDocument` (even though it shouldn’t because of the wild problems that has for attempting to parse HTML).

People resort to `html_entity_decode()` because that’s the only option. In WordPress we now have a spec-compliant decoder, but as it’s in user-space PHP its performance is far below what’s possible.

I’d love your help in setting up lexbor’s state machine to decode text nodes. I’d love it even more if this could be part of the PHP language. It constantly surprises me that _the language of the web_ (PHP) doesn’t have the tools to speak _the language of the web_ (HTML). This RFC is all about taking a step towards ensuring that PHP developers can rely on PHP to be a reliable middle-man between the HTML domain and the PHP domain.

In other words, requiring the DOM extension or `DOM\HtmlDocument` would be such a non-starter for WordPress (accounting for 43% of the web today) that it would completely unavailable.

Well, I don't think it would be a big deal to move the bundled lexbor to
somewhere where it is always available. I mean, so far it's only used
by ext/dom so it's bundled there, but if other parts of the php-src code
base would use it, we could put it elsewhere.

Christoph

Jakob_Givoni · August 24, 2024, 7:56pm

Hi Dennis,

Overall it sounds like a reasonable RFC.

Dennis:

Niels:

I’m not so sure that the name “decode_html” is self-descriptive enough, it sounds very generic.

The name is not very important to me. For the sake of history, the reason I have chosen “decode HTML” is because, unlike an HTML parser, this is focused on taking a snippet of HTML “text” content and decoding it into a “plain PHP string.”

Why not make it two methods called “decode_html_text” and “decode_html_attribute”?
Consider the following reasons:

The function doesn’t actually decode html as such, it decodes either an html text node string or an html attribute string.
Saves the $context parameter and the constants/enums, making the call significantly shorter.
It feels like decoding either text or attribute are two significantly different things. I admit I could be wrong, if code like decode_html($e->isAttritbute() ? HtmlContext::Attribute : HtmlContext::Text, $e->getContent()) is likely to be seen. But I somehow don’t foresee a lot of situations where text and attribute strings end up in the same code path?

A couple of other options that would silence anyone opposed to implicitly favouring utf-8:
html_text_to_utf8 and html_attribute_to_utf8

Best,
Jakob

Dennis_Snell · August 24, 2024, 8:31pm

On Aug 24, 2024, at 2:56 PM, Jakob Givoni jakob@givoni.dk wrote:

Hi Dennis,

Overall it sounds like a reasonable RFC.

Dennis:

Niels:

I’m not so sure that the name “decode_html” is self-descriptive enough, it sounds very generic.

The name is not very important to me. For the sake of history, the reason I have chosen “decode HTML” is because, unlike an HTML parser, this is focused on taking a snippet of HTML “text” content and decoding it into a “plain PHP string.”

Why not make it two methods called “decode_html_text” and “decode_html_attribute”?
Consider the following reasons:

The function doesn’t actually decode html as such, it decodes either an html text node string or an html attribute string.

Thanks Jakob. In WordPress I did just this.
https://developer.wordpress.org/reference/classes/wp_html_decoder/

Part of the reason for that was the inability to require something like an enum (due to PHP version support requirements). The Enum solution feels very nice too.

Saves the $context parameter and the constants/enums, making the call significantly shorter.

In my PR I’ve actually expanded the Enum to include a few other contexts. I feel like there’s a balance we have to do if we want to ride the line between fully reliable and fully convenient. On one hand, we could say “don’t send the text content of a SCRIPT element to this function!” But on the other hand, that kind of forces people to expect that SCRIPT content is different.

With the Enum there is that in-built training material when someone looks and finds Attribute | BodyText | ForeignText | Script | Style (the contexts I’ve explored in my PR).

We could make the same argument for decode_html_script() and decode_foreign_text_node() and decode_html_style(). Somehow the context feels cleaner to me, and like a single entry point for learning instead of five.

It feels like decoding either text or attribute are two significantly different things. I admit I could be wrong, if code like decode_html($e->isAttritbute() ? HtmlContext::Attribute : HtmlContext::Text, $e->getContent()) is likely to be seen.

None of these contexts are significantly different, which is one of the major dangers of using html_entity_decode(). The results will look just about right most of the time. It’s the subtle differences that matter most, I suppose. Thankfully, in most places I’ve seen them blurred together, the intent of the code someone is writing understands which is which.

preg_replace_callback(
‘~<a[^>]+href=“([^”]+)”[^>]*>([^<]+)~’,
function ( $m ) {
$title = str_replace( ‘]’, ‘]’, html_entity_decode( $m[2] ) );
$url = str_replace( ‘)’, ‘)’, html_entity_decode( $m[1] ) );
return “{$title}”;
}
$post_content
);

The lesson I have drawn is that people frequently have what they understand to be a text node or an attribute value, but they aren’t aware that they are supposed to decode differently, and they also aren’t reaching to interact with a full parser to get these values. If PHP could train people as they use these functions, purely through their interfaces, I think that could help elevate the level of reliability out there in the wild, as long as they aren’t too cumbersome (hence explicitly no default context argument or using separately-named functions).

Having the Enum I think enhances the ease with which people can reliably also decode things like SCRIPT and STYLE nodes. “I know html_decode_text() but I don’t know what the rules for SCRIPT are or if they’re different so I’ll just stick with that.” vs “My IDE suggests that Script is a different context, that’s interesting, I’ll try that and see how it’s different."

But I somehow don’t foresee a lot of situations where text and attribute strings end up in the same code path?

The underlying reason I started this work was in support of building an HTML parser. We have a streaming parser which relies on a different parsing model than those built purely on the state machine in the specification, taking advantage of what we can to eek out performance in PHP code. For this, the strings are in the same path, and in this work I’ve come across a number of other common use-cases where the flow is the same but the decoder needs to know the context.

Normalizing HTML from “tag soup” to standard serialized form.
Sanitizing code wanting to inspect values from different parts of the markup.
Sanitizing rules engines providing configurations or DSLs for sanitization.
Live optimizers or analyzers to improve the output HTML leaving a server.

It’s one of those things that when it becomes trivial to start getting reliable transforms from the HTML syntax to the decoded text, more opportunities appear that never seemed practical before.

A couple of other options that would silence anyone opposed to implicitly favouring utf-8:
html_text_to_utf8 and html_attribute_to_utf8

The names started with these . I do agree that it gets a bit excessive though to the point where it risks people not adopting them purely because they don’t want to type that long of a name every time they use it. Perhaps some of these

str_from_html( HtmlContext $context, string $html ): string {}

utf8_from_html( HtmlContext $context, string $html ): string {}

html_to_utf8( HtmlContext $context, string $html ): string {}

Best,
Jakob

Thanks for your input. I’m grateful for the discussions and that people are sharing.

Dennis Snell

Dennis_Snell · August 24, 2024, 8:34pm

On Aug 24, 2024, at 7:47 AM, Christoph M. Becker <cmbecker69@gmx.de> wrote:

On 23.08.2024 at 01:02, Dennis Snell wrote:

If we could have a single implementation, that would be great. I do understand of course your concern that DOM is not a required extension, and therefore basing the internals on Lexbor makes it tied to the DOM extension which may not be available. I however suspect that a large chunk of people needing a function like this have DOM available (as DOM is required by many HTML-processing-related packages). I can also look into it sometime soon if you want; anyway feel free to ping me.

I’m also very open to lexbor-based approaches but I’ve so-far found it more complicated than I expected. In some part this is because it involves setting up the parser and state machine for the HTML specification and much of the actual decoding can be safely done without this.

The other part is the extension aspect. I hear you, that you would expect calling code to have the DOM extensions available, but that’s simply not the case when developing a platform like WordPress, which I do. We don’t have control over the servers or environments where people are deploying this, and the availability of the DOM extensions is low enough that WordPress code simply cannot use `DOMDocument` (even though it shouldn’t because of the wild problems that has for attempting to parse HTML).

People resort to `html_entity_decode()` because that’s the only option. In WordPress we now have a spec-compliant decoder, but as it’s in user-space PHP its performance is far below what’s possible.

I’d love your help in setting up lexbor’s state machine to decode text nodes. I’d love it even more if this could be part of the PHP language. It constantly surprises me that _the language of the web_ (PHP) doesn’t have the tools to speak _the language of the web_ (HTML). This RFC is all about taking a step towards ensuring that PHP developers can rely on PHP to be a reliable middle-man between the HTML domain and the PHP domain.

In other words, requiring the DOM extension or `DOM\HtmlDocument` would be such a non-starter for WordPress (accounting for 43% of the web today) that it would completely unavailable.

Well, I don't think it would be a big deal to move the bundled lexbor to
somewhere where it is always available. I mean, so far it's only used
by ext/dom so it's bundled there, but if other parts of the php-src code
base would use it, we could put it elsewhere.

Having a DOM parser for HTML in PHP itself without requiring an extension would open up many new possibilities. For example, WordPress test suites don’t have any functional “assertEquivalentMarkup()” functions because there’s no spec-compliant parser in PHP. We finally wrote our own user-space HTML parser, but relying on `DOM\HtmlDocument` would be much easier.

These test suites need to run on a variety of environments and PHP versions, so it’s moot thinking we could hasten the use of a native class to get the job done, but if it remains locked inside an optional extension, it may be borderline impossible to ever migrate to it.

Christoph

Dennis Snell

Jakob_Givoni · August 25, 2024, 8:15am

On Sat, Aug 24, 2024 at 10:31 PM Dennis Snell <dennis.snell@automattic.com> wrote:

On Aug 24, 2024, at 2:56 PM, Jakob Givoni <jakob@givoni.dk> wrote:

Hi Dennis,

Overall it sounds like a reasonable RFC.

Dennis:

Niels:

I’m not so sure that the name “decode_html” is self-descriptive enough, it sounds very generic.

The name is not very important to me. For the sake of history, the reason I have chosen “decode HTML” is because, unlike an HTML parser, this is focused on taking a snippet of HTML “text” content and decoding it into a “plain PHP string.”

Why not make it two methods called “decode_html_text” and “decode_html_attribute”?
Consider the following reasons:

The function doesn’t actually decode html as such, it decodes either an html text node string or an html attribute string.

Thanks Jakob. In WordPress I did just this.
https://developer.wordpress.org/reference/classes/wp_html_decoder/

Part of the reason for that was the inability to require something like an enum (due to PHP version support requirements). The Enum solution feels very nice too.

Saves the $context parameter and the constants/enums, making the call significantly shorter.

In my PR I’ve actually expanded the Enum to include a few other contexts. I feel like there’s a balance we have to do if we want to ride the line between fully reliable and fully convenient. On one hand, we could say “don’t send the text content of a SCRIPT element to this function!” But on the other hand, that kind of forces people to expect that SCRIPT content is different.

With the Enum there is that in-built training material when someone looks and finds Attribute | BodyText | ForeignText | Script | Style (the contexts I’ve explored in my PR).

We could make the same argument for decode_html_script() and decode_foreign_text_node() and decode_html_style(). Somehow the context feels cleaner to me, and like a single entry point for learning instead of five.

Yes. With 5 different contexts it’s starting to shift in favor of a single function
I only saw the RFC which from what I can tell still only features 2 of them. I haven’t seen the PR (RFC Implementation section says “Yet to come”).

It feels like decoding either text or attribute are two significantly different things. I admit I could be wrong, if code like decode_html($e->isAttritbute() ? HtmlContext::Attribute : HtmlContext::Text, $e->getContent()) is likely to be seen.

None of these contexts are significantly different, which is one of the major dangers of using html_entity_decode(). The results will look just about right most of the time. It’s the subtle differences that matter most, I suppose.

Well, that was kind of what I meant - even if the differences are usually absent or subtle, they are significant (i.e. not necessarily big, but meaningful), meaning using it wrong would give the wrong result, right? Saying that they are not significantly different to me means that the result would just be a little less good sometimes, not directly wrong.

The lesson I have drawn is that people frequently have what they understand to be a text node or an attribute value, but they aren’t aware that they are supposed to decode differently, and they also aren’t reaching to interact with a full parser to get these values. If PHP could train people as they use these functions, purely through their interfaces, I think that could help elevate the level of reliability out there in the wild, as long as they aren’t too cumbersome (hence explicitly no default context argument or using separately-named functions).

Having the Enum I think enhances the ease with which people can reliably also decode things like SCRIPT and STYLE nodes. “I know html_decode_text() but I don’t know what the rules for SCRIPT are or if they’re different so I’ll just stick with that.” vs “My IDE suggests that Script is a different context, that’s interesting, I’ll try that and see how it’s different."

That is a good point and using enums favours that learning push since they are inherently grouped together.

Best,
Jakob

Thanks for your input. I’m grateful for the discussions and that people are sharing.

Cheers!

Dennis_Snell · August 25, 2024, 3:25pm

On Aug 25, 2024, at 3:15 AM, Jakob Givoni jakob@givoni.dk wrote:

On Sat, Aug 24, 2024 at 10:31 PM Dennis Snell <dennis.snell@automattic.com> wrote:

On Aug 24, 2024, at 2:56 PM, Jakob Givoni <jakob@givoni.dk> wrote:

Hi Dennis,

Overall it sounds like a reasonable RFC.

Dennis:

Niels:

I’m not so sure that the name “decode_html” is self-descriptive enough, it sounds very generic.

The name is not very important to me. For the sake of history, the reason I have chosen “decode HTML” is because, unlike an HTML parser, this is focused on taking a snippet of HTML “text” content and decoding it into a “plain PHP string.”

Why not make it two methods called “decode_html_text” and “decode_html_attribute”?
Consider the following reasons:

The function doesn’t actually decode html as such, it decodes either an html text node string or an html attribute string.

Thanks Jakob. In WordPress I did just this.
https://developer.wordpress.org/reference/classes/wp_html_decoder/

Part of the reason for that was the inability to require something like an enum (due to PHP version support requirements). The Enum solution feels very nice too.

Saves the $context parameter and the constants/enums, making the call significantly shorter.

In my PR I’ve actually expanded the Enum to include a few other contexts. I feel like there’s a balance we have to do if we want to ride the line between fully reliable and fully convenient. On one hand, we could say “don’t send the text content of a SCRIPT element to this function!” But on the other hand, that kind of forces people to expect that SCRIPT content is different.

With the Enum there is that in-built training material when someone looks and finds Attribute | BodyText | ForeignText | Script | Style (the contexts I’ve explored in my PR).

We could make the same argument for decode_html_script() and decode_foreign_text_node() and decode_html_style(). Somehow the context feels cleaner to me, and like a single entry point for learning instead of five.

Yes. With 5 different contexts it’s starting to shift in favor of a single function
I only saw the RFC which from what I can tell still only features 2 of them. I haven’t seen the PR (RFC Implementation section says “Yet to come”).

Oops, I’ll get to this!

It feels like decoding either text or attribute are two significantly different things. I admit I could be wrong, if code like decode_html($e->isAttritbute() ? HtmlContext::Attribute : HtmlContext::Text, $e->getContent()) is likely to be seen.

None of these contexts are significantly different, which is one of the major dangers of using html_entity_decode(). The results will look just about right most of the time. It’s the subtle differences that matter most, I suppose.

Well, that was kind of what I meant - even if the differences are usually absent or subtle, they are significant (i.e. not necessarily big, but meaningful), meaning using it wrong would give the wrong result, right? Saying that they are not significantly different to me means that the result would just be a little less good sometimes, not directly wrong.

In hindsight I think I misunderstood what you were saying and got it backwards. I meant that the algorithms are subtly different, but as you point out, yes, the outcomes can be significant. ln the better cases we get data corruption, but these do lead to misidentification of unsafe content.

For example, “&#x6aa\x00avascript” should decode as “javascript” when rendered by a browser when found inside the BODY of a page, but an attribute should read “ja�vascript.”

The lesson I have drawn is that people frequently have what they understand to be a text node or an attribute value, but they aren’t aware that they are supposed to decode differently, and they also aren’t reaching to interact with a full parser to get these values. If PHP could train people as they use these functions, purely through their interfaces, I think that could help elevate the level of reliability out there in the wild, as long as they aren’t too cumbersome (hence explicitly no default context argument or using separately-named functions).

Having the Enum I think enhances the ease with which people can reliably also decode things like SCRIPT and STYLE nodes. “I know html_decode_text() but I don’t know what the rules for SCRIPT are or if they’re different so I’ll just stick with that.” vs “My IDE suggests that Script is a different context, that’s interesting, I’ll try that and see how it’s different."

That is a good point and using enums favours that learning push since they are inherently grouped together.

Best,
Jakob

Thanks for your input. I’m grateful for the discussions and that people are sharing.

Cheers!

Warmly,
Dennis Snell

Mate_Kocsis · August 25, 2024, 9:17pm

Hi Christoph, Dennis,

Well, I don’t think it would be a big deal to move the bundled lexbor to
somewhere where it is always available. I mean, so far it’s only used
by ext/dom so it’s bundled there, but if other parts of the php-src code
base would use it, we could put it elsewhere.

Exactly. You might be aware that I’m working on an “uri” extension (https://externals.io/message/123997)
and it also needs some parts of lexbor. My implementation currently depends on ext/dom
for simplicity’s sake, however if the vote once passes, this temporary solution has to be changed.
Therefore we previously agreed with Niels that we would make lexbor an “internal extension” (similar to mysqlnd), or
at least we would somehow find a way for it to be always available, just like how Christoph said.

Regards,
Máté

Dennis_Snell · August 25, 2024, 9:56pm

On Aug 25, 2024, at 4:17 PM, Máté Kocsis kocsismate90@gmail.com wrote:

Hi Christoph, Dennis,

Well, I don’t think it would be a big deal to move the bundled lexbor to
somewhere where it is always available. I mean, so far it’s only used
by ext/dom so it’s bundled there, but if other parts of the php-src code
base would use it, we could put it elsewhere.

Exactly. You might be aware that I’m working on an “uri” extension (https://externals.io/message/123997)

Yes, and I only briefly saw that before, but I’m excited, because I’ve wanted very much to be able to properly parse URLs within PHP. Myself, I was also interested in seeing if we could get Ada into the language.

As with HTML parsing, I see much value in having additional interfaces that aren’t a DOM interface but which are designed for specific software purposes.

and it also needs some parts of lexbor. My implementation currently depends on ext/dom
for simplicity’s sake, however if the vote once passes, this temporary solution has to be changed.
Therefore we previously agreed with Niels that we would make lexbor an “internal extension” (similar to mysqlnd), or
at least we would somehow find a way for it to be always available, just like how Christoph said.

With all the improvements going around PHP these days, I find it extremely important to finally be able to reliably and safety understand some of the most basic content that we produce and parse: HTML and URLs.

Although the user-space libraries are of varying completion and quality, all of them suffer from the fact that it’s so challenging to efficiently parse most content using PHP. Getting these things baked into the language of the web will bring a potent uplift to the entire ecosystem, both because there will be less corruption, but also because performance won’t suffer in getting there.

Regards,
Máté