[PHP-DEV] [RFC][Concept] Data classes (a.k.a. structs)

ilutov · April 2, 2024, 12:17am

Hi everyone!

I'd like to introduce an idea I've played around with for a couple of
weeks: Data classes, sometimes called structs in other languages (e.g.
Swift and C#).

In a nutshell, data classes are classes with value semantics.
Instances of data classes are implicitly copied when assigned to a
variable, or when passed to a function. When the new instance is
modified, the original instance remains untouched. This might sound
familiar: It's exactly how arrays work in PHP.

$a = [1, 2, 3];
$b = $a;
$b[] = 4;
var_dump($a); // [1, 2, 3]
var_dump($b); // [1, 2, 3, 4]

You may think that copying the array on each assignment is expensive,
and you would be right. PHP uses a trick called copy-on-write, or CoW
for short. `$a` and `$b` actually share the same array until `$b =
4;` modifies it. It's only at this point that the array is copied and
replaced in `$b`, so that the modification doesn't affect `$a`. As
long as a variable is the sole owner of a value, or none of the
variables modify the value, no copy is needed. Data classes use the
same mechanism.

But why value semantics in the first place? There are two major flaws
with by-reference semantics for data structures:

1. It's very easy to forget cloning data that is referenced somewhere
else before modifying it. This will lead to "spooky actions at a
distance". Having recently used JavaScript (where all data structures
have by-reference semantics) for an educational IR optimizer,
accidental mutations of shared arrays/maps/sets were my primary source
of bugs.
2. Defensive cloning (to avoid issue 1) will lead to useless work when
the value is not referenced anywhere else.

PHP offers readonly properties and classes to address issue 1.
However, they further promote issue 2 by making it impossible to
modify values without cloning them first, even if we know they are not
referenced anywhere else. Some APIs further exacerbate the issue by
requiring multiple copies for multiple modifications (e.g.
`$response->withStatus(200)->withHeader('X-foo', 'foo');`).

As you may have noticed, arrays already solve both of these issues
through CoW. Data classes allow implementing arbitrary data structures
with the same value semantics in core, extensions or userland. For
example, a `Vector` data class may look something like the following:

data class Vector {
    private $values;

    public function __construct(...$values) {
        $this->values = $values;
    }

    public mutating function append($value) {
        $this->values[] = $value;
    }
}

$a = new Vector(1, 2, 3);
$b = $a;
$b->append!(4);
var_dump($a); // Vector(1, 2, 3)
var_dump($b); // Vector(1, 2, 3, 4)

An internal Vector implementation might offer a faster and stricter
alternative to arrays (e.g. Vector from php-ds).

Some other things to note about data classes:

* Data classes are ordinary classes, and as such may implement
interfaces, methods and more. I have not decided whether they should
support inheritance.
* Mutating method calls on data classes use a slightly different
syntax: `$vector->append!(42)`. All methods mutating `$this` must be
marked as `mutating`. The reason for this is twofold: 1. It signals to
the caller that the value is modified. 2. It allows `$vector` to be
cloned before knowing whether the method `append` is modifying, which
hugely reduces implementation complexity in the engine.
* Data classes customize identity (`===`) comparison, in the same way
arrays do. Two data objects are identical if all their properties are
identical (including order for dynamic properties).
* Sharing data classes by-reference is possible using references, as
you would for arrays.
* We may decide to auto-implement `__toString` for data classes,
amongst other things. I am still undecided whether this is useful for
PHP.
* Data classes protect from interior mutability. More concretely,
mutating nested data objects stored in a `readonly` property is not
legal, whereas it would be if they were ordinary objects.
* In the future, it should be possible to allow using data classes in
`SplObjectStorage`. However, because hashing is complex, this will be
postponed to a separate RFC.

One known gotcha is that we cannot trivially enforce placement of
`modfying` on methods without a performance hit. It is the
responsibility of the user to correctly mark such methods.

Here's a fully functional PoC, excluding JIT:

github.com/php/php-src

[RFC] Implement data classes (WIP)

php:master ← iluuu1994:data-classes

opened 01:37AM - 25 Mar 24 UTC

iluuu1994

+2131 -90

This is some early experimentation on data classes, which are objects with value… semantics (the same as arrays and strings). The main motivation is making data structures modeled with classes more ergonomic. For example, there's a desire for a faster `Vector` implementation (like the one from [php-ds](https://medium.com/p/9dda7af674cd)). However, reference types can lead to defensive copying to avoid accidental changes at a distance to an owned value, or bugs if copying is mistakenly omitted. `readonly` solves the latter by completely disallowing mutation. However, this essentially just forces a copy, which is bad for performance, especially for big data structures. Data classes instead automatically "separate" (clone) themselves from any other reference only once a modification is made to the object. If it is only referenced from a single place, no copy is needed. These are also called CoW (copy-on-write) semantics. Method calls pose an interesting problem: When there is a chain of property accesses, ending in a method call, the entire chain must be separated to avoid the change from leaking to other references. However, with the standard method call syntax, it is not clear whether a call will refer to a self-mutating method call (imagine a `Vector::push()` method, requiring separation), or an immutable method (not requiring separation) in advance. To signal to the engine that the chain needs to be separated, as it would for arrays on `$a['b']['c] = 'c';`, we use the `$parent->children->push!($child);` syntax instead. If `$parent` is referenced from multiple places, it will be cloned and the clone will be stored in `$parent`. The same happens for the `children` property. This ensures that any other reference to the same instance as `$parent` remains unmodified. *Side note*: We're looking for a different syntax for method calls, because Bob would like to use it for macros at some point. **TODOs**: - [x] `mutating` at decl-site support - [x] Disallow `SplObjectStorage`/`WeakMap`. Hashing data classes is complex and should be solved in a separate RFC. - References are problematic for hashes. If a data object contains references, the reference may implicitly change the hash for objects that are already stored in some bucket. Any future lookup will be impossible. So, using data objects containing references as keys must necessarily clone the object and unref any references. - `SplObjectStorage` internally adds objects to an array with the handle as the key. This key is guaranteed to be unique. Using a hash for data classes will not be straight forward, because we cannot avoid hash collisions. Thus, we will either need support for data classes in arrays themselves, or we'll need to add a higher-level bucket to the underlying array so that we can handle collisions ourselves. - `WeakMap` is a weird use-case for data values, because their RCs change unpredictably. It might be best to disallow them as `WeakMap` keys. Moreover, `WeakMap` will not add a refcount for its object key, making it possible to change the hash by modifying the object in-place, leaving the object in the wrong bucket. - [x] Disallow `ArrayObject` to avoid uncontrolled changes and integer keys - [ ] Opcache/Optimizer - [ ] RC inference - [ ] JIT - [x] Disallow `ReflectionProperty::setValue()` and `ReflectionMethod::invoke()`. They require `@prefer-ref` in userland for overrides of these methods. **Benchmark**: Valgrind shows a small performance regression. The real-time showed a small slowdown of +0.07% for mean and +0.10% for fastest of 20 runs with `-T10,300` of the Symfony Demo benchmark. <details><summary>Details</summary> <p> ``` Before: 2.995101 + 2.997169 + 2.998048 + 3.000360 + 3.000639 + 3.003076 + 3.003258 + 3.003537 + 3.003724 + 3.003811 + 3.004047 + 3.004114 + 3.004636 + 3.004637 + 3.004900 + 3.005175 + 3.005566 + 3.006546 + 3.007003 + 3.007804 After: 2.997989 + 3.000772 + 3.003236 + 3.003272 + 3.003358 + 3.003549 + 3.004557 + 3.004694 + 3.004716 + 3.004723 + 3.004893 + 3.005222 + 3.005734 + 3.007125 + 3.007395 + 3.007596 + 3.008235 + 3.008681 + 3.009244 + 3.010537 Mean: 100 / 3.00315755 * 3.0052764 = +0.07% Fastest: 100 / 2.995101 * 2.997989 = +0.10% ``` </p> </details>

Let me know what you think. I will start working on an RFC draft once
work on property hooks concludes.

Ilija

Deleu · April 2, 2024, 12:56am

Hi everyone!

I’d like to introduce an idea I’ve played around with for a couple of
weeks: Data classes, sometimes called structs in other languages (e.g.
Swift and C#).

In a nutshell, data classes are classes with value semantics.
Instances of data classes are implicitly copied when assigned to a
variable, or when passed to a function. When the new instance is
modified, the original instance remains untouched. This might sound
familiar: It’s exactly how arrays work in PHP.
$a = [1, 2, 3];
$b = $a;
$b[] = 4;
var_dump($a); // [1, 2, 3]
var_dump($b); // [1, 2, 3, 4]
You may think that copying the array on each assignment is expensive,
and you would be right. PHP uses a trick called copy-on-write, or CoW
for short. $a and $b actually share the same array until $b[] = 4; modifies it. It’s only at this point that the array is copied and
replaced in $b, so that the modification doesn’t affect $a. As
long as a variable is the sole owner of a value, or none of the
variables modify the value, no copy is needed. Data classes use the
same mechanism.

But why value semantics in the first place? There are two major flaws
with by-reference semantics for data structures:

It’s very easy to forget cloning data that is referenced somewhere
else before modifying it. This will lead to “spooky actions at a
distance”. Having recently used JavaScript (where all data structures
have by-reference semantics) for an educational IR optimizer,
accidental mutations of shared arrays/maps/sets were my primary source
of bugs.

Defensive cloning (to avoid issue 1) will lead to useless work when
the value is not referenced anywhere else.

PHP offers readonly properties and classes to address issue 1.
However, they further promote issue 2 by making it impossible to
modify values without cloning them first, even if we know they are not
referenced anywhere else. Some APIs further exacerbate the issue by
requiring multiple copies for multiple modifications (e.g.
$response->withStatus(200)->withHeader('X-foo', 'foo');).

As you may have noticed, arrays already solve both of these issues
through CoW. Data classes allow implementing arbitrary data structures
with the same value semantics in core, extensions or userland. For
example, a Vector data class may look something like the following:
data class Vector {
private $values;

public function __construct(...$values) {
$this->values = $values;
}

public mutating function append($value) {
$this->values[] = $value;
}
}

$a = new Vector(1, 2, 3);
$b = $a;
$b->append!(4);
var_dump($a); // Vector(1, 2, 3)
var_dump($b); // Vector(1, 2, 3, 4)
An internal Vector implementation might offer a faster and stricter
alternative to arrays (e.g. Vector from php-ds).

Exciting times to be a PHP Developer!

Some other things to note about data classes:

Data classes are ordinary classes, and as such may implement
interfaces, methods and more. I have not decided whether they should
support inheritance.

I’d argue in favor of not including inheritance in the first version. Taking inheritance out is an impossible BC Break. Not introducing it in the first stable release gives users a chance to evaluate whether it’s something we will drastically miss.

Mutating method calls on data classes use a slightly different
syntax: $vector->append!(42). All methods mutating $this must be
marked as mutating. The reason for this is twofold: 1. It signals to
the caller that the value is modified. 2. It allows $vector to be
cloned before knowing whether the method append is modifying, which
hugely reduces implementation complexity in the engine.

I’m not sure if I understood this one. Do you mean that the ! modifier here (at call-site) is helping the engine clone the variable before even diving into whether append() has been tagged as mutating? From outside it looks odd that a clone would happen ahead-of-time while talking about copy-on-write. Would this syntax break for non-mutating methods?

Data classes customize identity (===) comparison, in the same way
arrays do. Two data objects are identical if all their properties are
identical (including order for dynamic properties).

Sharing data classes by-reference is possible using references, as
you would for arrays.

We may decide to auto-implement __toString for data classes,
amongst other things. I am still undecided whether this is useful for
PHP.

Data classes protect from interior mutability. More concretely,
mutating nested data objects stored in a readonly property is not
legal, whereas it would be if they were ordinary objects.

In the future, it should be possible to allow using data classes in
SplObjectStorage. However, because hashing is complex, this will be
postponed to a separate RFC.

One known gotcha is that we cannot trivially enforce placement of
modfying on methods without a performance hit. It is the
responsibility of the user to correctly mark such methods.

Here’s a fully functional PoC, excluding JIT:
https://github.com/php/php-src/pull/13800

Let me know what you think. I will start working on an RFC draft once
work on property hooks concludes.

Ilija

Looking forward to this!!!

···

Marco Deleu

Alexander_Pravdin · April 2, 2024, 2:53am

On Tue, Apr 2, 2024 at 9:18 AM Ilija Tovilo <tovilo.ilija@gmail.com> wrote:

Hi everyone!

I'd like to introduce an idea I've played around with for a couple of
weeks: Data classes, sometimes called structs in other languages (e.g.
Swift and C#).
data class Vector {
    private $values;

    public function __construct(...$values) {
        $this->values = $values;
    }

    public mutating function append($value) {
        $this->values[] = $value;
    }
}

$a = new Vector(1, 2, 3);
$b = $a;
$b->append!(4);
var_dump($a); // Vector(1, 2, 3)
var_dump($b); // Vector(1, 2, 3, 4)

While I like the idea, I would like to suggest something else in
addition or as a separate feature. As an active user of readonly
classes with all promoted properties for data-holding purposes, I
would be happy to see the possibility of cloning them with passing
some properties to modify:

readonly class Data {
    function __construct(
        public string $foo,
        public string $bar,
        public string $baz,
) {}
}

$data = new Data(foo: 'A', bar: 'B', baz: 'C');

$data2 = clone $data with (bar: 'X', baz: 'Y');

Under the hood, this "clone" will copy all values of promoted
properties as is but modify some of them to custom values specified by
the user. The implementation of this functionality in the userland
destroys the beauty of readonly classes with promoted properties.
Manual implementation requires a lot of code lines while bringing no
sense to users who read this code. Cloning methods are bigger than the
meaningful part of the class - the constructor with properties
declaration. Because I have to redeclare all the properties in the
method arguments and then initialize each property with a
corresponding value. I love readonly classes with promoted properties
for data-holding purposes and the above feature is the only one I'm
missing to be completely happy.

In my personal experience, I never needed to copy data classes like
arrays, the immutability protects against unwanted changes enough. But
copying references helps to save memory, some datasets I work with can
be very big.

--
Best,
Alex

ilutov · April 2, 2024, 9:08am

Hi Marco

On Tue, Apr 2, 2024 at 2:56 AM Deleu <deleugyn@gmail.com> wrote:

On Mon, Apr 1, 2024 at 9:20 PM Ilija Tovilo <tovilo.ilija@gmail.com> wrote:

I'd like to introduce an idea I've played around with for a couple of
weeks: Data classes, sometimes called structs in other languages (e.g.
Swift and C#).

snip

Some other things to note about data classes:

* Data classes are ordinary classes, and as such may implement
interfaces, methods and more. I have not decided whether they should
support inheritance.

I'd argue in favor of not including inheritance in the first version. Taking inheritance out is an impossible BC Break. Not introducing it in the first stable release gives users a chance to evaluate whether it's something we will drastically miss.

I would probably agree. I believe the reasoning some languages don't
support inheritance for value types is because they are stored on the
stack. Inheritance encourages large structures, but copying very large
structures over and over on the stack may be slow.

In PHP, objects always live on the heap, and due to CoW we don't have
this problem. Still, it may be beneficial to disallow inheritance
first, and relax this restriction if it is necessary.

* Mutating method calls on data classes use a slightly different
syntax: `$vector->append!(42)`. All methods mutating `$this` must be
marked as `mutating`. The reason for this is twofold: 1. It signals to
the caller that the value is modified. 2. It allows `$vector` to be
cloned before knowing whether the method `append` is modifying, which
hugely reduces implementation complexity in the engine.

I'm not sure if I understood this one. Do you mean that the `!` modifier here (at call-site) is helping the engine clone the variable before even diving into whether `append()` has been tagged as mutating?

Precisely. The issue comes from deeper nested values:

$circle->position->zero();

Imagine that Circle is a data class with a Position, which is also a
data class. Position::zero() is a mutating method that sets the
coordinates to 0:0. For this to work, not only the position needs to
be copied, but also $circle. However, the engine doesn't yet know
ahead of time whether zero() is mutating, and as such needs to perform
a copy.

One idea was to evaluate the left-hand-side of the method call, and
repeat it with a copy if the method is mutating. However, this is not
trivially possible, because opcodes consume their operands. So, for an
expression like `getCircle()->position->zero()`, the return value of
`getCircle()` is already gone. `!` explicitly distinguishes the call
from non-mutating calls, and knows that a copy will be needed.

But as mentioned previously, I think a different syntax offers
additional benefits for readability.

From outside it looks odd that a clone would happen ahead-of-time while talking about copy-on-write. Would this syntax break for non-mutating methods?

If by break you mean the engine would error, then yes. Only mutating
methods may (and must) be called with the $foo->bar!() syntax.

Ilija

ilutov · April 2, 2024, 9:24am

Hi Alexander

On Tue, Apr 2, 2024 at 4:53 AM Alexander Pravdin <alex.pravdin@interi.co> wrote:

On Tue, Apr 2, 2024 at 9:18 AM Ilija Tovilo <tovilo.ilija@gmail.com> wrote:
>
> I'd like to introduce an idea I've played around with for a couple of
> weeks: Data classes, sometimes called structs in other languages (e.g.
> Swift and C#).

While I like the idea, I would like to suggest something else in
addition or as a separate feature. As an active user of readonly
classes with all promoted properties for data-holding purposes, I
would be happy to see the possibility of cloning them with passing
some properties to modify:

readonly class Data {
    function __construct(
        public string $foo,
        public string $bar,
        public string $baz,
) {}
}

$data = new Data(foo: 'A', bar: 'B', baz: 'C');

$data2 = clone $data with (bar: 'X', baz: 'Y');

What you're asking for is part of the "Clone with" RFC:
https://wiki.php.net/rfc/clone_with

This issue is valid and the RFC would improve the ergonomics of
readonly classes.

However, note that it really only addresses a small part of what this
RFC tries achieve:

Some APIs further exacerbate the issue by

requiring multiple copies for multiple modifications (e.g.
`$response->withStatus(200)->withHeader('X-foo', 'foo');`).

Readonly works fine for compact data structures, even if it is copied
more than it needs. For large data structures, like large lists, a
copy for each modification would be detrimental.

See how the performance of an insert into an array tanks if a copy of
the array is performed in each iteration (due to an additional
reference to it). Readonly is just not viable for data structures such
as lists, maps, sets, etc.

Ilija

Crell · April 2, 2024, 3:30pm

On Tue, Apr 2, 2024, at 12:17 AM, Ilija Tovilo wrote:

Hi everyone!

I'd like to introduce an idea I've played around with for a couple of
weeks: Data classes, sometimes called structs in other languages (e.g.
Swift and C#).

*gets popcorn*

In a nutshell, data classes are classes with value semantics.
Instances of data classes are implicitly copied when assigned to a
variable, or when passed to a function. When the new instance is
modified, the original instance remains untouched. This might sound
familiar: It's exactly how arrays work in PHP.
$a = [1, 2, 3];
$b = $a;
$b[] = 4;
var_dump($a); // [1, 2, 3]
var_dump($b); // [1, 2, 3, 4]
You may think that copying the array on each assignment is expensive,
and you would be right. PHP uses a trick called copy-on-write, or CoW
for short. `$a` and `$b` actually share the same array until `$b =
4;` modifies it. It's only at this point that the array is copied and
replaced in `$b`, so that the modification doesn't affect `$a`. As
long as a variable is the sole owner of a value, or none of the
variables modify the value, no copy is needed. Data classes use the
same mechanism.

But why value semantics in the first place? There are two major flaws
with by-reference semantics for data structures:

1. It's very easy to forget cloning data that is referenced somewhere
else before modifying it. This will lead to "spooky actions at a
distance". Having recently used JavaScript (where all data structures
have by-reference semantics) for an educational IR optimizer,
accidental mutations of shared arrays/maps/sets were my primary source
of bugs.
2. Defensive cloning (to avoid issue 1) will lead to useless work when
the value is not referenced anywhere else.

PHP offers readonly properties and classes to address issue 1.
However, they further promote issue 2 by making it impossible to
modify values without cloning them first, even if we know they are not
referenced anywhere else. Some APIs further exacerbate the issue by
requiring multiple copies for multiple modifications (e.g.
`$response->withStatus(200)->withHeader('X-foo', 'foo');`).

As you may have noticed, arrays already solve both of these issues
through CoW. Data classes allow implementing arbitrary data structures
with the same value semantics in core, extensions or userland. For
example, a `Vector` data class may look something like the following:
data class Vector {
    private $values;

    public function __construct(...$values) {
        $this->values = $values;
    }

    public mutating function append($value) {
        $this->values[] = $value;
    }
}

$a = new Vector(1, 2, 3);
$b = $a;
$b->append!(4);
var_dump($a); // Vector(1, 2, 3)
var_dump($b); // Vector(1, 2, 3, 4)
An internal Vector implementation might offer a faster and stricter
alternative to arrays (e.g. Vector from php-ds).

Some other things to note about data classes:

* Data classes are ordinary classes, and as such may implement
interfaces, methods and more. I have not decided whether they should
support inheritance.

What would be the reason not to? As you indicated in another reply, the main reason some languages don't is to avoid large stack copies, but PHP doesn't have large stack copies for objects anyway so that's a non-issue.

I've long argued that the fewer differences there are between service classes and data classes, the better, so I'm not sure what advantage this would have other than "ugh, inheritance is such a mess" (which is true, but that ship sailed long ago).

* Mutating method calls on data classes use a slightly different
syntax: `$vector->append!(42)`. All methods mutating `$this` must be
marked as `mutating`. The reason for this is twofold: 1. It signals to
the caller that the value is modified. 2. It allows `$vector` to be
cloned before knowing whether the method `append` is modifying, which
hugely reduces implementation complexity in the engine.

As discussed in R11, it would be very beneficial if this marker could be on the method definition, not the method invocation. You indicated that would be Hard(tm), but I think it's worth some effort to see if it's surmountably hard. (Or at least less hard than just auto-detecting it, which you indicated is Extremely Hard(tm).)

* Data classes customize identity (`===`) comparison, in the same way
arrays do. Two data objects are identical if all their properties are
identical (including order for dynamic properties).
* Sharing data classes by-reference is possible using references, as
you would for arrays.

* We may decide to auto-implement `__toString` for data classes,
amongst other things. I am still undecided whether this is useful for
PHP.

For reference:

Java record classes auto-generate equals(), toString(), hashCode(), and same-name methods (we don't need that).

Kotlin data classes auto-generate equals(), toString(), hashCode(), same-name methods, and a copy() method that is basically what we've been discussing as clone-with.

C# record classes auto-generate equals() and ToString(), and are immutable. They also support "with expressions" ($foo with { new args }, basically clone-with).

C# record structs auto-generate equals() and ToString(), and are mutable. (Go figure.)

Python data classes are highly configurable, but by default generate toString(), a var-dump-targeted string (__repr__), a hash function, and some other Python-specific things with no PHP-equivalent. They can also opt-in to generating ordering overrides (op overloads), being readonly (frozen), or being named-args-only.

Swift structs, from what I can find just briefly, don't seem to auto-generate anything. (I could be wrong here.)

(In basically all cases above, providing your own implementation in the data class overrides the default generated one.)

The concept doesn't exist in C/++, Go, or Rust, at least not in a usefully equivalent way. TypeScript doesn't seem to have them from what I can find.

So to the extent there is a consensus, equality, stringifying, and a hashcode (which we don't have yet, but will need in the future for some things I suspect) seem to be the rough expected defaults.

* Data classes protect from interior mutability. More concretely,
mutating nested data objects stored in a `readonly` property is not
legal, whereas it would be if they were ordinary objects.
* In the future, it should be possible to allow using data classes in
`SplObjectStorage`. However, because hashing is complex, this will be
postponed to a separate RFC.

Would data class properties only be allowed to be other data classes, or could they hold a non-data class? My knee jerk response is they should be data classes all the way down; the only counter-argument I can think of it would be how much existing code is out there that is a "data class" in all but name. I still fear someone adding a DB connection object to a data class and everything going to hell, though.

One known gotcha is that we cannot trivially enforce placement of
`modfying` on methods without a performance hit. It is the
responsibility of the user to correctly mark such methods.

Here's a fully functional PoC, excluding JIT:
[RFC] Implement data classes (WIP) by iluuu1994 · Pull Request #13800 · php/php-src · GitHub

Let me know what you think. I will start working on an RFC draft once
work on property hooks concludes.

Ilija

--Larry Garfield

Robert_Landers · April 2, 2024, 4:02pm

On Tue, Apr 2, 2024 at 2:20 AM Ilija Tovilo <tovilo.ilija@gmail.com> wrote:

Hi everyone!

I'd like to introduce an idea I've played around with for a couple of
weeks: Data classes, sometimes called structs in other languages (e.g.
Swift and C#).

In a nutshell, data classes are classes with value semantics.
Instances of data classes are implicitly copied when assigned to a
variable, or when passed to a function. When the new instance is
modified, the original instance remains untouched. This might sound
familiar: It's exactly how arrays work in PHP.
$a = [1, 2, 3];
$b = $a;
$b[] = 4;
var_dump($a); // [1, 2, 3]
var_dump($b); // [1, 2, 3, 4]
You may think that copying the array on each assignment is expensive,
and you would be right. PHP uses a trick called copy-on-write, or CoW
for short. `$a` and `$b` actually share the same array until `$b =
4;` modifies it. It's only at this point that the array is copied and
replaced in `$b`, so that the modification doesn't affect `$a`. As
long as a variable is the sole owner of a value, or none of the
variables modify the value, no copy is needed. Data classes use the
same mechanism.

But why value semantics in the first place? There are two major flaws
with by-reference semantics for data structures:

1. It's very easy to forget cloning data that is referenced somewhere
else before modifying it. This will lead to "spooky actions at a
distance". Having recently used JavaScript (where all data structures
have by-reference semantics) for an educational IR optimizer,
accidental mutations of shared arrays/maps/sets were my primary source
of bugs.
2. Defensive cloning (to avoid issue 1) will lead to useless work when
the value is not referenced anywhere else.

PHP offers readonly properties and classes to address issue 1.
However, they further promote issue 2 by making it impossible to
modify values without cloning them first, even if we know they are not
referenced anywhere else. Some APIs further exacerbate the issue by
requiring multiple copies for multiple modifications (e.g.
`$response->withStatus(200)->withHeader('X-foo', 'foo');`).

As you may have noticed, arrays already solve both of these issues
through CoW. Data classes allow implementing arbitrary data structures
with the same value semantics in core, extensions or userland. For
example, a `Vector` data class may look something like the following:
data class Vector {
    private $values;

    public function __construct(...$values) {
        $this->values = $values;
    }

    public mutating function append($value) {
        $this->values[] = $value;
    }
}

$a = new Vector(1, 2, 3);
$b = $a;
$b->append!(4);
var_dump($a); // Vector(1, 2, 3)
var_dump($b); // Vector(1, 2, 3, 4)
An internal Vector implementation might offer a faster and stricter
alternative to arrays (e.g. Vector from php-ds).

Some other things to note about data classes:

* Data classes are ordinary classes, and as such may implement
interfaces, methods and more. I have not decided whether they should
support inheritance.
* Mutating method calls on data classes use a slightly different
syntax: `$vector->append!(42)`. All methods mutating `$this` must be
marked as `mutating`. The reason for this is twofold: 1. It signals to
the caller that the value is modified. 2. It allows `$vector` to be
cloned before knowing whether the method `append` is modifying, which
hugely reduces implementation complexity in the engine.
* Data classes customize identity (`===`) comparison, in the same way
arrays do. Two data objects are identical if all their properties are
identical (including order for dynamic properties).
* Sharing data classes by-reference is possible using references, as
you would for arrays.
* We may decide to auto-implement `__toString` for data classes,
amongst other things. I am still undecided whether this is useful for
PHP.
* Data classes protect from interior mutability. More concretely,
mutating nested data objects stored in a `readonly` property is not
legal, whereas it would be if they were ordinary objects.
* In the future, it should be possible to allow using data classes in
`SplObjectStorage`. However, because hashing is complex, this will be
postponed to a separate RFC.

One known gotcha is that we cannot trivially enforce placement of
`modfying` on methods without a performance hit. It is the
responsibility of the user to correctly mark such methods.

Here's a fully functional PoC, excluding JIT:
[RFC] Implement data classes (WIP) by iluuu1994 · Pull Request #13800 · php/php-src · GitHub

Let me know what you think. I will start working on an RFC draft once
work on property hooks concludes.

Ilija

Neat! I've been playing around with "value-like" objects for awhile now:

Having inheritance supported would be useful, for example, consider an ID type:

data class Id {
public function __construct(public string $id) {}
}

Maybe you want to extend it to a UserId:

data class UserId extends Id {}

Now you can't accidentally pass a VideoId as a UserId, but underlying
ORMs can still use both as an Id.

Robert Landers
Software Engineer
Utrecht NL

Bruce_Weirdan · April 2, 2024, 6:51pm

On Tue, Apr 2, 2024 at 8:05 PM Ilija Tovilo <tovilo.ilija@gmail.com> wrote:

Equality for data objects is based on data, rather than the object
handle.

I believe equality should always consider the type of the object.

new Problem(size:'big') === new Universe(size:'big')
&& new Problem(size:'big') === new Shoe(size:'big');

If the above can ever be true then I'm not sure how big is the problem
(but probably very big).
Also see the examples of non-comparable ids - `new CompanyId(1)`
should not be equal to `new PersonId(1)`

And I'd find it very confusing if the following crashed

function f(Universe $_u): void {}
$universe = new Universe(size:'big');
$shoe = new Shoe(size:'big);

if ($shoe === $universe) {
   f($shoe); // shoe is *identical* to the universe, so it should be
accepted wherever the universe is
}

--
Best regards,
Bruce Weirdan mailto:weirdan@gmail.com

ilutov · April 2, 2024, 6:04pm

Hi Larry

On Tue, Apr 2, 2024 at 5:31 PM Larry Garfield <larry@garfieldtech.com> wrote:

On Tue, Apr 2, 2024, at 12:17 AM, Ilija Tovilo wrote:
> Hi everyone!
>
> I'd like to introduce an idea I've played around with for a couple of
> weeks: Data classes, sometimes called structs in other languages (e.g.
> Swift and C#).
>
> * Data classes are ordinary classes, and as such may implement
> interfaces, methods and more. I have not decided whether they should
> support inheritance.

What would be the reason not to? As you indicated in another reply, the main reason some languages don't is to avoid large stack copies, but PHP doesn't have large stack copies for objects anyway so that's a non-issue.

I've long argued that the fewer differences there are between service classes and data classes, the better, so I'm not sure what advantage this would have other than "ugh, inheritance is such a mess" (which is true, but that ship sailed long ago).

One issue that just came to mind is object identity. For example:

class Person {
    public function __construct(
        public string $firstname,
        public string $lastname,
    ) {}
}

class Manager extends Person {
public function bossAround() {}
}

$person = new Person('Boss', 'Man');
$manager = new Manager('Boss', 'Man');
var_dump($person === $manager); // ???

Equality for data objects is based on data, rather than the object
handle. How does this interact with inheritance? Technically, Person
and Manager represent the same data. Manager contains additional
behavior, but does that change identity?

I'm not sure what the answer is. That's just the first thing that came
to mind. I'm confident we'll discover more such edge cases. Of course,
I can invest the time to find the questions before deciding to
disallow inheritance.

> * Mutating method calls on data classes use a slightly different
> syntax: `$vector->append!(42)`. All methods mutating `$this` must be
> marked as `mutating`. The reason for this is twofold: 1. It signals to
> the caller that the value is modified. 2. It allows `$vector` to be
> cloned before knowing whether the method `append` is modifying, which
> hugely reduces implementation complexity in the engine.

As discussed in R11, it would be very beneficial if this marker could be on the method definition, not the method invocation. You indicated that would be Hard(tm), but I think it's worth some effort to see if it's surmountably hard. (Or at least less hard than just auto-detecting it, which you indicated is Extremely Hard(tm).)

I think you misunderstood. The intention is to mark both call-site and
declaration. Call-site is marked with ->method!(), while declaration
is marked with "public mutating function". Call-site is required to
avoid the engine complexity, as previously mentioned. But
declaration-site is required so that the user (and IDEs) even know
that you need to use the special syntax at the call-site.

So to the extent there is a consensus, equality, stringifying, and a hashcode (which we don't have yet, but will need in the future for some things I suspect) seem to be the rough expected defaults.

I'm just skeptical whether the default __toString() is ever useful. I
can see an argument for it for quick debugging in languages that don't
provide something like var_dump(). In PHP this seems much less useful.
It's impossible to provide a default implementation that works
everywhere (or pretty much anywhere, even).

Equality is already included. Hashing should be added separately, and
probably not just to data classes.

> * In the future, it should be possible to allow using data classes in
> `SplObjectStorage`. However, because hashing is complex, this will be
> postponed to a separate RFC.

Would data class properties only be allowed to be other data classes, or could they hold a non-data class? My knee jerk response is they should be data classes all the way down; the only counter-argument I can think of it would be how much existing code is out there that is a "data class" in all but name. I still fear someone adding a DB connection object to a data class and everything going to hell, though.

Disallowing ordinary by-ref objects is not trivial without additional
performance penalties, and I don't see a good reason for it. Can you
provide an example on when that would be problematic?

Ilija

Niels_Dossche · April 2, 2024, 6:14pm

On 02/04/2024 02:17, Ilija Tovilo wrote:

Hi everyone!

I'd like to introduce an idea I've played around with for a couple of
weeks: Data classes, sometimes called structs in other languages (e.g.
Swift and C#).

In a nutshell, data classes are classes with value semantics.
Instances of data classes are implicitly copied when assigned to a
variable, or when passed to a function. When the new instance is
modified, the original instance remains untouched. This might sound
familiar: It's exactly how arrays work in PHP.
$a = [1, 2, 3];
$b = $a;
$b[] = 4;
var_dump($a); // [1, 2, 3]
var_dump($b); // [1, 2, 3, 4]
You may think that copying the array on each assignment is expensive,
and you would be right. PHP uses a trick called copy-on-write, or CoW
for short. `$a` and `$b` actually share the same array until `$b =
4;` modifies it. It's only at this point that the array is copied and
replaced in `$b`, so that the modification doesn't affect `$a`. As
long as a variable is the sole owner of a value, or none of the
variables modify the value, no copy is needed. Data classes use the
same mechanism.

But why value semantics in the first place? There are two major flaws
with by-reference semantics for data structures:

1. It's very easy to forget cloning data that is referenced somewhere
else before modifying it. This will lead to "spooky actions at a
distance". Having recently used JavaScript (where all data structures
have by-reference semantics) for an educational IR optimizer,
accidental mutations of shared arrays/maps/sets were my primary source
of bugs.
2. Defensive cloning (to avoid issue 1) will lead to useless work when
the value is not referenced anywhere else.

PHP offers readonly properties and classes to address issue 1.
However, they further promote issue 2 by making it impossible to
modify values without cloning them first, even if we know they are not
referenced anywhere else. Some APIs further exacerbate the issue by
requiring multiple copies for multiple modifications (e.g.
`$response->withStatus(200)->withHeader('X-foo', 'foo');`).

As you may have noticed, arrays already solve both of these issues
through CoW. Data classes allow implementing arbitrary data structures
with the same value semantics in core, extensions or userland. For
example, a `Vector` data class may look something like the following:
data class Vector {
    private $values;

    public function __construct(...$values) {
        $this->values = $values;
    }

    public mutating function append($value) {
        $this->values[] = $value;
    }
}

$a = new Vector(1, 2, 3);
$b = $a;
$b->append!(4);
var_dump($a); // Vector(1, 2, 3)
var_dump($b); // Vector(1, 2, 3, 4)
An internal Vector implementation might offer a faster and stricter
alternative to arrays (e.g. Vector from php-ds).

Some other things to note about data classes:

* Data classes are ordinary classes, and as such may implement
interfaces, methods and more. I have not decided whether they should
support inheritance.
* Mutating method calls on data classes use a slightly different
syntax: `$vector->append!(42)`. All methods mutating `$this` must be
marked as `mutating`. The reason for this is twofold: 1. It signals to
the caller that the value is modified. 2. It allows `$vector` to be
cloned before knowing whether the method `append` is modifying, which
hugely reduces implementation complexity in the engine.
* Data classes customize identity (`===`) comparison, in the same way
arrays do. Two data objects are identical if all their properties are
identical (including order for dynamic properties).
* Sharing data classes by-reference is possible using references, as
you would for arrays.
* We may decide to auto-implement `__toString` for data classes,
amongst other things. I am still undecided whether this is useful for
PHP.
* Data classes protect from interior mutability. More concretely,
mutating nested data objects stored in a `readonly` property is not
legal, whereas it would be if they were ordinary objects.
* In the future, it should be possible to allow using data classes in
`SplObjectStorage`. However, because hashing is complex, this will be
postponed to a separate RFC.

One known gotcha is that we cannot trivially enforce placement of
`modfying` on methods without a performance hit. It is the
responsibility of the user to correctly mark such methods.

Here's a fully functional PoC, excluding JIT:
[RFC] Implement data classes (WIP) by iluuu1994 · Pull Request #13800 · php/php-src · GitHub

Let me know what you think. I will start working on an RFC draft once
work on property hooks concludes.

Ilija

Hi Ilija

Thank you for this proposal, I like the idea of having value semantic objects available.
I pulled your branch and played with it a bit.

As already hinted in the thread, I also think inheritance may be dangerous in a first version.
I want to add to that: if you extend a data-class with a non-data-class, the data-class behaviour gets lost, which is logical in a sense but also surprised me in a way.

Also, FWIW, I'm not sure about the name "data" class, perhaps "value" class or something alike is what people may be more familiar with wrt semantics, although dataclass is also a known term.

I do have a question about iterator behaviour. Consider this code:

data class Test {
        public $a = 1;
        public $b = 2;
}

$test = new Test;
foreach ($test as $k => &$v) {
        if ($k === "b")
                $test->a = $test;
        var_dump($k);
}

This will reset the iterator of the object on separation, so we will get an infinite loop.
Is this intended?
If so, is it because the right hand side is the original object while the left hand side gets the clone?
Is this consistent with how arrays separate?
(Note: I haven't really looked at your code)

Kind regards
Niels

Rob_Landers · April 2, 2024, 7:37pm

On Tue, Apr 2, 2024, at 20:51, Bruce Weirdan wrote:

On Tue, Apr 2, 2024 at 8:05 PM Ilija Tovilo <tovilo.ilija@gmail.com> wrote:

Equality for data objects is based on data, rather than the object

handle.

I believe equality should always consider the type of the object.
new Problem(size:'big') === new Universe(size:'big')

&& new Problem(size:'big') === new Shoe(size:'big');
If the above can ever be true then I’m not sure how big is the problem

(but probably very big).

Also see the examples of non-comparable ids - new CompanyId(1)

should not be equal to new PersonId(1)

And I’d find it very confusing if the following crashed
function f(Universe $_u): void {}

$universe = new Universe(size:'big');

$shoe = new Shoe(size:'big);

if ($shoe === $universe) {

f($shoe); // shoe is *identical* to the universe, so it should be

accepted wherever the universe is

}
–

Best regards,

Bruce Weirdan mailto:weirdan@gmail.com

I’d love to see it so that equality was more like == for regular objects. If the type matches and the data matches, it’s true. It’d be really helpful to be able to downcast types though. Such as in my user id example I gave earlier. Once it reaches a certain point in the code, it doesn’t matter that it was once a UserId, it just matters that it is currently an Id.

Now that I think about it, decoration might be better than inheritance here and inheritance might make more sense to be banned. In other words, this might be just as simple and easy to use:

data class Id {

public function __construct(public string $id) {}

}

data class UserId {

public function __construct(public Id $id) {}

}

Though it would be really interesting to use them as “traits” for each other to say “this data class can be converted to another type, but information will be lost” where they are 100% separate types but can be “cast” to specified types.

// “use” has all the same rules as extends, but,

// UserId is not an Id; it can be converted to an Id

data class UserId use Id {

public function __construct(public string $id, public string $name) {}
}

$user = new UserId(‘123’, ‘rob’);

$id = (Id) $user;

$user !== $id === true;

$id is 100% Id and lost all its “userness.” Hmm. Interesting indeed. Probably not practical, but interesting.

— Rob

Rowan_Tommins_IMSoP · April 2, 2024, 8:10pm

On 02/04/2024 01:17, Ilija Tovilo wrote:

I'd like to introduce an idea I've played around with for a couple of
weeks: Data classes, sometimes called structs in other languages (e.g.
Swift and C#).

Hi Ilija,

I'm really interested to see how this develops. A couple of thoughts that immediately occurred to me...

I'm not sure if you've considered it already, but mutating methods should probably be constrained to be void (or maybe "mutating" could occupy the return type slot). Otherwise, someone is bound to write this:

$start = new Location('Here');
$end = $start->move!('There');

Expecting it to mean this:

$start = new Location('Here');
$end = $start;
$end->move!('There');

When it would actually mean this:

$start = new Location('Here');
$start->move!('There');
$end = $start;

I seem to remember when this was discussed before, the argument being made that separating value objects completely means you have to spend time deciding how they interact with every feature of the language.

Does the copy-on-write optimisation actually require the entire class to be special, or could it be triggered by a mutating method on any object? To allow direct modification of properties as well, we could move the call-site marker slightly to a ->! operator:

$foo->!mutate();
$foo->!bar = 42;

The first would be the same as your current version: it would perform a CoW reference separation / clone, then call the method, which would require a "mutating" marker. The second would essentially be an optimised version of $foo = clone $foo with [ 'bar' => 42 ]

During the method call or write operation, readonly properties would allow an additional write, as is the case in __clone and the "clone with" proposal. So a "pure" data object would simply be declared with the existing "readonly class" syntax.

The main drawback I can see (outside of the implementation, which I can't comment on) is that we couldn't overload the === operator to use value semantics. In exchange, a lot of decisions would simply be made for us: they would just be objects, with all the same behaviour around inheritance, serialization, and so on.

Regards,

--
Rowan Tommins
[IMSoP]

Deleu · April 2, 2024, 8:49pm

If there is a class made up of 90% data struct and 10% non-data struct, the 90% could be extracted into a true data struct and be referenced in the existing regular class, making it even more organized in terms of establishing what’s “data” and what’s “service”. I would really favor making it “data class” all the way down.

I understand you disagree with the argument against inheritance, but to me the same logic applies here. Making it data class only allows for lifting the restriction in the future, if necessary (requiring another RFC vote). Making it mixed on version 1 means that support for the mixture of them can never be undone.

···

Marco Deleu

ilutov · April 2, 2024, 10:35pm

Hi Niels

On Tue, Apr 2, 2024 at 8:16 PM Niels Dossche <dossche.niels@gmail.com> wrote:

On 02/04/2024 02:17, Ilija Tovilo wrote:
> Hi everyone!
>
> I'd like to introduce an idea I've played around with for a couple of
> weeks: Data classes, sometimes called structs in other languages (e.g.
> Swift and C#).

As already hinted in the thread, I also think inheritance may be dangerous in a first version.
I want to add to that: if you extend a data-class with a non-data-class, the data-class behaviour gets lost, which is logical in a sense but also surprised me in a way.

Yes, that's definitely not intended. I haven't implemented any
inheritance checks yet. But if inheritance is allowed, then it should
be restricted to classes of the same kind (by-ref or by-val).

Also, FWIW, I'm not sure about the name "data" class, perhaps "value" class or something alike is what people may be more familiar with wrt semantics, although dataclass is also a known term.

I'm happy with value class, struct, record, data class, what have you.
I'll accept whatever the majority prefers.

I do have a question about iterator behaviour. Consider this code:
data class Test {
        public $a = 1;
        public $b = 2;
}

$test = new Test;
foreach ($test as $k => &$v) {
        if ($k === "b")
                $test->a = $test;
        var_dump($k);
}
This will reset the iterator of the object on separation, so we will get an infinite loop.
Is this intended?
If so, is it because the right hand side is the original object while the left hand side gets the clone?
Is this consistent with how arrays separate?

That's a good question. I have not really thought about iterators yet.
Modification of an array iterated by-reference does not restart the
iterator. Actually, by-reference capturing of the value also captures
the array by-reference, which is not completely intuitive.

My initial gut feeling is to handle data classes the same, i.e.
capture them by-reference when iterating the value by reference, so
that iteration is not restarted.

Ilija

Crell · April 2, 2024, 10:02pm

On Tue, Apr 2, 2024, at 6:04 PM, Ilija Tovilo wrote:

What would be the reason not to? As you indicated in another reply, the main reason some languages don't is to avoid large stack copies, but PHP doesn't have large stack copies for objects anyway so that's a non-issue.

I've long argued that the fewer differences there are between service classes and data classes, the better, so I'm not sure what advantage this would have other than "ugh, inheritance is such a mess" (which is true, but that ship sailed long ago).

One issue that just came to mind is object identity. For example:

class Person {
    public function __construct(
        public string $firstname,
        public string $lastname,
    ) {}
}

class Manager extends Person {
    public function bossAround() {}
}

$person = new Person('Boss', 'Man');
$manager = new Manager('Boss', 'Man');
var_dump($person === $manager); // ???

Equality for data objects is based on data, rather than the object
handle. How does this interact with inheritance? Technically, Person
and Manager represent the same data. Manager contains additional
behavior, but does that change identity?

I'm not sure what the answer is. That's just the first thing that came
to mind. I'm confident we'll discover more such edge cases. Of course,
I can invest the time to find the questions before deciding to
disallow inheritance.

As Bruce already demonstrated, equality should include type, not just properties. Even without inheritance that is necessary.

There may be good reason to omit inheritance, as we did on enums, but that shouldn't be the starting point. (I'd have to research and see what other languages do. I think it's a mixed bag.) We should try to ferret out those edge cases and see if there's reasonable solutions to them.

> * Mutating method calls on data classes use a slightly different
> syntax: `$vector->append!(42)`. All methods mutating `$this` must be
> marked as `mutating`. The reason for this is twofold: 1. It signals to
> the caller that the value is modified. 2. It allows `$vector` to be
> cloned before knowing whether the method `append` is modifying, which
> hugely reduces implementation complexity in the engine.

As discussed in R11, it would be very beneficial if this marker could be on the method definition, not the method invocation. You indicated that would be Hard(tm), but I think it's worth some effort to see if it's surmountably hard. (Or at least less hard than just auto-detecting it, which you indicated is Extremely Hard(tm).)

I think you misunderstood. The intention is to mark both call-site and
declaration. Call-site is marked with ->method!(), while declaration
is marked with "public mutating function". Call-site is required to
avoid the engine complexity, as previously mentioned. But
declaration-site is required so that the user (and IDEs) even know
that you need to use the special syntax at the call-site.

Ah, OK. That's... unfortunate, but I defer to you on the implementation complexity.

So to the extent there is a consensus, equality, stringifying, and a hashcode (which we don't have yet, but will need in the future for some things I suspect) seem to be the rough expected defaults.

I'm just skeptical whether the default __toString() is ever useful. I
can see an argument for it for quick debugging in languages that don't
provide something like var_dump(). In PHP this seems much less useful.
It's impossible to provide a default implementation that works
everywhere (or pretty much anywhere, even).

Equality is already included. Hashing should be added separately, and
probably not just to data classes.

The equivalent of Python's __repr__ (which it auto-generates) would be __debugInfo(). Arguably its current output is what the default would likely be anyway, though. I believe the typical auto-toString output is the same data, but presented in a more human-friendly way. (So yes, mainly useful for debugging.)

Equality, well, we've already debated whether or not we should make that a general feature. Of note, though, in languages with equals(), it's also user-overridable.

> * In the future, it should be possible to allow using data classes in
> `SplObjectStorage`. However, because hashing is complex, this will be
> postponed to a separate RFC.

I believe this is where we would want/need a __hash() method or similar; Derick and I encountered that while researching collections in other languages. Leaving it out for now is fine, but it would be important for any future list-of functionality.

Would data class properties only be allowed to be other data classes, or could they hold a non-data class? My knee jerk response is they should be data classes all the way down; the only counter-argument I can think of it would be how much existing code is out there that is a "data class" in all but name. I still fear someone adding a DB connection object to a data class and everything going to hell, though.

Disallowing ordinary by-ref objects is not trivial without additional
performance penalties, and I don't see a good reason for it. Can you
provide an example on when that would be problematic?

Ilija

There's two aspects to it, that I see.

data class A {
public function __construct(public string $name) {}
}

data class B {
  public function __construct(
    public A $a,
    public PDO $conn,
  ) {}
}

$b = new B(new A(), $pdoConnection);

function stuff(B $b2) {
  $b2->a->name = 'Larry';
  // This triggers a CoW on $b2, separating it from $b, and also creating a new instance of A. What about $conn?
  // Does it get cloned? That would be bad. Does it not get cloned? That seems weird that it's still the same on
  // a data object.

  $b2->conn->beginTransaction();
  // This I would say is technically a modification, since the state of the connection is changing. But then
  // should this trigger $b2 cloning from $b1? Neither answer is obvious to me.
}

In a sense, it's similar to the "PSR-7 is immutable, asterisk, streams" issue that has often been pointed out. "Data objects are safe to pass around and will self-clone when needed, asterisk, unless there's a normal object in it and then it's non-obvious" doesn't sound like a good mental model to give people.

Or consider DateTime. It's mutable. Should mutating it clone an object that has a DateTime property? I can realistically argue both ways, and I'm not convinced either is right; just that neither is intuitive.

"Data classes all the way down" resolves this problem.

The caveat would be that a genuinely immutable object would (probably?) be safe (DateTimeImmutable, or a readonly class), so maybe we can make readonly classes an exception? Ah, no, we cannot, because despite what PHPStan insists, there's no reason that the single write to a readonly property must happen at construction. It can easily happen as a side effect of another method (eg, a cache value), meaning readonly objects are not truly immutable. In fact, readonly objects can have non-readonly objects on their properties, too. So I don't think that's safe, either.

The other aspect is, eg, serialization. People will come to expect (reasonably) that a data class will have certain properties (in the abstract sense, not lexical sense). For instance, most classes are serializable, but a few are not. (Eg, if they have a reference to PDO or a file handle or something unserializable.) Data classes seem like they should be safe to serialize always, as they're "just data". If data classes are limited to primitives and data classes internally, that means we can effectively guarantee that they will be serializable, always. If one of the properties could be a non-serializable object, that assumption breaks.

There's probably other similar examples besides serialization where "think of this as data" and "think of this as logic" is how you'd want to think, which leads to different assumptions, which we shouldn't stealthily break.

--Larry Garfield

ilutov · April 2, 2024, 11:01pm

Hi Rowan

On Tue, Apr 2, 2024 at 10:10 PM Rowan Tommins [IMSoP]
<imsop.php@rwec.co.uk> wrote:

On 02/04/2024 01:17, Ilija Tovilo wrote:

I'd like to introduce an idea I've played around with for a couple of
weeks: Data classes, sometimes called structs in other languages (e.g.
Swift and C#).

I'm not sure if you've considered it already, but mutating methods should probably be constrained to be void (or maybe "mutating" could occupy the return type slot). Otherwise, someone is bound to write this:

$start = new Location('Here');
$end = $start->move!('There');

Expecting it to mean this:

$start = new Location('Here');
$end = $start;
$end->move!('There');

When it would actually mean this:

$start = new Location('Here');
$start->move!('There');
$end = $start;

I think there are some valid patterns for mutating methods with a
return value. For example, Set::add() might return a bool to indicate
whether the value was already present in the set.

I seem to remember when this was discussed before, the argument being made that separating value objects completely means you have to spend time deciding how they interact with every feature of the language.

Data classes are classes with a single additional
zend_class_entry.ce_flags flag. So unless customized, they behave as
classes. This way, we have the option to tweak any behavior we would
like, but we don't need to.

Of course, this will still require an analysis of what behavior we
might want to tweak.

Does the copy-on-write optimisation actually require the entire class to be special, or could it be triggered by a mutating method on any object? To allow direct modification of properties as well, we could move the call-site marker slightly to a ->! operator:

$foo->!mutate();
$foo->!bar = 42;

I suppose this is possible, but it puts the burden for figuring out
what to separate onto the user. Consider this example, which would
work with the current approach:

$shapes[0]->position->zero!();

The left-hand-side of the mutating method call is fetched by
"read+write". Essentially, this ensures that any array or data class
is separated (copied if RC >1).

Without such a class-wide marker, you'll need to remember to add the
special syntax exactly where applicable.

$shapes![0]!->position!->zero();

In this case, $shapes, $shapes[0], and $shapes[0]->position must all
be separated. This seems very easy to mess up, especially since only
zero() is actually known to be separating and can thus be verified at
runtime.

The main drawback I can see (outside of the implementation, which I can't comment on) is that we couldn't overload the === operator to use value semantics. In exchange, a lot of decisions would simply be made for us: they would just be objects, with all the same behaviour around inheritance, serialization, and so on.

Right, this would either require some other marker that switches to
this mode of comparison, or operator overloading.

Ilija

ilutov · April 3, 2024, 6:09pm

Hi Larry

On Wed, Apr 3, 2024 at 12:03 AM Larry Garfield <larry@garfieldtech.com> wrote:

On Tue, Apr 2, 2024, at 6:04 PM, Ilija Tovilo wrote:

> I think you misunderstood. The intention is to mark both call-site and
> declaration. Call-site is marked with ->method!(), while declaration
> is marked with "public mutating function". Call-site is required to
> avoid the engine complexity, as previously mentioned. But
> declaration-site is required so that the user (and IDEs) even know
> that you need to use the special syntax at the call-site.

Ah, OK. That's... unfortunate, but I defer to you on the implementation complexity.

As I've argued, I believe the different syntax is a positive. This
way, data classes are known to stay unmodified unless:

1. You're explicitly modifying it yourself.
2. You're calling a mutating method, with its associated syntax.
3. You're creating a reference from the value, either explicitly or by
passing it to a by-reference parameter.

By-reference argument passing is the only way that mutations of data
classes can be hidden (given that they look exactly like normal
by-value arguments), and its arguably a flaw of by-reference passing
itself. In all other cases, you can expect your value _not_ to
unexpectedly change. For this reason, I consider it as an alternative
approach to readonly classes.

> Disallowing ordinary by-ref objects is not trivial without additional
> performance penalties, and I don't see a good reason for it. Can you
> provide an example on when that would be problematic?

There's two aspects to it, that I see.

data class A {
  public function __construct(public string $name) {}
}

data class B {
  public function __construct(
    public A $a,
    public PDO $conn,
  ) {}
}

$b = new B(new A(), $pdoConnection);

function stuff(B $b2) {
  $b2->a->name = 'Larry';
  // This triggers a CoW on $b2, separating it from $b, and also creating a new instance of A. What about $conn?
  // Does it get cloned? That would be bad. Does it not get cloned? That seems weird that it's still the same on
  // a data object.

  $b2->conn->beginTransaction();
  // This I would say is technically a modification, since the state of the connection is changing. But then
  // should this trigger $b2 cloning from $b1? Neither answer is obvious to me.
}

IMO, the answer is relatively straight-forward: PDO is a reference
type. For all intents and purposes, when you're passing B to stuff(),
B is copied. Since B::$conn is a "reference" (read pointer), copying B
doesn't copy the connection, only the reference to it. B::$a, however,
is a value type, so copying B also copies A. The fact that this isn't
_exactly_ what happens under the hood due to CoW is an implementation
detail, it doesn't need to change how you think about it. From the
users standpoint, $b and $b2 can already separate values once stuff()
is called.

This is really no different from arrays:

$b = ['a' => ['name' => 'Larry'], 'conn' => $pdoConnection];
$b2 = $b; // $b is detached from $b2, $b['conn'] remains a shared object.

The other aspect is, eg, serialization. People will come to expect (reasonably) that a data class will have certain properties (in the abstract sense, not lexical sense). For instance, most classes are serializable, but a few are not. (Eg, if they have a reference to PDO or a file handle or something unserializable.) Data classes seem like they should be safe to serialize always, as they're "just data". If data classes are limited to primitives and data classes internally, that means we can effectively guarantee that they will be serializable, always. If one of the properties could be a non-serializable object, that assumption breaks.

I'm not sure that's a convincing argument to fully disallow reference
types, especially since it would prevent you from storing
DateTimeImmutables and other immutable values in data classes and thus
break many valid use-cases. That would arguably be very limiting.

There's probably other similar examples besides serialization where "think of this as data" and "think of this as logic" is how you'd want to think, which leads to different assumptions, which we shouldn't stealthily break.

I think your assumption here is that non-data classes cannot contain
data. This doesn't hold, and especially will not until data classes
become more common. Readonly classes can be considered strict versions
of data classes in terms of mutability, minus some of the other
semantic changes (e.g. identity).

Ilija

Kevin_Dunglas · April 4, 2024, 11:49am

Data classes will be a very useful addition to “API Platform”.

API Platform is a “resource-oriented” framework that strongly encourages the use of “data-only” classes:
we use PHP classes both as a specification language to document the public shape of web APIs (like an OpenAPI specification, but written in PHP instead of JSON or YAML),
and as Data Transfer Objects containing the data to be serialized into JSON (read), or the JSON payload deserialized into PHP objects (write).

Being able to encourage users to use structs (that’s what we already call this type of behavior-less class in our workshops) for these objects will help us a lot.

Kévin

Rowan_Tommins_IMSoP · April 4, 2024, 10:28pm

Data classes are classes with a single additional > zend_class_entry.ce_flags flag. So unless customized, they behave as > classes. This way, we have the option to tweak any behavior we would > like, but we don’t need to. > > Of course, this will still require an analysis of what behavior we > might want to tweak.

Regardless of the implementation, there are a lot of interactions we will want to consider; and we will have to keep considering new ones as we add to the language. For instance, the Property Hooks RFC would probably have needed a section on “Interaction with Data Classes”.

On the other hand, maybe having two types of objects to consider each time is better than having to consider combinations of lots of small features.

On a practical note, a few things I’ve already thought of to consider:

Can a data class have readonly properties (or be marked “readonly data class”)? If so, how will they behave?
Can you explicitly use the “clone” keyword with an instance of a data class? Does it make any difference?
Tied into that: can you implement __clone(), and when will it be called?
If you implement __set(), will copy-on-write be triggered before it’s called?
Can you implement __destruct()? Will it ever be called?

Consider this example, which would > work with the current approach: > > $shapes[0]->position->zero!();

I find this concise example confusing, and I think there’s a few things to unpack here…

Firstly, there’s putting a data object in an array:

$numbers = [ new Number(42) ];
$cow = $numbers;
$cow[0]->increment!();
assert($numbers !== $cow);

This is fairly clearly equivalent to this:

$numbers = [ 42 ];
$cow = $numbers;
$cow[0]++;
assert($numbers !== $cow);

CoW is triggered on the array for both, because ++ and ->increment!() are both clearly modifications.

Second, there’s putting a data object into another data object:

$shape = new Shape(new Position(42,42));
$cow = $shape;
$cow->position->zero!();
assert($shape !== $cow);

This is slightly less obvious, because it presumably depends on the definition of Shape. Assuming Position is a data class:

If Shape is a normal class, changing the value of $cow->position just happens in place, and the assertion fails
If Shape is a readonly class (or position is a readonly property on a normal class), changing the value of $cow->position shouldn’t be allowed, so this will presumably give an error
If Shape is a data class, changing the value of $shape->position implies a “mutation” of $shape itself, so we get a separation before anything is modified, and the assertion passes

Unlike in the array case, this behaviour can’t be resolved until you know the run-time type of $shape.

Now, back to your example:

$shapes = [ new Shape(new Position(42,42)) ];
$cow = $shapes;
$shapes[0]->position->zero!(); assert($cow !== $shapes);

This combines the two, meaning that now we can’t know whether to separate the array until we know (at run-time) whether Shape is a normal class or a data class.

But once that is known, the whole of “->position->zero!()” is a modification to $shapes[0], so we need to separate $shapes.

Without such a class-wide marker, you'll need to remember to add the
special syntax exactly where applicable.

$shapes![0]!->position!->zero();

The array access doesn’t need any special marker, because there’s no ambiguity. The ambiguous call is the reference to ->position: in your current proposal, this represents a modification if Shape is a data class, and is itself being modified. My suggestion (or really, thought experiment) was that it would represent a modification if it has a ! in the call.

So if Shape is a readonly class:

$shapes[0]->position->!zero();
// Error: attempting to modify readonly property Shape::$position

$shapes[0]->!position->!zero();
// OK; an optimised version of:
$shapes[0] = clone $shapes[0] with [
‘position’ => (clone $shapes[0]->position with [‘x’=>0,‘y’=>0])
];

If ->! is only allowed if the RHS is either a readonly property or a mutating method, then this can be reasoned about statically: it will either error, or cause a CoW separation of $shapes. It also allows classes to mix aspects of “data class” and “normal class” behaviour, which might or might not be a good idea.

This is mostly just a thought experiment, but I am a bit concerned that code like this is going to be confusingly ambiguous:

$item->shape->position->zero!();

What is going to be CoW cloned, and what is going to be modified in place? I can’t actually know without knowing the definition behind both $item and $item->shape. It might even vary depending on input.

Regards,

···

-- 
Rowan Tommins
[IMSoP]

ilutov · April 6, 2024, 5:56pm

Hi Rowan

On Fri, Apr 5, 2024 at 12:28 AM Rowan Tommins [IMSoP]
<imsop.php@rwec.co.uk> wrote:

On 03/04/2024 00:01, Ilija Tovilo wrote:

Regardless of the implementation, there are a lot of interactions we will want to consider; and we will have to keep considering new ones as we add to the language. For instance, the Property Hooks RFC would probably have needed a section on "Interaction with Data Classes".

That remark was implying that data classes really are just classes
with some additional tweaks. That gives us the ability to handle them
differently when desired. However, they will otherwise behave just
like classes, which makes it not so different from your suggestion.

On a practical note, a few things I've already thought of to consider:

- Can a data class have readonly properties (or be marked "readonly data class")? If so, how will they behave?

Yes. The CoW semantics become irrelevant, given that nothing may
trigger a separation. However, data classes also include value
equality, and hashing in the future. These may still be useful for
immutable data.

- Can you explicitly use the "clone" keyword with an instance of a data class? Does it make any difference?

Manual cloning is not useful, but it's also not harmful. So I'm
leaning towards allowing this. This way, data classes may be handled
generically, along with other non-data classes.

- Tied into that: can you implement __clone(), and when will it be called?

Yes. `__clone` will be called when the object is separated, as you would expect.

- If you implement __set(), will copy-on-write be triggered before it's called?

Yes. Separation happens as part of the property fetching, rather than
the assignment itself. Hence, for `$foo->bar->baz = 'baz';`, once
`Bar::__set('baz', 'baz')` is called, `$foo` and `$foo->bar` will
already have been separated.

- Can you implement __destruct()? Will it ever be called?

Yes. As with any other object, this will be called once the last
reference to the object goes away. There's nothing special going on.

It's worth noting that CoW makes `__clone` and `__destruct` somewhat
nondeterministic, or at least non-obvious.

> Consider this example, which would > work with the current approach: > > $shapes[0]->position->zero!();

I find this concise example confusing, and I think there's a few things to unpack here...

I think you're putting too much focus on CoW. CoW should really be
considered an implementation detail. It's not _fully_ transparent,
given that it is observable through `__clone` and `__destruct` as
mentioned above. But it is _mostly_ transparent.

Conceptually, the copy happens not when the method is called, but when
the variable is assigned. For your example:

$shape = new Shape(new Position(42,42));
$copy = $shape; // Conceptually, a recursive copy happens here.
$copy->position->zero!(); // $shape is already detached from $copy.
The ! merely indicates that the value is modified.

The array access doesn't need any special marker, because there's no ambiguity.

This is only true if you ignore ArrayAccess. `$foo['bar']` does not
necessarily indicate that `$foo` is an array. If it were a `Vector`,
then we would absolutely need an indication to separate it.

It's true that `$foo->bar` currently indicates that `$foo` is a
reference type. This assumption would break with this RFC, but that's
also kind of the whole point.

What is going to be CoW cloned, and what is going to be modified in place? I can't actually know without knowing the definition behind both $item and $item->shape. It might even vary depending on input.

For the most part, data classes should consist of other value types,
or immutable reference types (e.g. DateTimeImmutable). This actually
makes the rules quite simple: If you assign a value type, the entire
data structure is copied recursively. The fact that PHP delays this
step for performance is unimportant. The fact that immutable reference
types aren't cloned is also unimportant, given that they don't change.

Ilija