To recap, trying to deduplicate this array fails with the various techniques we presented (but we'll use the Set technique here)
None of the items were deduplicated, even though they are all the "same".
For comparison, let's see what Ruby would do:
So it looks like Ruby does what we expected, successfully removing duplicates in the array as expected.
Instead of adding the objects directly inside the array, we first assign the objects into variables and then add the variables into the array.
A closer examination
What's happening here?
We can ask ourselves the following questions when comparing two objects:
- Are the two objects exactly the same instance (same memory location)?
- Do the two objects represent the same logical concept (same temperature, same user, etc)?
When developers write comparisons, most of the time they are thinking of question 2. This is what you get when comparing primitives but objects are a different story; they not easy for the computer to answer because the "logical concept" is not well-defined.
For example, look at the result (in Ruby) if instead of a Hash (or Array) of primitive values, we use a custom class:
In this case, Ruby's result is an answer to question 1 (they don't have the same memory location) since the logical concept of "having the same name means they are the same object" isn't well-defined.
This is the root of the problem: the vast majority of times we are looking for answers to question 2, but every now and then because of technical or implementation details, the result is an answer to question 1 instead—breaking expectations.
Some languages (like Ruby) know that question 2 is usually what the user wants, and they go the extra mile to implement a specialized comparison to give the expected result when dealing with classes from their standard library (Hash, Array, Set, etc). This however is not enough and fails once custom classes are involved.
Back to basics
Conversely, different instantiations of a class result in different objects that reside in different memory locations and will have different references, even if they are all instantiated from the same class. This is the reason why our deduplication algorithm works for primitive values but not on objects: the test of equality is too strict such that only primitive values can properly satisfy it.
It would be nice if we can make == not only compare primitive values, but also the object's instrumental value: the value that we as developer think of the object as having (e.g. the temperature in degrees Celsius), not what the object actually has (i.e. a pointer to a memory location). This way, we can have e.g. indexOf use the overloaded == operator and get back our preferred test for equality, making it do the "right" thing.
A brief aside on how a rubyist might do deduplication
In Ruby, many operators (like ==) are just methods belonging to an object, and can therefore redefined. This overloading is actually expected to be present by much of the standard library.
For example, Ruby objects can be used as keys to a Hash because internally the method Object#hash is called, and Array#uniq works because internally the method Object#eql? is called. These methods are defined on all of Ruby's standard library and emits values that make them behave in a way developers would expect.
For domain-specific objects, implementing these methods will result in any instances of your object behaving "correctly" without any changes to the standard library functions, as long as you fulfill the contracts. In the Ruby docs, the developer is expected to define these when trying to use objects as hash keys.
Two objects refer to the same hash key when their hash value is identical and the two objects are eql? to each other.
A user-defined class may be used as a hash key if the hash and eql? methods are overridden to provide meaningful behavior. By default, separate instances refer to separate hash keys.
A typical implementation of hash is based on the object's data while eql? is usually aliased to the overridden == method:
For most objects, this results in the string "[object Object]" being used as the object's key, which is why all of the objects were deduplicated (even if they had different properties).
Objects that have toString() method redefined (like Date or Array) fare better with this approach since their stringified value is the object's instrumental value (the value we expect the object to have conceptually).
A more general solution?
Even though we can redefine the toString() method on objects, it can be tedious and error-prone: this can really only work if (like Ruby) the rest of the ecosystem expects that these patterns are present and used throughout; otherwise the custom usage sticks out like a sore thumb.
Additionally, the toString() API is meant to return a textual representation of the object, and is mostly intended for human consumption (what you’d usually see on a console.log() call). This may or may not be the same as what you need to use for equality comparisons.
Reusing the toString() API for something else might work on specific cases, but may cause surprising effects in others. In fact, MDN's docs on Object.toString() has a similar discussion on detecting an object's class via the toString() function and concludes that using it in this way is unreliable.
What we want is something akin to Ruby's Object#hash and Object#eql? methods that would inspect an object and emit a specific value unique to that inspection, but does not differentiate between instantiations of objects. In other words, even if we instantiate two objects, if those two objects have the same instrumental value, then it should emit the same "hash" value.
While JSON.stringify() is by no means a very robust hash function (e.g. it won't differentiate between Infinity and NaN since those will both be serialized as null) it does traverse objects and arrays, making it usable for comparing objects that are deeply nested.
If we do ever have a need for such custom checks, there is always the option to redefine the toJSON() function to include them.
Although this solution might be generic, it somewhat gives a bad smell due to the over usage of JSON.stringify(). Even worse, since we're dealing with an (at worst) O(n^2) complexity, there can be a large performance hit when calling JSON.stringify() with multiple large objects.
Equality on our own terms
There are multiple ways of doing this, but we'll try to show at least two: via functions, and via mixins. We'll be using both techniques inside a filter function to check whether an element has been detected before (and is therefore a duplicate) similar to how we did it in the Using filter() and indexOf() example.
This one is straightforward: we define a function called deepArrayEqual() that would check a deeply nested array and determine if they are equal to each other. In our case, we're only interested in the following conditions to decide equality:
- the lengths must be the same
- the order of elements must be the same (also applies to nested arrays)
- equality is checked via strict equality (i.e. ===)
We then use our own implementation of indexOf(e) === i to go recursively through all elements using the deepArrayEqual() function for equality determination.
Another technique would be using mixins. In our case, we don't really need to be very comprehensive in the checks so we can relax our conditions. We consider two objects as equal when:
- the number of keys of both objects are the same
- each of the object's properties are the same
We then use Object.assign() to "mixin" the eql() function to each of the objects we are checking, so we can use eql() in the form of this.eql?.(that) // boolean (the ?. syntax is for optional chaining that will return null if the receiver does not have the function defined).
The downside here is that the objects now have the eql() function mixed in, which may or may not be desired; if needed we can just use the delete operator to remove the function from the objects. We don't really mind them having that function mixed in so we leave them in.
We just need to be careful not clobber any existing eql() function definition on that object; wrapping the Object.assign() call inside a conditional that checks for a predefined eql() function can a valid way to avoid this kind of shadowing.
As in the "using functions" example, we use our own implementation of indexOf(e) === i to go through the array and check for duplicates.
Using a custom object
Instead of trying to be generic, we can just work on the objects that we're using directly, and bake in them a function that returns a comparable value that's unique to the object. This functions similar to how Ruby's Object#hash method would work, and we can use it the same way.
While this technique has the downside of only being applicable to our custom objects (i.e. only objects instantiated from this class would be satisfying our conditions for equality) that is also a benefit for us in the sense that we have full control of what the definition of equality means to our application.
Summary and recommendations
any decent answer to an interesting question begins, "it depends..."
— Kent Beck 🌻 (@KentBeck)May 6, 2015
If you're reasonably sure you'd only be dealing with flattened, primitive-only (e.g. string or numeric) array elements (like the blog post "tags" in our case), then creating a new Set() and then either spreading the elements or creating a new Array() from the created set is the most straightforward and performant. This approach will net you the results closest to what you would expect a unique() function would.
We've shown however that this really can be surprising when objects and arrays are involved. This is because while the same primitive value is always equal to itself, different instances of an object are technically different from each other due to the way == works.
In languages like Ruby, you can get around the question of equality by implementing Object#==, Object#eql?, and Object#hash yourself in either the object's eigenclass or the parent Class.
You can perform specific tests in that implementation that can allow you for example, to compare two dates in the Gregorian and Julian calendars and eliminate one of the two if they are in fact the "same" date (just in different calendars), or compare various Temperature objects that are the "same" temperature but expressed in different units such as Celsius, Fahrenheit, and Kelvin.
This makes using operations that depend on equality (like Array#uniq) almost universally "do the right thing" without too much thinking of what equality is and effectively delegating the responsibility of implementation to the object; if you're interested in the details then you just look at the Object#==, Object#hash, and Object#eql? implementations.
We would also recommend staying away from the "magical" techniques that try to use Object or Map keys by virtue of their being unique by design (and ride on their implementation of deduplication). These assume that the key deduplication algorithm behaves as you'd expect, and as you can see with the Object with objects as keys example, you'll also need to override the object's toString() function to get satisfactory results.
For deduplicating non-primitives, we've looked into one such implementation using JSON.stringify() as an example that can work with most objects, even if deeply nested:
The caveat here is that while the filter function seems to be generic enough, you'll still need to handle special cases that your application might require (e.g. handling NaN or functions). Furthermore, using JSON.stringify() on huge collections that contain large, deeply nested objects or arrays can severely hurt performance as this example runs in O(n^2) in the worst case.
This is of course not recommended as running JSON.stringify() is a quite heavy handed, especially on larger data structures. We can instead go for a filter function that embodies the proper test for equality that you need for your application.
In our case, we consider objects that have the same properties and arrays that have the same length and elements to be equal, and only check for those specifically. One such example can be:
In other words, depending on your approach you can either
- pursue efficiency and sacrifice "correctness" (e.g. NaN should not be equal to each other but the Set object will equate them anyway, or not being able to deduplicate objects and be limited to only primitive values as array elements),
- or pursue a more general solution and sacrifice performance by using JSON.stringify(),
- or (even better) pursue a middle ground by providing your own equality algorithms.
It is our job as software developers to decide which approach best suits our applications.