Engineering

Removing Duplicate Elements from a JavaScript Array (part 2)

Removing Duplicate Elements from a JavaScript Array (part 2)

In part 1 we saw that while deduplicating JavaScript arrays containing primitive values for its elements can be quite straightforward, dealing with objects as elements of an array can be a bit tricky.

To recap, trying to deduplicate this array fails with the various techniques we presented (but we'll use the Set technique here)

const array = [
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "remote-work", title: "Remote Work" },
  { slug: "culture", title: "Culture" },
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
];

[...new Set(array)];

None of the items were deduplicated, even though they are all the "same".

For comparison, let's see what Ruby would do:

array = [
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "remote-work", title: "Remote Work" },
  { slug: "culture" , title: "Culture"},
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
  ]

array.uniq

So it looks like Ruby does what we expected, successfully removing duplicates in the array as expected.

But hold on, let's try a slightly different example in JavaScript:

const scrum = { slug: "scrum", title: "Scrum" };
const remoteWork = { slug: "remote-work", title: "Remote Work" };
const culture = { slug: "culture", title: "Culture" };
const arr = [1, [2, 3], [[4, 5], [6, 7], 8], 9];

const array = [scrum, scrum, scrum, remoteWork, culture, arr, arr];

[...new Set(array)];

Instead of adding the objects directly inside the array, we first assign the objects into variables and then add the variables into the array.

Surprisingly (or not) in this case JavaScript deduplicates the array just like Ruby.

A closer examination

What's happening here?

We can ask ourselves the following questions when comparing two objects:

  1. Are the two objects exactly the same instance (same memory location)?
  2. Do the two objects represent the same logical concept (same temperature, same user, etc)?

When developers write comparisons, most of the time they are thinking of question 2. This is what you get when comparing primitives but objects are a different story; they not easy for the computer to answer because the "logical concept" is not well-defined.

For example, look at the result (in Ruby) if instead of a Hash (or Array) of primitive values, we use a custom class:

class Name
  attr_accessor :name
  def initialize(n)
    @name = n
  end
end

array = [
  Name.new('Adam'),
  Name.new('Adam'),
  Name.new('Adam')
].uniq

=> [
     #<Name:0x0000563133c2d9a8 @name="Wong">,
     #<Name:0x0000563133c2d958 @name="Wong">,
     #<Name:0x0000563133c2d908 @name="Wong">
   ]

In this case, Ruby's result is an answer to question 1 (they don't have the same memory location) since the logical concept of "having the same name means they are the same object" isn't well-defined.

This is the root of the problem: the vast majority of times we are looking for answers to question 2, but every now and then because of technical or implementation details, the result is an answer to question 1 instead—breaking expectations.

Some languages (like Ruby) know that question 2 is usually what the user wants, and they go the extra mile to implement a specialized comparison to give the expected result when dealing with classes from their standard library (Hash, Array, Set, etc). This however is not enough and fails once custom classes are involved.

JavaScript on the other hand always answers question 1 when comparing objects. Despite being consistent, this is unlikely to be what the developer expected and can therefore be surprising.

Back to basics

Primitive values in JavaScript are immutable. This also means that primitive values are always equal to each other, since a value (e.g. the number 42) will always be the same thing.

In JavaScript, a primitive (primitive value, primitive data type) is data that is not an object and has no methods. There are 7 primitive data types: string, number, bigint, boolean, undefined, symbol, and null.

Conversely, different instantiations of a class result in different objects that reside in different memory locations and will have different references, even if they are all instantiated from the same class. This is the reason why our deduplication algorithm works for primitive values but not on objects: the test of equality is too strict such that only primitive values can properly satisfy it.

It would be nice if we can make == not only compare primitive values, but also the object's instrumental value: the value that we as developer think of the object as having (e.g. the temperature in degrees Celsius), not what the object actually has (i.e. a pointer to a memory location). This way, we can have e.g. indexOf use the overloaded == operator and get back our preferred test for equality, making it do the "right" thing.

JavaScript however does not support operator overloading, which means the == operator will always have the same behavior and cannot be redefined. Instead, it has [multiple ways of checking for equality](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Equality_comparisons_and_sameness)..

A brief aside on how a rubyist might do deduplication

In Ruby, many operators (like ==) are just methods belonging to an object, and can therefore redefined. This overloading is actually expected to be present by much of the standard library.

For example, Ruby objects can be used as keys to a Hash because internally the method Object#hash is called, and Array#uniq works because internally the method Object#eql? is called. These methods are defined on all of Ruby's standard library and emits values that make them behave in a way developers would expect.

For domain-specific objects, implementing these methods will result in any instances of your object behaving "correctly" without any changes to the standard library functions, as long as you fulfill the contracts. In the Ruby docs, the developer is expected to define these when trying to use [objects as hash keys](https://docs.ruby-lang.org/en/2.0.0/Hash.html#class-Hash-label-Hash+Keys).

Two objects refer to the same hash key when their hash value is identical and the two objects are eql? to each other.

A user-defined class may be used as a hash key if the hash and eql? methods are overridden to provide meaningful behavior. By default, separate instances refer to separate hash keys.

A typical implementation of hash is based on the object's data while eql? is usually aliased to the overridden == method:

Incidentally, the reason why our Using Object Keys technique work strangely on arrays with non-primitive values is that JavaScript objects can only have strings as keys. That means any other value will have itself coerced into a string via the toString() function:

({}.toString());

For most objects, this results in the string "[object Object]" being used as the object's key, which is why all of the objects were deduplicated (even if they had different properties).

Objects that have toString() method redefined (like Date or Array) fare better with this approach since their stringified value is the object's instrumental value (the value we expect the object to have conceptually).

A more general solution?

Even though we can redefine the toString() method on objects, it can be tedious and error-prone: this can really only work if (like Ruby) the rest of the ecosystem expects that these patterns are present and used throughout; otherwise the custom usage sticks out like a sore thumb.

Additionally, the toString() API is meant to return a textual representation of the object, and is mostly intended for human consumption (what you’d usually see on a console.log() call). This may or may not be the same as what you need to use for equality comparisons.

Reusing the toString() API for something else might work on specific cases, but may cause surprising effects in others. In fact, MDN's docs on Object.toString() has a similar discussion on detecting an object's class via the toString() function and concludes that using it in this way is unreliable.

What we want is something akin to Ruby's Object#hash and Object#eql? methods that would inspect an object and emit a specific value unique to that inspection, but does not differentiate between instantiations of objects. In other words, even if we instantiate two objects, if those two objects have the same instrumental value, then it should emit the same "hash" value.

The closest thing we have in standard JavaScript is to serialize the object using JSON.stringify(). If there is a need for even further customization, you can do that by (re)defining the toJSON() function and then comparing the result with strict equality ===; if an object has the toJSON() function defined, JSON.stringify() will use that function when serializing the object. The toJSON() function also accepts a parameter with varying behaviors which helps in creating a much more specific pseudo-hash function.

While JSON.stringify() is by no means a very robust hash function (e.g. it won't differentiate between Infinity and NaN since those will both be serialized as null) it does traverse objects and arrays, making it usable for comparing objects that are deeply nested.

const arr = [
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "remote-work", title: "Remote Work" },
  { slug: "culture", title: "Culture" },
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
];

arr.filter((e, i, a) => {
  const eJSON = JSON.stringify(e);

  for (let index = 0; index < a.length; index++) {
    const currJSON = JSON.stringify(a[index]);

    if (eJSON === currJSON) {
      if (index === i) return true;
      else break; // we only consider the first match
    }
  }
  return false;
});

If we do ever have a need for such custom checks, there is always the option to redefine the toJSON() function to include them.

Although this solution might be generic, it somewhat gives a bad smell due to the over usage of JSON.stringify(). Even worse, since we're dealing with an (at worst) O(n^2) complexity, there can be a large performance hit when calling JSON.stringify() with multiple large objects.

Equality on our own terms

Given that JavaScript doesn’t have a well-defined API for defining equality, instead of trying to find a generic and catch-all solution that can work on almost anything we throw at it, we can try defining what equality means to us and specifically test for those conditions. This has the added benefit of not only reducing waste (since we only check for what's important for us) but also documenting what equality denotes (at least according to our application).

There are multiple ways of doing this, but we'll try to show at least two: via functions, and via mixins. We'll be using both techniques inside a filter function to check whether an element has been detected before (and is therefore a duplicate) similar to how we did it in the Using filter() and indexOf() example.

Using functions

This one is straightforward: we define a function called deepArrayEqual() that would check a deeply nested array and determine if they are equal to each other. In our case, we're only interested in the following conditions to decide equality:

  • the lengths must be the same
  • the order of elements must be the same (also applies to nested arrays)
  • equality is checked via strict equality (i.e. ===)
const deepArrayEqual = (a, b) => {
  if (a.length !== b.length) return false; // quick check for unbalanced arrays

  // recursively check for equality of elements
  if (Array.isArray(a) && Array.isArray(b))
    return a.every((e, i) => deepArrayEqual(e, b[i])); // recurse if  array
  else return a === b; // terminal condition; check for equality
};

[
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
  [1, [2, 3], [[4, 5], [6, 7], 8], 9],
].filter((e, i, a) => {
  // for loop is our own implementation of the `indexOf()` function
  for (let index = 0; index < a.length; index++) {
    if (deepArrayEqual(e, a[index])) {
      if (index === i) return true;
      else break;
    }
  }
  return false;
});

We then use our own implementation of indexOf(e) === i to go recursively through all elements using the deepArrayEqual() function for equality determination.

Using mixins

Another technique would be using mixins. In our case, we don't really need to be very comprehensive in the checks so we can relax our conditions. We consider two objects as equal when:

  • the number of keys of both objects are the same
  • each of the object's properties are the same
const eqlMixin = {
  eql(that) {
    // quick check for different number of properties
    if (Object.keys(this).length !== Object.keys(that).length) return false;

    // this is eql to that iff all properties === each other
    return Object.keys(that).every((prop) => this[prop] === that[prop]);
  },
};

[
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "scrum", title: "Scrum" },
  { slug: "remote-work", title: "Remote Work" },
  { slug: "culture", title: "Culture" },
].filter((e, i, a) => {
  // for loop is our own implementation of the `indexOf()` function
  for (let index = 0; index < a.length; index++) {
    Object.assign(e, eqlMixin);
    if (e.eql?.(a[index])) {
      if (index === i) return true;
      else break;
    }
  }
  return false;
});

We then use Object.assign() to "mixin" the eql() function to each of the objects we are checking, so we can use eql() in the form of this.eql?.(that) // boolean (the ?. syntax is for optional chaining that will return null if the receiver does not have the function defined).

The downside here is that the objects now have the eql() function mixed in, which may or may not be desired; if needed we can just use the delete operator to remove the function from the objects. We don't really mind them having that function mixed in so we leave them in.

We just need to be careful not clobber any existing eql() function definition on that object; wrapping the Object.assign() call inside a conditional that checks for a predefined eql() function can a valid way to avoid this kind of shadowing.

As in the "using functions" example, we use our own implementation of indexOf(e) === i to go through the array and check for duplicates.

Using a custom object

Instead of trying to be generic, we can just work on the objects that we're using directly, and bake in them a function that returns a comparable value that's unique to the object. This functions similar to how Ruby's Object#hash method would work, and we can use it the same way.

class TagObject {
  constructor(slug, title) {
    this.slug = slug;
    this.title = title;
  }

  // should return this object's "identity"
  objHash() {
    return `${this.slug}:${this.title}`;
  }
}

[
  new TagObject("scrum", "Scrum"),
  new TagObject("scrum", "Scrum"),
  new TagObject("scrum", "Scrum"),
  new TagObject("remote-work", "Remote Work"),
  new TagObject("culture", "Culture"),
].filter((e, i, a) => {
  // for loop is our own implementation of the `indexOf()` function
  for (let index = 0; index < a.length; index++) {
    if (e.objHash() === a[index].objHash()) {
      if (index === i) return true;
      else break;
    }
  }
  return false;
});

While this technique has the downside of only being applicable to our custom objects (i.e. only objects instantiated from this class would be satisfying our conditions for equality) that is also a benefit for us in the sense that we have full control of what the definition of equality means to our application.

Summary and recommendations

So now we go full circle to the original question that sparked all this: what's the best way to deduplicate a JavaScript array?

If you're reasonably sure you'd only be dealing with flattened, primitive-only (e.g. string or numeric) array elements (like the blog post "tags" in our case), then creating a new Set() and then either spreading the elements or creating a new Array() from the created set is the most straightforward and performant. This approach will net you the results closest to what you would expect a unique() function would.

[...new Set(array)];
// or
new Array(new Set(array));

We've shown however that this really can be surprising when objects and arrays are involved. This is because while the same primitive value is always equal to itself, different instances of an object are technically different from each other due to the way == works.

In languages like Ruby, you can get around the question of equality by implementing Object#==, Object#eql?, and Object#hash yourself in either the object's eigenclass or the parent Class.

You can perform specific tests in that implementation that can allow you for example, to compare two dates in the Gregorian and Julian calendars and eliminate one of the two if they are in fact the "same" date (just in different calendars), or compare various Temperature objects that are the "same" temperature but expressed in different units such as Celsius, Fahrenheit, and Kelvin.

This makes using operations that depend on equality (like Array#uniq) almost universally "do the right thing" without too much thinking of what equality is and effectively delegating the responsibility of implementation to the object; if you're interested in the details then you just look at the Object#==, Object#hash, and Object#eql? implementations.

JavaScript however does not have this capability of (re)implementing equality; instead it relies on four equality algorithms that cover most of the general cases and their exceptions. If you need to be even more nuanced you will need implement the equality test as part of local code (e.g. inside the filter() or the reduce() functions).

We would also recommend staying away from the "magical" techniques that try to use Object or Map keys by virtue of their being unique by design (and ride on their implementation of deduplication). These assume that the key deduplication algorithm behaves as you'd expect, and as you can see with the Object with objects as keys example, you'll also need to override the object's toString() function to get satisfactory results.

For deduplicating non-primitives, we've looked into one such implementation using JSON.stringify() as an example that can work with most objects, even if deeply nested:

arr.filter((e, i, a) => {
  const eJSON = JSON.stringify(e);
  for (let index = 0; index < a.length; index++) {
    const currJSON = JSON.stringify(a[index]);
    if (eJSON === currJSON) {
      if (index === i) return true;
      else break;
    }
  }
  return false;
});

The caveat here is that while the filter function seems to be generic enough, you'll still need to handle special cases that your application might require (e.g. handling NaN or functions). Furthermore, using JSON.stringify() on huge collections that contain large, deeply nested objects or arrays can severely hurt performance as this example runs in O(n^2) in the worst case.

This is of course not recommended as running JSON.stringify() is a quite heavy handed, especially on larger data structures. We can instead go for a filter function that embodies the proper test for equality that you need for your application.

In our case, we consider objects that have the same properties and arrays that have the same length and elements to be equal, and only check for those specifically. One such example can be:

const deepArrayEqual = (a, b) => {
  if (a.length !== b.length) return false;
  if (Array.isArray(a) && Array.isArray(b))
    return a.every((e, i) => deepArrayEqual(e, b[i]));
  else return a === b;
};

const eqlMixin = {
  eql(that) {
    if (Object.keys(this).length !== Object.keys(that).length) return false;
    return Object.keys(that).every((prop) => this[prop] === that[prop]);
  },
};

arr.filter((e, i, a) => {
  for (let index = 0; index < a.length; index++) {
    if (
      (Array.isArray(e) && deepArrayEqual(e, a[index])) ||
      (Object.assign(e, eqlMixin) && e.eql?.(a[index]))
    ) {
      if (index === i) return true;
      else break;
    }
  }
  return false;
});

In other words, depending on your approach you can either

  • pursue efficiency and sacrifice "correctness" (e.g. NaN should not be equal to each other but the Set object will equate them anyway, or not being able to deduplicate objects and be limited to only primitive values as array elements),
  • or pursue a more general solution and sacrifice performance by using JSON.stringify(),
  • or (even better) pursue a middle ground by providing your own equality algorithms.

It is our job as software developers to decide which approach best suits our applications.

Resources

Need help building your product?

Reach out to us by filling out the form on our contact page. If you need an NDA, just let us know, and we’ll gladly provide one!

Top software development company Malaysia awards
Loading...