One of the features we wanted to add to the company website was the ability to tag and filter posts so that these posts can then be displayed in their own index page. That meant that we needed to read through all the YAML frontmatter, collate the tags into a collection, then deduplicate the collection so we can create an index page per tag.
For example, Ruby's arrays have a built-in method uniq that does what you'd expect:
We eventually ended settling with one of them, but we realized that there might be some value in exploring the variety of solutions that we've collected. We tried to ask ourselves: How do they work and how they fare in terms of performance?
So we took a few of those solutions that we felt best represented a certain concept, and ran benchmarks on them.
The benchmarks fluctuate depending on the resources your browser is currently using, but the ranking mostly remains the same. For reference, these are the results that we got:
Let's look at them one by one, in order of benchmark performance.
Using filter() and indexOf()
We start off with the fastest of the bunch, and it's also sweet and succinct (albeit needing some thought to how it works).
One uncommon thing here would be the usage of all three arguments in the filter function: the current element, the current index, and the array being iterated on.
The filter function finds the index of e in the array a and checks if the current index i is the same. If so, that means that this is the first time that the current e was encountered because the current index and the location of e in the array are the same.
If they are not the same, this check will evaluate to false because indexOf will return the index of the earliest occurrence of e and the current index i would not be the same as that index.
Although the filter function looks like it's doing multiple passes over the array to find the index of the current element (at worst an O(n^2) complexity with all elements already unique), it should still be fast enough for simple use cases.
Using filter() and includes()
While similar to the first one, this one uses some this-chicanery to avoid having to declare a temporary variable to hold the comparison list.
In its callback function mode (not using an arrow function), Array.prototype.filter() takes two arguments: the callback function (which can be defined or inline) and something called a thisArg. The thisArg is the value to be used as this when the callback is executed.
That means within the scope of the function, this would refer to  since that was passed in as the second parameter to filter().
The callback function will then use includes() to check if e is present in the array; if not we negate this and continue on the boolean && and execute an array push of the element into the array.
This results in duplicates elements returning true after the includes() test, which is negated, which then short-circuits the boolean and does not continue on with the array push.
It's only slightly slower than the indexOf() version, but depending on who you ask it may be more understandable than that. The only unusual thing about this is the usage of the thisArg which can be surprising to some developers who are used to arrow functions and don't need to worry about what this references.
We like this one a lot due to its simplicity and elegance. You simply create a new Set Object from the array, and then use the spread syntax to create a new array with the elements of the created Set. The MDN docs even provide this exact same example on deduplicating an array as part of its sample usage documentation.
It's almost like having unique elements is its sole reason for living (it is, mostly).
Why would you need to create an array from a Set when a Set should be expected behave much like an array, you might ask? If you look at the documentation of Set you'd notice that the array functions you'd mostly expect to use like map() or reduce() aren't present (you can still do forEach() though).
You also don't get any of array-like behaviors like being able to retrieve an element via its index using  or being able to have "holes" in the sequence by directly assigning a value to non-sequential indexes.
This was what we ended up finally using in our implementation.
Using Object Keys
There are two tricks being (ab)used here:
- Array.prototype.reduce() takes two arguments: the reducer function, and an optional initial value
The reducer function then creates a property in the accumulator acc and returns the accumulator. As expected in reducer functions, the returned accumulator is made available to the next iteration.
The trick here is that if the property already exists, then no new property is created and the assignment simply happens to a property that is already there.
We then call Object.keys() to get the property names of the accumulator object in order to get back a deduplicated array.
So how does the actual filter function work? Let's dissect the filter function:
- we start evaluating the expression right to left, starting with the in operator
- the in operator checks whether e is present in the property list of Object t and returns true or false
- the result (true or false) gets assigned to t[e] but this isn't really important; what we want is to create the key e in Object t if it doesn't exist yet
- what's important though is that this expression evaluates to true or false depending on whether property e is present in Object t
- since the expression returns true if e is in t, we want to negate this because if e is already present in t then we're dealing with a duplicate
- Array.prototype.filter() will reject elements for which the filter function returns false, leaving us with a deduplicated array
There are a lot of tricks and gotchas here, and we do not recommend using this in production code unless you have specific reasons (like trying to confuse your colleagues so they pay attention to your code's review :P). It also came in dead last in the benchmark, taking almost double the time it took the second worst algorithm to run.
Going Beyond the Primitive Case
Let's check our assumptions for a bit and try to see where our algorithms break down. Right now all we've been dealing with are arrays filled with strings. What would happen if our array happens to contain objects instead? Heck, let's throw in a couple of duplicated arrays in the mix as well.
Here's an example using Array.prototype.filter() and Array.prototype.indexOf()
It seems that we didn't get what we were expecting. Strangely, none of the duplicated elements were filtered out.
How about using the Set object?
This didn't work either. We get the same result as the filter() and indexOf() version.
Maybe the magical ways will fare better:
Now we get something different, but still is definitely not what we can consider as the "correct" result. One would have assumed that deduplication is such a simple task that it's such a surprise not to get the expected answer.
Speaking of assumptions, we've been talking about deduplication but neglecting to consider the elephant in the room: what does it mean for something to be a duplicate of another? In other words: what makes something different from another?
We've been taking for granted that the functions we've been using have implicit tests of equality that may not necessarily match what we expect from what "equality" means between array elements.
We'll dive deeper into existential questions of equality and truth in part 2.