abc - Wrap arrays as fast, immutable collections
Primitive arrays are fast and compact in memory. But they are mutable, have a lot of quirks, and do not even have a working equals and hashcode. So they are rarely used directly as data structures in scala.
This library wraps flat, primitive arrays as immutable sequences, sets and maps. Now at first, the idea of having an immutable data structure backed by a single flat array without any sort of tree structure might seem ridiculous. Updating a single element is going to require copying the entire array and is thus going to be an O(n) operation.
But take a look at the typical usage patterns for immutable collections. Often, you transform the entire collection repeatedly with a sequence of map, flatmap and filter operations. So optimizing for single element updates at the expense of things like compact in-memory representation might not be worth it. So the approach in this library is not to make single element updates as fast as possible, but to give you the tools to avoid having to do them in the first place.
Now obviously building a collection by starting with an empty immutable collection and then adding all elements one-by-one would be an O(n2) operation and thus totally unacceptable. See benchmark results below. But intentionally doing this is also very rare in my experience. Usually you have some sort of sequence or iterable which you want to convert into a collection. That can always be done in at most O(n log n) using the sonic reducer.
The benchmarks compare creation, membership test, and bulk operations for
ArraySet[A] is just a wrapper around an ordered array. Lookup for contains etc. is done using a binary search and is therefore O(log n). Elements are sorted, so an
ArraySet[A] is most closely comparable with a
SortedSet[A]. But it will still perform better than a binary search tree, since the data is in a single continuous section of memory. And of course it will work up to very large collections where a binary tree will run out of memory because of its memory overhead.
The best approach to build ArraySets is to use the constructor that takes a sequence of elements. This is referred to as "create bulk" in the benchmarks. The naive approach of building an ArraySet is to add elements one by one. This is referred to as "create elements" in the benchmarks. In the scala collections, bulk creation is internally done by adding elements sequentially, so there is no difference between the two approaches.
As you can see from this benchmark, bulk creation of ArraySet[Int] is consistently much faster than building HashSet[Int] or SortedSet[Int]. The naive approach of adding elements sequentially unsurprisingly gets much slower for high n, since it is an O(n2) operation. So don't do that.
The essential set/element operation for a set is membership test. The two cases are for if the element is contained in the set, and if it is not contained in the set (outside).
For a successful membership test, the performance is better than that of the SortedSet and sometimes even better than that of the HashSet. The performance difference between all three collections is not as high. The performance for a failed membership test is somewhere in the middle between SortedSet (worst) and HashSet (best). Note that both the x scale and the y scale are logarithmic, so ArraySet is doing quite a bit better than SortedSet due to the compact in-memory representation.
For all major set/set operations that are supported by scala collections, ArraySet is faster than SortedSet, often by two orders of magnitude. It is also significantly faster than HashSet for large n. Note the log scale on both the x- and the y-axis.
There are multiple lines because each benchmark is done multiple times for varying overlaps. See the benchmark source.
Set/set operations are implemented using my Minimum-Comparison Merging algorithm.
Memory usage of various Seq[Int] in bytes
For List[Int], memory usage is almost exactly 10 times as large as for an Array[Int] or ArraySeq[Int]. For Vector[Int], memory usage is only about 5 times as high as Array[Int].
Memory usage of various Set[Int] in bytes
For Int sets, the memory usage of SortedSet is about 12 times more than that of ArraySet. The memory usage of HashSet is about 14 times higher than for ArraySet.
Memory usage of various Map[Int, Int] in bytes
For Int sets, the memory usage of SortedSet is about 11 times more than that of ArraySet. The memory usage of HashSet is about 12 times higher than for ArraySet.