HashSets are known from its efficiency, when it comes to checking, if they contain a particular object. They were designed exactly for doing that in a constant time. Contrary, checking objects existence in an ArrayList has a linear computational complexity, which is less efficient in theory. But is it true for small sets as well?
Check if the set contains the object
Recently, I had a case when I needed to implement a method to check, if there is a particular object in a list of elements. For example, here is the input list of numbers:
List<Integer> list = Lists.newArrayList(3, 5, 7, 2, 19);
and the method should verify, if it contains the number 7. A no brainer solution is to use the contains method from ArrayList like this:
list.contains(7)
But we know, that the computer will iterate over the list of elements until it finds an item equal to 7. For a list 10 elements long, there are at most 10 elements to compare to 7. If a list is 1000 elements long, there are at most 1000 elements to compare. The longer the input list is, the expected time needed to find it is greater. The computational complexity is linear O(n).
Another solution, that seems dedicated to this particular problem is to use HashSet. In that approach, we need to build a HashSet object and again use the contains method.
HashSet<Integer> set = new HashSet<>(Lists.newArrayList(3, 5, 7, 2, 19));
set.contains(7);
It's known, that the contains method on HashSet has a constant execution time. The time is independent on a size of the set. No matter how big the set is, checking existence of any object takes a constant time. To be precise, the time is independent on the set size.
Knowing that, choosing HashSet seems to be an obvious choice. But on the other hand, a constant time doesn't mean, that it is a small time. If each execution of set.contains(7) took 1 minute, no matter how big the set is, it would still be a constant access time, even though we would say it is slow for a set of 5 elements. That is exactly why I started thinking about how big the set must be to make sense of using hash set versus an array list.
Performance of contains method HashSet vs ArrayList
Actually, testing this is relatively easy. HashSet and ArrayList does not require any extra libraries or a setup. To make it simple, I used parametrized tests in junit to test various sizes of the collections. Here is a test method that initializes a HashSet object, iteratively adds elements to it. And then iteratively checks if the set contains each elements. The time is measure for the whole - initialization and the verification of containment.
@ParameterizedTest
@ValueSource(ints = {1000, 5, 10, 100, 1000, 10000})
void testArrayListContains(int size) {
long startTime = System.nanoTime();
List<Integer> set = new ArrayList<>(size);
for (int i = 0; i < size; i++) {
set.add(i);
}
int foundCnt = 0;
for (int i = 0; i < size; i++) {
if (set.contains(i)) {
foundCnt++;
}
}
long endTime = System.nanoTime();
long durationMs = (endTime - startTime) / 1000;
System.out.println("Test of ArrayList.contains for " + foundCnt + " numbers took " + durationMs + "ms");
}
As I want to compare performance of both collections, I prepared an alternative method, that uses ArrayList:
@ParameterizedTest
@ValueSource(ints = {1000, 5, 10, 100, 1000, 10000})
void testArrayListContains(int size) {
long startTime = System.nanoTime();
List<Integer> set = new ArrayList<>(size);
for (int i = 0; i < size; i++) {
set.add(i);
}
int foundCnt = 0;
for (int i = 0; i < size; i++) {
if (set.contains(i)) {
foundCnt++;
}
}
long endTime = System.nanoTime();
long durationMs = (endTime - startTime) / 1000;
System.out.println("Test of ArrayList.contains for " + foundCnt + " numbers took " + durationMs + "ms");
}
It does pretty much the same, but using a list, not a set.
Both test methods are parametrized with six different collection sizes. Actually, five that matters. I noticed, that the first execution always takes more time than the others. I don't want you to think, that a collection of five elements performs worse than a collection of ten elements, so I added an initial warm up on 1000 elements to both methods. It is just a test round, so we will not look at that first execution on 1000 elements.
I executed both tests and here is what I received:
Test of HashSet.contains for 5 numbers took 12ms
Test of HashSet.contains for 10 numbers took 14ms
Test of HashSet.contains for 100 numbers took 66ms
Test of HashSet.contains for 1000 numbers took 342ms
Test of HashSet.contains for 10000 numbers took 3319ms
Test of ArrayList.contains for 5 numbers took 4ms
Test of ArrayList.contains for 10 numbers took 8ms
Test of ArrayList.contains for 100 numbers took 206ms
Test of ArrayList.contains for 1000 numbers took 4691ms
Test of ArrayList.contains for 10000 numbers took 79789ms
Surprisingly (or not), ArrayList performs better for small number of elements. A list of 5 and 10, are small enough, that even linearly iterating through the elements to find the desired one is faster than using more advanced hashing tricks in HashSet. HashSet definitely remains a better choice, when there are more items in the set. HashSet for 100 and more seems to be a no brainer.
Performance of remove method HashSet vs ArrayList
The contains method is often used in a context of sets and lists, but the remove method is important as well. I also tested it! Here are both tests of the remove methods.
@ParameterizedTest
@ValueSource(ints = {1000, 5, 10, 100, 1000, 10000})
void testHashSetRemove(int size) {
long startTime = System.nanoTime();
Set<Integer> set = new HashSet<>(size);
for (int i = 0; i < size; i++) {
set.add(i);
}
int foundCnt = 0;
for (int i = 0; i < size; i++) {
if (set.remove(i)) {
foundCnt++;
}
}
long endTime = System.nanoTime();
long durationMs = (endTime - startTime) / 1000;
System.out.println("Test of HashSet.remove for " + foundCnt + " numbers took " + durationMs + "ms");
}
@ParameterizedTest
@ValueSource(ints = {1000, 5, 10, 100, 1000, 10000})
void testArrayListRemove(int size) {
long startTime = System.nanoTime();
List<Integer> set = new ArrayList<>(size);
for (int i = 0; i < size; i++) {
set.add(i);
}
int foundCnt = 0;
for (int i = 0; i < size; i++) {
if (set.remove((Object) i)) {
foundCnt++;
}
}
long endTime = System.nanoTime();
long durationMs = (endTime - startTime) / 1000;
System.out.println("Test of ArrayList.remove for " + foundCnt + " numbers took " + durationMs + "ms");
}
As they are so similar to the ones above, I just put here the results:
Test of HashSet.remove for 5 numbers took 43ms
Test of HashSet.remove for 10 numbers took 31ms
Test of HashSet.remove for 100 numbers took 54ms
Test of HashSet.remove for 1000 numbers took 501ms
Test of HashSet.remove for 10000 numbers took 3433ms
Test of ArrayList.remove for 5 numbers took 33ms
Test of ArrayList.remove for 10 numbers took 10ms
Test of ArrayList.remove for 100 numbers took 67ms
Test of ArrayList.remove for 1000 numbers took 661ms
Test of ArrayList.remove for 10000 numbers took 18880ms
This test has a very similar conclusion. It may make no sense to involve HashSet, if there are only up to 10-99 elements. I am not surprised, that the finding is the same as for the contains method. They both are implemented using the same mechanism: iterating in ArrayList and hashing function in HashSet.
Conclusion
It is clear that, if you have to build a collection just to check containment of a few elements, choose ArrayList if the collection is up to 10-100 elements. Go for HashSet, if there are at least 100 elements. The same applies to removing elements. I haven't spent more time on detailed looking for a sharp boundery of efficiency between both solutions for a simple reason. It probably depends on a hardware used for running the test.
One more remark at the end. The results could be a little different, if I didn't not create the collection from scratch each time, but had it ready as a constant field initialized during the application startup. But that is a story for another time.