In-depth analysis of the implementation principle of java HashMap

Author：Eve Cole Update Time：2025-03-02 22:16:01

Mark it, and at the same time, you can combine hashCode() and equals() methods well. It is best to override hashcode() when overwriting the equals method to ensure that the two objects of equals have the same hashcode. Conversely: hashcode() is not equal, it must be It is also different to be able to deduce equals(); hashcode() is equal, equals() may be equal or may not be equal.

Because when HashMap gets, first compare hashcode, then compare equals, hashcode==&&equals, both are true, so it is considered to be the same key

1. HashMap Overview:

HashMap is an asynchronous implementation of the Map interface based on hash table. This implementation provides all optional mapping operations and allows the use of null values and null keys. This class does not guarantee the order of mappings, especially it does not guarantee that the order will last.

2. HashMap's data structure:

In the Java programming language, there are two basic structures, one is an array and the other is a simulated pointer (reference). All data structures can be constructed using these two basic structures, and HashMap is no exception. HashMap is actually a "linked list hash" data structure, that is, a combination of arrays and linked lists.

As can be seen from the above figure, the underlying layer of HashMap is an array structure, and each item in the array is another linked list. When a new HashMap is created, an array will be initialized.

The source code is as follows:

 /** * The table, resized as necessary. Length MUST Always be a power of two. */ transient Entry[] table; static class Entry<K,V> implements Map.Entry<K,V > { final K key; V value; Entry<K,V> next; final int hash; … }

It can be seen that Entry is the element in the array. Each Map.Entry is actually a key-value pair, which holds a reference to the next element, which constitutes a linked list.

3. HashMap access implementation:

1) Storage:

 public V put(K key, V value) { // HashMap allows null keys and null values. // When the key is null, call the putForNullKey method and place the value at the first position of the array. if (key == null) return putForNullKey(value); // Recalculate the hash value based on the keyCode of the key. int hash = hash(key.hashCode()); // Search for the index of the specified hash value in the corresponding table. int i = indexFor(hash, table.length); // If the Entry at the i index is not null, continue to traverse the next element of the e element through a loop. for (Entry<K,V> e = table[i]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) { V oldValue = e.value; e.value = value; e.recordAccess(this); return oldValue; } } // If the Entry at the i index is null, it means that this is There is no Entry yet. modCount++; // Add key and value to the i index. addEntry(hash, key, value, i); return null; }

From the above source code, we can see that when we put element in HashMap, we first recalculate the hash value based on the hashCode of the key, and according to the hash value, the position of this element in the array (i.e., subscript), if the array is There are other elements already stored at the position, so the elements at this position will be stored in the form of a linked list, the newly added ones are placed at the head of the chain, and the first added ones are placed at the end of the chain. If there is no element at this position in the array, put the element directly at this position in the array.

The addEntry(hash, key, value, i) method places the key-value pair at the i index of the array table based on the calculated hash value. addEntry is a method of package access permissions provided by HashMap, the code is as follows:

 void addEntry(int hash, K key, V value, int bucketIndex) { // Get the Entry at the specified bucketIndex index Entry<K,V> e = table[bucketIndex]; // Put the newly created Entry into the bu boxIndex index and let the new Entry point to the original Entry table[bucketIndex] = new Entry<K,V>(hash, key, value, e); // If the number of key-value pairs in the Map exceeds the limit if (size++ >= threshold) // Expand the length of the table object to twice the original one. resize(2 * table.length); }

When the system decides to store the key-value pair in the HashMap, the value in the Entry is not considered at all, and it simply calculates and determines the storage location of each Entry based on the key. We can completely treat the value in the Map collection as an attachment to the key. When the system determines the storage location of the key, the value can be saved there.

The hash(int h) method recalculates the hash once based on the hashCode of the key. This algorithm adds high-bit calculations to prevent hash conflicts caused when the low-bit remains unchanged and when the high-bit changes.

 static int hash(int h) { h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); }

We can see that to find an element in HashMap, we need to find the corresponding position in the array based on the hash value of the key. How to calculate this position is the hash algorithm. As mentioned earlier, the data structure of HashMap is a combination of arrays and linked lists, so of course we hope that the element positions in this HashMap should be distributed evenly as much as possible, so that the number of elements at each position is only one. Then when we use the hash algorithm to find it When this position is used, you can immediately know that the elements in the corresponding position are what we want, and there is no need to traverse the linked list anymore, which greatly optimizes the efficiency of the query.

For any given object, as long as its hashCode() return value is the same, the hash code value calculated by the program calling the hash(int h) method is always the same. The first thing we think of is to modulo the hash value to the array length, so that the distribution of elements is relatively uniform. However, the "modulo" operation consumes a lot, and this is done in HashMap: call the indexFor(int h, int length) method to calculate which index the object should be stored in the table array.

The code of the indexFor(int h, int length) method is as follows:

 static int indexFor(int h, int length) { return h & (length-1); }

This method is very clever. It obtains the saved bits of the object through h & (table.length -1), and the length of the underlying array of HashMap is always to the n power of 2, which is the optimization of HashMap in terms of speed. In the HashMap constructor, there is the following code:

 int capacity = 1; while (capacity < initialCapacity) capacity <<= 1;

This code ensures that the capacity of HashMap is always at the n power of 2 during initialization, that is, the length of the underlying array is always at the n power of 2.

When length is always to the n power of 2, the h& (length-1) operation is equivalent to modulo length, that is, h%length, but & has higher efficiency than %.

This looks very simple, but it is actually quite mysterious. Let's give an example to illustrate:

Assuming that the array lengths are 15 and 16, and the optimized hash codes are 8 and 9, respectively, then the result after the & operation is as follows:

h & (table.length-1) hash table.length-1
8 & (15-1): 0100 & 1110 = 0100
9 & (15-1): 0101 & 1110 = 0100
-------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------
8 & (16-1): 0100 & 1111 = 0100
9 & (16-1): 0101 & 1111 = 0101

As can be seen from the above example: when they are "as" with 15-1 (1110), the same result is produced, that is, they will be positioned at the same position in the array, which creates a collision , 8 and 9 will be placed at the same position in the array to form a linked list. Then when querying, you need to traverse the linked list to get 8 or 9, which reduces the efficiency of the query. At the same time, we can also find that when the array length is 15, the hash value will be "as" with 15-1 (1110), then the last bit will always be 0, and 0001, 0011, 0101, 1001, 1011, 0111, 0111 , 1101 can never store elements, and the space is wasted quite a lot. What's worse is that in this case, the position where the array can use is much smaller than the length of the array, which means that the chance of collision is further increased and the Slow query efficiency! When the array length is 16, that is, when it is to the n power of 2, the value on each bit of the binary number obtained by 2n-1 is 1, which makes the low bit of the sum of the original hash when it is & on the low bit, so that the low bit of the obtained sum hash is The same, coupled with the hash(int h) method further optimizes the hashCode of the key and adds high-bit calculations, so that only two values of the same hash value will be placed at the same position in the array to form a linked list.

Therefore, when the array length is n powers of 2, the probability that different keys can calculate the same index is smaller, so the data will be distributed evenly on the array, which means that the probability of collision is small, relative, query At this time, you don’t have to traverse the linked list at a certain location, so the query efficiency will be higher.

According to the source code of the put method above, when the program tries to put a key-value pair into a HashMap, the program first determines the storage location of the Entry based on the hashCode() return value of the key: if the key of two Entry The hashCode() return value is the same, and their storage location is the same. If the keys of these two Entry return true through equals comparison, the value of newly added Entry will override the value of the original Entry in the collection, but the key will not override. If the keys of these two Entry return false through equals comparison, the newly added Entry will form an Entry chain with the original Entry in the collection, and the newly added Entry is located at the head of the Entry chain - details are continued to see the description of the addEntry() method .

2) Read:

 public V get(Object key) { if (key == null) return getForNullKey(); int hash = hash(key.hashCode()); for (Entry<K,V> e = table[indexFor(h ash, table. length)]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) return e.value; } return null; }

With the hash algorithm stored above as the basis, it is easy to understand this code. From the above source code, we can see that when getting an element in HashMap, first calculate the hashCode of the key, find an element at the corresponding position in the array, and then find the required element in the linked list of the corresponding position through the equals method of the key.

3) To summarize, HashMap processes key-value as a whole at the bottom, and this whole is an Entry object. The underlying HashMap uses an Entry[] array to store all key-value pairs. When an Entry object needs to be stored, it will determine its storage location in the array based on the hash algorithm, and determine its location in the array based on the equals method. . When an Entry needs to be retrieved, it will also find its storage location in the array according to the hash algorithm, and then extract the Entry from the linked list at that location according to the equals method.

4. HashMap's resize (rehash):

As there are more and more elements in the HashMap, the chance of hash conflict is getting higher and higher, because the length of the array is fixed. Therefore, in order to improve the efficiency of query, the array of HashMap must be expanded. The array size expansion operation will also appear in ArrayList. This is a common operation. After the HashMap array size is expanded, the most performance-consuming point appears: The data in the original array must be recalculated and placed in the new array, which is resize.
So when will HashMap be expanded? When the number of elements in the HashMap exceeds the array size *loadFactor, the array is expanded. The default value of loadFactor is 0.75, which is a compromise value. That is to say, by default, the array size is 16. Then when the number of elements in HashMap exceeds 16*0.75=12, the size of the array is expanded to 2*16=32, that is, double the size, and then recalculate it. The position of each element in the array, which is a very performance-consuming operation, so if we have predicted the number of elements in the HashMap, then the number of preset elements can effectively improve the performance of the HashMap.

5. HashMap performance parameters:

HashMap contains the following constructors:

HashMap(): Build a HashMap with an initial capacity of 16 and a load factor of 0.75.

HashMap(int initialCapacity): build a HashMap with initial Capacity and a load factor of 0.75.

HashMap(int initialCapacity, float loadFactor): Creates a HashMap with the specified initial capacity and the specified load factor.

HashMap's basic constructor, HashMap (int initialCapacity, float loadFactor), has two parameters, they are the initial capacity initialCapacity and the load factor loadFactor.

initialCapacity: The maximum capacity of HashMap, that is, the length of the underlying array.

loadFactor: The load factor loadFactor is defined as: the actual number of elements of a hash table (n)/the capacity of a hash table (m).

The load factor measures the degree of use of a hash table space. The larger the load factor, the higher the load factor, and vice versa. For hash tables using linked list method, the average time to find an element is O(1+a). Therefore, if the load factor is larger, the space is more fully utilized, but the consequence is a decrease in search efficiency; if the load factor is too If small, then the data of the hash table will be too sparse, causing serious waste of space.

In the implementation of HashMap, the maximum capacity of HashMap is determined by the threshold field:

The code copy is as follows:
threshold = (int)(capacity * loadFactor);

Based on the definition formula of the load factor, it can be seen that threshold is the maximum number of elements allowed according to the loadFactor and capacity. If this number exceeds this, resize to reduce the actual load factor. The default load factor 0.75 is a balanced choice for spatial and temporal efficiency. When the capacity exceeds this maximum capacity, the resized HashMap capacity is twice the capacity:

 if (size++ >= threshold) resize(2 * table.length);

6. Fail-Fast mechanism:

We know that java.util.HashMap is not thread-safe, so if other threads modify the map during the process of using the iterator, a ConcurrentModificationException will be thrown, which is the so-called fail-fast strategy.
This strategy is implemented in the source code through the modCount field. modCount, as the name implies, is the number of modifications. Modification of HashMap content will increase this value. Then, during the iterator initialization process, this value will be assigned to the iterator's expectedModCount.

 HashIterator() { expectedModCount = modCount; if (size > 0) { // advance to first entry Entry[] t = table; while (index < t.length && (next = t[index ++]) == null) ; } }

During the iteration process, determine whether the modCount and expectedModCount are equal. If they are not equal, it means that other threads have modified the map:
Note that modCount is declared as volatile, ensuring visibility of modifications between threads.

 final Entry<K,V> nextEntry() { if (modCount != expectedModCount) throw new ConcurrentModificationException();

In HashMap's API, it is stated:

The iterators returned by the "collection view method" of all HashMap classes fail quickly: after the iterator is created, if the mapping is modified from the structure, unless it is passed through the iterator's own remove method, any other way. The iterator will throw a ConcurrentModificationException if the modification is made. Therefore, in the face of concurrent modifications, the iterator will soon fail completely without risking arbitrary uncertain behavior at uncertain times in the future.

Note that the fast failure behavior of the iterator cannot be guaranteed. Generally speaking, when there are asynchronous concurrent modifications, it is impossible to make any firm guarantees. Fast Fail Iterator throws a ConcurrentModificationException when it works best. Therefore, it is wrong to write programs that rely on this exception, and the correct way is: the fast failure behavior of the iterator should be used only to detect program errors.