Thursday, May 15, 2008

Java Hashing

lets start with hashing in general and how does java implement same and what are the drawbacks ( if any )

hashing is a process of generating an index or address ( which is of smaller length) basing on ur data . A good hash function is the one which generates distinct addresses for distinct file names.

As a simple example of the using of hashing in databases, a group of people could be arranged in a database like this:
Abernathy, Sara Epperdingle, Roscoe Moore, Wilfred Smith, David (and many more sorted into alphabetical order)
Each of these names would be the key in the database for that person's data. A database search mechanism would first have to start looking character-by-character across the name for matches until it found the match (or ruled the other entries out). But if each of the names were hashed, it might be possible (depending on the number of names in the database) to generate a unique four-digit key for each name. For example:
7864 Abernathy, Sara 9802 Epperdingle, Roscoe 1990 Moore, Wilfred 8822 Smith, David (and so forth)
A search for any name would first consist of computing the hash value (using the same hash function used to store the item) and then comparing for a match using that value. It would, in general, be much faster to find a match across four digits, each having only 10 possibilities, than across an unpredictable value length where each character had 26 possibilities.

The hashing algorithm is also called hash function


Uses of hashing
(i) Hashes play a role in security systems where they're used to ensure that transmitted messages have not been tampered with. The sender generates a hash of the message, encrypts it, and sends it with the message itself. The recipient then decrypts both the message and the hash, produces another hash from the received message, and compares the two hashes. If they're the same, there is a very high probability that the message was transmitted intact.

(ii) Hashing is also a common method of accessing data records.

Java and hashing
Every Java object has a hashCode() and an equals() method. While the Java language does not provide direct support for associative arrays -- arrays that can take any object as an index -- the presence of the hashCode() method in the root Object class clearly anticipates the ubiquitous use of HashMap (and its predecessor, Hashtable). Under ideal conditions, hash-based containers offer both efficient insertion and efficient retrieval; supporting hashing directly in the object model facilitates the development and use of hash-based containers.

if two objects are equal according to the equals() method, they must have the same hashCode() value (although the reverse is not generally true).

Why override equals() and hashCode()?

What would happen if a class did not override equals() and hashCode()? Nothing, if we never used an that class as a key in a HashMap or other hash-based collection. However, if we were to use such a classes object for a key in a HashMap, we would not be able to reliably retrieve the associated value, unless we used the exact same classes instance in the get() call as we did in the put() call. This would require ensuring that we only use a single instance of the Integer object corresponding to a particular integer value throughout our program. Needless to say, this approach would be inconvenient and error prone.

There are some restrictions placed on the behavior of equals() and hashCode(), which are enumerated in the documentation for Object. In particular, the equals() method must exhibit the following properties:
· Symmetry: For two references, a and b, a.equals(b) if and only if b.equals(a)
· Reflexivity: For all non-null references, a.equals(a)
· Transitivity: If a.equals(b) and b.equals(c), then a.equals(c)
· Consistency with hashCode(): Two equal objects must have the same hashCode() value
· For any non-null reference value x, x.equals(null) must return false.

public boolean equals(Object o) {
if (o == null)
return false;
...
}

public boolean equals(Object o) {
if (!(o instanceof MyType))
return false; // also return null if o is null .. no spl check required
...
}

Building hashing into the root object class of the Java class library was a very sensible design compromise -- it makes using hash-based containers so much easier and more efficient. However, several criticisms have been made of the approach to and implementation of hashing and equality in the Java class library. The hash-based containers in java.util are very convenient and easy to use, but may not be suitable for applications that require very high performance. While most of these will never be changed, it is worthwhile to keep in mind when you're designing applications that rely heavily on the efficiency of hash-based containers. These criticisms include:

Too small a hash range. Using int, instead of long, for the return type of hashCode() increases the possibility of hash collisions.


Bad distribution of hash values. The hash values for short strings and small integers are themselves small integers, and are close to the hash values of other "nearby" objects. A more well-behaved hash function would distribute the hash values more evenly across the hash range.


No defined hashing operations. While some classes, such as String and List, define a hash algorithm to be used in combining the hash values of its constituent elements into a single hash value, the language specification does not define any approved means of combining the hash values of multiple objects into a new hash value.

Difficulty writing equals() when extending an instantiable class that already overrides equals(). The "obvious" ways to define equals() when extending an instantiable class that already overrides equals() all fail to meet the symmetry or transitivity requirements of the equals() method. This means that you must understand the structure and implementation details of classes you are extending when overriding equals(), and may even need to expose private fields in the base class as protected to do so, which violates principles of good object-oriented design.

Basically what happens is when hashCode returns a number then its searched for in collection and gets a list, and then it uses equals( ) to search in bucket that it got.

Equals( ) defines logical equality

Recipie for a equals( ) method

1. Use the == operator to check if the argument is a reference to this object.
If so, return true. This is just a performance optimization, but one that is worth
doing if the comparison is potentially expensive.
2. Use the instanceof operator to check if the argument is of the correct
type. If not, return false. Typically, the correct type is the class in which the
method occurs. Occasionally, it is some interface implemented by this class.
Use an interface if the class implements an interface that refines the equals
contract to permit comparisons across classes that implement the interface. The
collection interfaces Set, List, Map, and Map.Entry have this property.
3. Cast the argument to the correct type. Because this cast was preceded by an
instanceof test, it is guaranteed to succeed.
4. For each “significant” field in the class, check to see if that field of the argument
matches the corresponding field of this object. If all these tests succeed,
return true; otherwise, return false.
5. When you are finished writing your equals method, ask yourself three
questions: Is it symmetric, is it transitive, and is it consistent?

Here is the contract, copied from the java.lang.Object specification:
n Whenever it is invoked on the same object more than once during an execution
of an application, the hashCode method must consistently return the
same integer, provided no information used in equals comparisons on the
object is modified. This integer need not remain consistent from one execution
of an application to another execution of the same application.
n If two objects are equal according to the equals(Object) method, then calling
the hashCode method on each of the two objects must produce the same
integer result.
n It is not required that if two objects are unequal according to the equals(Object)
method, then calling the hashCode method on each of the two objects
must produce distinct integer results. However, the programmer should be
aware that producing distinct integer results for unequal objects may improve
the performance of hash tables.

References
http://java.sun.com/docs/books/effective/chapters.html
http://www.owasp.org/index.php/Hashing_Java

2 comments:

subin said...

How HashMaps Work?

Suppose a HashMaps size exceededand it started expanding,
will it do rehashing ? why should it if it does>

Lavnish said...

HOW HASHMAP WORK
When you call the get method on a Map (including a HashMap) the Map does two steps:

Firstly it checks to see if it has any keys with a matching hashCode. This is a very quick operation, so it can return null immediately if it hasn't.

Only then does it look up the actual keyed entry that it is to return, for which it needs the equals method.

IF SIZE EXCEEDS ...

When the number of entries in the hashmap exceeds... the product of 'load factor' and 'current capcity', the hashmap is rehashed so as to have twice the number of buckets. A value of 0.75 for load factor is good tradeoff between time and space cost.

hence The load factor and initial capacity effect the performance of the hashmap.

WHY SHOULD IT REHASH
It should rehash ... to allow facility of accessing elements "effeciently" ....else if its not rehashed as and when the hashmap is growing it will become less usable

HTH