High cardinality in Categorical Variables.

When I started my journey as a data science entusiast , I stuck up when I was predicting climate change affected zone or not. During my intern days I got datasets and I started classifying what kind of data it is weather it is categorical or continuous, nominal or ordinal etc. I got a feature named Zip code which is of type nominal categorical data type. For nominal data type one hot encoding may be a solution but what about 15k unique values? If we use one hot encoding here it will create 15k features and make our feature space fat and leads to highly sparse feature space matrix.

Should I Drop Zip Code?

Zip code have too many unique values and too many unique values means a very high variance in that feature which may be a causes overfiting and we get modest results. It may be possible that our target feature is depended on zip code because it represents the geo-location. So droping the feature means we may lose the information about target pattern because there is a chance that our target might be location-sensitive. So droping might be a bad idea until we make sure by using some hypothesis testing like chi-square or ANOVA.


Hashing is commonly used technique to reduce cardinality. Here we design a hash function f, which takes zip code and return a number. There is a catch is , f should return only a fixed set of numbers/buckets. The number of bucket usually an argument of f.


This is an example of Indian pin code system where every digit we have represents a specific geo-location and different level. So we need to design our hash function such way that it takes an input zip-code and put it into the bucket mapped by hashed function. The number of buckets should be choosen resonably small, it can be used as one-hot encoded format later. Like we we choose our hash function to divide Zone it will create 5 buckets only like North,South,East,West,Center. It can be used as one hot encoded by just adding up more five feature attribute.

If we not design our hash function f by keeping underlying data distribution in our mind it leads to skewness in our data distribution and impact our ML algorithm.

So goal of our hash function f is not only map every zip code to a bucket but also we have to take care and maintain the original data distribution as close as possible.

Consistency is the another important requirement of hash function. Same Zip code should fall in same bucket irrespective of how many time it occurs.

please share your views how you handle these cases!!!

Thank You


Data Science , Cloud Developer, geeky