High cardinality in Categorical Variables.

When I started my journey as a data science entusiast , I stuck up when I was predicting climate change affected zone or not. During my intern days I got datasets and I started classifying what kind of data it is weather it is categorical or continuous, nominal or ordinal etc. I got a feature named Zip code which is of type nominal categorical data type. For nominal data type one hot encoding may be a solution but what about 15k unique values? If we use one hot encoding here it will create 15k features and make our feature space fat and leads to highly sparse feature space matrix.

Should I Drop Zip Code?

Hashing

Hashing is commonly used technique to reduce cardinality. Here we design a hash function f, which takes zip code and return a number. There is a catch is , f should return only a fixed set of numbers/buckets. The number of bucket usually an argument of f.

Indian PIN CODE SYSTEM
Indian PIN CODE SYSTEM

This is an example of Indian pin code system where every digit we have represents a specific geo-location and different level. So we need to design our hash function such way that it takes an input zip-code and put it into the bucket mapped by hashed function. The number of buckets should be choosen resonably small, it can be used as one-hot encoded format later. Like we we choose our hash function to divide Zone it will create 5 buckets only like North,South,East,West,Center. It can be used as one hot encoded by just adding up more five feature attribute.

If we not design our hash function f by keeping underlying data distribution in our mind it leads to skewness in our data distribution and impact our ML algorithm.

So goal of our hash function f is not only map every zip code to a bucket but also we have to take care and maintain the original data distribution as close as possible.

Consistency is the another important requirement of hash function. Same Zip code should fall in same bucket irrespective of how many time it occurs.

please share your views how you handle these cases!!!

Thank You

https://www.linkedin.com/in/ravi-pandey-0006807183

Data Science , Cloud Developer, geeky

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store