Data Cleaning - One-Hot Encoding

 

One - Hot Encoding

You probably have seen in your life when you have to choose from multiple choices. Then they are sometimes written as 1. A option, 2. B option, and 3. C option. We call these options categorical variables. Since these are called variables, we can find a relationship between those. But it's not easy as numeric variables. C option - B option is not A option, nor A option + B option is C option. For this case, the options are not considered as numerical variables. So, how do we find the relationships? The answer is one hot encoding

One-hot encoding is simple: transferring categorical variables into binary variables so that we can know which is which so that it's easier to apply on a machine learning system such as linear regression. We will show you the process with an example.

We have a housing data to construct a regression model that predicts SalesPrice using other features in the dataset. We will 1) clean the data, 2) apply one-hot encoding, and 3) run linear regression model on it.
1.  Clean the data:
First, we want to identify 'nan' values. But, I want to change the values differently based on the column type. So, I got objective columns have 'nan' values and numeric columns have 'nan' values separately using masks. After getting columns respectively, I changed 'nan' values in the objective columns with 'N/A'. Although there were some columns have 'none' options but, after a look, I realized 'none' is not the same as 'N/A' under all the columns. For numeric columns, I could have added a mean value to 'nan' values, but I didn't want to change any statistical information as much as I could. So, I got the range from min to max from each column and changed 'nan' values with those range 
2. Apply one-hot encoding:
After cleaning the data, I realized that there was 'MSSubClass' column which was numeric values but they were categorial variables. So, we one-hot encoding the column. It's important to drop a column when you one-hot encoding so that we don't have a collinearity with the added constant column. The constant column is for flexibility to make predictions that are not restricted to the origin. Now, we encode 4 different columns that I think it's impacting on SalePrice which are: "HeatingQC", "HouseStyle", "PoolQC", and "SaleType". Of course we dropped the first column as before but didn't add a constant column because we already have it. For convenience, we removed all other categorical features.
3. Run linear regression model:
Using statsmodel.OLS() on 'SalePrice', we got the linear regression values. We converted the summary into a data frame and got the highest coefficient variables regarding 'SalePrice': PoolQC_N/A' which was interesting. 
It has 9.66 coefficient and 0.000 p value which has a positive direction toward 'SalePrice' with R-Squared being 0.852.
4. All columns linear regression:
What if we don't drop other categorical columns but try to run on linear regression? If you remember, we saved a copy of the housing documents before we dropped other categorical columns. With that, we will run on linear regression on 'SalePrice' again. 
Interestingly, we have a different column: "RoofMatl_Membran" indicates the roof material is Membrane. It has lower coefficient 6.74 but still have 0 p-value with R-squared being 0.934.



*GitHub:https://github.com/KwakSukyoung/coding/blob/master/ACME/DataCleaning/data_cleaning.ipynb






Comments