Data Cleaning - Data Inspection

 

Data Inspection
Before analyzing data or modeling using data, one of the most important tasks is to clean data. There are so many different ways and different types of people that data is not always collected in an expected way.
Here is a data example includes about 100 movies information. We will clean this data today.
1. Data Type:
For color data, the data value should be usually string or it could be an integer if it is defined as a number. However, it shouldn't be mixed. After we printed each data column's data types, so far there no big issue yet.
2. Data Range:
Let's say we collected the height of a high school students. Maybe 9 ft would be very suspicious but, we know for sure 0 foot 0 inch is not possible. We will check if we have that kind of issues from the data.
Note that there -50 for duration, 202 for movie year, and -7.5 for imdb score, a movie rate from 1 - 10. To see if there is any column that didn't meet the normal range, I found the index for each.
One thing we need to know is that I won't drop those rows or replace the value to nan right away because it's the always correct way to fix the incorrect data or there are different ways to fix it. So, we will just check incorrect rows for now without change the values.
 3. Mandatory Constraints:
Based on the data usage purpose, there are some columns cannot have missing entries. One of the most common columns is a name column. Name is the most important indicator to distinguish data. In this case, since we do not have a specific purpose of using this data, we will skip this part.
4. Unique Constraint:
There are some columns that has to be unique. For this case, I used nuique() showing the unique number of each data column values. Technically, for this case, we do not have to worry too much about uniqueness  so we will move to the next part.
5.  Cross-Field Validation:
There are some facts counted as a cause and result or first and later. For example, you cannot have a birthday later than your death date or weeding date. For this case, we can consider the existence of wrong data such as a color movie before 1908. This is how I coded and we found a suspicious data.
6.  Duplicated Data:
Sometimes, it's commonly found that some duplicated rows or columns appeared in a data frame. In this case, we don't want a duplicated movie pieces in our data frame. So, I looked up if there is any rows that have the same director and the same movie title. Although it looks a lot, but there are only 15 duplicated rows.
In this case, we will remove the duplicated rows with a little bit of consideration since we have pretty much the same values for other columns also. After checking all the duplicates, I decided to keep only the last cases. Now, we don't see any duplicates.
7. Accuracy:
It's good to doubt how reliable the data is. The data could be someone's idea instead of truth. For this case, we won't worry too much about it but we can check where this data come from, who collected this data, or how this data was collected.
8. Missing Data:
When you deal with data, it's easy to see 'nan'. Unfortunately, not every data is given and you might be able to see too many 'nan's. In fact, if you have too many of those, the data might not be efficient for you to accomplish your goals. Thus, it's a great habit to check the number of 'nan's or missing data values. One way to check the number of 'nan's is to us isna(). But, you also note that there might be missing information such as blank or indicated otherwise. 
As you can see, there is Null instead of 'nan'.
9.  Uniformity:
You might have had this confusing moment that you don't understand how to translate from your measurement to a different measurement. As you know, 0 Celsius is not the same a s 0 Fahrenheit. Therefore, it's important to keep data values in the same format. For example, I used unique() function again and for 'color' column, we have two names for same meaning. 
So, I changed 'Color' and 'color ' to be the same words: 'Color' by using replace().



Comments