Pandas - Grouping
data. For instance, we have this data contains data on the sleep information of different mammals such
as carnivore, herbivore, omnivore, or insectivore.
We can group the data on 'genus', 'vore', or whatever we want to group them on. Let's say we want to
group the data on 'more'. Then all we have to do is put ".groupby(<column name>). Thus, I put "vore"
in the column name.
Since "vores" is a Pandas object, one easy way to see what I get from groupby is "get_group". You just
have to choose the kind of vore you want to look at from the data.
Groupby itself is already great but there are awesome Pandas tools: Cut, QCut, and Pivot Table. We
will learn these operators by solving this problem:
1. Cut:
When you go to a restaurant or an amusement park, you might note that you are categorized by your
age. From the official Disneyland website, these are the categories:
Let's say I have a family of 6 people with ages: [ 1, 3, 7, 15, 45, 50]. We can find separate them based
on the given intervals.
Because the interval is ( , ], I chose the interval from 0 - 2.99, 2.99 - 9.99, 9.99 - 50. Then we know that
there is 3 adults, 2 children, and 1 "No ticket required" person.
Now, here is what the Ohio data look like that the problem offered:
From the first question, we get intervals: 0 - 38, 39 - 42, 43+. So, I separated the column "Educational
Attainment" into the intervals using Pandas.cut and saved the values into a new column "degree".
Since we want to see the most often interval the workers are in the dataset, we will code using
Pandas.value_counts. Then we find out the interval (38,42) is the answer for the first question.
2. QCut: QCut is similar to cut but separate them into qualities instead of intervals. We have an example of 6 data
integers and 2 at the end. From the code, 2 means we separate all the data integers in an equal qualities.
Thus, this is what we get.
If we had 3 instead 2, this is what we get.
Now, the second problem want us to separate "age" column into 6 equally-sized groups using pd.qcut()
and this is what I coded:
3. Pivot Table: Although we successfully divided "age" column, we still have to answer the second question which is,
"Which interval has the highest average Usual Hours Worked? Note that we cannot just do the same
thing as the first question because we need to answer the question with "Usual Hours Worked" data. To
see data with some certain values, we use pivot table.
We set the index which is 'fare' column, set "Usual Hours Worked" as values, and set "mean" for
aggfunc because we want to get the average scale. Then we get the answer for the second question:
Lastly, we will answer the third question, "Using the partitions from the first two parts, what age/degree
combination has the lowest yearly salary on average?" Since we want to group values based on
age/degree, we will put 'degree' column as index and 'fare' column as columns while we set "Yearly
Salary" for values:
*GitHub : https://github.com/KwakSukyoung/coding/blob/master/ACME/Pandas3/pandas3.ipynb















Comments
Post a Comment