Pandas - Grouping

Pandas - Grouping

Pandas is a strong tool to manage and to analyze data. One of the great functions of Pandas is to group

data. For instance, we have this data contains data on the sleep information of different mammals such

as carnivore, herbivore, omnivore, or insectivore.

We can group the data on 'genus', 'vore', or whatever we want to group them on. Let's say we want to

group the data on 'more'. Then all we have to do is put ".groupby(<column name>). Thus, I put "vore"

in the column name.

Since "vores" is a Pandas object, one easy way to see what I get from groupby is "get_group". You just

have to choose the kind of vore you want to look at from the data.

Groupby itself is already great but there are awesome Pandas tools: Cut, QCut, and Pivot Table. We

will learn these operators by solving this problem:

1. Cut:

When you go to a restaurant or an amusement park, you might note that you are categorized by your

age. From the official Disneyland website, these are the categories:

Let's say I have a family of 6 people with ages: [ 1, 3, 7, 15, 45, 50]. We can find separate them based

on the given intervals.

Because the interval is ( , ], I chose the interval from 0 - 2.99, 2.99 - 9.99, 9.99 - 50. Then we know that

there is 3 adults, 2 children, and 1 "No ticket required" person.

Now, here is what the Ohio data look like that the problem offered:

From the first question, we get intervals: 0 - 38, 39 - 42, 43+. So, I separated the column "Educational

Attainment" into the intervals using Pandas.cut and saved the values into a new column "degree".

Since we want to see the most often interval the workers are in the dataset, we will code using

Pandas.value_counts. Then we find out the interval (38,42) is the answer for the first question.

2. QCut:

QCut is similar to cut but separate them into qualities instead of intervals. We have an example of 6 data

integers and 2 at the end. From the code, 2 means we separate all the data integers in an equal qualities.

Thus, this is what we get.

If we had 3 instead 2, this is what we get.

Now, the second problem want us to separate "age" column into 6 equally-sized groups using pd.qcut()

and this is what I coded:

3. Pivot Table:

Although we successfully divided "age" column, we still have to answer the second question which is,

"Which interval has the highest average Usual Hours Worked? Note that we cannot just do the same

thing as the first question because we need to answer the question with "Usual Hours Worked" data. To

see data with some certain values, we use pivot table.

We set the index which is 'fare' column, set "Usual Hours Worked" as values, and set "mean" for

aggfunc because we want to get the average scale. Then we get the answer for the second question:

Lastly, we will answer the third question, "Using the partitions from the first two parts, what age/degree

combination has the lowest yearly salary on average?" Since we want to group values based on

age/degree, we will put 'degree' column as index and 'fare' column as columns while we set "Yearly

Salary" for values:

*GitHub : https://github.com/KwakSukyoung/coding/blob/master/ACME/Pandas3/pandas3.ipynb

Search This Blog

Sukyoung Kwak

Pandas - Grouping

Comments

Post a Comment

Popular posts from this blog

Basic SQL - Earthquakes

Basic SQL - Student Information