Pandas - Grouping

 


Pandas - Grouping
    Pandas is a strong tool to manage and to analyze data. One of the great functions of Pandas is to group 
    data. For instance, we have this data contains data on the sleep information of different mammals such 
    as carnivore, herbivore, omnivore, or insectivore.

    We can group the data on 'genus', 'vore', or whatever we want to group them on. Let's say we want to 
    group the data on 'more'. Then all we have to do is put ".groupby(<column name>). Thus, I put "vore"     
    in the column name. 
    Since "vores" is a Pandas object, one easy way to see what I get from groupby is "get_group". You just 
    have to choose the  kind of vore you want to look at from the data. 

    Groupby itself is already great but there are awesome Pandas tools: Cut, QCut, and Pivot Table. We 
    will learn these operators by solving this problem:
1. Cut:
    When you go to a restaurant or an amusement park, you might note that you are categorized by your 
    age. From the official Disneyland website, these are the categories:
    Let's say I have a family of 6 people with ages: [ 1, 3, 7, 15, 45, 50]. We can find separate them based 
    on the given intervals.
    Because the interval is ( , ], I chose the interval from 0 - 2.99, 2.99 - 9.99, 9.99 - 50. Then we know that 
    there is 3 adults, 2 children, and 1 "No ticket required" person.
    Now, here is what the Ohio data look like that the problem offered:
    From the first question, we get intervals: 0 - 38, 39 - 42, 43+. So, I separated the column "Educational 
    Attainment" into the intervals using Pandas.cut and saved the values into a new column "degree".
    Since we want to see the most often interval the workers are in the dataset, we will code using 
    Pandas.value_counts. Then we find out the interval (38,42) is the answer for the first question.
2. QCut:
    QCut is similar to cut but separate them into qualities instead of intervals. We have an example of 6 data 
    integers and 2 at the end. From the code, 2 means we separate all the data integers in an equal qualities. 
    Thus, this is what we get.
    If we had 3 instead 2, this is what we get.
    Now, the second problem want us to separate "age" column into 6 equally-sized groups using pd.qcut() 
    and this is what I coded:
3. Pivot Table:
    Although we successfully divided "age" column, we still have to answer the second question which is,     
    "Which interval has the highest average Usual Hours Worked? Note that we cannot just do the same 
    thing as the first question because we need to answer the question with "Usual Hours Worked" data. To 
    see data with some certain values, we use pivot table.
    We set the index which is 'fare' column, set "Usual Hours Worked" as values, and set "mean" for 
    aggfunc because we want to get the average scale. Then we get the answer for the second question:
    Lastly, we will answer the third question, "Using the partitions from the first two parts, what age/degree 
    combination has the lowest yearly salary on average?" Since we want to group values based on 
    age/degree, we will put 'degree' column as index and 'fare' column as columns while we set "Yearly 
    Salary" for values:


*GitHub : https://github.com/KwakSukyoung/coding/blob/master/ACME/Pandas3/pandas3.ipynb







Comments