pandas => 데이터 그룹화

기본 그룹화

한 칼럼으로 그룹화

다음 DataFrame 사용

df = pd.DataFrame({'A': ['a', 'b', 'c', 'a', 'b', 'b'], 
                   'B': [2, 8, 1, 4, 3, 8], 
                   'C': [102, 98, 107, 104, 115, 87]})

df
# Output: 
#    A  B    C
# 0  a  2  102
# 1  b  8   98
# 2  c  1  107
# 3  a  4  104
# 4  b  3  115
# 5  b  8   87

열 A로 그룹화하고 다른 열의 평균값을 얻습니다.

df.groupby('A').mean()
# Output: 
#           B    C
# A               
# a  3.000000  103
# b  6.333333  100
# c  1.000000  107

여러 열로 그룹화

df.groupby(['A','B']).mean()
# Output: 
#          C
# A B       
# a 2  102.0
#   4  104.0
# b 3  115.0
#   8   92.5
# c 1  107.0

결과 DataFrame의 각 행을 튜플 또는 MultiIndex (이 경우 열 A와 B의 요소 쌍)로 인덱싱하는 방법에 유의하십시오.

예를 들어 각 그룹의 항목 수를 계산하고 그 평균을 계산하는 것과 같이 여러 집계 메소드를 한 번에 적용하려면 agg 함수를 사용하십시오.

df.groupby(['A','B']).agg(['count', 'mean'])
# Output:
#         C       
#     count   mean
# A B             
# a 2     1  102.0
#   4     1  104.0
# b 3     1  115.0
#   8     2   92.5
# c 1     1  107.0

그룹화 번호

다음 DataFrame의 경우 :

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'Age': np.random.randint(20, 70, 100), 
                   'Sex': np.random.choice(['Male', 'Female'], 100), 
                   'number_of_foo': np.random.randint(1, 20, 100)})
df.head()
# Output: 

#    Age     Sex  number_of_foo
# 0   64  Female             14
# 1   67  Female             14
# 2   20  Female             12
# 3   23    Male             17
# 4   23  Female             15

그룹 Age 을 세 가지 범주 (또는 저장소)로 나눕니다. 쓰레기통은 다음과 같이 주어질 수있다.

빈의 수를 나타내는 정수 n -이 경우 데이터 프레임의 데이터는 같은 크기의 n 간격으로 나뉘어집니다
예를 들어 bins=[19, 40, 65, np.inf] 세 개의 연령 그룹 (19, 40] , (40, 65] , (65, np.inf] .

팬더는 라벨의 문자열 버전을 자동으로 할당합니다. labels 매개 변수를 문자열 목록으로 정의하여 자체 레이블을 정의 할 수도 있습니다.

pd.cut(df['Age'], bins=4)
# this creates four age groups: (19.951, 32.25] < (32.25, 44.5] < (44.5, 56.75] < (56.75, 69]
Name: Age, dtype: category
Categories (4, object): [(19.951, 32.25] < (32.25, 44.5] < (44.5, 56.75] < (56.75, 69]]

pd.cut(df['Age'], bins=[19, 40, 65, np.inf])
# this creates three age groups: (19, 40], (40, 65] and (65, infinity)
Name: Age, dtype: category
Categories (3, object): [(19, 40] < (40, 65] < (65, inf]]

groupby 에서 그것을 사용하여 foo의 평균 수를 구하십시오.

age_groups = pd.cut(df['Age'], bins=[19, 40, 65, np.inf])
df.groupby(age_groups)['number_of_foo'].mean()
# Output: 
# Age
# (19, 40]     9.880000
# (40, 65]     9.452381
# (65, inf]    9.250000
# Name: number_of_foo, dtype: float64

연령 그룹 및 성별을 십자 표시 :

pd.crosstab(age_groups, df['Sex'])
# Output: 
# Sex        Female  Male
# Age
# (19, 40]       22    28
# (40, 65]       18    24
# (65, inf]       3     5

그룹의 열 선택

groupby를 수행 할 때 단일 열 또는 열 목록을 선택할 수 있습니다.

In [11]: df = pd.DataFrame([[1, 1, 2], [1, 2, 3], [2, 3, 4]], columns=["A", "B", "C"])

In [12]: df
Out[12]:
   A  B  C
0  1  1  2
1  1  2  3
2  2  3  4

In [13]: g = df.groupby("A")

In [14]: g["B"].mean()           # just column B
Out[14]:
A
1    1.5
2    3.0
Name: B, dtype: float64

In [15]: g[["B", "C"]].mean()    # columns B and C
Out[15]:
     B    C
A
1  1.5  2.5
2  3.0  4.0

agg 를 사용하여 수행 할 열과 집계를 지정할 수도 있습니다.

In [16]: g.agg({'B': 'mean', 'C': 'count'})
Out[16]:
   C    B
A        
1  2  1.5
2  1  3.0

개수 대 크기별 집계

size 와 count 의 차이는 다음과 같습니다.

size 는 NaN 값을 count 하지만 count 는 그렇지 않습니다.

df = pd.DataFrame(
        {"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"],
         "City":["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"],
         "Val": [4, 3, 3, np.nan, np.nan, 4]})

df
# Output: 
#        City     Name  Val
# 0   Seattle    Alice  4.0
# 1   Seattle      Bob  3.0
# 2  Portland  Mallory  3.0
# 3   Seattle  Mallory  NaN
# 4   Seattle      Bob  NaN
# 5  Portland  Mallory  4.0


df.groupby(["Name", "City"])['Val'].size().reset_index(name='Size')
# Output: 
#       Name      City  Size
# 0    Alice   Seattle     1
# 1      Bob   Seattle     2
# 2  Mallory  Portland     2
# 3  Mallory   Seattle     1

df.groupby(["Name", "City"])['Val'].count().reset_index(name='Count')
# Output: 
#       Name      City  Count
# 0    Alice   Seattle      1
# 1      Bob   Seattle      1
# 2  Mallory  Portland      2
# 3  Mallory   Seattle      0

그룹 집계

In [1]: import numpy as np   
In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'A': list('XYZXYZXYZX'), 'B': [1, 2, 1, 3, 1, 2, 3, 3, 1, 2], 
                           'C': [12, 14, 11, 12, 13, 14, 16, 12, 10, 19]})

In [4]: df.groupby('A')['B'].agg({'mean': np.mean, 'standard deviation': np.std})
Out[4]: 
   standard deviation      mean
A                              
X            0.957427  2.250000
Y            1.000000  2.000000
Z            0.577350  1.333333

여러 열의 경우 :

In [5]: df.groupby('A').agg({'B': [np.mean, np.std], 'C': [np.sum, 'count']})
Out[5]: 
    C               B          
  sum count      mean       std
A                              
X  59     4  2.250000  0.957427
Y  39     3  2.000000  1.000000
Z  35     3  1.333333  0.577350

다른 파일의 그룹 내보내기

groupby() 의해 반환 된 객체를 반복 할 수 있습니다. 이터레이터는 (Category, DataFrame) 튜플을 포함합니다.

# Same example data as in the previous example.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'Age': np.random.randint(20, 70, 100), 
                   'Sex': np.random.choice(['Male', factor'Female'], 100), 
                   'number_of_foo': np.random.randint(1, 20, 100)})

# Export to Male.csv and Female.csv files.
for sex, data in df.groupby('Sex'):
    data.to_csv("{}.csv".format(sex))

원본 데이터 프레임을 보존하면서 변환을 사용하여 그룹 수준의 통계를 얻습니다.

예:

df = pd.DataFrame({'group1' :  ['A', 'A', 'A', 'A',
                               'B', 'B', 'B', 'B'],
                   'group2' :  ['C', 'C', 'C', 'D',
                               'E', 'E', 'F', 'F'],
                   'B'      :  ['one', np.NaN, np.NaN, np.NaN,
                                np.NaN, 'two', np.NaN, np.NaN],
                   'C'      :  [np.NaN, 1, np.NaN, np.NaN,
                               np.NaN, np.NaN, np.NaN, 4]})           

 df
Out[34]: 
     B    C group1 group2
0  one  NaN      A      C
1  NaN  1.0      A      C
2  NaN  NaN      A      C
3  NaN  NaN      A      D
4  NaN  NaN      B      E
5  two  NaN      B      E
6  NaN  NaN      B      F
7  NaN  4.0      B      F

group1 과 group2 의 각 조합에 대해 B의 누락 된 관측치 수를 얻고 싶습니다. groupby.transform 은 정확히 그렇게하는 매우 강력한 기능입니다.

df['count_B']=df.groupby(['group1','group2']).B.transform('count')                        

df
Out[36]: 
     B    C group1 group2  count_B
0  one  NaN      A      C        1
1  NaN  1.0      A      C        1
2  NaN  NaN      A      C        1
3  NaN  NaN      A      D        0
4  NaN  NaN      B      E        1
5  two  NaN      B      E        1
6  NaN  NaN      B      F        0
7  NaN  4.0      B      F        0

Modified text is an extract of the original Stack Overflow Documentation

아래 라이선스 CC BY-SA 3.0

와 제휴하지 않음 Stack Overflow

pandas
데이터 그룹화

수색…