pandas
Hervormen en draaien
Zoeken…
Eenvoudig draaien
Probeer eerst pivot
:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Mary', 'Josh','Jon','Lucy', 'Jane', 'Sue'],
'Age':[34, 37, 29, 40, 29, 31],
'City':['Boston','New York', 'Chicago', 'Los Angeles', 'Chicago', 'Boston'],
'Position':['Manager','Programmer','Manager','Manager','Programmer', 'Programmer']},
columns=['Name','Position','City','Age'])
print (df)
Name Position City Age
0 Mary Manager Boston 34
1 Josh Programmer New York 37
2 Jon Manager Chicago 29
3 Lucy Manager Los Angeles 40
4 Jane Programmer Chicago 29
5 Sue Programmer Boston 31
print (df.pivot(index='Position', columns='City', values='Age'))
City Boston Chicago Los Angeles New York
Position
Manager 34.0 29.0 40.0 NaN
Programmer 31.0 29.0 NaN 37.0
Indien nodig de index opnieuw instellen, kolomnamen verwijderen en NaN-waarden invullen:
#pivoting by numbers - column Age
print (df.pivot(index='Position', columns='City', values='Age')
.reset_index()
.rename_axis(None, axis=1)
.fillna(0))
Position Boston Chicago Los Angeles New York
0 Manager 34.0 29.0 40.0 0.0
1 Programmer 31.0 29.0 0.0 37.0
#pivoting by strings - column Name
print (df.pivot(index='Position', columns='City', values='Name'))
City Boston Chicago Los Angeles New York
Position
Manager Mary Jon Lucy None
Programmer Sue Jane None Josh
Draaien met aggregeren
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Mary', 'Jon','Lucy', 'Jane', 'Sue', 'Mary', 'Lucy'],
'Age':[35, 37, 40, 29, 31, 26, 28],
'City':['Boston', 'Chicago', 'Los Angeles', 'Chicago', 'Boston', 'Boston', 'Chicago'],
'Position':['Manager','Manager','Manager','Programmer', 'Programmer','Manager','Manager'],
'Sex':['Female','Male','Female','Female', 'Female','Female','Female']},
columns=['Name','Position','City','Age','Sex'])
print (df)
Name Position City Age Sex
0 Mary Manager Boston 35 Female
1 Jon Manager Chicago 37 Male
2 Lucy Manager Los Angeles 40 Female
3 Jane Programmer Chicago 29 Female
4 Sue Programmer Boston 31 Female
5 Mary Manager Boston 26 Female
6 Lucy Manager Chicago 28 Female
Als u pivot
, krijgt u een foutmelding:
print (df.pivot(index='Position', columns='City', values='Age'))
ValueError: Index bevat dubbele vermeldingen, kan niet opnieuw worden gevormd
Gebruik pivot_table
met aggregatiefunctie:
#default aggfunc is np.mean
print (df.pivot_table(index='Position', columns='City', values='Age'))
City Boston Chicago Los Angeles
Position
Manager 30.5 32.5 40.0
Programmer 31.0 29.0 NaN
print (df.pivot_table(index='Position', columns='City', values='Age', aggfunc=np.mean))
City Boston Chicago Los Angeles
Position
Manager 30.5 32.5 40.0
Programmer 31.0 29.0 NaN
Nog een agg-functies:
print (df.pivot_table(index='Position', columns='City', values='Age', aggfunc=sum))
City Boston Chicago Los Angeles
Position
Manager 61.0 65.0 40.0
Programmer 31.0 29.0 NaN
#lost data !!!
print (df.pivot_table(index='Position', columns='City', values='Age', aggfunc='first'))
City Boston Chicago Los Angeles
Position
Manager 35.0 37.0 40.0
Programmer 31.0 29.0 NaN
Indien nodig aggregeren door kolommen met string
:
print (df.pivot_table(index='Position', columns='City', values='Name'))
DataError: Geen numerieke typen om te aggregeren
U kunt deze verzwarende functies gebruiken:
print (df.pivot_table(index='Position', columns='City', values='Name', aggfunc='first'))
City Boston Chicago Los Angeles
Position
Manager Mary Jon Lucy
Programmer Sue Jane None
print (df.pivot_table(index='Position', columns='City', values='Name', aggfunc='last'))
City Boston Chicago Los Angeles
Position
Manager Mary Lucy Lucy
Programmer Sue Jane None
print (df.pivot_table(index='Position', columns='City', values='Name', aggfunc='sum'))
City Boston Chicago Los Angeles
Position
Manager MaryMary JonLucy Lucy
Programmer Sue Jane None
print (df.pivot_table(index='Position', columns='City', values='Name', aggfunc=', '.join))
City Boston Chicago Los Angeles
Position
Manager Mary, Mary Jon, Lucy Lucy
Programmer Sue Jane None
print (df.pivot_table(index='Position', columns='City', values='Name', aggfunc=', '.join, fill_value='-')
.reset_index()
.rename_axis(None, axis=1))
Position Boston Chicago Los Angeles
0 Manager Mary, Mary Jon, Lucy Lucy
1 Programmer Sue Jane -
De informatie met betrekking tot het geslacht is nog niet gebruikt. Het kan worden geschakeld door een van de kolommen, of het kan worden toegevoegd als een ander niveau:
print (df.pivot_table(index='Position', columns=['City','Sex'], values='Age', aggfunc='first'))
City Boston Chicago Los Angeles
Sex Female Female Male Female
Position
Manager 35.0 28.0 37.0 40.0
Programmer 31.0 29.0 NaN NaN
Meerdere kolommen kunnen worden opgegeven in een van de attributenindex, kolommen en waarden.
print (df.pivot_table(index=['Position','Sex'], columns='City', values='Age', aggfunc='first'))
City Boston Chicago Los Angeles
Position Sex
Manager Female 35.0 28.0 40.0
Male NaN 37.0 NaN
Programmer Female 31.0 29.0 NaN
Verschillende aggregatiefuncties toepassen
U kunt eenvoudig meerdere functies toepassen tijdens een enkele pivot:
In [23]: import numpy as np
In [24]: df.pivot_table(index='Position', values='Age', aggfunc=[np.mean, np.std])
Out[24]:
mean std
Position
Manager 34.333333 5.507571
Programmer 32.333333 4.163332
Soms wilt u misschien specifieke functies op specifieke kolommen toepassen:
In [35]: df['Random'] = np.random.random(6)
In [36]: df
Out[36]:
Name Position City Age Random
0 Mary Manager Boston 34 0.678577
1 Josh Programmer New York 37 0.973168
2 Jon Manager Chicago 29 0.146668
3 Lucy Manager Los Angeles 40 0.150120
4 Jane Programmer Chicago 29 0.112769
5 Sue Programmer Boston 31 0.185198
For example, find the mean age, and standard deviation of random by Position:
In [37]: df.pivot_table(index='Position', aggfunc={'Age': np.mean, 'Random': np.std})
Out[37]:
Age Random
Position
Manager 34.333333 0.306106
Programmer 32.333333 0.477219
Men kan een lijst met functies doorgeven die ook op de afzonderlijke kolommen kunnen worden toegepast:
In [38]: df.pivot_table(index='Position', aggfunc={'Age': np.mean, 'Random': [np.mean, np.std]})]
Out[38]:
Age Random
mean mean std
Position
Manager 34.333333 0.325122 0.306106
Programmer 32.333333 0.423712 0.477219
Stapelen en ontstapelen
import pandas as pd
import numpy as np
np.random.seed(0)
tuples = list(zip(*[['bar', 'bar', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two','one', 'two']]))
idx = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(6, 2), index=idx, columns=['A', 'B'])
print (df)
A B
first second
bar one 1.764052 0.400157
two 0.978738 2.240893
foo one 1.867558 -0.977278
two 0.950088 -0.151357
qux one -0.103219 0.410599
two 0.144044 1.454274
print (df.stack())
first second
bar one A 1.764052
B 0.400157
two A 0.978738
B 2.240893
foo one A 1.867558
B -0.977278
two A 0.950088
B -0.151357
qux one A -0.103219
B 0.410599
two A 0.144044
B 1.454274
dtype: float64
#reset index, rename column name
print (df.stack().reset_index(name='val2').rename(columns={'level_2': 'val1'}))
first second val1 val2
0 bar one A 1.764052
1 bar one B 0.400157
2 bar two A 0.978738
3 bar two B 2.240893
4 foo one A 1.867558
5 foo one B -0.977278
6 foo two A 0.950088
7 foo two B -0.151357
8 qux one A -0.103219
9 qux one B 0.410599
10 qux two A 0.144044
11 qux two B 1.454274
print (df.unstack())
A B
second one two one two
first
bar 1.764052 0.978738 0.400157 2.240893
foo 1.867558 0.950088 -0.977278 -0.151357
qux -0.103219 0.144044 0.410599 1.454274
rename_axis
(nieuw in pandas
0.18.0
):
#reset index, remove columns names
df1 = df.unstack().reset_index().rename_axis((None,None), axis=1)
#reset MultiIndex in columns with list comprehension
df1.columns = ['_'.join(col).strip('_') for col in df1.columns]
print (df1)
first A_one A_two B_one B_two
0 bar 1.764052 0.978738 0.400157 2.240893
1 foo 1.867558 0.950088 -0.977278 -0.151357
2 qux -0.103219 0.144044 0.410599 1.454274
panda's hieronder 0.18.0
#reset index
df1 = df.unstack().reset_index()
#remove columns names
df1.columns.names = (None, None)
#reset MultiIndex in columns with list comprehension
df1.columns = ['_'.join(col).strip('_') for col in df1.columns]
print (df1)
first A_one A_two B_one B_two
0 bar 1.764052 0.978738 0.400157 2.240893
1 foo 1.867558 0.950088 -0.977278 -0.151357
2 qux -0.103219 0.144044 0.410599 1.454274
Kruistabel
import pandas as pd
df = pd.DataFrame({'Sex': ['M', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F'],
'Age': [20, 19, 17, 35, 22, 22, 12, 15, 17, 22],
'Heart Disease': ['Y', 'N', 'Y', 'N', 'N', 'Y', 'N', 'Y', 'N', 'Y']})
df
Age Heart Disease Sex
0 20 Y M
1 19 N M
2 17 Y F
3 35 N M
4 22 N F
5 22 Y F
6 12 N M
7 15 Y M
8 17 N F
9 22 Y F
pd.crosstab(df['Sex'], df['Heart Disease'])
Hearth Disease N Y
Sex
F 2 3
M 3 2
Puntnotatie gebruiken:
pd.crosstab(df.Sex, df.Age)
Age 12 15 17 19 20 22 35
Sex
F 0 0 2 0 0 3 0
M 1 1 0 1 1 0 1
Transponeren van DF:
pd.crosstab(df.Sex, df.Age).T
Sex F M
Age
12 0 1
15 0 1
17 2 0
19 0 1
20 0 1
22 3 0
35 0 1
Marges of cumulatieven ophalen:
pd.crosstab(df['Sex'], df['Heart Disease'], margins=True)
Heart Disease N Y All
Sex
F 2 3 5
M 3 2 5
All 5 5 10
Cumuleren van cumulatief:
pd.crosstab(df['Sex'], df['Age'], margins=True).T
Sex F M All
Age
12 0 1 1
15 0 1 1
17 2 0 2
19 0 1 1
20 0 1 1
22 3 0 3
35 0 1 1
All 5 5 10
Percentages verkrijgen:
pd.crosstab(df["Sex"],df['Heart Disease']).apply(lambda r: r/len(df), axis=1)
Heart Disease N Y
Sex
F 0.2 0.3
M 0.3 0.2
Cumulatief worden en vermenigvuldigen met 100:
df2 = pd.crosstab(df["Age"],df['Sex'], margins=True ).apply(lambda r: r/len(df)*100, axis=1)
df2
Sex F M All
Age
12 0.0 10.0 10.0
15 0.0 10.0 10.0
17 20.0 0.0 20.0
19 0.0 10.0 10.0
20 0.0 10.0 10.0
22 30.0 0.0 30.0
35 0.0 10.0 10.0
All 50.0 50.0 100.0
Een kolom uit DF verwijderen (een manier):
df2[["F","M"]]
Sex F M
Age
12 0.0 10.0
15 0.0 10.0
17 20.0 0.0
19 0.0 10.0
20 0.0 10.0
22 30.0 0.0
35 0.0 10.0
All 50.0 50.0
Panda's smelten om van breed naar lang te gaan
>>> df
ID Year Jan_salary Feb_salary Mar_salary
0 1 2016 4500 4200 4700
1 2 2016 3800 3600 4400
2 3 2016 5500 5200 5300
>>> melted_df = pd.melt(df,id_vars=['ID','Year'],
value_vars=['Jan_salary','Feb_salary','Mar_salary'],
var_name='month',value_name='salary')
>>> melted_df
ID Year month salary
0 1 2016 Jan_salary 4500
1 2 2016 Jan_salary 3800
2 3 2016 Jan_salary 5500
3 1 2016 Feb_salary 4200
4 2 2016 Feb_salary 3600
5 3 2016 Feb_salary 5200
6 1 2016 Mar_salary 4700
7 2 2016 Mar_salary 4400
8 3 2016 Mar_salary 5300
>>> melted_['month'] = melted_['month'].str.replace('_salary','')
>>> import calendar
>>> def mapper(month_abbr):
... # from http://stackoverflow.com/a/3418092/42346
... d = {v: str(k).zfill(2) for k,v in enumerate(calendar.month_abbr)}
... return d[month_abbr]
>>> melted_df['month'] = melted_df['month'].apply(mapper)
>>> melted_df
ID Year month salary
0 1 2016 01 4500
1 2 2016 01 3800
2 3 2016 01 5500
3 1 2016 02 4200
4 2 2016 02 3600
5 3 2016 02 5200
6 1 2016 03 4700
7 2 2016 03 4400
8 3 2016 03 5300
CSV-reeksen in kolommen splitsen (opnieuw vormgeven) in meerdere rijen, met één element per rij
import pandas as pd
df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
{'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}])
print(df)
reshaped = \
(df.set_index(df.columns.drop('var1',1).tolist())
.var1.str.split(',', expand=True)
.stack()
.reset_index()
.rename(columns={0:'var1'})
.loc[:, df.columns]
)
print(reshaped)
Output:
var1 var2 var3
0 a,b,c 1 XX
1 d,e,f,x,y 2 ZZ
var1 var2 var3
0 a 1 XX
1 b 1 XX
2 c 1 XX
3 d 2 ZZ
4 e 2 ZZ
5 f 2 ZZ
6 x 2 ZZ
7 y 2 ZZ