pandas
Remodelage et pivotement
Recherche…
Pivotement simple
Essayez d'abord d'utiliser pivot
:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Mary', 'Josh','Jon','Lucy', 'Jane', 'Sue'],
'Age':[34, 37, 29, 40, 29, 31],
'City':['Boston','New York', 'Chicago', 'Los Angeles', 'Chicago', 'Boston'],
'Position':['Manager','Programmer','Manager','Manager','Programmer', 'Programmer']},
columns=['Name','Position','City','Age'])
print (df)
Name Position City Age
0 Mary Manager Boston 34
1 Josh Programmer New York 37
2 Jon Manager Chicago 29
3 Lucy Manager Los Angeles 40
4 Jane Programmer Chicago 29
5 Sue Programmer Boston 31
print (df.pivot(index='Position', columns='City', values='Age'))
City Boston Chicago Los Angeles New York
Position
Manager 34.0 29.0 40.0 NaN
Programmer 31.0 29.0 NaN 37.0
Si nécessaire, réinitialisez l'index, supprimez les noms des colonnes et remplissez les valeurs NaN:
#pivoting by numbers - column Age
print (df.pivot(index='Position', columns='City', values='Age')
.reset_index()
.rename_axis(None, axis=1)
.fillna(0))
Position Boston Chicago Los Angeles New York
0 Manager 34.0 29.0 40.0 0.0
1 Programmer 31.0 29.0 0.0 37.0
#pivoting by strings - column Name
print (df.pivot(index='Position', columns='City', values='Name'))
City Boston Chicago Los Angeles New York
Position
Manager Mary Jon Lucy None
Programmer Sue Jane None Josh
Pivoter avec agréger
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Mary', 'Jon','Lucy', 'Jane', 'Sue', 'Mary', 'Lucy'],
'Age':[35, 37, 40, 29, 31, 26, 28],
'City':['Boston', 'Chicago', 'Los Angeles', 'Chicago', 'Boston', 'Boston', 'Chicago'],
'Position':['Manager','Manager','Manager','Programmer', 'Programmer','Manager','Manager'],
'Sex':['Female','Male','Female','Female', 'Female','Female','Female']},
columns=['Name','Position','City','Age','Sex'])
print (df)
Name Position City Age Sex
0 Mary Manager Boston 35 Female
1 Jon Manager Chicago 37 Male
2 Lucy Manager Los Angeles 40 Female
3 Jane Programmer Chicago 29 Female
4 Sue Programmer Boston 31 Female
5 Mary Manager Boston 26 Female
6 Lucy Manager Chicago 28 Female
Si utiliser pivot
, obtenir une erreur:
print (df.pivot(index='Position', columns='City', values='Age'))
ValueError: l'index contient des entrées en double, ne peut pas être remodelé
Utilisez pivot_table
avec la fonction d'agrégation:
#default aggfunc is np.mean
print (df.pivot_table(index='Position', columns='City', values='Age'))
City Boston Chicago Los Angeles
Position
Manager 30.5 32.5 40.0
Programmer 31.0 29.0 NaN
print (df.pivot_table(index='Position', columns='City', values='Age', aggfunc=np.mean))
City Boston Chicago Los Angeles
Position
Manager 30.5 32.5 40.0
Programmer 31.0 29.0 NaN
Une autre fonction ag
print (df.pivot_table(index='Position', columns='City', values='Age', aggfunc=sum))
City Boston Chicago Los Angeles
Position
Manager 61.0 65.0 40.0
Programmer 31.0 29.0 NaN
#lost data !!!
print (df.pivot_table(index='Position', columns='City', values='Age', aggfunc='first'))
City Boston Chicago Los Angeles
Position
Manager 35.0 37.0 40.0
Programmer 31.0 29.0 NaN
Si nécessaire, agréger par colonnes avec string
valeurs de string
:
print (df.pivot_table(index='Position', columns='City', values='Name'))
DataError: aucun type numérique à agréger
Vous pouvez utiliser ces fonctions agrandissantes:
print (df.pivot_table(index='Position', columns='City', values='Name', aggfunc='first'))
City Boston Chicago Los Angeles
Position
Manager Mary Jon Lucy
Programmer Sue Jane None
print (df.pivot_table(index='Position', columns='City', values='Name', aggfunc='last'))
City Boston Chicago Los Angeles
Position
Manager Mary Lucy Lucy
Programmer Sue Jane None
print (df.pivot_table(index='Position', columns='City', values='Name', aggfunc='sum'))
City Boston Chicago Los Angeles
Position
Manager MaryMary JonLucy Lucy
Programmer Sue Jane None
print (df.pivot_table(index='Position', columns='City', values='Name', aggfunc=', '.join))
City Boston Chicago Los Angeles
Position
Manager Mary, Mary Jon, Lucy Lucy
Programmer Sue Jane None
print (df.pivot_table(index='Position', columns='City', values='Name', aggfunc=', '.join, fill_value='-')
.reset_index()
.rename_axis(None, axis=1))
Position Boston Chicago Los Angeles
0 Manager Mary, Mary Jon, Lucy Lucy
1 Programmer Sue Jane -
Les informations concernant le sexe n'ont pas encore été utilisées. Il pourrait être commuté par l'une des colonnes ou être ajouté à un autre niveau:
print (df.pivot_table(index='Position', columns=['City','Sex'], values='Age', aggfunc='first'))
City Boston Chicago Los Angeles
Sex Female Female Male Female
Position
Manager 35.0 28.0 37.0 40.0
Programmer 31.0 29.0 NaN NaN
Plusieurs colonnes peuvent être spécifiées dans l'un des index, colonnes et valeurs des attributs.
print (df.pivot_table(index=['Position','Sex'], columns='City', values='Age', aggfunc='first'))
City Boston Chicago Los Angeles
Position Sex
Manager Female 35.0 28.0 40.0
Male NaN 37.0 NaN
Programmer Female 31.0 29.0 NaN
Application de plusieurs fonctions d'agrégation
Vous pouvez facilement appliquer plusieurs fonctions pendant un seul pivot:
In [23]: import numpy as np
In [24]: df.pivot_table(index='Position', values='Age', aggfunc=[np.mean, np.std])
Out[24]:
mean std
Position
Manager 34.333333 5.507571
Programmer 32.333333 4.163332
Parfois, vous souhaiterez peut-être appliquer des fonctions spécifiques à des colonnes spécifiques:
In [35]: df['Random'] = np.random.random(6)
In [36]: df
Out[36]:
Name Position City Age Random
0 Mary Manager Boston 34 0.678577
1 Josh Programmer New York 37 0.973168
2 Jon Manager Chicago 29 0.146668
3 Lucy Manager Los Angeles 40 0.150120
4 Jane Programmer Chicago 29 0.112769
5 Sue Programmer Boston 31 0.185198
For example, find the mean age, and standard deviation of random by Position:
In [37]: df.pivot_table(index='Position', aggfunc={'Age': np.mean, 'Random': np.std})
Out[37]:
Age Random
Position
Manager 34.333333 0.306106
Programmer 32.333333 0.477219
On peut aussi passer une liste de fonctions à appliquer aux colonnes individuelles:
In [38]: df.pivot_table(index='Position', aggfunc={'Age': np.mean, 'Random': [np.mean, np.std]})]
Out[38]:
Age Random
mean mean std
Position
Manager 34.333333 0.325122 0.306106
Programmer 32.333333 0.423712 0.477219
Empilage et dépilage
import pandas as pd
import numpy as np
np.random.seed(0)
tuples = list(zip(*[['bar', 'bar', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two','one', 'two']]))
idx = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(6, 2), index=idx, columns=['A', 'B'])
print (df)
A B
first second
bar one 1.764052 0.400157
two 0.978738 2.240893
foo one 1.867558 -0.977278
two 0.950088 -0.151357
qux one -0.103219 0.410599
two 0.144044 1.454274
print (df.stack())
first second
bar one A 1.764052
B 0.400157
two A 0.978738
B 2.240893
foo one A 1.867558
B -0.977278
two A 0.950088
B -0.151357
qux one A -0.103219
B 0.410599
two A 0.144044
B 1.454274
dtype: float64
#reset index, rename column name
print (df.stack().reset_index(name='val2').rename(columns={'level_2': 'val1'}))
first second val1 val2
0 bar one A 1.764052
1 bar one B 0.400157
2 bar two A 0.978738
3 bar two B 2.240893
4 foo one A 1.867558
5 foo one B -0.977278
6 foo two A 0.950088
7 foo two B -0.151357
8 qux one A -0.103219
9 qux one B 0.410599
10 qux two A 0.144044
11 qux two B 1.454274
print (df.unstack())
A B
second one two one two
first
bar 1.764052 0.978738 0.400157 2.240893
foo 1.867558 0.950088 -0.977278 -0.151357
qux -0.103219 0.144044 0.410599 1.454274
rename_axis
(nouveau dans les pandas
0.18.0
):
#reset index, remove columns names
df1 = df.unstack().reset_index().rename_axis((None,None), axis=1)
#reset MultiIndex in columns with list comprehension
df1.columns = ['_'.join(col).strip('_') for col in df1.columns]
print (df1)
first A_one A_two B_one B_two
0 bar 1.764052 0.978738 0.400157 2.240893
1 foo 1.867558 0.950088 -0.977278 -0.151357
2 qux -0.103219 0.144044 0.410599 1.454274
Pandas ci-dessous 0.18.0
#reset index
df1 = df.unstack().reset_index()
#remove columns names
df1.columns.names = (None, None)
#reset MultiIndex in columns with list comprehension
df1.columns = ['_'.join(col).strip('_') for col in df1.columns]
print (df1)
first A_one A_two B_one B_two
0 bar 1.764052 0.978738 0.400157 2.240893
1 foo 1.867558 0.950088 -0.977278 -0.151357
2 qux -0.103219 0.144044 0.410599 1.454274
Tabulation croisée
import pandas as pd
df = pd.DataFrame({'Sex': ['M', 'M', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F'],
'Age': [20, 19, 17, 35, 22, 22, 12, 15, 17, 22],
'Heart Disease': ['Y', 'N', 'Y', 'N', 'N', 'Y', 'N', 'Y', 'N', 'Y']})
df
Age Heart Disease Sex
0 20 Y M
1 19 N M
2 17 Y F
3 35 N M
4 22 N F
5 22 Y F
6 12 N M
7 15 Y M
8 17 N F
9 22 Y F
pd.crosstab(df['Sex'], df['Heart Disease'])
Hearth Disease N Y
Sex
F 2 3
M 3 2
En utilisant la notation par points:
pd.crosstab(df.Sex, df.Age)
Age 12 15 17 19 20 22 35
Sex
F 0 0 2 0 0 3 0
M 1 1 0 1 1 0 1
Obtenir la transposition de DF:
pd.crosstab(df.Sex, df.Age).T
Sex F M
Age
12 0 1
15 0 1
17 2 0
19 0 1
20 0 1
22 3 0
35 0 1
Obtenir des marges ou des cumulatifs:
pd.crosstab(df['Sex'], df['Heart Disease'], margins=True)
Heart Disease N Y All
Sex
F 2 3 5
M 3 2 5
All 5 5 10
Obtenir la transposition du cumulatif:
pd.crosstab(df['Sex'], df['Age'], margins=True).T
Sex F M All
Age
12 0 1 1
15 0 1 1
17 2 0 2
19 0 1 1
20 0 1 1
22 3 0 3
35 0 1 1
All 5 5 10
Obtenir des pourcentages:
pd.crosstab(df["Sex"],df['Heart Disease']).apply(lambda r: r/len(df), axis=1)
Heart Disease N Y
Sex
F 0.2 0.3
M 0.3 0.2
Se cumuler et multiplier par 100:
df2 = pd.crosstab(df["Age"],df['Sex'], margins=True ).apply(lambda r: r/len(df)*100, axis=1)
df2
Sex F M All
Age
12 0.0 10.0 10.0
15 0.0 10.0 10.0
17 20.0 0.0 20.0
19 0.0 10.0 10.0
20 0.0 10.0 10.0
22 30.0 0.0 30.0
35 0.0 10.0 10.0
All 50.0 50.0 100.0
Suppression d'une colonne de DF (one way):
df2[["F","M"]]
Sex F M
Age
12 0.0 10.0
15 0.0 10.0
17 20.0 0.0
19 0.0 10.0
20 0.0 10.0
22 30.0 0.0
35 0.0 10.0
All 50.0 50.0
Les pandas fondent pour passer du long au long
>>> df
ID Year Jan_salary Feb_salary Mar_salary
0 1 2016 4500 4200 4700
1 2 2016 3800 3600 4400
2 3 2016 5500 5200 5300
>>> melted_df = pd.melt(df,id_vars=['ID','Year'],
value_vars=['Jan_salary','Feb_salary','Mar_salary'],
var_name='month',value_name='salary')
>>> melted_df
ID Year month salary
0 1 2016 Jan_salary 4500
1 2 2016 Jan_salary 3800
2 3 2016 Jan_salary 5500
3 1 2016 Feb_salary 4200
4 2 2016 Feb_salary 3600
5 3 2016 Feb_salary 5200
6 1 2016 Mar_salary 4700
7 2 2016 Mar_salary 4400
8 3 2016 Mar_salary 5300
>>> melted_['month'] = melted_['month'].str.replace('_salary','')
>>> import calendar
>>> def mapper(month_abbr):
... # from http://stackoverflow.com/a/3418092/42346
... d = {v: str(k).zfill(2) for k,v in enumerate(calendar.month_abbr)}
... return d[month_abbr]
>>> melted_df['month'] = melted_df['month'].apply(mapper)
>>> melted_df
ID Year month salary
0 1 2016 01 4500
1 2 2016 01 3800
2 3 2016 01 5500
3 1 2016 02 4200
4 2 2016 02 3600
5 3 2016 02 5200
6 1 2016 03 4700
7 2 2016 03 4400
8 3 2016 03 5300
Fractionner (remodeler) les chaînes CSV dans des colonnes en plusieurs lignes, avec un élément par ligne
import pandas as pd
df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
{'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}])
print(df)
reshaped = \
(df.set_index(df.columns.drop('var1',1).tolist())
.var1.str.split(',', expand=True)
.stack()
.reset_index()
.rename(columns={0:'var1'})
.loc[:, df.columns]
)
print(reshaped)
Sortie:
var1 var2 var3
0 a,b,c 1 XX
1 d,e,f,x,y 2 ZZ
var1 var2 var3
0 a 1 XX
1 b 1 XX
2 c 1 XX
3 d 2 ZZ
4 e 2 ZZ
5 f 2 ZZ
6 x 2 ZZ
7 y 2 ZZ