pandas => Creazione di DataFrames

introduzione

DataFrame è una struttura dati fornita dalla libreria pandas, oltre a Series & Panel . È una struttura bidimensionale e può essere paragonata a una tabella di righe e colonne.

Ogni riga può essere identificata da un indice intero (0..N) o un'etichetta impostata in modo esplicito durante la creazione di un oggetto DataFrame. Ogni colonna può essere di tipo distinto ed è identificata da un'etichetta.

Questo argomento copre vari modi per costruire / creare un oggetto DataFrame. Ex. dagli array di Numpy, dalla lista delle tuple, dal dizionario.

Creare un campione DataFrame

import pandas as pd

Crea un DataFrame da un dizionario, contenente due colonne: numbers e colors . Ogni chiave rappresenta un nome di colonna e il valore è una serie di dati, il contenuto della colonna:

df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']})

Mostra contenuti di dataframe:

print(df)
# Output: 
#   colors  numbers
# 0    red        1
# 1  white        2
# 2   blue        3

I panda ordinano le colonne in ordine alfabetico, in quanto non vengono ordinate le dict . Per specificare l'ordine, utilizzare il parametro columns .

df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']}, 
                  columns=['numbers', 'colors'])

print(df)  
# Output:     
#    numbers colors
# 0        1    red
# 1        2  white
# 2        3   blue

Crea un campione DataFrame usando Numpy

Crea un DataFrame di numeri casuali:

import numpy as np
import pandas as pd

# Set the seed for a reproducible sample
np.random.seed(0)  

df = pd.DataFrame(np.random.randn(5, 3), columns=list('ABC'))

print(df)
# Output:
#           A         B         C
# 0  1.764052  0.400157  0.978738
# 1  2.240893  1.867558 -0.977278
# 2  0.950088 -0.151357 -0.103219
# 3  0.410599  0.144044  1.454274
# 4  0.761038  0.121675  0.443863

Crea un DataFrame con numeri interi:

df = pd.DataFrame(np.arange(15).reshape(5,3),columns=list('ABC'))

print(df)
# Output:
#     A   B   C
# 0   0   1   2
# 1   3   4   5
# 2   6   7   8
# 3   9  10  11
# 4  12  13  14

Crea un DataFrame e includi nans ( NaT, NaN, 'nan', None ) su colonne e righe:

df = pd.DataFrame(np.arange(48).reshape(8,6),columns=list('ABCDEF'))

print(df)
# Output: 
#     A   B   C   D   E   F
# 0   0   1   2   3   4   5
# 1   6   7   8   9  10  11
# 2  12  13  14  15  16  17
# 3  18  19  20  21  22  23
# 4  24  25  26  27  28  29
# 5  30  31  32  33  34  35
# 6  36  37  38  39  40  41
# 7  42  43  44  45  46  47

df.ix[::2,0] = np.nan # in column 0, set elements with indices 0,2,4, ... to NaN 
df.ix[::4,1] = pd.NaT # in column 1, set elements with indices 0,4, ... to np.NaT
df.ix[:3,2] = 'nan'   # in column 2, set elements with index from 0 to 3 to 'nan'
df.ix[:,5] = None     # in column 5, set all elements to None
df.ix[5,:] = None     # in row 5, set all elements to None    
df.ix[7,:] = np.nan   # in row 7, set all elements to NaN

print(df)
# Output:
#     A     B     C   D   E     F
# 0 NaN   NaT   nan   3   4  None
# 1   6     7   nan   9  10  None
# 2 NaN    13   nan  15  16  None
# 3  18    19   nan  21  22  None
# 4 NaN   NaT    26  27  28  None
# 5 NaN  None  None NaN NaN  None
# 6 NaN    37    38  39  40  None
# 7 NaN   NaN   NaN NaN NaN   NaN

Crea un campione DataFrame da più raccolte utilizzando Dizionario

import pandas as pd
import numpy as np

np.random.seed(123) 
x = np.random.standard_normal(4)
y = range(4)
df = pd.DataFrame({'X':x, 'Y':y})
>>> df
          X  Y
0 -1.085631  0
1  0.997345  1
2  0.282978  2
3 -1.506295  3

Crea un DataFrame da un elenco di tuple

È possibile creare un DataFrame da un elenco di tuple semplici e persino scegliere gli elementi specifici delle tuple che si desidera utilizzare. Qui creeremo un DataFrame utilizzando tutti i dati di ogni tupla tranne l'ultimo elemento.

import pandas as pd

data = [
('p1', 't1', 1, 2),
('p1', 't2', 3, 4),
('p2', 't1', 5, 6),
('p2', 't2', 7, 8),
('p2', 't3', 2, 8)
]

df = pd.DataFrame(data)

print(df)
#     0   1  2  3
# 0  p1  t1  1  2
# 1  p1  t2  3  4
# 2  p2  t1  5  6
# 3  p2  t2  7  8
# 4  p2  t3  2  8

Crea un DataFrame da un dizionario di liste

Crea un DataFrame da più elenchi passando un dict i cui valori sono elencati. Le chiavi del dizionario sono usate come etichette di colonne. Le liste possono anche essere ndarrays. Gli elenchi / ndarray devono essere tutti della stessa lunghezza.

import pandas as pd
    
# Create DF from dict of lists/ndarrays
df = pd.DataFrame({'A' : [1, 2, 3, 4],
                       'B' : [4, 3, 2, 1]})
df
# Output:
#       A  B
#    0  1  4
#    1  2  3
#    2  3  2
#    3  4  1

Se gli array non sono della stessa lunghezza viene generato un errore

df = pd.DataFrame({'A' : [1, 2, 3, 4], 'B' : [5, 5, 5]}) # a ValueError is raised

Utilizzando ndarrays

import pandas as pd
import numpy as np

np.random.seed(123) 
x = np.random.standard_normal(4)
y = range(4)
df = pd.DataFrame({'X':x, 'Y':y})
df
# Output:           X  Y
#         0 -1.085631  0
#         1  0.997345  1
#         2  0.282978  2
#         3 -1.506295  3

Vedi ulteriori dettagli su: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#from-dict-of-ndarrays-lists

Creare un campione DataFrame con datetime

import pandas as pd
import numpy as np

np.random.seed(0)
# create an array of 5 dates starting at '2015-02-24', one per minute
rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)) }) 

print (df)
# Output:
#                  Date       Val
# 0 2015-02-24 00:00:00  1.764052
# 1 2015-02-24 00:01:00  0.400157
# 2 2015-02-24 00:02:00  0.978738
# 3 2015-02-24 00:03:00  2.240893
# 4 2015-02-24 00:04:00  1.867558

# create an array of 5 dates starting at '2015-02-24', one per day
rng = pd.date_range('2015-02-24', periods=5, freq='D')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))}) 

print (df)
# Output:
#         Date       Val
# 0 2015-02-24 -0.977278
# 1 2015-02-25  0.950088
# 2 2015-02-26 -0.151357
# 3 2015-02-27 -0.103219
# 4 2015-02-28  0.410599

# create an array of 5 dates starting at '2015-02-24', one every 3 years
rng = pd.date_range('2015-02-24', periods=5, freq='3A')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})  

print (df)
# Output:
#         Date       Val
# 0 2015-12-31  0.144044
# 1 2018-12-31  1.454274
# 2 2021-12-31  0.761038
# 3 2024-12-31  0.121675
# 4 2027-12-31  0.443863

DataFrame con DatetimeIndex :

import pandas as pd
import numpy as np

np.random.seed(0)
rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Val' : np.random.randn(len(rng)) }, index=rng)  

print (df)
# Output:
#                           Val
# 2015-02-24 00:00:00  1.764052
# 2015-02-24 00:01:00  0.400157
# 2015-02-24 00:02:00  0.978738
# 2015-02-24 00:03:00  2.240893
# 2015-02-24 00:04:00  1.867558

Offset-aliases per il parametro freq in date_range :

Alias     Description
B         business day frequency  
C         custom business day frequency (experimental)  
D         calendar day frequency  
W         weekly frequency  
M         month end frequency  
BM        business month end frequency  
CBM       custom business month end frequency  
MS        month start frequency  
BMS       business month start frequency  
CBMS      custom business month start frequency  
Q         quarter end frequency  
BQ        business quarter endfrequency  
QS        quarter start frequency  
BQS       business quarter start frequency  
A         year end frequency  
BA        business year end frequency  
AS        year start frequency  
BAS       business year start frequency  
BH        business hour frequency  
H         hourly frequency  
T, min    minutely frequency  
S         secondly frequency  
L, ms     milliseconds  
U, us     microseconds  
N         nanoseconds

Creare un campione DataFrame con MultiIndex

import pandas as pd
import numpy as np

Utilizzando from_tuples :

np.random.seed(0)
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                      ['one', 'two', 'one', 'two',
                       'one', 'two', 'one', 'two']]))

idx = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

Utilizzando from_product :

idx = pd.MultiIndex.from_product([['bar', 'baz', 'foo', 'qux'],['one','two']])

Quindi, usa questo MultiIndex:

df = pd.DataFrame(np.random.randn(8, 2), index=idx, columns=['A', 'B'])
print (df)
                     A         B
first second                    
bar   one     1.764052  0.400157
      two     0.978738  2.240893
baz   one     1.867558 -0.977278
      two     0.950088 -0.151357
foo   one    -0.103219  0.410599
      two     0.144044  1.454274
qux   one     0.761038  0.121675
      two     0.443863  0.333674

Salva e carica un DataFrame nel formato pickle (.plk)

import pandas as pd

# Save dataframe to pickled pandas object
df.to_pickle(file_name) # where to save it usually as a .plk

# Load dataframe from pickled pandas object
df= pd.read_pickle(file_name)

Crea un DataFrame da un elenco di dizionari

Un DataFrame può essere creato da un elenco di dizionari. Le chiavi sono usate come nomi di colonne.

import pandas as pd
L = [{'Name': 'John', 'Last Name': 'Smith'}, 
         {'Name': 'Mary', 'Last Name': 'Wood'}]
pd.DataFrame(L)
# Output:  Last Name  Name
# 0     Smith  John
# 1      Wood  Mary

I valori mancanti sono riempiti con NaN s

L = [{'Name': 'John', 'Last Name': 'Smith', 'Age': 37},
     {'Name': 'Mary', 'Last Name': 'Wood'}]
pd.DataFrame(L)
# Output:     Age Last Name  Name
#          0   37     Smith  John
#          1  NaN      Wood  Mary

Modified text is an extract of the original Stack Overflow Documentation

Autorizzato sotto CC BY-SA 3.0

Non affiliato con Stack Overflow

pandas
Creazione di DataFrames

Ricerca…