pandas => データフレームの作成

前書き

DataFrameは、 Series ＆ Panelを除き、 pandasライブラリが提供するデータ構造です。それは2次元構造であり、行と列のテーブルと比較することができます。

各行は、整数インデックス（0..N）、またはDataFrameオブジェクトの作成時に明示的に設定されたラベルによって識別できます。各列は異なるタイプのもので、ラベルで識別されます。

このトピックでは、DataFrameオブジェクトを作成/作成するさまざまな方法について説明します。 Ex。 Numpy配列から、タプルのリストから、辞書から。

サンプルのDataFrameを作成する

import pandas as pd

numbersとcolors 2つの列を含む辞書からDataFrameを作成します。各キーは列名を表し、値は一連のデータです。列の内容は次のとおりです。

df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']})

データフレームの内容を表示する：

print(df)
# Output: 
#   colors  numbers
# 0    red        1
# 1  white        2
# 2   blue        3

dictが注文されていないので、Pandasは列をアルファベット順に並べます。順序を指定するには、 columnsパラメーターを使用します。

df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']}, 
                  columns=['numbers', 'colors'])

print(df)  
# Output:     
#    numbers colors
# 0        1    red
# 1        2  white
# 2        3   blue

Numpyを使用してサンプルDataFrameを作成する

乱数のDataFrameを作成する：

import numpy as np
import pandas as pd

# Set the seed for a reproducible sample
np.random.seed(0)  

df = pd.DataFrame(np.random.randn(5, 3), columns=list('ABC'))

print(df)
# Output:
#           A         B         C
# 0  1.764052  0.400157  0.978738
# 1  2.240893  1.867558 -0.977278
# 2  0.950088 -0.151357 -0.103219
# 3  0.410599  0.144044  1.454274
# 4  0.761038  0.121675  0.443863

整数でDataFrameを作成する：

df = pd.DataFrame(np.arange(15).reshape(5,3),columns=list('ABC'))

print(df)
# Output:
#     A   B   C
# 0   0   1   2
# 1   3   4   5
# 2   6   7   8
# 3   9  10  11
# 4  12  13  14

DataFrameを作成し、列と行にnans（ NaT, NaN, 'nan', None ）を含める：

df = pd.DataFrame(np.arange(48).reshape(8,6),columns=list('ABCDEF'))

print(df)
# Output: 
#     A   B   C   D   E   F
# 0   0   1   2   3   4   5
# 1   6   7   8   9  10  11
# 2  12  13  14  15  16  17
# 3  18  19  20  21  22  23
# 4  24  25  26  27  28  29
# 5  30  31  32  33  34  35
# 6  36  37  38  39  40  41
# 7  42  43  44  45  46  47

df.ix[::2,0] = np.nan # in column 0, set elements with indices 0,2,4, ... to NaN 
df.ix[::4,1] = pd.NaT # in column 1, set elements with indices 0,4, ... to np.NaT
df.ix[:3,2] = 'nan'   # in column 2, set elements with index from 0 to 3 to 'nan'
df.ix[:,5] = None     # in column 5, set all elements to None
df.ix[5,:] = None     # in row 5, set all elements to None    
df.ix[7,:] = np.nan   # in row 7, set all elements to NaN

print(df)
# Output:
#     A     B     C   D   E     F
# 0 NaN   NaT   nan   3   4  None
# 1   6     7   nan   9  10  None
# 2 NaN    13   nan  15  16  None
# 3  18    19   nan  21  22  None
# 4 NaN   NaT    26  27  28  None
# 5 NaN  None  None NaN NaN  None
# 6 NaN    37    38  39  40  None
# 7 NaN   NaN   NaN NaN NaN   NaN

Dictionaryを使用して複数のコレクションからサンプルDataFrameを作成する

import pandas as pd
import numpy as np

np.random.seed(123) 
x = np.random.standard_normal(4)
y = range(4)
df = pd.DataFrame({'X':x, 'Y':y})
>>> df
          X  Y
0 -1.085631  0
1  0.997345  1
2  0.282978  2
3 -1.506295  3

タプルのリストからDataFrameを作成する

単純なタプルのリストからDataFrameを作成し、使用するタプルの特定の要素を選択することもできます。ここでは、最後の要素を除く各タプルのすべてのデータを使用してDataFrameを作成します。

import pandas as pd

data = [
('p1', 't1', 1, 2),
('p1', 't2', 3, 4),
('p2', 't1', 5, 6),
('p2', 't2', 7, 8),
('p2', 't3', 2, 8)
]

df = pd.DataFrame(data)

print(df)
#     0   1  2  3
# 0  p1  t1  1  2
# 1  p1  t2  3  4
# 2  p2  t1  5  6
# 3  p2  t2  7  8
# 4  p2  t3  2  8

リストの辞書からDataFrameを作成する

複数のリストからDataFrameを作成するには、値のリストを持つdictを渡します。辞書のキーは列ラベルとして使用されます。リストはndarrayでもかまいません。 lists / ndarraysはすべて同じ長さでなければなりません。

import pandas as pd
    
# Create DF from dict of lists/ndarrays
df = pd.DataFrame({'A' : [1, 2, 3, 4],
                       'B' : [4, 3, 2, 1]})
df
# Output:
#       A  B
#    0  1  4
#    1  2  3
#    2  3  2
#    3  4  1

配列の長さが同じでない場合は、エラーが発生します

df = pd.DataFrame({'A' : [1, 2, 3, 4], 'B' : [5, 5, 5]}) # a ValueError is raised

ndarraysの使用

import pandas as pd
import numpy as np

np.random.seed(123) 
x = np.random.standard_normal(4)
y = range(4)
df = pd.DataFrame({'X':x, 'Y':y})
df
# Output:           X  Y
#         0 -1.085631  0
#         1  0.997345  1
#         2  0.282978  2
#         3 -1.506295  3

詳細については、 http ： //pandas.pydata.org/pandas-docs/stable/dsintro.html#from-dict-of-ndarrays-listsを参照してください。

datetimeでサンプルのDataFrameを作成する

import pandas as pd
import numpy as np

np.random.seed(0)
# create an array of 5 dates starting at '2015-02-24', one per minute
rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)) }) 

print (df)
# Output:
#                  Date       Val
# 0 2015-02-24 00:00:00  1.764052
# 1 2015-02-24 00:01:00  0.400157
# 2 2015-02-24 00:02:00  0.978738
# 3 2015-02-24 00:03:00  2.240893
# 4 2015-02-24 00:04:00  1.867558

# create an array of 5 dates starting at '2015-02-24', one per day
rng = pd.date_range('2015-02-24', periods=5, freq='D')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))}) 

print (df)
# Output:
#         Date       Val
# 0 2015-02-24 -0.977278
# 1 2015-02-25  0.950088
# 2 2015-02-26 -0.151357
# 3 2015-02-27 -0.103219
# 4 2015-02-28  0.410599

# create an array of 5 dates starting at '2015-02-24', one every 3 years
rng = pd.date_range('2015-02-24', periods=5, freq='3A')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})  

print (df)
# Output:
#         Date       Val
# 0 2015-12-31  0.144044
# 1 2018-12-31  1.454274
# 2 2021-12-31  0.761038
# 3 2024-12-31  0.121675
# 4 2027-12-31  0.443863

DatetimeIndex使用したDataFrame ：

import pandas as pd
import numpy as np

np.random.seed(0)
rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Val' : np.random.randn(len(rng)) }, index=rng)  

print (df)
# Output:
#                           Val
# 2015-02-24 00:00:00  1.764052
# 2015-02-24 00:01:00  0.400157
# 2015-02-24 00:02:00  0.978738
# 2015-02-24 00:03:00  2.240893
# 2015-02-24 00:04:00  1.867558

date_rangeパラメータfreq Offset-aliases ：

Alias     Description
B         business day frequency  
C         custom business day frequency (experimental)  
D         calendar day frequency  
W         weekly frequency  
M         month end frequency  
BM        business month end frequency  
CBM       custom business month end frequency  
MS        month start frequency  
BMS       business month start frequency  
CBMS      custom business month start frequency  
Q         quarter end frequency  
BQ        business quarter endfrequency  
QS        quarter start frequency  
BQS       business quarter start frequency  
A         year end frequency  
BA        business year end frequency  
AS        year start frequency  
BAS       business year start frequency  
BH        business hour frequency  
H         hourly frequency  
T, min    minutely frequency  
S         secondly frequency  
L, ms     milliseconds  
U, us     microseconds  
N         nanoseconds

MultiIndexでサンプルのDataFrameを作成する

import pandas as pd
import numpy as np

from_tuplesを使用from_tuples ：

np.random.seed(0)
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                      ['one', 'two', 'one', 'two',
                       'one', 'two', 'one', 'two']]))

idx = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

from_product使用：

idx = pd.MultiIndex.from_product([['bar', 'baz', 'foo', 'qux'],['one','two']])

次に、このMultiIndexを使用します。

df = pd.DataFrame(np.random.randn(8, 2), index=idx, columns=['A', 'B'])
print (df)
                     A         B
first second                    
bar   one     1.764052  0.400157
      two     0.978738  2.240893
baz   one     1.867558 -0.977278
      two     0.950088 -0.151357
foo   one    -0.103219  0.410599
      two     0.144044  1.454274
qux   one     0.761038  0.121675
      two     0.443863  0.333674

データフレームを保存してピクルス（.plk）形式で読み込む

import pandas as pd

# Save dataframe to pickled pandas object
df.to_pickle(file_name) # where to save it usually as a .plk

# Load dataframe from pickled pandas object
df= pd.read_pickle(file_name)

辞書のリストからDataFrameを作成する

DataFrameは辞書のリストから作成できます。キーは列名として使用されます。

import pandas as pd
L = [{'Name': 'John', 'Last Name': 'Smith'}, 
         {'Name': 'Mary', 'Last Name': 'Wood'}]
pd.DataFrame(L)
# Output:  Last Name  Name
# 0     Smith  John
# 1      Wood  Mary

欠損値はNaN埋められます

L = [{'Name': 'John', 'Last Name': 'Smith', 'Age': 37},
     {'Name': 'Mary', 'Last Name': 'Wood'}]
pd.DataFrame(L)
# Output:     Age Last Name  Name
#          0   37     Smith  John
#          1  NaN      Wood  Mary

Modified text is an extract of the original Stack Overflow Documentation

ライセンスを受けた CC BY-SA 3.0

所属していない Stack Overflow

pandas
データフレームの作成

サーチ…

前書き