pandas => データ型

備考

dtypesはパンダに固有のものではありません。それらはnumpyと建築的に密接につながっているパンダの結果です。

列のdtypeは決して列に含まれるオブジェクトのpython型と相関する必要はありません。

ここには浮動小数点数を持つpd.Seriesがあります。 dtypeはfloatます。

次に、 astypeを使ってastypeに「キャスト」します。

pd.Series([1.,2.,3.,4.,5.]).astype(object)
0    1
1    2
2    3
3    4
4    5
dtype: object

dtypeは現在オブジェクトですが、リスト内のオブジェクトはまだ浮動です。論理的には、Pythonではすべてがオブジェクトであり、オブジェクトにアップキャストされていることが分かっています。

type(pd.Series([1.,2.,3.,4.,5.]).astype(object)[0])
float

ここでは、フロートを文字列に「キャスト」します。

pd.Series([1.,2.,3.,4.,5.]).astype(str)
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: object

dtypeは現在オブジェクトですが、リスト内のエントリのタイプはstringです。これはnumpyが文字列を処理しないため、それらが単なるオブジェクトであり、懸念されないように動作するためです。

type(pd.Series([1.,2.,3.,4.,5.]).astype(str)[0])
str

dtypesを信用しないでください、彼らはパンダの建築上の瑕疵の成果物です。必要に応じて指定しますが、列にdtypeが設定されているかどうかには依存しません。

列のタイプの確認

カラムのタイプは、 .dtypes .dtypes属性によってチェックすることができます。

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 'C': [True, False, True]})

In [2]: df
Out[2]:
   A    B      C
0  1  1.0   True
1  2  2.0  False
2  3  3.0   True

In [3]: df.dtypes
Out[3]:
A      int64
B    float64
C       bool
dtype: object

単一のシリーズの場合は、 .dtype属性を使用.dtypeます。

In [4]: df['A'].dtype
Out[4]: dtype('int64')

dtypeの変更

astype()メソッドはSeriesのdtypeを変更し、新しいSeriesを返します。

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 
                           'C': ['1.1.2010', '2.1.2011', '3.1.2011'], 
                           'D': ['1 days', '2 days', '3 days'],
                           'E': ['1', '2', '3']})
In [2]: df
Out[2]:
   A    B          C       D  E
0  1  1.0   1.1.2010  1 days  1
1  2  2.0   2.1.2011  2 days  2
2  3  3.0   3.1.2011  3 days  3

In [3]: df.dtypes
Out[3]:
A      int64
B    float64
C     object
D     object
E     object
dtype: object

列Aの型を浮動小数点に変更し、列Bを整数に変更します。

In [4]: df['A'].astype('float')
Out[4]:
0    1.0
1    2.0
2    3.0
Name: A, dtype: float64

In [5]: df['B'].astype('int')
Out[5]:
0    1
1    2
2    3
Name: B, dtype: int32

astype()メソッドは、特定の型変換用です（つまり、 .astype(float64') .astype(float32) 、または.astype(float16)を指定できます）。一般的な変換では、 pd.to_numeric 、 pd.to_datetime 、およびpd.to_timedelta使用できます。

型を数値に変更する

pd.to_numericは値を数値型に変更します。

In [6]: pd.to_numeric(df['E'])
Out[6]:
0    1
1    2
2    3
Name: E, dtype: int64

入力が数値に変換できない場合、 pd.to_numericはデフォルトでエラーを発生させます。 errorsパラメータを使用すると、その動作を変更できます。

# Ignore the error, return the original input if it cannot be converted
In [7]: pd.to_numeric(pd.Series(['1', '2', 'a']), errors='ignore')
Out[7]:
0    1
1    2
2    a
dtype: object

# Return NaN when the input cannot be converted to a number
In [8]: pd.to_numeric(pd.Series(['1', '2', 'a']), errors='coerce')
Out[8]:
0    1.0
1    2.0
2    NaN
dtype: float64

すべての行をチェックする必要がある場合は、入力を数値に変換できないisnull使用してboolean indexingを使用する：

In [9]: df = pd.DataFrame({'A': [1, 'x', 'z'],
                           'B': [1.0, 2.0, 3.0],
                           'C': [True, False, True]})

In [10]: pd.to_numeric(df.A, errors='coerce').isnull()
Out[10]: 
0    False
1     True
2     True
Name: A, dtype: bool

In [11]: df[pd.to_numeric(df.A, errors='coerce').isnull()]
Out[11]: 
   A    B      C
1  x  2.0  False
2  z  3.0   True

タイプをdatetimeに変更する

In [12]: pd.to_datetime(df['C'])
Out[12]:
0   2010-01-01
1   2011-02-01
2   2011-03-01
Name: C, dtype: datetime64[ns]

2.1.2011は2011年2月1日に変換されます。代わりに2011年1月2日が必要な場合は、 dayfirstパラメータを使用する必要があります。

In [13]: pd.to_datetime('2.1.2011', dayfirst=True)
Out[13]: Timestamp('2011-01-02 00:00:00')

タイプをtimedeltaに変更する

In [14]: pd.to_timedelta(df['D'])
Out[14]:
0   1 days
1   2 days
2   3 days
Name: D, dtype: timedelta64[ns]

dtypeに基づいて列を選択する

select_dtypesメソッドを使用すると、dtypeに基づいて列を選択できます。

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 'C': ['a', 'b', 'c'], 
                           'D': [True, False, True]})

In [2]: df
Out[2]: 
   A    B  C      D
0  1  1.0  a   True
1  2  2.0  b  False
2  3  3.0  c   True

includeおよびexcludeパラメータを使用して、必要なタイプを指定できます。

# Select numbers
In [3]: df.select_dtypes(include=['number'])  # You need to use a list
Out[3]:
   A    B
0  1  1.0
1  2  2.0
2  3  3.0    

# Select numbers and booleans
In [4]: df.select_dtypes(include=['number', 'bool'])
Out[4]:
   A    B      D
0  1  1.0   True
1  2  2.0  False
2  3  3.0   True

# Select numbers and booleans but exclude int64
In [5]: df.select_dtypes(include=['number', 'bool'], exclude=['int64'])
Out[5]:
     B      D
0  1.0   True
1  2.0  False
2  3.0   True

dtypesの要約

get_dtype_countsメソッドを使用すると、dtypesの内訳を見ることができます。

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 'C': ['a', 'b', 'c'], 
                           'D': [True, False, True]})

In [2]: df.get_dtype_counts()
Out[2]: 
bool       1
float64    1
int64      1
object     1
dtype: int64

Modified text is an extract of the original Stack Overflow Documentation

ライセンスを受けた CC BY-SA 3.0

所属していない Stack Overflow

pandas
データ型

サーチ…

備考