pandas => 데이터 유형

비고

dtypes는 팬더가 원산지가 아닙니다. 그것들은 numpy에 가까운 판다 건축 결합의 결과입니다.

열의 dtype은 어떤 식 으로든 열에 포함 된 객체의 파이썬 유형과 상관 관계가 없습니다.

여기에는 수레가있는 pd.Series 가 있습니다. dtype은 float 입니다.

그런 다음 우리는 astype 을 사용하여 객체에 "캐스팅"합니다.

pd.Series([1.,2.,3.,4.,5.]).astype(object)
0    1
1    2
2    3
3    4
4    5
dtype: object

dtype은 이제 객체이지만 목록의 객체는 여전히 부동 상태입니다. 파이썬에서 모든 것이 객체이고, 객체에 업 캐스팅 될 수 있다는 것을 알고 있다면 Logical.

type(pd.Series([1.,2.,3.,4.,5.]).astype(object)[0])
float

여기서 우리는 수레를 문자열로 "캐스팅"하려고합니다.

pd.Series([1.,2.,3.,4.,5.]).astype(str)
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: object

이제 dtype은 object이지만 목록의 항목 유형은 string입니다. 이것은 numpy 가 문자열을 처리하지 않기 때문에 마치 객체와 관심사가 아닌 것처럼 작동합니다.

type(pd.Series([1.,2.,3.,4.,5.]).astype(str)[0])
str

dtypes를 신뢰하지 마십시오. 그들은 판다의 건축 결함의 유물입니다. 반드시 지정해야하지만 dtype이 열에 설정되어있는 것에 의존하지 마십시오.

열의 유형 확인

열 유형은 .dtypes 속성을 통해 확인할 수 있습니다.

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 'C': [True, False, True]})

In [2]: df
Out[2]:
   A    B      C
0  1  1.0   True
1  2  2.0  False
2  3  3.0   True

In [3]: df.dtypes
Out[3]:
A      int64
B    float64
C       bool
dtype: object

단일 계열의 경우 .dtype 특성을 사용할 수 있습니다.

In [4]: df['A'].dtype
Out[4]: dtype('int64')

dtype 변경하기

astype() 메서드는 Series의 dtype을 변경하고 새로운 Series를 반환합니다.

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 
                           'C': ['1.1.2010', '2.1.2011', '3.1.2011'], 
                           'D': ['1 days', '2 days', '3 days'],
                           'E': ['1', '2', '3']})
In [2]: df
Out[2]:
   A    B          C       D  E
0  1  1.0   1.1.2010  1 days  1
1  2  2.0   2.1.2011  2 days  2
2  3  3.0   3.1.2011  3 days  3

In [3]: df.dtypes
Out[3]:
A      int64
B    float64
C     object
D     object
E     object
dtype: object

열 A의 유형을 부동으로 변경하고 열 B를 정수로 입력하십시오.

In [4]: df['A'].astype('float')
Out[4]:
0    1.0
1    2.0
2    3.0
Name: A, dtype: float64

In [5]: df['B'].astype('int')
Out[5]:
0    1
1    2
2    3
Name: B, dtype: int32

astype() 메소드는 특정 유형 변환 (예 : .astype(float64') , .astype(float32) 또는 .astype(float16) )을위한 것입니다. 일반적인 변환의 경우 pd.to_numeric , pd.to_datetime 및 pd.to_timedelta 사용할 수 있습니다.

유형을 숫자로 변경

pd.to_numeric 은 값을 숫자 유형으로 변경합니다.

In [6]: pd.to_numeric(df['E'])
Out[6]:
0    1
1    2
2    3
Name: E, dtype: int64

입력을 숫자로 변환 할 수없는 경우 기본적으로 pd.to_numeric 은 오류를 발생 pd.to_numeric . errors 매개 변수를 사용하여 해당 동작을 변경할 수 있습니다.

# Ignore the error, return the original input if it cannot be converted
In [7]: pd.to_numeric(pd.Series(['1', '2', 'a']), errors='ignore')
Out[7]:
0    1
1    2
2    a
dtype: object

# Return NaN when the input cannot be converted to a number
In [8]: pd.to_numeric(pd.Series(['1', '2', 'a']), errors='coerce')
Out[8]:
0    1.0
1    2.0
2    NaN
dtype: float64

필요한 경우 입력이있는 모든 행을 확인하여 숫자로 변환 할 수 없습니다 isnull 사용하여 boolean indexing 사용 :

In [9]: df = pd.DataFrame({'A': [1, 'x', 'z'],
                           'B': [1.0, 2.0, 3.0],
                           'C': [True, False, True]})

In [10]: pd.to_numeric(df.A, errors='coerce').isnull()
Out[10]: 
0    False
1     True
2     True
Name: A, dtype: bool

In [11]: df[pd.to_numeric(df.A, errors='coerce').isnull()]
Out[11]: 
   A    B      C
1  x  2.0  False
2  z  3.0   True

유형을 datetime으로 변경

In [12]: pd.to_datetime(df['C'])
Out[12]:
0   2010-01-01
1   2011-02-01
2   2011-03-01
Name: C, dtype: datetime64[ns]

2.1.2011은 2011 년 2 월 1 dayfirst 됩니다. 2011 년 1 월 2 일을 대신 사용하려면 dayfirst 매개 변수를 사용해야합니다.

In [13]: pd.to_datetime('2.1.2011', dayfirst=True)
Out[13]: Timestamp('2011-01-02 00:00:00')

유형을 timedelta로 변경

In [14]: pd.to_timedelta(df['D'])
Out[14]:
0   1 days
1   2 days
2   3 days
Name: D, dtype: timedelta64[ns]

dtype을 기반으로 열 선택

select_dtypes 메소드는 dtype을 기반으로 컬럼을 선택하는 데 사용될 수 있습니다.

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 'C': ['a', 'b', 'c'], 
                           'D': [True, False, True]})

In [2]: df
Out[2]: 
   A    B  C      D
0  1  1.0  a   True
1  2  2.0  b  False
2  3  3.0  c   True

include 및 exclude 매개 변수를 사용하여 원하는 유형을 지정할 수 있습니다.

# Select numbers
In [3]: df.select_dtypes(include=['number'])  # You need to use a list
Out[3]:
   A    B
0  1  1.0
1  2  2.0
2  3  3.0    

# Select numbers and booleans
In [4]: df.select_dtypes(include=['number', 'bool'])
Out[4]:
   A    B      D
0  1  1.0   True
1  2  2.0  False
2  3  3.0   True

# Select numbers and booleans but exclude int64
In [5]: df.select_dtypes(include=['number', 'bool'], exclude=['int64'])
Out[5]:
     B      D
0  1.0   True
1  2.0  False
2  3.0   True

dtyp 요약

get_dtype_counts 메소드를 사용하여 dtypes의 분석을 볼 수 있습니다.

In [1]: df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.0, 2.0, 3.0], 'C': ['a', 'b', 'c'], 
                           'D': [True, False, True]})

In [2]: df.get_dtype_counts()
Out[2]: 
bool       1
float64    1
int64      1
object     1
dtype: int64

Modified text is an extract of the original Stack Overflow Documentation

아래 라이선스 CC BY-SA 3.0

와 제휴하지 않음 Stack Overflow

pandas
데이터 유형

수색…

비고