Python Language => Unicodeとバイト

構文

str.encode（エンコーディング、errors = '厳密'）
bytes.decode（エンコーディング、errors = '厳密'）
オープン（ファイル名、モード、エンコーディング=なし）

パラメーター

パラメータ	詳細
エンコーディング	使用するエンコーディング（例： `'ascii'` 、 `'utf8'`など）
エラー	エラーモード、例えば、 `'replace'`は悪い文字を疑問符で置き換え、 `'ignore'`は悪い文字を無視するなどです。

基本

Python 3では、 strはユニコード対応文字列の型で、 bytesは生のバイト列の型です。

type("f") == type(u"f")  # True, <class 'str'>
type(b"f")               # <class 'bytes'>

Python 2では 、カジュアルな文字列はデフォルトでrawバイトのシーケンスであり、Unicode文字列は "u"という接頭辞を持つすべての文字列でした。

type("f") == type(b"f")  # True, <type 'str'>
type(u"f")               # <type 'unicode'>

Unicodeからバイトへ

Unicode文字列は.encode(encoding)バイトに変換できます。

Python 3

>>> "£13.55".encode('utf8')
b'\xc2\xa313.55'
>>> "£13.55".encode('utf16')
b'\xff\xfe\xa3\x001\x003\x00.\x005\x005\x00'

Python 2

py2では、デフォルトのコンソールエンコーディングはsys.getdefaultencoding() == 'ascii'でutf-8はないため、前の例のように印刷することは直接できません。

>>> print type(u"£13.55".encode('utf8'))
<type 'str'>
>>> print u"£13.55".encode('utf8')
SyntaxError: Non-ASCII character '\xc2' in...

# with encoding set inside a file

# -*- coding: utf-8 -*-
>>> print u"£13.55".encode('utf8')
┬ú13.55

エンコーディングが文字列を処理できない場合、 `UnicodeEncodeError`が発生します：

>>> "£13.55".encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xa3' in position 0: ordinal not in range(128)

ユニコードするバイト

バイトは、 .decode(encoding) Unicode文字列に変換できます。

バイトのシーケンスは、適切なエンコーディングを介してのみUnicode文字列に変換できます。

>>> b'\xc2\xa313.55'.decode('utf8')
'£13.55'

エンコードが文字列を処理できない場合、 UnicodeDecodeErrorが発生します。

>>> b'\xc2\xa313.55'.decode('utf16')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/csaftoiu/csaftoiu-github/yahoo-groups-backup/.virtualenv/bin/../lib/python3.5/encodings/utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x35 in position 6: truncated data

エラー処理のエンコード/デコード

.encodeと.decode両方にエラーモードがあります。

デフォルトは'strict' 、エラー時に例外が発生します。他のモードはより寛容です。

エンコーディング

>>> "£13.55".encode('ascii', errors='replace')
b'?13.55'
>>> "£13.55".encode('ascii', errors='ignore')
b'13.55'
>>> "£13.55".encode('ascii', errors='namereplace')
b'\\N{POUND SIGN}13.55'
>>> "£13.55".encode('ascii', errors='xmlcharrefreplace')
b'&#163;13.55'
>>> "£13.55".encode('ascii', errors='backslashreplace')
b'\\xa313.55'

デコード

>>> b = "£13.55".encode('utf8')
>>> b.decode('ascii', errors='replace')
'��13.55'
>>> b.decode('ascii', errors='ignore')
'13.55'
>>> b.decode('ascii', errors='backslashreplace')
'\\xc2\\xa313.55'

士気

ユニコードとバイトを扱うときは、エンコーディングを真っ直ぐに保つことが不可欠であるということは、上記から明らかです。

ファイルI / O

非バイナリモードでオープンされたファイル（例えば'r'または'w' ）は、文字列を処理します。聴覚障害のエンコーディングは'utf8'です。

open(fn, mode='r')                    # opens file for reading in utf8
open(fn, mode='r', encoding='utf16')  # opens file for reading utf16

# ERROR: cannot write bytes when a string is expected:
open("foo.txt", "w").write(b"foo")

バイナリモード（ 'rb'や'wb' ）で開かれたファイルは、バイトを処理します。エンコーディングがないので、エンコーディング引数は指定できません。

open(fn, mode='wb')  # open file for writing bytes

# ERROR: cannot write string when bytes is expected:
open(fn, mode='wb').write("hi")

Modified text is an extract of the original Stack Overflow Documentation

ライセンスを受けた CC BY-SA 3.0

所属していない Stack Overflow

Python Language
Unicodeとバイト

サーチ…

構文