str 或字节数据与 unicode 字符之间的转换

Created: November-22, 2018

文件和网络消息的内容可以表示编码字符。它们通常需要转换为 unicode 才能正常显示。

在 Python 2 中，你可能需要将 str 数据转换为 Unicode 字符。默认值（''，"" 等）是一个 ASCII 字符串，其中任何超出 ASCII 范围的值都显示为转义值。Unicode 字符串是 u''（或 u"" 等）。

Python 2.x >= 2.3

# You get "© abc" encoded in UTF-8 from a file, network, or other data source

s = '\xc2\xa9 abc'  # s is a byte array, not a string of characters
                    # Doesn't know the original was UTF-8
                    # Default form of string literals in Python 2
s[0]                # '\xc2' - meaningless byte (without context such as an encoding)
type(s)             # str - even though it's not a useful one w/o having a known encoding

u = s.decode('utf-8')  # u'\xa9 abc'
                       # Now we have a Unicode string, which can be read as UTF-8 and printed properly
                       # In Python 2, Unicode string literals need a leading u
                       # str.decode converts a string which may contain escaped bytes to a Unicode string
u[0]                # u'\xa9' - Unicode Character 'COPYRIGHT SIGN' (U+00A9) '©'
type(u)             # unicode

u.encode('utf-8')   # '\xc2\xa9 abc'
                    # unicode.encode produces a string with escaped bytes for non-ASCII characters

在 Python 3 中，你可能需要将字节数组（称为字节文字）转换为 Unicode 字符串。默认值现在是一个 Unicode 字符串，现在必须输入 bytestring 文字作为 b''，b"" 等。字节文字将返回 True 到 isinstance(some_val, byte)，假设 some_val 是一个可能被编码为字节的字符串。

Python 3.x >= 3.0

# You get from file or network "© abc" encoded in UTF-8

s = b'\xc2\xa9 abc' # s is a byte array, not characters
                    # In Python 3, the default string literal is Unicode; byte array literals need a leading b
s[0]                # b'\xc2' - meaningless byte (without context such as an encoding)
type(s)             # bytes - now that byte arrays are explicit, Python can show that.

u = s.decode('utf-8')  # '© abc' on a Unicode terminal
                       # bytes.decode converts a byte array to a string (which will, in Python 3, be Unicode)
u[0]                # '\u00a9' - Unicode Character 'COPYRIGHT SIGN' (U+00A9) '©'
type(u)             # str
                    # The default string literal in Python 3 is UTF-8 Unicode

u.encode('utf-8')   # b'\xc2\xa9 abc'
                    # str.encode produces a byte array, showing ASCII-range bytes as unescaped characters.