Python Language => 정규 표현식 (Regex)

소개

파이썬은 정규 표현식을 re 모듈을 통해 사용할 수 있도록합니다.

정규 표현식 은 부분 문자열을 일치시키기위한 규칙으로 해석되는 문자의 조합입니다. 예를 들어, 'amount\D+\d+' 표현식은 amount 라는 단어와 하나 이상의 정수 이외의 숫자로 분리 된 정수로 구성된 문자열과 일치합니다. amount=100 , amount is 3 , amount is equal to: 33 등.

통사론

직접 정규 표현식
re.match (pattern, string, flag = 0) # Out : 문자열의 시작 부분에 일치 패턴 또는 없음
re.search (pattern, string, flag = 0) # Out : 문자열 내부의 패턴 일치 또는 없음
re.findall (pattern, string, flag = 0) # Out : string 또는 []의 패턴 일치 모든 목록
re.finditer (pattern, string, flag = 0) # Out : re.findall과 동일하지만 iterator 객체를 반환합니다.
re (패턴, 대체, 문자열, 플래그 = 0) # 출력 : 패턴 대신 대체 (문자열 또는 함수)가있는 문자열
미리 컴파일 된 정규 표현식
precompiled_pattern = re.compile (패턴, 플래그 = 0)
precompiled_pattern.match (string) # Out : 문자열의 시작 부분에서 일치 또는 없음
precompiled_pattern.search (string) # Out : 문자열의 아무 곳이나 일치하거나 없음
precompiled_pattern.findall (string) # Out : 일치하는 모든 하위 문자열 목록
precompiled_pattern.sub (문자열 / 패턴 / 함수, 문자열) # Out : 대체 된 문자열

문자열의 시작 부분 일치

re.match() 의 첫 번째 인수는 정규 표현식이고 두 번째 인수는 일치시킬 문자열입니다.

import re

pattern = r"123"
string = "123zzb"

re.match(pattern, string)
# Out: <_sre.SRE_Match object; span=(0, 3), match='123'>

match = re.match(pattern, string)

match.group()
# Out: '123'

pattern 변수는 문자열이 원시 문자열 리터럴 임을 나타내는 r 접두어가 붙은 문자열임을 알 수 있습니다.

원시 문자열 리터럴은 문자열 리터럴과 약간 다른 구문을 가지고 있습니다. 즉, 원시 문자열 리터럴의 백 슬래시 \ 는 "단지 백 슬래시"를 의미하며 백 슬래쉬를 두 배로 늘리거나 줄 바꿈 ( \n )과 같은 "이스케이프 시퀀스"를 이스케이프 할 필요가 없습니다. , 탭 ( \t ), 백 스페이스 ( \ ), 용지 공급 ( \r ) 등이 있습니다. 일반적인 문자열 리터럴에서 각 백 슬래시는 이스케이프 시퀀스의 시작으로 간주되지 않도록 두 배가되어야합니다.

따라서 r"\n" 은 \ 및 n 의 2 문자로 구성된 문자열입니다. 정규 표현식은 백 슬래시도 사용합니다. 예를 들어 \d 는 임의의 숫자를 나타냅니다. 원시 문자열 ( r"\d" )을 사용하여 문자열 ( "\\d" )을 이중으로 이스케이프해야하는 것을 피할 수 있습니다.

예를 들면 :

string = "\\t123zzb" # here the backslash is escaped, so there's no tab, just '\' and 't'
pattern = "\\t123"   # this will match \t (escaping the backslash) followed by 123
re.match(pattern, string).group()   # no match
re.match(pattern, "\t123zzb").group()  # matches '\t123'

pattern = r"\\t123"  
re.match(pattern, string).group()   # matches '\\t123'

일치는 문자열의 시작 부분에서만 수행됩니다. 어디서나 일치 시키려면 re.search 대신 사용하십시오.

match = re.match(r"(123)", "a123zzb")

match is None
# Out: True

match = re.search(r"(123)", "a123zzb")

match.group()
# Out: '123'

수색

pattern = r"(your base)"
sentence = "All your base are belong to us."

match = re.search(pattern, sentence)
match.group(1)
# Out: 'your base'

match = re.search(r"(belong.*)", sentence)
match.group(1)
# Out: 'belong to us.'

검색은 re.match 와 달리 문자열의 아무 곳에서나 수행됩니다. re.findall 을 사용할 수도 있습니다.

문자열의 시작 부분에서 검색 할 수도 있습니다 ( ^ 사용),

match = re.search(r"^123", "123zzb")
match.group(0)
# Out: '123'

match = re.search(r"^123", "a123zzb")
match is None
# Out: True

문자열 끝에 ( $ 사용),

match = re.search(r"123$", "zzb123")
match.group(0)
# Out: '123'

match = re.search(r"123$", "123zzb")
match is None
# Out: True

또는 둘 다 ( ^ 및 $ 둘 다 사용) :

match = re.search(r"^123$", "123")
match.group(0)
# Out: '123'

그룹화

그룹화는 괄호로 수행됩니다. group() 호출하면 일치하는 괄호로 묶인 하위 그룹으로 구성된 문자열이 반환됩니다.

match.group() # Group without argument returns the entire match found
# Out: '123'
match.group(0) # Specifying 0 gives the same result as specifying no argument
# Out: '123'

특정 하위 그룹을 가져 오기 위해 group() 에 인수를 제공 할 수도 있습니다.

문서에서 :

단일 인수가있는 경우 결과는 단일 문자열입니다. 인수가 여러 개인 경우 결과는 인수 당 하나의 항목이있는 튜플입니다.

반면에 groups() 을 호출하면 하위 그룹이 포함 된 튜플 목록이 반환됩니다.

sentence = "This is a phone number 672-123-456-9910"
pattern = r".*(phone).*?([\d-]+)"

match = re.match(pattern, sentence)

match.groups()   # The entire match as a list of tuples of the paranthesized subgroups
# Out: ('phone', '672-123-456-9910')

m.group()        # The entire match as a string
# Out: 'This is a phone number 672-123-456-9910'

m.group(0)       # The entire match as a string
# Out: 'This is a phone number 672-123-456-9910'

m.group(1)       # The first parenthesized subgroup.
# Out: 'phone'

m.group(2)       # The second parenthesized subgroup.
# Out: '672-123-456-9910'

m.group(1, 2)    # Multiple arguments give us a tuple.
# Out: ('phone', '672-123-456-9910')

명명 된 그룹

match = re.search(r'My name is (?P<name>[A-Za-z ]+)', 'My name is John Smith')
match.group('name')
# Out: 'John Smith'

match.group(1)
# Out: 'John Smith'

인덱스뿐만 아니라 이름으로도 참조 할 수있는 캡처 그룹을 만듭니다.

비 포획 그룹

(?:) 사용하면 그룹이 만들어 지지만 그룹은 캡처되지 않습니다. 즉, 그룹으로 사용할 수 있지만 "그룹 공간"은 오염되지 않습니다.

re.match(r'(\d+)(\+(\d+))?', '11+22').groups()
# Out: ('11', '+22', '22')

re.match(r'(\d+)(?:\+(\d+))?', '11+22').groups()
# Out: ('11', '22')

이 예제는 11+22 또는 11 과 일치하지만 11+ 일치하지 않습니다. + 기호와 두 번째 용어가 그룹화되어 있기 때문입니다. 반면에 + 기호는 캡처되지 않습니다.

특수 문자 이스케이프

특수 문자 (아래의 문자 클래스 대괄호 [ ] 와 같은)는 문자 그대로 일치하지 않습니다.

match = re.search(r'[b]', 'a[b]c')
match.group()
# Out: 'b'

특수 문자를 이스케이프 처리하면 문자 그대로 일치시킬 수 있습니다.

match = re.search(r'\[b\]', 'a[b]c')
match.group()
# Out: '[b]'

이것을 re.escape() 함수를 사용할 수 있습니다 :

re.escape('a[b]c')
# Out: 'a\\[b\\]c'
match = re.search(re.escape('a[b]c'), 'a[b]c')
match.group()
# Out: 'a[b]c'

re.escape() 함수는 모든 특수 문자를 이스케이프하므로 사용자 입력을 기반으로 정규 표현식을 작성하는 경우 유용합니다.

username = 'A.C.'  # suppose this came from the user
re.findall(r'Hi {}!'.format(username), 'Hi A.C.! Hi ABCD!')
# Out: ['Hi A.C.!', 'Hi ABCD!']
re.findall(r'Hi {}!'.format(re.escape(username)), 'Hi A.C.! Hi ABCD!')
# Out: ['Hi A.C.!']

교체

re.sub 사용하여 문자열에서 대체 할 수 있습니다.

문자열 바꾸기

re.sub(r"t[0-9][0-9]", "foo", "my name t13 is t44 what t99 ever t44")
# Out: 'my name foo is foo what foo ever foo'

그룹 참조 사용

적은 수의 그룹으로 대체하는 방법은 다음과 같습니다.

re.sub(r"t([0-9])([0-9])", r"t\2\1", "t13 t19 t81 t25")
# Out: 't31 t91 t18 t52'

그러나 '10'과 같은 그룹 ID를 만들면 작동하지 않습니다 . \10 은 'ID 번호 1에 이어 0'으로 읽습니다. 따라서 좀 더 구체적으로 \g<i> 표기법을 사용해야합니다.

re.sub(r"t([0-9])([0-9])", r"t\g<2>\g<1>", "t13 t19 t81 t25")
# Out: 't31 t91 t18 t52'

대체 기능 사용

items = ["zero", "one", "two"]
re.sub(r"a\[([0-3])\]", lambda match: items[int(match.group(1))], "Items: a[0], a[1], something, a[2]")
# Out: 'Items: zero, one, something, two'

겹치지 않는 모든 일치 항목 찾기

re.findall(r"[0-9]{2,3}", "some 1 text 12 is 945 here 4445588899")
# Out: ['12', '945', '444', '558', '889']

"[0-9]{2,3}" 앞의 r 은 파이썬이 문자열을있는 그대로 해석하도록 알려줍니다. "원시"문자열로.

당신은 또한 사용할 수 있습니다 re.finditer() 와 같은 방식으로 작동 re.findall() 하지만, 반복자 반환 SRE_Match 대신 문자열 목록의 객체 :

results = re.finditer(r"([0-9]{2,3})", "some 1 text 12 is 945 here 4445588899")
print(results)
# Out: <callable-iterator object at 0x105245890>
for result in results:
     print(result.group(0))
''' Out:
12
945
444
558
889
'''

미리 컴파일 된 패턴

import re

precompiled_pattern = re.compile(r"(\d+)")
matches = precompiled_pattern.search("The answer is 41!")
matches.group(1)
# Out: 41

matches = precompiled_pattern.search("Or was it 42?")
matches.group(1)
# Out: 42

패턴을 컴파일하면 나중에 프로그램에서 재사용 할 수 있습니다. 그러나 Python은 최근에 사용 된 표현식 ( docs , SO 답변 )을 캐시하므로 "한 번에 몇 가지 정규 표현식 만 사용하는 프로그램은 정규 표현식 컴파일에 대해 걱정할 필요가 없습니다"라는 점에 유의하십시오.

import re

precompiled_pattern = re.compile(r"(.*\d+)")
matches = precompiled_pattern.match("The answer is 41!")
print(matches.group(1))
# Out: The answer is 41

matches = precompiled_pattern.match("Or was it 42?")
print(matches.group(1))
# Out: Or was it 42

re.match ()와 함께 사용할 수 있습니다.

허용 된 문자 확인하기

문자열에 특정 문자 집합 (이 경우 az, AZ 및 0-9) 만 포함되어 있는지 확인하려면 이렇게 할 수 있습니다.

import re

def is_allowed(string):
    characherRegex = re.compile(r'[^a-zA-Z0-9.]')
    string = characherRegex.search(string)
    return not bool(string)
    
print (is_allowed("abyzABYZ0099")) 
# Out: 'True'

print (is_allowed("#*@#$%^")) 
# Out: 'False'

[^a-zA-Z0-9.] 에서 [^a-z0-9.] 까지 식 라인을 적용하여 예를 들어 대문자를 허용하지 않을 수도 있습니다.

부분 크레딧 : http://stackoverflow.com/a/1325265/2697955

정규식을 사용하여 문자열 분할

정규식을 사용하여 문자열을 분할 할 수도 있습니다. 예를 들어,

import re
data = re.split(r'\s+', 'James 94 Samantha 417 Scarlett 74')
print( data )
# Output: ['James', '94', 'Samantha', '417', 'Scarlett', '74']

국기

특수한 경우에는 정규 표현식의 동작을 변경해야합니다. 플래그를 사용하여 수행됩니다. 플래그는 flags 키워드를 통해 또는 표현식에서 직접 두 가지 방법으로 설정할 수 있습니다.

국기 키워드

다음은 re.search 예제이지만 re 모듈의 대부분의 기능에 적용됩니다.

m = re.search("b", "ABC")  
m is None
# Out: True

m = re.search("b", "ABC", flags=re.IGNORECASE)
m.group()
# Out: 'B'

m = re.search("a.b", "A\nBC", flags=re.IGNORECASE) 
m is None
# Out: True

m = re.search("a.b", "A\nBC", flags=re.IGNORECASE|re.DOTALL) 
m.group()
# Out: 'A\nB'

공통 플래그

깃발	간단한 설명
`re.IGNORECASE` . `re.I` , `re.I`	패턴이 대소 문자를 무시하도록합니다.
`re.DOTALL` , `re.S`	만들어 `.` 줄 바꿈을 포함한 모든 것에 매치한다.
`re.MULTILINE` , `re.M`	하게 `^` 일치하는 라인의 시작 `$` 행의 끝
`re.DEBUG`	디버그 정보를 켭니다.

사용 가능한 모든 플래그의 전체 목록은 docs를 확인하십시오.

인라인 플래그

문서에서 :

(?iLmsux) ( 'i', 'L', 'm', 's', 'u', 'x'세트의 문자 하나 이상)
그룹이 빈 문자열과 일치합니다. 문자는 대응하는 플래그를 설정합니다 : re.I (대소 문자 무시), re.L (로케일 종속), re.M (복수 행), re.S (모두 일치하는 점), re.U (유니 코드 종속) 및 전체 정규 표현식에 대해 re.X (자세한 정보 표시). re.compile () 함수에 플래그 인수를 전달하는 대신 일반 표현식의 일부로 플래그를 포함하려는 경우 유용합니다.
(? x) 플래그는 표현식이 구문 분석되는 방식을 변경합니다. 표현식 문자열에서 첫 번째로 사용되거나 하나 이상의 공백 문자 뒤에 사용해야합니다. 플래그 앞에 공백 문자가 없으면 결과는 정의되지 않습니다.

`re.finditer`를 사용하여 경기 반복하기

re.finditer 를 사용하여 문자열의 모든 일치를 반복 할 수 있습니다. 이렇게하면 문자열 (색인)의 일치 위치에 대한 정보와 같은 추가 정보를 re.findall 수 있습니다.

import re
text = 'You can try to find an ant in this string'
pattern = 'an?\w' # find 'an' either with or without a following word character

for match in re.finditer(pattern, text):
    # Start index of match (integer)
    sStart = match.start()

    # Final index of match (integer)
    sEnd = match.end()

    # Complete match (string)
    sGroup = match.group()

    # Print match
    print('Match "{}" found at: [{},{}]'.format(sGroup, sStart,sEnd))

결과:

Match "an" found at: [5,7]
Match "an" found at: [20,22]
Match "ant" found at: [23,26]

특정 위치에서만 표현식 맞추기

종종 특정 장소에서만 표현식을 일치시키려는 경우가 있습니다 (즉, 나머지는 그대로 유지됩니다). 다음 문장을 생각해보십시오 :

An apple a day keeps the doctor away (I eat an apple everyday).

여기서 "apple"은 두 번 발생하는데, 이는 새로운 regex 모듈에 의해 지원되는 backtracking control 동사 로 해결할 수 있습니다. 아이디어는 다음과 같습니다.

forget_this | or this | and this as well | (but keep this)

Apple의 예제를 보면 다음과 같습니다.

import regex as re
string = "An apple a day keeps the doctor away (I eat an apple everyday)."
rx = re.compile(r'''
    \([^()]*\) (*SKIP)(*FAIL)  # match anything in parentheses and "throw it away"
    |                          # or
    apple                      # match an apple
    ''', re.VERBOSE)
apples = rx.findall(string)
print(apples)
# only one

이것은 괄호 밖에서 찾을 수있을 때만 "사과"와 일치합니다.

다음은 작동 방식입니다.

왼쪽에서 오른쪽 으로 볼 때, 정규식 엔진은 모든 것을 왼쪽에서 소비하며, (*SKIP) 는 "항상 true-assertion"의 역할을합니다. 그런 다음 올바르게 실패 (*FAIL) 하고 역 추적합니다.
이제는 왼쪽에서 더 멀리 이동하는 것이 금지되어있는 오른쪽에서 왼쪽 으로 (*SKIP) 뒤로 (*SKIP) 뛰는 동안 (*SKIP) 지점으로 이동합니다. 대신 엔진은 왼쪽에 아무 것도 버리고 (*SKIP) 가 호출 된 지점으로 점프하도록 지시받습니다.

Modified text is an extract of the original Stack Overflow Documentation

아래 라이선스 CC BY-SA 3.0

와 제휴하지 않음 Stack Overflow

Python Language
정규 표현식 (Regex)

수색…

소개

통사론