Python Language => 파이썬으로 웹 스크래핑하기

소개

웹 스크래핑 은 데이터를 웹 페이지에서 끊임없이 '제거'할 수있는 자동화 된 프로그래밍 방식 프로세스입니다. 화면 긁기 또는 웹 수확이라고도하는 웹 긁기는 공개적으로 액세스 할 수있는 웹 페이지에서 즉시 데이터를 제공 할 수 있습니다. 일부 웹 사이트에서는 웹 스크 레이 핑이 불법 일 수 있습니다.

비고

웹 스크래핑을위한 유용한 파이썬 패키지 (알파벳 순서)

요청 및 데이터 수집

requests 캐싱; 캐싱 데이터는 매우 유용합니다. 개발 중에는 사이트를 불필요하게 치는 것을 피할 수 있다는 의미입니다. 실제 컬렉션을 실행하는 동안, 어떤 이유로 스 크레이퍼가 손상되면 (사이트에서 비정상적인 콘텐츠를 처리하지 못했을 가능성이 있습니까? 아니면 사이트가 다운되었을 수도 있습니다 ...?) 컬렉션을 매우 빠르게 반복 할 수 있습니다 네가 그만 둔 곳에서.

`scrapy`

웹 크롤러를 구축하는 데 유용합니다. 여기에서 requests 사용하고 페이지를 반복하는 것보다 강력한 기능이 필요합니다.

`selenium`

브라우저 자동화를위한 Selenium WebDriver 용 Python 바인딩. requests 을 사용하여 HTTP 요청을 직접 작성하면 웹 페이지를 검색하는 것이 더 간단합니다. 그러나 requests 단독으로 사용하여 사이트의 원하는 동작을 복제 할 수없는 경우, 특히 JavaScript가 페이지의 요소를 렌더링해야하는 경우에는 유용한 도구로 남아 있습니다.

HTML 구문 분석

`BeautifulSoup`

다양한 파서 (Python의 내장 HTML 구문 분석기, html5lib , lxml 또는 lxml.html )를 사용하여 HTML 및 XML 문서를 쿼리합니다.

`lxml`

HTML과 XML을 처리합니다. CSS 선택자와 XPath를 통해 HTML 문서의 내용을 쿼리하고 선택할 수 있습니다.

요청 및 lxml을 사용하여 일부 데이터를 스크래핑하는 기본 예제

# For Python 2 compatibility.
from __future__ import print_function

import lxml.html
import requests


def main():
    r = requests.get("https://httpbin.org")
    html_source = r.text
    root_element = lxml.html.fromstring(html_source)
    # Note root_element.xpath() gives a *list* of results.
    # XPath specifies a path to the element we want.
    page_title = root_element.xpath('/html/head/title/text()')[0]
    print(page_title)

if __name__ == '__main__':
    main()

요청으로 웹 스크래핑 세션 유지 보수

웹 스크래핑 세션 을 유지하여 쿠키 및 기타 매개 변수를 유지하는 것이 좋습니다. 또한 requests.Session 은 기본 TCP 연결을 호스트에 다시 사용하기 때문에 성능이 향상 될 수 있습니다.

import requests

with requests.Session() as session:
    # all requests through session now have User-Agent header set
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}

    # set cookies
    session.get('http://httpbin.org/cookies/set?key=value')

    # get cookies
    response = session.get('http://httpbin.org/cookies')
    print(response.text)

Scrapy 프레임 워크를 사용한 스크래핑

먼저 새로운 Scrapy 프로젝트를 설정해야합니다. 코드를 저장할 디렉토리를 입력하고 다음을 실행하십시오.

scrapy startproject projectName

긁으려면 거미가 필요합니다. 스파이더는 특정 사이트를 긁는 방법을 정의합니다. 다음은 StackOverflow에서 가장 많이 투표 된 질문에 대한 링크를 따라 가고 각 페이지 ( 소스 )의 일부 데이터를 긁는 스파이더 코드입니다.

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'  # each spider has a unique name
    start_urls = ['http://stackoverflow.com/questions?sort=votes']  # the parsing starts from a specific set of urls

    def parse(self, response):  # for each request this generator yields, its response is sent to parse_question
        for href in response.css('.question-summary h3 a::attr(href)'):  # do some scraping stuff using css selectors to find question urls 
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response): 
        yield {
            'title': response.css('h1 a::text').extract_first(),
            'votes': response.css('.question .vote-count-post::text').extract_first(),
            'body': response.css('.question .post-text').extract_first(),
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url,
        }

spider 클래스를 projectName\spiders 디렉토리에 저장하십시오. 이 경우 - projectName\spiders\stackoverflow_spider.py .

이제 거미를 사용할 수 있습니다. 예를 들어, 다음을 실행하십시오 (프로젝트 디렉토리에서).

scrapy crawl stackoverflow

Scrapy 사용자 에이전트 수정

때로는 기본 Scrapy 사용자 에이전트 ( "Scrapy/VERSION (+http://scrapy.org)" )가 호스트에 의해 차단되는 경우가 있습니다. 기본 사용자 에이전트를 변경하려면 settings.py를 열고 다음 줄의 주석을 제거하고 원하는 내용으로 편집하십시오.

#USER_AGENT = 'projectName (+http://www.yourdomain.com)'

예를 들어

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'

BeautifulSoup4를 사용하여 스크래핑하기

from bs4 import BeautifulSoup
import requests

# Use the requests module to obtain a page
res = requests.get('https://www.codechef.com/problems/easy')

# Create a BeautifulSoup object
page = BeautifulSoup(res.text, 'lxml')   # the text field contains the source of the page

# Now use a CSS selector in order to get the table containing the list of problems
datatable_tags = page.select('table.dataTable')  # The problems are in the <table> tag,
                                                 # with class "dataTable"
# We extract the first tag from the list, since that's what we desire
datatable = datatable_tags[0]
# Now since we want problem names, they are contained in <b> tags, which are
# directly nested under <a> tags
prob_tags = datatable.select('a > b')
prob_names = [tag.getText().strip() for tag in prob_tags]

print prob_names

Selenium WebDriver를 사용한 스크래핑

일부 웹 사이트는 긁히기 싫어합니다. 이러한 경우 브라우저로 작업하는 실제 사용자를 시뮬레이트해야 할 수도 있습니다. Selenium은 웹 브라우저를 시작하고 제어합니다.

from selenium import webdriver

browser = webdriver.Firefox()  # launch firefox browser

browser.get('http://stackoverflow.com/questions?sort=votes')  # load url

title = browser.find_element_by_css_selector('h1').text  # page title (first h1 element)

questions = browser.find_elements_by_css_selector('.question-summary')  # question list

for question in questions:  # iterate over questions
    question_title = question.find_element_by_css_selector('.summary h3 a').text
    question_excerpt = question.find_element_by_css_selector('.summary .excerpt').text
    question_vote = question.find_element_by_css_selector('.stats .vote .votes .vote-count-post').text
    
    print "%s\n%s\n%s votes\n-----------\n" % (question_title, question_excerpt, question_vote)

셀레늄은 훨씬 더 많은 일을 할 수 있습니다. 브라우저의 쿠키를 수정하고, 양식을 채우고, 마우스 클릭을 시뮬레이션하고, 웹 페이지의 스크린 샷을 찍고, 사용자 정의 JavaScript를 실행할 수 있습니다.

urllib.request로 간단한 웹 컨텐츠 다운로드

표준 라이브러리 모듈 urllib.request 를 사용하여 웹 컨텐트를 다운로드 할 수 있습니다.

from urllib.request import urlopen

response = urlopen('http://stackoverflow.com/questions?sort=votes')    
data = response.read()

# The received bytes should usually be decoded according the response's character set
encoding = response.info().get_content_charset()
html = data.decode(encoding)

Python 2 에서도 비슷한 모듈을 사용할 수 있다 .

컬링으로 긁어 모으기

수입 :

from subprocess import Popen, PIPE
from lxml import etree
from io import StringIO

다운로드 중 :

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
url = 'http://stackoverflow.com'
get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE)
result = get.stdout.read().decode('utf8')

-s : 자동 다운로드

-A : 사용자 에이전트 플래그

구문 분석 :

tree = etree.parse(StringIO(result), etree.HTMLParser())
divs = tree.xpath('//div')

Modified text is an extract of the original Stack Overflow Documentation

아래 라이선스 CC BY-SA 3.0

와 제휴하지 않음 Stack Overflow

Python Language
파이썬으로 웹 스크래핑하기

수색…

소개

비고