R Language => Webスクレープと解析

備考

スクレイピングとは、コンピュータを使用してWebページのコードを取得することを指します。コードを取得したら、それをRでさらに使用するために便利な形式に解析する必要があります。

Base Rにはこれらのプロセスに必要なツールが多くないため、通常はスクレイピングと解析はパッケージで行われます。いくつかのパッケージは、スクレーピング（ RSelenium 、 httr 、 curl 、 RCurl ）、解析用（ XML 、 xml2 ）、両方用（ rvest ）のために最も便利です。

関連するプロセスは、Web APIとは異なり、機械可読であることを意図したデータを返すWeb APIをスクラップしています。同じパッケージの多くが両方に使用されています。

合法性

サーバの負荷が増えたか、データの所有権が懸念されているかにかかわらず、一部のウェブサイトは掻き取られることに反対していますウェブサイトがその使用条件を撤回することを禁じている場合、それを掻き取ることは違法です。

rvestによる基本的な削り取り

rvestは、PythonのBeautiful Soupに触発されたHadley WickhamによるWeb rvestと解析用のパッケージです。 Hadleyのxml2パッケージのlibxml2バインディングをHTML解析に活用しています。

tidyverseの一環として、 rvestがパイプされます。それは使用しています

xml2::read_htmlを使用してWebページのHTMLをスクラップし、
これは、そのサブセットとすることができるhtml_nodeとhtml_nodes CSSを使用して、機能またはXPathのセレクタ、および
html_textやhtml_tableような関数でRオブジェクトにパースされます。

RのWikipediaページからマイルストーンのテーブルを削るために、コードは次のようになります

library(rvest)

url <- 'https://en.wikipedia.org/wiki/R_(programming_language)'

        # scrape HTML from website
url %>% read_html() %>% 
    # select HTML tag with class="wikitable"
    html_node(css = '.wikitable') %>% 
    # parse table into data.frame
    html_table() %>%
    # trim for printing
    dplyr::mutate(Description = substr(Description, 1, 70))

##    Release       Date                                                  Description
## 1     0.16            This is the last alpha version developed primarily by Ihaka 
## 2     0.49 1997-04-23 This is the oldest source release which is currently availab
## 3     0.60 1997-12-05 R becomes an official part of the GNU Project. The code is h
## 4   0.65.1 1999-10-07 First versions of update.packages and install.packages funct
## 5      1.0 2000-02-29 Considered by its developers stable enough for production us
## 6      1.4 2001-12-19 S4 methods are introduced and the first version for Mac OS X
## 7      2.0 2004-10-04 Introduced lazy loading, which enables fast loading of data 
## 8      2.1 2005-04-18 Support for UTF-8 encoding, and the beginnings of internatio
## 9     2.11 2010-04-22                          Support for Windows 64 bit systems.
## 10    2.13 2011-04-14 Adding a new compiler function that allows speeding up funct
## 11    2.14 2011-10-31 Added mandatory namespaces for packages. Added a new paralle
## 12    2.15 2012-03-30 New load balancing functions. Improved serialization speed f
## 13     3.0 2013-04-03 Support for numeric index values 231 and larger on 64 bit sy

これによりdata.frameが返されますが、スクレイプされたデータには典型的なように、ここでデータを整理する必要があります。ここでは、日付の書式設定、 NAの挿入などがあります。

一貫性の低い長方形形式のデータは、解析を成功させるためにループ処理やその他の処理を必要とすることに注意してください。ウェブサイトがjQueryやその他の手段を使ってコンテンツを挿入すると、 read_htmlするには不十分であり、 RSeleniumようなより頑強なスクレーパーが必要な場合があります。

ログインが必要な場合にrvestを使用する

私は一般的な問題は、Webを廃棄するときに、WebサイトにログインするためのユーザーIDとパスワードを入力する方法です。

私の回答を追跡するために作成したこの例では、ここにスタックオーバーフローが発生しました。全体的な流れはログインすること、ウェブページの情報を収集すること、データフレームを追加して次のページに移動することです。

library(rvest) 

#Address of the login webpage
login<-"https://stackoverflow.com/users/login?ssrc=head&returnurl=http%3a%2f%2fstackoverflow.com%2f"

#create a web session with the desired login address
pgsession<-html_session(login)
pgform<-html_form(pgsession)[[2]]  #in this case the submit is the 2nd form
filled_form<-set_values(pgform, email="*****", password="*****")
submit_form(pgsession, filled_form)

#pre allocate the final results dataframe.
results<-data.frame()  

#loop through all of the pages with the desired info
for (i in 1:5)
{
  #base address of the pages to extract information from
  url<-"http://stackoverflow.com/users/**********?tab=answers&sort=activity&page="
  url<-paste0(url, i)
  page<-jump_to(pgsession, url)

  #collect info on the question votes and question title
  summary<-html_nodes(page, "div .answer-summary")
  question<-matrix(html_text(html_nodes(summary, "div"), trim=TRUE), ncol=2, byrow = TRUE)

  #find date answered, hyperlink and whether it was accepted
  dateans<-html_node(summary, "span") %>% html_attr("title")
  hyperlink<-html_node(summary, "div a") %>% html_attr("href")
  accepted<-html_node(summary, "div") %>% html_attr("class")

  #create temp results then bind to final results 
  rtemp<-cbind(question, dateans, accepted, hyperlink)
  results<-rbind(results, rtemp)
}

#Dataframe Clean-up
names(results)<-c("Votes", "Answer", "Date", "Accepted", "HyperLink")
results$Votes<-as.integer(as.character(results$Votes))
results$Accepted<-ifelse(results$Accepted=="answer-votes default", 0, 1)

この場合のループは5ページに制限されていますが、これはアプリケーションに合わせて変更する必要があります。私はユーザー固有の値を******に置き換えました。うまくいけば、これはあなたのためのいくつかのガイダンスを提供します。

Modified text is an extract of the original Stack Overflow Documentation

ライセンスを受けた CC BY-SA 3.0

所属していない Stack Overflow

R Language
Webスクレープと解析

サーチ…

備考

合法性

rvestによる基本的な削り取り

ログインが必要な場合にrvestを使用する