R Language
वेब स्क्रैपिंग और पार्सिंग

apache-spark C++ HTML Java Language JavaScript latex GNU/Linux Python Language Regular Expressions SQL

टिप्पणियों

स्क्रैपिंग एक वेबपेज के कोड को पुनः प्राप्त करने के लिए कंप्यूटर का उपयोग करने को संदर्भित करता है। कोड प्राप्त होने के बाद, इसे आर में आगे उपयोग के लिए एक उपयोगी रूप में पार्स किया जाना चाहिए।

बेस आर में इन प्रक्रियाओं के लिए आवश्यक कई उपकरण नहीं होते हैं, इसलिए स्क्रैपिंग और पार्सिंग आमतौर पर पैकेज के साथ किया जाता है। कुछ पैकेज स्क्रैपिंग के लिए सबसे अधिक उपयोगी होते हैं ( RSelenium , httr , curl , RCurl ), कुछ पार्सिंग के लिए ( XML , XML xml2 ), और दोनों के लिए कुछ ( rvest )।

एक संबंधित प्रक्रिया एक वेब एपीआई को स्क्रैप कर रही है, जो एक वेबपेज के विपरीत मशीन-पठनीय होने का इरादा रखता है। दोनों के लिए एक ही पैकेज का उपयोग किया जाता है।

वैधता

कुछ वेबसाइट्स को स्क्रैप किए जाने पर आपत्ति है, चाहे सर्वर लोड बढ़ने के कारण या डेटा स्वामित्व को लेकर चिंता हो। अगर कोई वेबसाइट इसमें उपयोग करने की शर्तों को रद्द करती है, तो उसे स्क्रैप करना गैरकानूनी है।

मूल खुरचना

rvest वेब scraping और पार्स करने से हैडली विकहैम पायथन के द्वारा प्रेरित के लिए एक पैकेज है ब्यूटीफुल सूप । यह एचटीएमएल पार्सिंग के लिए हैडली के xml2 पैकेज के libxml2 बाइंडिंग का लाभ उठाता है।

Tidyverse के हिस्से के रूप rvest है पहुंचाया । यह उपयोगकर्ता है

xml2::read_html किसी वेबपेज का HTML स्क्रैप करने के लिए,
फिर सीएसएस या XPath चयनकर्ताओं और, का उपयोग करके अपने html_node और html_nodes फ़ंक्शन के साथ सबसेट हो सकता है
html_text और html_table जैसे फ़ंक्शन के साथ R ऑब्जेक्ट पर पार्स किया गया।

R पर विकिपीडिया पृष्ठ से मील के पत्थर की तालिका को परिमार्जन करने के लिए, कोड जैसा दिखेगा

library(rvest)

url <- 'https://en.wikipedia.org/wiki/R_(programming_language)'

        # scrape HTML from website
url %>% read_html() %>% 
    # select HTML tag with class="wikitable"
    html_node(css = '.wikitable') %>% 
    # parse table into data.frame
    html_table() %>%
    # trim for printing
    dplyr::mutate(Description = substr(Description, 1, 70))

##    Release       Date                                                  Description
## 1     0.16            This is the last alpha version developed primarily by Ihaka 
## 2     0.49 1997-04-23 This is the oldest source release which is currently availab
## 3     0.60 1997-12-05 R becomes an official part of the GNU Project. The code is h
## 4   0.65.1 1999-10-07 First versions of update.packages and install.packages funct
## 5      1.0 2000-02-29 Considered by its developers stable enough for production us
## 6      1.4 2001-12-19 S4 methods are introduced and the first version for Mac OS X
## 7      2.0 2004-10-04 Introduced lazy loading, which enables fast loading of data 
## 8      2.1 2005-04-18 Support for UTF-8 encoding, and the beginnings of internatio
## 9     2.11 2010-04-22                          Support for Windows 64 bit systems.
## 10    2.13 2011-04-14 Adding a new compiler function that allows speeding up funct
## 11    2.14 2011-10-31 Added mandatory namespaces for packages. Added a new paralle
## 12    2.15 2012-03-30 New load balancing functions. Improved serialization speed f
## 13     3.0 2013-04-03 Support for numeric index values 231 and larger on 64 bit sy

हालांकि यह एक डेटा.फ्रेम लौटाता है, ध्यान दें कि जैसा कि स्क्रैप किए गए डेटा के लिए विशिष्ट है, अभी भी आगे की सफाई के लिए डेटा है: यहां, तिथियों को प्रारूपित करते हुए, NA सम्मिलित करना, और इसी तरह।

ध्यान दें कि कम लगातार आयताकार प्रारूप में डेटा लूपिंग या अन्य आगे मूंगिंग को सफलतापूर्वक पार्स करने के लिए ले सकता है। यदि वेबसाइट सामग्री डालने के लिए jQuery या अन्य माध्यमों का उपयोग करती है, तो read_html स्क्रैप करने के लिए अपर्याप्त हो सकता है, और RSelenium जैसे अधिक मजबूत स्क्रैपर आवश्यक हो सकता है।

जब लॉगिन की आवश्यकता हो तो rvest का उपयोग करना

मैं एक वेब स्क्रैपिंग करते समय आम समस्या का सामना करता हूं कि वेब साइट में लॉग इन करने के लिए उपयोगकर्ता और पासवर्ड कैसे दर्ज करें।

इस उदाहरण में, जो मैंने अपने जवाबों को ट्रैक करने के लिए यहां पोस्ट किया था ताकि ओवरफ्लो को रोका जा सके। समग्र प्रवाह लॉगिन करने के लिए है, एक वेब पेज पर जाएं जानकारी इकट्ठा करें, इसे एक डेटाफ्रेम जोड़ें और फिर अगले पृष्ठ पर जाएं।

library(rvest) 

#Address of the login webpage
login<-"https://stackoverflow.com/users/login?ssrc=head&returnurl=http%3a%2f%2fstackoverflow.com%2f"

#create a web session with the desired login address
pgsession<-html_session(login)
pgform<-html_form(pgsession)[[2]]  #in this case the submit is the 2nd form
filled_form<-set_values(pgform, email="*****", password="*****")
submit_form(pgsession, filled_form)

#pre allocate the final results dataframe.
results<-data.frame()  

#loop through all of the pages with the desired info
for (i in 1:5)
{
  #base address of the pages to extract information from
  url<-"http://stackoverflow.com/users/**********?tab=answers&sort=activity&page="
  url<-paste0(url, i)
  page<-jump_to(pgsession, url)

  #collect info on the question votes and question title
  summary<-html_nodes(page, "div .answer-summary")
  question<-matrix(html_text(html_nodes(summary, "div"), trim=TRUE), ncol=2, byrow = TRUE)

  #find date answered, hyperlink and whether it was accepted
  dateans<-html_node(summary, "span") %>% html_attr("title")
  hyperlink<-html_node(summary, "div a") %>% html_attr("href")
  accepted<-html_node(summary, "div") %>% html_attr("class")

  #create temp results then bind to final results 
  rtemp<-cbind(question, dateans, accepted, hyperlink)
  results<-rbind(results, rtemp)
}

#Dataframe Clean-up
names(results)<-c("Votes", "Answer", "Date", "Accepted", "HyperLink")
results$Votes<-as.integer(as.character(results$Votes))
results$Accepted<-ifelse(results$Accepted=="answer-votes default", 0, 1)

इस मामले में लूप केवल 5 पृष्ठों तक सीमित है, इसे आपके एप्लिकेशन को फिट करने के लिए बदलना होगा। मैंने उपयोगकर्ता विशिष्ट मानों को ****** से बदल दिया, उम्मीद है कि यह आपको समस्या के लिए कुछ मार्गदर्शन प्रदान करेगा।

Modified text is an extract of the original Stack Overflow Documentation

के तहत लाइसेंस प्राप्त है CC BY-SA 3.0

से संबद्ध नहीं है Stack Overflow