Web Scraping using R

r
webscraping

#1

can anyone guide me how to extract data using R from this following site
https://www.4coffshore.com/windfarms/windfarms.aspx?windfarmId=US12
for Eg :- I want to extract all US projects with each individual project data (i.e if we click particular project on website it will show details of that project i need that inside data in csv file )
2. after clicking each projects it will show two columns parameter and information so my parameter
column is common in every project but information column changes as per project
so in my final csv i want just one column of parameter and all different projects info columns together


#2

Hi @deva123 check out this article on web scraping using R.


#3

Thanks for Advice , I already tried that but the above website which i mentioned is different
and little bit complex
is there any way we extract data which is inside that each project and i want to extract all projects per country


#4

You should probably use a Selenium Server for web driving, not only web scraping. Have a look at this page https://ropensci.org/tutorials/rselenium_tutorial/


#5

Thanks for advice


#6

Hi as mentioned the @lvalnegri for solving your problem you need to use Selenium.
because of each state there is a multiple links (i.e1,2,3).

Here i write the code i know this is not a elegant solution but it give a some idea.
and i write comment to the each line so easily to understand.

        zz <- 'https://www.4coffshore.com/windfarms/windfarms.aspx?windfarmId=US12' %>% 
          # Is your link 
          read_html() %>% 
          # Using the above function scraping the website.
          html_nodes(.,xpath =  "//table[@class='table table-striped table-condensed']//a[@class='linkWF']") %>% 
          # After that need to find xpath for each project.
         html_attr(.,"href") %>% 
 # Capture the each project link
               paste('https://www.4coffshore.com',.,sep = "") %>% 
# Here pasting the project with website domine.
               map(.,read_html) %>% 
   #       Now again need to scraping the each project individually
          map(.,html_nodes,xpath = "//div[@class='table-responsive']//table[@class='tblProject']") %>% 
    #      Need to find the xpath for the table 
          map(.,html_table) %>% 
     #     Convert it in to table format
          map(., ~ do.call(rbind,.x))
  #       Finally bind the two tables because of Here two seperate tables for each project.


    zz1 <- zz[!unlist(lapply(zz,function(x) is.null(x)))]
#  If you observe some projects don't have the data so need to remove the those empty dataframe.
    zz2 <- lapply(zz1[2:length(zz1)], function(x) x[!(names(x) %in% c("X1", "X2"))])

# Here First 2 columns is same for each list of dataframes.So remove the the first 2 columns .
    zz3 <- do.call(cbind, c(zz1[1],zz2))
#  Combine the first dataframe with remaining dataframes.
    write.csv(zz3,'zz3.csv')
# Final Csv format.

Again you need to do some cleaning to your data.

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.4 purrr_0.2.5       rvest_0.3.2       xml2_1.2.0       

loaded via a namespace (and not attached):
 [1] httr_1.3.1     compiler_3.4.0 selectr_0.4-1  magrittr_1.5   R6_2.2.2      
 [6] tools_3.4.0    curl_3.2       yaml_2.1.19    Rcpp_0.12.17   stringi_1.1.7 
[11] stringr_1.3.1  rlang_0.2.1   
> 

Sorry to poor english


#7

Thanks this Gr8 at least I know where to start now …
is there any drawback to use revest in this case …? why and how you decide to use Rselenium over rvest … ?


#8

Here need to use the Both rvest and Rselenium ,Because whole data is not in active page (i.e After every 50 projects you need to click the buttons for 2 and 3 rd pages.)
By useing the rvest we can perform the web scraping (i.e,capture the data in active page)
Basic web scraping in R, with focus on rvest and RSelenium

For more details please see above url.

Happy Coding.:slight_smile:


#9

once again thanks gopala b4 this i really don’t know much ab8 web scraping and how it is useful in data analytics
i face lot of issues while installing Rselenium is that library outdate becoz some article said it was removed from CRAN so I have to download that from github.
is R not outdated for web scraping …? becoz some of my friend use Python (Beautifulsoup)
i’m familiar with r so i prefer r what do you suggest …?
plz let me know if you have Any reference article and books related to this …