Different outputs everytime the program is run while scraping web using beautiful soup

data_mining
web_mining
scraping
python

#1

Hello people,
I’m trying to scrap name of books(Business & Economics) and their other details from the following link:


My objective is to scrap the names of all books from the first 10 pages of the website.
I used the following code to do the same:

import requests
from bs4 import BeautifulSoup

def amazon_spider(max_pages):
    page = 1
    i = 1
    while page <= max_pages:
        url = 'https://www.amazon.in/s/ref=sr_pg_3?rh=n%3A976389031%2Cn%3A%21976390031%2Cn%3A1318068031&page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, 'html.parser')

        for link in list(soup.findAll('a', {'class': 'a-link-normal s-access-detail-page  a-text-normal'})):
            print(i)
            data = link.get('title')
            href = link.get('href')
            print(href)
            print(link.string)
            get_single_item_data(href)
            i = i + 1
        page = page + 1
amazon_spider(10)

I should get names of 160 books when I run the above code, but I rarely get that. Sometimes I get names of 64 books or 128 books and sometimes none at all.
Why does the output keep varying? Is the issue with the code or the internet connectivity?

Regards


#2

Hi @B.Rabbit, the code is not the issue. Amazon has set up preventive measures for discouraging web scraping. The reasons (as given here) may be:

Amazon continually try and keep scrapers from working, they do this by:

  • A/B testing (randomly receive different HTML).
  • Huge numbers of HTML layouts for the same product categories.
  • Changing HTML layouts.
  • Moving content inside iFrames.

I would suggest u to try using Amazon API for your purpose. (will it serve your purpose? I don’t know! but atleast you will be safe! )


#3

Hi @jalFaizy,
Thanks for your thoughts, but what do you mean by ‘safe’?

Regards


#4

within “legal limits”