DevTech101

DevTech101
1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

Web Scraping Using Python Selenium and Beautiful Soup

I recently got involved in a project requiring Web Scrapping for the purposes of Product documentation, Pricing, Reviews, etc..

Below I am going to describe  a methods that is widely used for web scrapping. the process below uses Selenium a Python Module to retrieve web information from Walmart. 

First we need to make sure the Python interpreter is installed.

If on Windows – Open CMD in windows, and just type python and hit enter, this should take you to the windows store (if not installed), just click get python & install (usually on top right corner) – you can skip this if already installed.

If on Linux just run sudo apt-get install python.

Next lets install the required python modules. create a file requirements.txt with the below contents.

requests>=2.28.1
selenium>=4.7.2
beautifulsoup4>=4.11.1
lxml>=4.9.2

Next, lets install the required Python modules. To install the required python modules, run the below.

pip install -r requirements.txt

We are almost redye for some action. Lets create our python web scrapper script.

Create a file with the below content. 

Do you need help with this configuration ? Just let us know.

Use the Contact Form to get in touch with us for a free Consultation.

#!/usr/bin/env python

import sys, os
import json
import requests
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup

# url = 'https://www.walmart.com/ip/FUBU-Men-s-Zone-Basketball-High-top-Sneakers/439361264/'
url = 'https://www.walmart.com/ip/Hisense-58-Class-4K-UHD-LED-LCD-Roku-Smart-TV-HDR-R6-Series-58R6E3/587182688/'

# User agent for Linux
# 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'

# Headers
headers = {
    'authority': 'www.walmart.com',
    'method': 'GET',
    'accept': 'application/json',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'content-type': 'application/json',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
}


def get_cookies(selenium_cookies):
    cookies = {}
    for cookie in selenium_cookies:
        cookies[cookie['name']] = cookie['value']
    return cookies

driver = webdriver.Chrome()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.headless = True
driver = webdriver.Chrome(options=chrome_options)
driver.create_options()

driver.get(url)

selenium_cookies = driver.get_cookies()
cookies = get_cookies(selenium_cookies)
# print(cookies)

# driver.close()
driver.quit()

r = requests.request("GET", url, headers=headers, cookies=cookies)

# save the content to a file
# with open('walmart_data.html', 'w') as f:
    # print(r.text, file=f)
soup = BeautifulSoup(r.text, 'lxml')
data = json.loads(soup.find('script', type='application/ld+json').text)

# print(soup.prettify())
print('Price:', data['offers']['price'])

The Python script uses selenium with a single URL/item (this can also be randomized if required) to retrieve a workable cookie which is checkout by some web sites (like Walmart). You can then use a proxy service to avoid getting blocked while web crawling/scrapping. I listed below some of the API proxy vendors below.

You can also use Scrapy with the spider option using the selenium module. The Scrapy documentation is available here.
 
You can find a list of Headless browsers on GitHub here

Another option I found was using a service like BlueCart API to retrieve Walmart data, but might be quite a bit more expansive then the other choices.

Do you need help with this configuration ? Just let us know.

Like this article. Please provide feedback, or let us know in the comments below.

Use the Contact Form to get in touch with us for a free Consultation.

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x
%d bloggers like this: