How to Extract Product Data from Walmart with Python and BeautifulSoup

Walmart is the leading retailer with both online stores as well as physical stores around the world. Having a larger product variety in the portfolio with $519.93 Billion of net sales, Walmart is dominating the retail market as well as it also provides ample data, which could be utilized to get insights on product portfolios, customer’s behavior, as well as market trends.

In this tutorial blog, we will extract product data from Walmart s well as store that in the SQL databases. We use Python for scraping a website. The package used for the scraping exercise is called BeautifulSoup. Together with that, we have also utilized Selenium as it helps us interact with Google Chrome.

Scrape Walmart Product Data

The initial step is importing all the required libraries. When, we import the packages, let’s start by setting the scraper’s flow. For modularizing the code, we initially investigated the URL structure of Walmart product pages. A URL is an address of a web page, which a user refers to as well as can be utilized for uniquely identifying the page.

Here, in the given example, we have made a listing of page URLs within Walmart’s electronics department. We also have made the list of names of different product categories. We would use them in future to name the tables or datasets.

You may add as well as remove the subcategories for all major product categories. All you require to do is going to subcategory pages as well as scrape the page URL. The address is general for all the available products on the page. You may also do that for maximum product categories. In the given image, we have showed categories including Toys and Food for the demo.

In addition, we have also stored URLs in the list because it makes data processing in Python much easier. When, we have all the lists ready, let’s move on for writing a scraper.

Also, we have made a loop for automating the extraction exercise. Although, we can run that for only one category as well as subcategory also. Let us pretend, we wish to extract data for only one sub-category like TVs in ‘Electronics’ category. Later on, we will exhibit how to scale a code for all the sub-categories.

Here, a variable pg=1 makes sure that we are extracting data for merely the first URL within an array ‘url_sets’ i.e. merely for the initial subcategory in main category. When you complete that, the following step might be to outline total product pages that you would wish to open for scraping data from. To do this, we are extracting data from the best 10 pages.

Then, we loop through a complete length of top_n array i.e. 10 times for opening the product pages as well as scrape a complete webpage structure in HTML form code. It is like inspecting different elements of web page as well as copying the resultants’ HTML code. Although, we have more added a limitation that only a part of HTML structure, which lies in a tag ‘Body’ is scraped as well as stored as the object. That is because applicable product data is only within a page’s HTML body.

This entity can be used for pulling relevant product data for different products, which were listed on an active page. For doing that, we have identified that a tag having product data is the ‘div’ tag having a class, ‘search-result-gridview-item-wrapper’. Therefore, in next step, we have used a find_all function for scraping all the occurrences from the given class. We have stored this data in the temporary object named ‘codelist’.

After that, we have built the URL of separate products. For doing so, we have observed that different product pages begin with a basic string called ‘https://walmart.com/ip’. All unique-identifies were added only before this string. A unique identifier was similar as a string values scraped from a ‘search-result-gridview-item-wrapper’ items saved above. Therefore, in the following step, we have looped through a temporary object code list, for constructing complete URL of any particular product’ page.

With this URL, we will be able to scrape particular product-level data. To do this demo, we have got details like unique Product codes, Product’s name, Product page URL, Product_description, name of current page’s category where a product is positioned, name of the active subcategory where the product is positioned on a website (which is called active breadcrumb), Product pricing, ratings (Star ratings), number of reviews or ratings for a product as well as other products suggested on the Walmart’s site similar or associated to a product. You may customize this listing according to your convinience.

The code given above follows the following step of opening an individual product page, based on the constructed URLs as well as scraping the products’ attributes, as given in the listing above. When you are okay with a listing of attributes getting pulled within a code, the last step for a scraper might be to attach all the product data in the subcategory within a single frame data. The code here shows that.

A data frame called ‘df’ would have all the data for products on the best 10 pages of a chosen subcategory within your code. You may either write data on the CSV files or distribute it to the SQL database. In case, you need to export that to the MySQL database within the table named ‘product_info’, you may utilize the code given below:

You would need to provide the SQL database credentials and when you do it, Python helps you to openly connect the working environment with the database as well as push the dataset straight as the SQL dataset. In the above code, in case the table having that name exists already, the recent code would replace with the present table. You may always change a script to evade doing so. Python provides you an option to ‘fail’, ‘append’, or ‘replace’ data here.

It is the basic code structure, which can be improved to add exclusions to deal with missing data or later loading pages. In case, you choose to loop the code for different subcategories, a complete code would look like:

import os
import selenium.webdriver
import csv
import time
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

url_sets=["https://www.walmart.com/browse/tv-video/all-tvs/3944_1060825_447913",
"https://www.walmart.com/browse/computers/desktop-computers/3944_3951_132982",
"https://www.walmart.com/browse/electronics/all-laptop-computers/3944_3951_1089430_132960",
"https://www.walmart.com/browse/prepaid-phones/1105910_4527935_1072335",
"https://www.walmart.com/browse/electronics/portable-audio/3944_96469",
"https://www.walmart.com/browse/electronics/gps-navigation/3944_538883/",
"https://www.walmart.com/browse/electronics/sound-bars/3944_77622_8375901_1230415_1107398",
"https://www.walmart.com/browse/electronics/digital-slr-cameras/3944_133277_1096663",
"https://www.walmart.com/browse/electronics/ipad-tablets/3944_1078524"]

categories=["TVs","Desktops","Laptops","Prepaid_phones","Audio","GPS","soundbars","cameras","tablets"]


# scraper
for pg in range(len(url_sets)):
# number of pages per category
top_n= ["1","2","3","4","5","6","7","8","9","10"]
# extract page number within sub-category
url_category=url_sets[pg]
print("Category:",categories[pg])
final_results = []
for i_1 in range(len(top_n)):
print("Page number within category:",i_1)
url_cat=url_category+"?page="+top_n[i_1]
driver= webdriver.Chrome(executable_path='C:/Drivers/chromedriver.exe')
driver.get(url_cat)
body_cat = driver.find_element_by_tag_name("body").get_attribute("innerHTML")
driver.quit()
soupBody_cat = BeautifulSoup(body_cat)


for tmp in soupBody_cat.find_all('div', {'class':'search-result-gridview-item-wrapper'}):
final_results.append(tmp['data-id'])

# save final set of results as a list 
codelist=list(set(final_results))
print("Total number of prods:",len(codelist))
# base URL for product page
url1= "https://walmart.com/ip"


# Data Headers
WLMTData = [["Product_code","Product_name","Product_description","Product_URL",
"Breadcrumb_parent","Breadcrumb_active","Product_price", 
"Rating_Value","Rating_Count","Recommended_Prods"]]

for i in range(len(codelist)):
#creating a list without the place taken in the first loop
print(i)
item_wlmt=codelist[i]
url2=url1+"/"+item_wlmt
#print(url2)


try:
driver= webdriver.Chrome(executable_path='C:/Drivers/chromedriver.exe') # Chrome driver is being used.
print ("Requesting URL: " + url2)


driver.get(url2) # URL requested in browser.
print ("Webpage found ...")
time.sleep(3)
# Find the document body and get its inner HTML for processing in BeautifulSoup parser.
body = driver.find_element_by_tag_name("body").get_attribute("innerHTML")
print("Closing Chrome ...") # No more usage needed.
driver.quit() # Browser Closed.


print("Getting data from DOM ...")
soupBody = BeautifulSoup(body) # Parse the inner HTML using BeautifulSoup


h1ProductName = soupBody.find("h1", {"class": "prod-ProductTitle prod-productTitle-buyBox font-bold"})
divProductDesc = soupBody.find("div", {"class": "about-desc about-product-description xs-margin-top"})
liProductBreadcrumb_parent = soupBody.find("li", {"data-automation-id": "breadcrumb-item-0"})
liProductBreadcrumb_active = soupBody.find("li", {"class": "breadcrumb active"})
spanProductPrice = soupBody.find("span", {"class": "price-group"})
spanProductRating = soupBody.find("span", {"itemprop": "ratingValue"})
spanProductRating_count = soupBody.find("span", {"class": "stars-reviews-count-node"})

################# exceptions #########################
if divProductDesc is None:
divProductDesc="Not Available"
else:
divProductDesc=divProductDesc

if liProductBreadcrumb_parent is None:
liProductBreadcrumb_parent="Not Available"
else:
liProductBreadcrumb_parent=liProductBreadcrumb_parent

if liProductBreadcrumb_active is None:
liProductBreadcrumb_active="Not Available"
else:
liProductBreadcrumb_active=liProductBreadcrumb_active

if spanProductPrice is None:
spanProductPrice="NA"
else:
spanProductPrice=spanProductPrice


if spanProductRating is None or spanProductRating_count is None:
spanProductRating=0.0
spanProductRating_count="0 ratings"


else:
spanProductRating=spanProductRating.text
spanProductRating_count=spanProductRating_count.text




### Recommended Products
reco_prods=[]
for tmp in soupBody.find_all('a', {'class':'tile-link-overlay u-focusTile'}):
reco_prods.append(tmp['data-product-id'])


if len(reco_prods)==0:
reco_prods=["Not available"]
else:
reco_prods=reco_prods
WLMTData.append([codelist[i],h1ProductName.text,ivProductDesc.text,url2,
liProductBreadcrumb_parent.text, 
liProductBreadcrumb_active.text, spanProductPrice.text, spanProductRating, 
spanProductRating_count,reco_prods])


except Exception as e:
print (str(e))

# save final result as dataframe
df=pd.DataFrame(WLMTData)
df.columns = df.iloc[0]
df=df.drop(df.index[0])

# Export dataframe to SQL
import sqlalchemy
database_username = 'ENTER USERNAME'
database_password = 'ENTER USERNAME PASSWORD'
database_ip = 'ENTER DATABASE IP'
database_name = 'ENTER DATABASE NAME'
database_connection = sqlalchemy.create_engine('mysql+mysqlconnector://{0}:{1}@{2}/{3}'. 
format(database_username, database_password, database_ip, base_name))
df.to_sql(con=database_connection, name='‘product_info’', if_exists='replace',flavor='mysql')

You may always add additional complexity into this code for adding customization to the scraper. For example, the given scraper will take care of the missing data within attributes including pricing, description, or reviews. The data might be missing because of many reasons like if a product get out of stock or sold out, improper data entry, or is new to get any ratings or data currently.

For adapting different web structures, you would need to keep changing your web scraper for that to become functional while a webpage gets updated. The web scraping services gives you with a base template for the Python’s scraper on Walmart.

Want to extract data for your business? Contact iWeb Scraping, your data scraping professional!

 

Frequently Asked Questions

The primary advantage is scalability and real-time business intelligence. Manually reading tweets is inefficient. Sentiment analysis tools allow you to instantly analyze thousands of tweets about your brand, products, or campaigns. This provides a scalable way to understand customer feelings, track brand reputation, and gather actionable insights from a massive, unfiltered source of public opinion, as highlighted in the blog’s “Advantages” section.

By analyzing the sentiment behind tweets, businesses can directly understand why customers feel the way they do. It helps identify pain points with certain products, gauge reactions to new launches, and understand the reasons behind positive feedback. This deep insight into the “voice of the customer” allows companies to make data-driven decisions to improve products, address complaints quickly, and enhance overall customer satisfaction, which aligns with the business applications discussed in the blog.

Yes, when using advanced tools, it provides reliable and consistent criteria. As the blog notes, manual analysis can be inconsistent due to human bias. Automated sentiment analysis using Machine Learning and AI (like the technology used by iWeb Scraping) trains models to tag data uniformly. This eliminates human inconsistency, provides results with a high degree of accuracy, and offers a reliable foundation for strategic business decisions.

Businesses can use a range of tools, from code-based libraries to dedicated platforms. As mentioned in the blog, popular options include Python with libraries like Tweepy and TextBlob, or dedicated services like MeaningCloud and iWeb Scraping’s Text Analytics API. The choice depends on your needs: Python offers customization for technical teams, while off-the-shelf APIs from web scraping services provide a turnkey solution for automatically scraping Twitter and extracting brand insights quickly and accurately.

Share this Article :

Build the scraper you want123

We’ll customize your concurrency, speed, and extended trial — for high-volume scraping.

Continue Reading

E-Commerce2

How to Extract & Save Facebook Group Members to a Google Sheet?

Get a jump on including Bootstrap's source files in a new project with our official guides.Get a jump on including Bootstrap's source files.

Parth Vataliya 4 Min Read
E-Commerce2

How to Extract & Save Facebook Group Members to a Google Sheet?

Get a jump on including Bootstrap's source files in a new project with our official guides.Get a jump on including Bootstrap's source files.

Parth Vataliya 4 Min Read
E-Commerce2

How to Extract & Save Facebook Group Members to a Google Sheet?

Get a jump on including Bootstrap's source files in a new project with our official guides.Get a jump on including Bootstrap's source files.

Parth Vataliya 4 Min Read
Scroll to Top