Hotel & Travel

How to Scrape Expedia using Python and LXML?

Parth Vataliya

6 min read

October 29, 2021

When done manually, gathering travel data for planes is a massive undertaking. There are various possible combinations of airports, routes, times, and costs, all of which are always changing. Ticket rates fluctuate on a daily (or even hourly) basis, and there are numerous flights available each day. Web scraping is one method for keeping track of this information. In this Blog, we’ll scrape Expedia , a popular vacation booking site, to get flight information. The flight schedules and pricing for a sender and the receiver pair will be extracted by our scraper.

Data Fields that will be extracted:

Arrival Airport
Arrival Time
Departure Airport
Departure Time
Flight name
Flight duration
Ticket Price
No. of stops
Airline

Below shown is the screenshot of the data fields that we will be extracting:

Scraping Code:

1. Create the URL of the search results from Expedia for instance, we will check the available flights listed from New York to Miami:

https://www.expedia.com/Flights-Search?trip=oneway&leg1=from:New%20York,%20NY%20(NYC-All%20Airports),to:Miami,%20Florida,departure:04/01/2017TANYT&passengers=children:0,adults:1,seniors:0,infantinlap:Y&mode=search
2. Using Python Requests, download the HTML of the search result page.

3. Parse the webpage with LXML — LXML uses Xpaths to browse the HTML Tree Structure. The XPaths for the details we require in the code have already been defined.

4. Save the information in a JSON file. You can change this later to write to a database.

Requirements

We’ll need several libraries for obtaining and parsing HTML for this Python 3 web scraping tutorial. The requirements for the package are shown below.

Install Python 3 and Pip

Install Packages
The code is explanatory

You can check the code from the link here.

Executing the Expedia Scraper

Let’s say the script’s name is expedia.py. In a command prompt or terminal, input the script name followed by a -h.

usage: expedia.py [-h] source destination date
positional arguments:
source Source airport code
destination Destination airport code
date MM/DD/YYYY
optional arguments:
-h, --help show this help message and exit

The input and output arguments are the airline codes for the source and destination airports, respectively. The date parameter must be in the form MM/DD/YYYY

For example, to get flights from New York to Miami, we would use the following arguments:

python3 expedia.py nyc mia 04/01/2017

The nyc-mia-flight-results.json file will be created as a result of this. json, which will be saved in the same directory as the script.

This is what the output file will look like:

{
"arrival": "Miami Intl., Miami",
"timings": [
{
"arrival_airport": "Miami, FL (MIA-Miami Intl.)",
"arrival_time": "12:19a",
"departure_airport": "New York, NY (LGA-LaGuardia)",
"departure_time": "9:00p"
}
],
"airline": "American Airlines",
"flight duration": "1 days 3 hours 19 minutes",
"plane code": "738",
"plane": "Boeing 737-800",
"departure": "LaGuardia, New York",
"stops": "Nonstop",
"ticket price": "1144.21"
},
{
"arrival": "Miami Intl., Miami",
"timings": [
{
"arrival_airport": "St. Louis, MO (STL-Lambert-St. Louis Intl.)",
"arrival_time": "11:15a",
"departure_airport": "New York, NY (LGA-LaGuardia)",
"departure_time": "9:11a"
},
{
"arrival_airport": "Miami, FL (MIA-Miami Intl.)",
"arrival_time": "8:44p",
"departure_airport": "St. Louis, MO (STL-Lambert-St. Louis Intl.)",
"departure_time": "4:54p"
}
],
"airline": "Republic Airlines As American Eagle",
"flight duration": "0 days 11 hours 33 minutes",
"plane code": "E75",
"plane": "Embraer 175",
"departure": "LaGuardia, New York",
"stops": "1 Stop",
"ticket price": "2028.40"
},

You can download the code at:

import json

import requests

from lxml import html

from collections import OrderedDict

import argparse

def parse(source,destination,date):

for i in range(5):

try:

url = "https://www.expedia.com/Flights-Search?trip=oneway&leg1=from:{0},to:{1},departure:{2}TANYT&passengers=adults:1,children:0,seniors:0,infantinlap:Y&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com".format(source,destination,date)

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}

response = requests.get(url, headers=headers, verify=False)
parser = html.fromstring(response.text)
json_data_xpath = parser.xpath("//script[@id='cachedResultsJson']//text()")
raw_json =json.loads(json_data_xpath[0] if json_data_xpath else '')
flight_data = json.loads(raw_json["content"])
flight_info  = OrderedDict() 
lists=[]
for i in flight_data['legs'].keys():
total_distance =  flight_data['legs'][i].get("formattedDistance",'')
exact_price = flight_data['legs'][i].get('price',{}).get('totalPriceAsDecimal','')
departure_location_airport = flight_data['legs'][i].get('departureLocation',{}).get('airportLongName','')
departure_location_city = flight_data['legs'][i].get('departureLocation',{}).get('airportCity','')
departure_location_airport_code = flight_data['legs'][i].get('departureLocation',{}).get('airportCode','')
arrival_location_airport = flight_data['legs'][i].get('arrivalLocation',{}).get('airportLongName','')
arrival_location_airport_code = flight_data['legs'][i].get('arrivalLocation',{}).get('airportCode','')
arrival_location_city = flight_data['legs'][i].get('arrivalLocation',{}).get('airportCity','')
airline_name = flight_data['legs'][i].get('carrierSummary',{}).get('airlineName','')
no_of_stops = flight_data['legs'][i].get("stops","")
flight_duration = flight_data['legs'][i].get('duration',{})
flight_hour = flight_duration.get('hours','')
flight_minutes = flight_duration.get('minutes','')
flight_days = flight_duration.get('numOfDays','')
if no_of_stops==0:
    stop = "Nonstop"

else:

    stop = str(no_of_stops)+' Stop'
total_flight_duration = "{0} days {1} hours {2} minutes".format(flight_days,flight_hour,flight_minutes)
departure = departure_location_airport+", "+departure_location_city
arrival = arrival_location_airport+", "+arrival_location_city
carrier = flight_data['legs'][i].get('timeline',[])[0].get('carrier',{})
plane = carrier.get('plane','')
plane_code = carrier.get('planeCode','')
formatted_price = "{0:.2f}".format(exact_price)
if not airline_name:
    airline_name = carrier.get('operatedBy','')

timings = []
for timeline in  flight_data['legs'][i].get('timeline',{}):
    if 'departureAirport' in timeline.keys():
        departure_airport = timeline['departureAirport'].get('longName','')
        departure_time = timeline['departureTime'].get('time','')
        arrival_airport = timeline.get('arrivalAirport',{}).get('longName','')
        arrival_time = timeline.get('arrivalTime',{}).get('time','')

        flight_timing = {

                            'departure_airport':departure_airport,

                            'departure_time':departure_time,

                            'arrival_airport':arrival_airport,

                            'arrival_time':arrival_time

        }

        timings.append(flight_timing)

flight_info={'stops':stop,

    'ticket price':formatted_price,

    'departure':departure,

    'arrival':arrival,

    'flight duration':total_flight_duration,

    'airline':airline_name,

    'plane':plane,

    'timings':timings,

    'plane code':plane_code

}
lists.append(flight_info)

    sortedlist = sorted(lists, key=lambda k: k['ticket price'],reverse=False)
    return sortedlist
except ValueError:
    print ("Rerying...")
 
return {"error":"failed to process the page",}

if __name__=="__main__":
argparser = argparse.ArgumentParser()
argparser.add_argument('source',help = 'Source airport code')
argparser.add_argument('destination',help = 'Destination airport code')
argparser.add_argument('date',help = 'MM/DD/YYYY')
args = argparser.parse_args()
source = args.source
destination = args.destination
date = args.date
print ("Fetching flight details")
scraped_data = parse(source,destination,date)
print ("Writing data to output file")
with open('%s-%s-flight-results.json'%(source,destination),'w') as fp:
json.dump(scraped_data,fp,indent = 4)

Unless the page structure changes dramatically, this scraper should be able to retrieve most of the flight details present on Expedia. This scraper is probably not going to work for you if you want to scrape the details of thousands of pages at very short intervals.

Contact iWeb Scraping for extracting Expedia using Python and LXML or ask for a free quote!

Frequently Asked Questions

The primary advantage is scalability and real-time business intelligence. Manually reading tweets is inefficient. Sentiment analysis tools allow you to instantly analyze thousands of tweets about your brand, products, or campaigns. This provides a scalable way to understand customer feelings, track brand reputation, and gather actionable insights from a massive, unfiltered source of public opinion, as highlighted in the blog’s “Advantages” section.

By analyzing the sentiment behind tweets, businesses can directly understand why customers feel the way they do. It helps identify pain points with certain products, gauge reactions to new launches, and understand the reasons behind positive feedback. This deep insight into the “voice of the customer” allows companies to make data-driven decisions to improve products, address complaints quickly, and enhance overall customer satisfaction, which aligns with the business applications discussed in the blog.

Yes, when using advanced tools, it provides reliable and consistent criteria. As the blog notes, manual analysis can be inconsistent due to human bias. Automated sentiment analysis using Machine Learning and AI (like the technology used by iWeb Scraping) trains models to tag data uniformly. This eliminates human inconsistency, provides results with a high degree of accuracy, and offers a reliable foundation for strategic business decisions.

Businesses can use a range of tools, from code-based libraries to dedicated platforms. As mentioned in the blog, popular options include Python with libraries like Tweepy and TextBlob, or dedicated services like MeaningCloud and iWeb Scraping’s Text Analytics API. The choice depends on your needs: Python offers customization for technical teams, while off-the-shelf APIs from web scraping services provide a turnkey solution for automatically scraping Twitter and extracting brand insights quickly and accurately.

Share this Article :

Build the scraper you want123

We’ll customize your concurrency, speed, and extended trial — for high-volume scraping.

Continue Reading

Business

Why Web Scraping Alone Is No Longer Enough for Modern Businesses?

Web scraping is an effective way to gather data from websites, but businesses are increasingly seeking more advanced methods of …

Parth Vataliya Reading Time: 10 min

E-Commerce

How to Scrape Personal Care & Beauty Product Data from Sephora.com?

Sephora.com hosts over 300 brands and thousands of beauty products. Extracting this data helps businesses analyze pricing trends, track competitor …

Parth Vataliya Reading Time: 13 min

Other

How to Extract AI Overviews for Multiple Queries: A Technical Guide

What Are AI Overviews and Why Should You Extract Them? AI Overviews represent Google’s latest innovation in search technology. These …

Parth Vataliya Reading Time: 10 min

Get in Touch with Us

iWeb Scraping eliminates manual data entry with AI-powered extraction for businesses.

Address

Web scraping is an efficien

Address

Web scraping is an efficien

Address

Web scraping is an efficien

Address

Web scraping is an efficien

Expert Consultation

Discuss your data needs with our specialists for tailored scraping solutions.

Expert Consultation

Discuss your data needs with our specialists for tailored scraping solutions.

Expert Consultation

Discuss your data needs with our specialists for tailored scraping solutions.

Social Media :

Managed Extraction

Managed Extraction

By Use Case

By Industry

Categories

APIs

Web Scraping API

APIs

Web Scraping API

Web Scraping API

Web Scraping API

How to Scrape Expedia using Python and LXML?

Data Fields that will be extracted:

Scraping Code:

Requirements

Install Python 3 and Pip

Executing the Expedia Scraper

Frequently Asked Questions

Table of Contents

Build the scraper you want123

Continue Reading

Why Web Scraping Alone Is No Longer Enough for Modern Businesses?

How to Scrape Personal Care & Beauty Product Data from Sephora.com?

How to Extract AI Overviews for Multiple Queries: A Technical Guide

Get in Touch with Us

Get in Touch with Us

Address

Address

Address

Address

Expert Consultation

Expert Consultation

Expert Consultation

Managed Extraction

Managed Extraction

By Use Case

By Industry

Categories

APIs

Web Scraping API

APIs

Web Scraping API

Web Scraping API

Web Scraping API

How to Scrape Expedia using Python and LXML?

Data Fields that will be extracted:

Scraping Code:

Requirements

Install Python 3 and Pip

Executing the Expedia Scraper

Frequently Asked Questions

What is the main advantage of using Twitter sentiment analysis for business?

How can Twitter sentiment analysis improve customer experience?

Is automated Twitter sentiment analysis reliable for business decisions?

What tools can a business use to perform Twitter sentiment analysis?

Table of Contents

Build the scraper you want123

Continue Reading

Why Web Scraping Alone Is No Longer Enough for Modern Businesses?

How to Scrape Personal Care & Beauty Product Data from Sephora.com?

How to Extract AI Overviews for Multiple Queries: A Technical Guide

Get in Touch with Us

Get in Touch with Us

Address

Address

Address

Address

Expert Consultation

Expert Consultation

Expert Consultation