How Web Scraping API is Used to Extract Real Estate Website Data?

extract-real-estate-data-with-scraping-api

In this blog, we will discuss about scraping real estate data using Python and creating dataset as per the requirement. For example, here we will extract real estate information of New York City. The website’s information is not permanently contained inside the HTML code but is dynamically produced by issuing a POST request to Realtor.com’s API. The reply to the request will then returns the data from the website in the form of JSON formatted strings. As a result, your goal is to “fake” the POST request in your Python script, claiming to be the Website requesting data from the API.

If you tap on the delete icon in the left upper corner beside the red sign to remove the request log and then browse down to the bottom of the webpage to click on “Next,” we will see a POST Request issued to the Website’s API on top.

When you click on the “Response” tab, you will check that the reply to the request will be data extracted from the real estate listings viewed on the webpage.

To define what data you want to get, you must submit a request payload with POST requests. When you click on the “Payload” tab, you will discover the Request Payload. You must also click on “see source” if you want the request payload to be a string rather than a json dictionary. Within this question packet, you will specify the data you want to retrieve, the results of the second page of the Real Estate Listings website for the city of New York.

Because now you know that you need to submit your Request and what payloads you have to send with it, you can begin building your Python Script.

You will require the following libraries:

import requests 
import json 
import pandas as pd

When you will copy the request payload to your script, you must include “r” on the opposite side of the string as the string includes escape object. It is also critical to include the Content-Type in our request headers.

url = "https://www.iwebscraping.com/api/v1/hulk?client_id=rdc-x&schema=vesta"
headers = {"content-type": "application/json"}

body = r'{"query":"\n\nquery ConsumerSearchMainQuery($query: HomeSearchCriteria!, $limit: Int, $offset: Int, $sort: [SearchAPISort], $sort_type: SearchSortType, $client_data: JSON, $geoSupportedSlug: String!, $bucket: SearchAPIBucket, $by_prop_type: [String])\n{\n home_search: home_search(query: $query,\n sort: $sort,\n limit: $limit,\n offset: $offset,\n sort_type: $sort_type,\n client_data: $client_data,\n bucket: $bucket,\n ){\n count\n total\n results {\n property_id\n list_price\n primary_photo (https: true){\n href\n }\n source {\n id\n agents{\n office_name\n }\n type\n spec_id\n plan_id\n }\n community {\n property_id\n description {\n name\n }\n advertisers{\n office{\n hours\n phones {\n type\n number\n }\n }\n builder {\n fulfillment_id\n }\n }\n }\n products {\n brand_name\n products\n }\n listing_id\n matterport\n virtual_tours{\n href\n type\n }\n status\n permalink\n price_reduced_amount\n other_listings{rdc {\n listing_id\n status\n listing_key\n primary\n }}\n description{\n beds\n baths\n baths_full\n baths_half\n baths_1qtr\n baths_3qtr\n garage\n stories\n type\n sub_type\n lot_sqft\n sqft\n year_built\n sold_price\n sold_date\n name\n }\n location{\n street_view_url\n address{\n line\n postal_code\n state\n state_code\n city\n coordinate {\n lat\n lon\n }\n }\n county {\n name\n fips_code\n }\n }\n tax_record {\n public_record_id\n }\n lead_attributes {\n show_contact_an_agent\n opcity_lead_attributes {\n cashback_enabled\n flip_the_market_enabled\n }\n lead_type\n }\n open_houses {\n start_date\n end_date\n description\n methods\n time_zone\n dst\n }\n flags{\n is_coming_soon\n is_pending\n is_foreclosure\n is_contingent\n is_new_construction\n is_new_listing (days: 14)\n is_price_reduced (days: 30)\n is_plan\n is_subdivision\n }\n list_date\n last_update_date\n coming_soon_date\n photos(limit: 2, https: true){\n href\n }\n tags\n branding {\n type\n photo\n name\n }\n }\n }\n geo(slug_id: $geoSupportedSlug) {\n parents {\n geo_type\n slug_id\n name\n }\n geo_statistics(group_by: property_type) {\n housing_market {\n by_prop_type(type: $by_prop_type){\n type\n attributes{\n median_listing_price\n median_lot_size\n median_sold_price\n median_price_per_sqft\n median_days_on_market\n }\n }\n listing_count\n median_listing_price\n median_rent_price\n median_price_per_sqft\n median_days_on_market\n median_sold_price\n month_to_month {\n active_listing_count_percent_change\n median_days_on_market_percent_change\n median_listing_price_percent_change\n median_listing_price_sqft_percent_change\n }\n }\n }\n recommended_cities: recommended(query: {geo_search_type: city, limit: 20}) {\n geos {\n ... on City {\n city\n state_code\n geo_type\n slug_id\n }\n geo_statistics(group_by: property_type) {\n housing_market {\n by_prop_type(type: [\"home\"]) {\n type\n attributes {\n median_listing_price\n }\n }\n median_listing_price\n }\n }\n }\n }\n recommended_neighborhoods: recommended(query: {geo_search_type: neighborhood, limit: 20}) {\n geos {\n ... on Neighborhood {\n neighborhood\n city\n state_code\n geo_type\n slug_id\n }\n geo_statistics(group_by: property_type) {\n housing_market {\n by_prop_type(type: [\"home\"]) {\n type\n attributes {\n median_listing_price\n }\n }\n median_listing_price\n }\n }\n }\n }\n recommended_counties: recommended(query: {geo_search_type: county, limit: 20}) {\n geos {\n ... on HomeCounty {\n county\n state_code\n geo_type\n slug_id\n }\n geo_statistics(group_by: property_type) {\n housing_market {\n by_prop_type(type: [\"home\"]) {\n type\n attributes {\n median_listing_price\n }\n }\n median_listing_price\n }\n }\n }\n }\n recommended_zips: recommended(query: {geo_search_type: postal_code, limit: 20}) {\n geos {\n ... on PostalCode {\n postal_code\n geo_type\n slug_id\n }\n geo_statistics(group_by: property_type) {\n housing_market {\n by_prop_type(type: [\"home\"]) {\n type\n attributes {\n median_listing_price\n }\n }\n median_listing_price\n }\n }\n }\n }\n }\n}","variables":{"query":{"status":["for_sale","ready_to_build"],"primary":true,"state_code":"NY"},"client_data":{"device_data":{"device_type":"web"},"user_data":{"last_view_timestamp":-1}},"limit":42,"offset":42,"zohoQuery":{"silo":"search_result_page","location":"New York","property_status":"for_sale","filters":{},"page_index":"2"},"sort_type":"relevant","geoSupportedSlug":"","by_prop_type":["home"]},"operationName":"ConsumerSearchMainQuery","callfrom":"SRP","nrQueryType":"MAIN_SRP","visitor_id":"eff16470-ceb5-4926-8c0b-6d1779772842","isClient":true,"seoPayload":{"asPath":"/realestateandhomes-search/New-York/pg-2","pageType":{"silo":"search_result_page","status":"for_sale"},"county_needed_for_uniq":false}}'
json_body = json.loads(body)

r = requests.post(url=url, json=json_body, headers=headers)
json_data = r.json()

Our objective now would be to retrieve the desired information from the “JSON_data” variable, which provides a list of real estate listings. We will loop through every item and construct a feature dictionary, which we will then add to a list to create a Pandas DataFrame.

You will then have a DataFrame with 18 columns. You currently have a DataFrame with 18 columns. But there is one item that would be useless if we were to further examine our information, so that’s the “tags” column, which is just a series of strings.

It is possible with one to encode this field such that each unique label is displayed as a dependent variable.

The result is just the scraped data from one website, however, there are still 206 different websites with real estate data of the New York city. Using web scraping services, we aim to crawl in each of the 206 pages, and to do so, we must submit a request for each page with a specific request payload.

We need to change three Dict Values to accommodate the requirement. The page number is represented by the “page_index” and “seoPayload” values, while the “offset” key is indeed a number that is incremented by 42 for every page.

def send_request(page_number: int, offset_parameter: int):
url = "https://www.iwebscraping.com/api/v1/hulk?client_id=rdc-x&schema=vesta"
headers = {"content-type": "application/json"}


body = r'{"query":"\n\nquery ConsumerSearchMainQuery($query: HomeSearchCriteria!, $limit: Int, $offset: Int, $sort: [SearchAPISort], $sort_type: SearchSortType, $client_data: JSON, $geoSupportedSlug: String!, $bucket: SearchAPIBucket, $by_prop_type: [String])\n{\n home_search: home_search(query: $query,\n sort: $sort,\n limit: $limit,\n offset: $offset,\n sort_type: $sort_type,\n client_data: $client_data,\n bucket: $bucket,\n ){\n count\n total\n results {\n property_id\n list_price\n primary_photo (https: true){\n href\n }\n source {\n id\n agents{\n office_name\n }\n type\n spec_id\n plan_id\n }\n community {\n property_id\n description {\n name\n }\n advertisers{\n office{\n hours\n phones {\n type\n number\n }\n }\n builder {\n fulfillment_id\n }\n }\n }\n products {\n brand_name\n products\n }\n listing_id\n matterport\n virtual_tours{\n href\n type\n }\n status\n permalink\n price_reduced_amount\n other_listings{rdc {\n listing_id\n status\n listing_key\n primary\n }}\n description{\n beds\n baths\n baths_full\n baths_half\n baths_1qtr\n baths_3qtr\n garage\n stories\n type\n sub_type\n lot_sqft\n sqft\n year_built\n sold_price\n sold_date\n name\n }\n location{\n street_view_url\n address{\n line\n postal_code\n state\n state_code\n city\n coordinate {\n lat\n lon\n }\n }\n county {\n name\n fips_code\n }\n }\n tax_record {\n public_record_id\n }\n lead_attributes {\n show_contact_an_agent\n opcity_lead_attributes {\n cashback_enabled\n flip_the_market_enabled\n }\n lead_type\n }\n open_houses {\n start_date\n end_date\n description\n methods\n time_zone\n dst\n }\n flags{\n is_coming_soon\n is_pending\n is_foreclosure\n is_contingent\n is_new_construction\n is_new_listing (days: 14)\n is_price_reduced (days: 30)\n is_plan\n is_subdivision\n }\n list_date\n last_update_date\n coming_soon_date\n photos(limit: 2, https: true){\n href\n }\n tags\n branding {\n type\n photo\n name\n }\n }\n }\n geo(slug_id: $geoSupportedSlug) {\n parents {\n geo_type\n slug_id\n name\n }\n geo_statistics(group_by: property_type) {\n housing_market {\n by_prop_type(type: $by_prop_type){\n type\n attributes{\n median_listing_price\n median_lot_size\n median_sold_price\n median_price_per_sqft\n median_days_on_market\n }\n }\n listing_count\n median_listing_price\n median_rent_price\n median_price_per_sqft\n median_days_on_market\n median_sold_price\n month_to_month {\n active_listing_count_percent_change\n median_days_on_market_percent_change\n median_listing_price_percent_change\n median_listing_price_sqft_percent_change\n }\n }\n }\n recommended_cities: recommended(query: {geo_search_type: city, limit: 20}) {\n geos {\n ... on City {\n city\n state_code\n geo_type\n slug_id\n }\n geo_statistics(group_by: property_type) {\n housing_market {\n by_prop_type(type: [\"home\"]) {\n type\n attributes {\n median_listing_price\n }\n }\n median_listing_price\n }\n }\n }\n }\n recommended_neighborhoods: recommended(query: {geo_search_type: neighborhood, limit: 20}) {\n geos {\n ... on Neighborhood {\n neighborhood\n city\n state_code\n geo_type\n slug_id\n }\n geo_statistics(group_by: property_type) {\n housing_market {\n by_prop_type(type: [\"home\"]) {\n type\n attributes {\n median_listing_price\n }\n }\n median_listing_price\n }\n }\n }\n }\n recommended_counties: recommended(query: {geo_search_type: county, limit: 20}) {\n geos {\n ... on HomeCounty {\n county\n state_code\n geo_type\n slug_id\n }\n geo_statistics(group_by: property_type) {\n housing_market {\n by_prop_type(type: [\"home\"]) {\n type\n attributes {\n median_listing_price\n }\n }\n median_listing_price\n }\n }\n }\n }\n recommended_zips: recommended(query: {geo_search_type: postal_code, limit: 20}) {\n geos {\n ... on PostalCode {\n postal_code\n geo_type\n slug_id\n }\n geo_statistics(group_by: property_type) {\n housing_market {\n by_prop_type(type: [\"home\"]) {\n type\n attributes {\n median_listing_price\n }\n }\n median_listing_price\n }\n }\n }\n }\n }\n}","variables":{"query":{"status":["for_sale","ready_to_build"],"primary":true,"state_code":"NY"},"client_data":{"device_data":{"device_type":"web"},"user_data":{"last_view_timestamp":-1}},"limit":42,"offset":42,"zohoQuery":{"silo":"search_result_page","location":"New York","property_status":"for_sale","filters":{},"page_index":"2"},"sort_type":"relevant","geoSupportedSlug":"","by_prop_type":["home"]},"operationName":"ConsumerSearchMainQuery","callfrom":"SRP","nrQueryType":"MAIN_SRP","visitor_id":"eff16470-ceb5-4926-8c0b-6d1779772842","isClient":true,"seoPayload":{"asPath":"/realestateandhomes-search/New-York/pg-2","pageType":{"silo":"search_result_page","status":"for_sale"},"county_needed_for_uniq":false}}'
json_body = json.loads(body)


json_body["variables"]["page_index"] = page_number
json_body["seoPayload"] = page_number
json_body["variables"]["offset"] = offset_parameter


r = requests.post(url=url, json=json_body, headers=headers)
json_data = r.json()
return json_dat
Here, you will loop every page and attach the JSON data to a list.

We’ll develop a Function to retrieve data from a specific item.

Finally, here you will execute through the lists that will hold the JSON data of every page and scrape the data.

After executing the above scripts, you should download a dataset with 8652 rows and 179 columns.

class RealtorScraper:

  def __init__(self, page_numbers: int) -> None:

        self.page_numbers = page_numbers    

    def send_request(self, page_number: int, offset_parameter: int) -> dict:

        url = "https://www.iwebscraping.com/api/v1/hulk?client_id=rdc-x&schema=vesta"

        headers = {"content-type": "application/json"}

        body = r'{"query":"\n\nquery ConsumerSearchMainQuery($query: HomeSearchCriteria!, $limit: Int, $offset: Int, $sort: [SearchAPISort], $sort_type: SearchSortType, $client_data: JSON, $geoSupportedSlug: String!, $bucket: SearchAPIBucket, $by_prop_type: [String])\n{\n  home_search: home_search(query: $query,\n    sort: $sort,\n    limit: $limit,\n    offset: $offset,\n    sort_type: $sort_type,\n    client_data: $client_data,\n    bucket: $bucket,\n  ){\n    count\n    total\n    results {\n      property_id\n      list_price\n      primary_photo (https: true){\n        href\n      }\n      source {\n        id\n        agents{\n          office_name\n        }\n        type\n        spec_id\n        plan_id\n      }\n      community {\n        property_id\n        description {\n          name\n        }\n        advertisers{\n          office{\n            hours\n            phones {\n              type\n              number\n            }\n          }\n          builder {\n            fulfillment_id\n          }\n        }\n      }\n      products {\n        brand_name\n        products\n      }\n      listing_id\n      matterport\n      virtual_tours{\n        href\n        type\n      }\n      status\n      permalink\n      price_reduced_amount\n      other_listings{rdc {\n      listing_id\n      status\n      listing_key\n      primary\n    }}\n      description{\n        beds\n        baths\n        baths_full\n        baths_half\n        baths_1qtr\n        baths_3qtr\n        garage\n        stories\n        type\n        sub_type\n        lot_sqft\n        sqft\n        year_built\n        sold_price\n        sold_date\n        name\n      }\n      location{\n        street_view_url\n        address{\n          line\n          postal_code\n          state\n          state_code\n          city\n          coordinate {\n            lat\n            lon\n          }\n        }\n        county {\n          name\n          fips_code\n        }\n      }\n      tax_record {\n        public_record_id\n      }\n      lead_attributes {\n        show_contact_an_agent\n        opcity_lead_attributes {\n          cashback_enabled\n          flip_the_market_enabled\n        }\n        lead_type\n      }\n      open_houses {\n        start_date\n        end_date\n        description\n        methods\n        time_zone\n        dst\n      }\n      flags{\n        is_coming_soon\n        is_pending\n        is_foreclosure\n        is_contingent\n        is_new_construction\n        is_new_listing (days: 14)\n        is_price_reduced (days: 30)\n        is_plan\n        is_subdivision\n      }\n      list_date\n      last_update_date\n      coming_soon_date\n      photos(limit: 2, https: true){\n        href\n      }\n      tags\n      branding {\n        type\n        photo\n        name\n      }\n    }\n  }\n  geo(slug_id: $geoSupportedSlug) {\n    parents {\n      geo_type\n      slug_id\n      name\n    }\n    geo_statistics(group_by: property_type) {\n      housing_market {\n        by_prop_type(type: $by_prop_type){\n          type\n           attributes{\n            median_listing_price\n            median_lot_size\n            median_sold_price\n            median_price_per_sqft\n            median_days_on_market\n          }\n        }\n        listing_count\n        median_listing_price\n        median_rent_price\n        median_price_per_sqft\n        median_days_on_market\n        median_sold_price\n        month_to_month {\n          active_listing_count_percent_change\n          median_days_on_market_percent_change\n          median_listing_price_percent_change\n          median_listing_price_sqft_percent_change\n        }\n      }\n    }\n    recommended_cities: recommended(query: {geo_search_type: city, limit: 20}) {\n      geos {\n        ... on City {\n          city\n          state_code\n          geo_type\n          slug_id\n        }\n        geo_statistics(group_by: property_type) {\n          housing_market {\n            by_prop_type(type: [\"home\"]) {\n              type\n              attributes {\n                median_listing_price\n              }\n            }\n            median_listing_price\n          }\n        }\n      }\n    }\n    recommended_neighborhoods: recommended(query: {geo_search_type: neighborhood, limit: 20}) {\n      geos {\n        ... on Neighborhood {\n          neighborhood\n          city\n          state_code\n          geo_type\n          slug_id\n        }\n        geo_statistics(group_by: property_type) {\n          housing_market {\n            by_prop_type(type: [\"home\"]) {\n              type\n              attributes {\n                median_listing_price\n              }\n            }\n            median_listing_price\n          }\n        }\n      }\n    }\n    recommended_counties: recommended(query: {geo_search_type: county, limit: 20}) {\n      geos {\n        ... on HomeCounty {\n          county\n          state_code\n          geo_type\n          slug_id\n        }\n        geo_statistics(group_by: property_type) {\n          housing_market {\n            by_prop_type(type: [\"home\"]) {\n              type\n              attributes {\n                median_listing_price\n              }\n            }\n            median_listing_price\n          }\n        }\n      }\n    }\n    recommended_zips: recommended(query: {geo_search_type: postal_code, limit: 20}) {\n      geos {\n        ... on PostalCode {\n          postal_code\n          geo_type\n          slug_id\n        }\n        geo_statistics(group_by: property_type) {\n          housing_market {\n            by_prop_type(type: [\"home\"]) {\n              type\n              attributes {\n                median_listing_price\n              }\n            }\n            median_listing_price\n          }\n        }\n      }\n    }\n  }\n}","variables":{"query":{"status":["for_sale","ready_to_build"],"primary":true,"state_code":"NY"},"client_data":{"device_data":{"device_type":"web"},"user_data":{"last_view_timestamp":-1}},"limit":42,"offset":42,"zohoQuery":{"silo":"search_result_page","location":"New York","property_status":"for_sale","filters":{},"page_index":"2"},"sort_type":"relevant","geoSupportedSlug":"","by_prop_type":["home"]},"operationName":"ConsumerSearchMainQuery","callfrom":"SRP","nrQueryType":"MAIN_SRP","visitor_id":"eff16470-ceb5-4926-8c0b-6d1779772842","isClient":true,"seoPayload":{"asPath":"/realestateandhomes-search/New-York/pg-2","pageType":{"silo":"search_result_page","status":"for_sale"},"county_needed_for_uniq":false}}'

        json_body = json.loads(body)

        json_body["variables"]["page_index"] = page_number

        json_body["seoPayload"] = page_number

        json_body["variables"]["offset"] = offset_parameter

        r = requests.post(url=url, json=json_body, headers=headers)

        json_data = r.json()

        return json_data    

    def extract_features(self, entry: dict) -> dict:

        feature_dict = {

            "id": entry["property_id"],

            "price": entry["list_price"],

            "beds": entry["description"]["beds"],

            "baths": entry["description"]["baths"],

            "garage": entry["description"]["garage"],

            "stories": entry["description"]["stories"],

            "house_type": entry["description"]["type"],

            "lot_sqft": entry["description"]["lot_sqft"],

            "sqft": entry["description"]["sqft"],

            "year_built": entry["description"]["year_built"],

            "address": entry["location"]["address"]["line"],

            "postal_code": entry["location"]["address"]["postal_code"],

            "state": entry["location"]["address"]["state_code"],

            "city": entry["location"]["address"]["city"],

            "tags": entry["tags"]

        }       

        if entry["location"]["address"]["coordinate"]:

            feature_dict.update({"lat": entry["location"]["address"]["coordinate"]["lat"]})

            feature_dict.update({"lon": entry["location"]["address"]["coordinate"]["lon"]})           

        if entry["location"]["county"]:

            feature_dict.update({"county": entry["location"]["county"]["name"]})        

        return feature_dict    

    def parse_json_data(self) -> list:

        offset_parameter = 42
      
        feature_dict_list = []
        
        for i in range(1, self.page_numbers):

            json_data = self.send_request(page_number=i, offset_parameter=offset_parameter)

            offset_parameter += 42          

            for entry in json_data["data"]["home_search"]["results"]:

                feature_dict = self.extract_features(entry)

                feature_dict_list.append(feature_dict)                

        return feature_dict_list    

    def create_dataframe(self) -> pd.DataFrame:

        feature_dict_list = self.parse_json_data()       

        df = pd.DataFrame(feature_dict_list)

        dummy_df = pd.get_dummies(df['tags'].explode()).groupby(level=0).sum()

        merged_df = pd.merge(df, dummy_df, left_index=True, right_index=True)

        return merged_df

if __name__ == "__main__":

    r = RealtorScraper(page_numbers=206)

    df = r.create_dataframe()

For further details, contact iWeb Scraping today or request for a quote!

Frequently Asked Questions

The primary advantage is scalability and real-time business intelligence. Manually reading tweets is inefficient. Sentiment analysis tools allow you to instantly analyze thousands of tweets about your brand, products, or campaigns. This provides a scalable way to understand customer feelings, track brand reputation, and gather actionable insights from a massive, unfiltered source of public opinion, as highlighted in the blog’s “Advantages” section.

By analyzing the sentiment behind tweets, businesses can directly understand why customers feel the way they do. It helps identify pain points with certain products, gauge reactions to new launches, and understand the reasons behind positive feedback. This deep insight into the “voice of the customer” allows companies to make data-driven decisions to improve products, address complaints quickly, and enhance overall customer satisfaction, which aligns with the business applications discussed in the blog.

Yes, when using advanced tools, it provides reliable and consistent criteria. As the blog notes, manual analysis can be inconsistent due to human bias. Automated sentiment analysis using Machine Learning and AI (like the technology used by iWeb Scraping) trains models to tag data uniformly. This eliminates human inconsistency, provides results with a high degree of accuracy, and offers a reliable foundation for strategic business decisions.

Businesses can use a range of tools, from code-based libraries to dedicated platforms. As mentioned in the blog, popular options include Python with libraries like Tweepy and TextBlob, or dedicated services like MeaningCloud and iWeb Scraping’s Text Analytics API. The choice depends on your needs: Python offers customization for technical teams, while off-the-shelf APIs from web scraping services provide a turnkey solution for automatically scraping Twitter and extracting brand insights quickly and accurately.

Share this Article :

Build the scraper you want123

We’ll customize your concurrency, speed, and extended trial — for high-volume scraping.

Continue Reading

E-Commerce2

How to Extract & Save Facebook Group Members to a Google Sheet?

Get a jump on including Bootstrap's source files in a new project with our official guides.Get a jump on including Bootstrap's source files.

Parth Vataliya 4 Min Read
E-Commerce2

How to Extract & Save Facebook Group Members to a Google Sheet?

Get a jump on including Bootstrap's source files in a new project with our official guides.Get a jump on including Bootstrap's source files.

Parth Vataliya 4 Min Read
E-Commerce2

How to Extract & Save Facebook Group Members to a Google Sheet?

Get a jump on including Bootstrap's source files in a new project with our official guides.Get a jump on including Bootstrap's source files.

Parth Vataliya 4 Min Read
Scroll to Top