Nordstrom is a leading fashion retailer based in US with an equally popular e-commerce store that operates worldwide. It's a popular web scraping target because of the rich data it offers and it's position in the fashion industry.
In this guide, we'll take a look at web scraping Nordstrom using Python. We'll cover:
- Nordstrom product data scraping.
- Product discovery and search.
For this, we'll be using popular web scraping in Python tools httpx
and parsel
. To parse the data we'll be using hidden web data approach.
Nordstrom is relatively easy to scrape so let's dive in!
<!--kg-card-end: markdown--><!--kg-card-begin: markdown-->
Why Scrape Nordstrom?
Nordstrom is a popular fashion retailer with a huge product catalog. It's a great target for web scraping because of the rich data it offers. Its popularity and dataset size is a great way to understand the fashion e-commerce market. This data can be used for business analytics, market analysis and competitive intelligence.
For more on web scraping uses see our web scraping use case hub.
<!--kg-card-end: markdown--><!--kg-card-begin: markdown-->
Scrape Preview
In this article, we'll focus on scraping Nordstrom product data and product reviews. Here are some examples of the datasets we'll be collecting:
Scraped Product Dataset
{
"id": 5846438,
"title": "SKIMS Stretch Cotton T-Shirt",
"type": "T-shirt/Tee",
"typeParent": "Tops",
"ageGroups": [
"ADULT"
],
"reviewAverageRating": 4.5,
"numberOfReviews": 652,
"brand": {
"brandName": "SKIMS",
"brandUrl": "/brands/skims--21197?origin=productBrandLink",
"hasBrandPage": false,
"imsBrandId": 74974321
},
"description": "A tried-and-true classic, this fitted T-shirt made from stretch-cotton jersey is from Kim Kardashian's highly sought-out SKIMS.",
"features": [
"21 1/2\" length (size Medium)",
"Crewneck",
"Short sleeves",
"90% cotton, 10% elastane",
"Machine wash, tumble dry",
"Imported",
"Item #6194916"
],
"gender": "Female",
"isAvailable": true,
"media": {
"5847438": {
"id": 5847438,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/e354aaf8-5865-431b-b8d8-3cbccc6a2d83.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5847448": {
"id": 5847448,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/df191e8d-4f2c-48f4-9144-e6b9dbede775.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5847458": {
"id": 5847458,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/bca96a41-af1b-4736-89e3-e2facb3ec8ed.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5847468": {
"id": 5847468,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/1b0051f1-f60e-4b4b-8f79-3fabd077e91d.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5847478": {
"id": 5847478,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/86510e70-589b-440a-b66a-98982ce59740.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5847488": {
"id": 5847488,
"colorId": "053",
"name": "LIGHT HEATHER GREY",
"url": "https://n.nordstrommedia.com/id/sr3/d6ae4e0c-3b22-4dff-b528-d428005d8cd8.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5848438": {
"id": 5848438,
"colorId": "234",
"name": "SEDONA",
"url": "https://n.nordstrommedia.com/id/sr3/d64c4a4d-ca98-46af-8ff4-efd7460e3321.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5848448": {
"id": 5848448,
"colorId": "234",
"name": "SEDONA",
"url": "https://n.nordstrommedia.com/id/sr3/f1d6105b-9e75-49aa-bfdb-39ed6a0cd82a.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5848458": {
"id": 5848458,
"colorId": "234",
"name": "SEDONA",
"url": "https://n.nordstrommedia.com/id/sr3/04936587-02d9-41c7-b36f-b7f90144df6e.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5849438": {
"id": 5849438,
"colorId": "242",
"name": "UMBER",
"url": "https://n.nordstrommedia.com/id/sr3/85f4e2d8-00de-41f9-b777-2169bb799970.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5849448": {
"id": 5849448,
"colorId": "242",
"name": "UMBER",
"url": "https://n.nordstrommedia.com/id/sr3/4e2bffa2-fb87-416c-8438-a922d593423f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5849458": {
"id": 5849458,
"colorId": "242",
"name": "UMBER",
"url": "https://n.nordstrommedia.com/id/sr3/ca5f4ff8-7587-48cc-8914-818ee6320b9c.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850438": {
"id": 5850438,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/0762da9a-4326-46fd-9b84-6db33035c0ea.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850448": {
"id": 5850448,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/9f20433a-3d03-4893-87f9-2fd90f05c2b5.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850458": {
"id": 5850458,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/32d39f3b-88e8-4ee2-bb15-7723bed651c8.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850468": {
"id": 5850468,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/da666e38-7c2d-408e-9874-f30f094ccd9e.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850478": {
"id": 5850478,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/97828599-558b-48a5-8e03-35aeec7f6dbe.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5850488": {
"id": 5850488,
"colorId": "251",
"name": "CAMEL",
"url": "https://n.nordstrommedia.com/id/sr3/ef50a5bb-8f20-428d-8d64-0c7f9dd80776.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5851438": {
"id": 5851438,
"colorId": "301",
"name": "DEEP SEA",
"url": "https://n.nordstrommedia.com/id/sr3/8a2ed339-427b-4f93-9a49-762a43145d42.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5851448": {
"id": 5851448,
"colorId": "301",
"name": "DEEP SEA",
"url": "https://n.nordstrommedia.com/id/sr3/406118cc-c17a-42a5-842c-c12a54c19b39.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5852438": {
"id": 5852438,
"colorId": "339",
"name": "MINERAL",
"url": "https://n.nordstrommedia.com/id/sr3/a6c49b4c-1849-4c9e-895e-2804c4a0d01b.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5852448": {
"id": 5852448,
"colorId": "339",
"name": "MINERAL",
"url": "https://n.nordstrommedia.com/id/sr3/3c3820b0-0fe3-4869-bfcb-040917a78276.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5852458": {
"id": 5852458,
"colorId": "339",
"name": "MINERAL",
"url": "https://n.nordstrommedia.com/id/sr3/0544d615-d912-4fed-8e35-95bd9fdf753f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5852468": {
"id": 5852468,
"colorId": "339",
"name": "MINERAL",
"url": "https://n.nordstrommedia.com/id/sr3/df96797a-9a3d-4070-83c5-cd7d94dd1260.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5853438": {
"id": 5853438,
"colorId": "400",
"name": "COBALT",
"url": "https://n.nordstrommedia.com/id/sr3/95c440cd-18ea-47e0-a48f-6f97f1e1c0fc.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5854438": {
"id": 5854438,
"colorId": "446",
"name": "KYANITE",
"url": "https://n.nordstrommedia.com/id/sr3/b0359253-5e23-4619-9123-34dfb35063e6.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5854448": {
"id": 5854448,
"colorId": "446",
"name": "KYANITE",
"url": "https://n.nordstrommedia.com/id/sr3/81a2918e-5643-4d63-8850-d9d8654b62af.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5854458": {
"id": 5854458,
"colorId": "446",
"name": "KYANITE",
"url": "https://n.nordstrommedia.com/id/sr3/8e2b9d0f-8b8f-4835-9a57-bb197f95631d.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855438": {
"id": 5855438,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/31cdee52-d41a-46a7-8691-3ae1e0c53fb7.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855448": {
"id": 5855448,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/a1843dad-b30c-4031-8d36-42c47934572f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855458": {
"id": 5855458,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/e2543102-670e-40e8-acb6-916ea91f1515.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855468": {
"id": 5855468,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/3daefa94-9c8a-41f6-967e-f85b80ba3ebf.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855478": {
"id": 5855478,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/95c440cd-18ea-47e0-a48f-6f97f1e1c0fc.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5855488": {
"id": 5855488,
"colorId": "8",
"name": "525",
"url": "https://n.nordstrommedia.com/id/sr3/ad9d1fcc-a0a0-4856-8345-de54e3b6b54f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856438": {
"id": 5856438,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/aaa6a78e-f7d8-46f3-b51e-533642b5ea02.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856448": {
"id": 5856448,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/3d680521-dc9e-4f07-a634-e02043e78910.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856458": {
"id": 5856458,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/fbc8e722-af04-403f-a8fa-e938d56da1f3.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856468": {
"id": 5856468,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/4f2e699b-125f-484e-8873-09f72a2fa40a.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856478": {
"id": 5856478,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/1559d8ec-d03e-416c-9d27-c6d31151012f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5856488": {
"id": 5856488,
"colorId": "603",
"name": "SANGRIA",
"url": "https://n.nordstrommedia.com/id/sr3/44ffdcad-614c-4329-ba6b-65244873e200.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857438": {
"id": 5857438,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/35a9863f-feda-463c-aedf-a988329754c8.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857448": {
"id": 5857448,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/2241bfc4-be0f-4645-a350-7d19aafce7ae.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857458": {
"id": 5857458,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/3c76bafa-dda1-4069-9d55-4deddd58a70f.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857468": {
"id": 5857468,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/a3e5cf6b-0e43-455e-aba4-0b7093e0ac60.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857478": {
"id": 5857478,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/357882e4-c176-4c98-9601-39ee0299452a.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5857488": {
"id": 5857488,
"colorId": "690",
"name": "ROSE CLAY",
"url": "https://n.nordstrommedia.com/id/sr3/b9a9588e-a241-43b8-b907-0fc5d16d959c.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5858438": {
"id": 5858438,
"colorId": "900",
"name": "BONE",
"url": "https://n.nordstrommedia.com/id/sr3/eb5b0ed4-41b9-439b-a56d-a9f549892451.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5859438": {
"id": 5859438,
"colorId": "003",
"name": "SOOT",
"url": "https://n.nordstrommedia.com/id/sr3/2c5c5fd6-3df6-4e30-a5af-893041f219dc.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
},
"5860438": {
"id": 5860438,
"colorId": "203",
"name": "GARNET",
"url": "https://n.nordstrommedia.com/id/sr3/9b140781-5301-4137-b94e-fe10b7a674b4.jpeg?crop=pad&pad_color=FFF&format=jpeg&w=780&h=1196"
}
},
"variants": {
"5871416": {
"id": 5871416,
"sizeId": "xx-small",
"colorId": "339",
"totalQuantityAvailable": 1,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871419": {
"id": 5871419,
"sizeId": "medium",
"colorId": "339",
"totalQuantityAvailable": 9,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871420": {
"id": 5871420,
"sizeId": "large",
"colorId": "339",
"totalQuantityAvailable": 10,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871421": {
"id": 5871421,
"sizeId": "x-large",
"colorId": "339",
"totalQuantityAvailable": 19,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871422": {
"id": 5871422,
"sizeId": "plus-2 x",
"colorId": "339",
"totalQuantityAvailable": 19,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871423": {
"id": 5871423,
"sizeId": "plus-3 x",
"colorId": "339",
"totalQuantityAvailable": 14,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"5871424": {
"id": 5871424,
"sizeId": "plus-4 x",
"colorId": "339",
"totalQuantityAvailable": 23,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "339",
"value": "Mineral",
"sizes": "_s:xx-small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5852438,
5852448,
5852458,
5852468
],
"swatch": "https://n.nordstrommedia.com/id/sr3/8a3eb8e4-e660-42d9-af9e-41d9e85ecb99.jpeg?crop=fit&w=31&h=31"
}
},
"33855448": {
"id": 33855448,
"sizeId": "small",
"colorId": "900",
"totalQuantityAvailable": 319,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855449": {
"id": 33855449,
"sizeId": "medium",
"colorId": "900",
"totalQuantityAvailable": 437,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855450": {
"id": 33855450,
"sizeId": "large",
"colorId": "900",
"totalQuantityAvailable": 626,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855451": {
"id": 33855451,
"sizeId": "x-large",
"colorId": "900",
"totalQuantityAvailable": 273,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855452": {
"id": 33855452,
"sizeId": "plus-2 x",
"colorId": "900",
"totalQuantityAvailable": 105,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855454": {
"id": 33855454,
"sizeId": "xx-small",
"colorId": "900",
"totalQuantityAvailable": 38,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855455": {
"id": 33855455,
"sizeId": "plus-3 x",
"colorId": "900",
"totalQuantityAvailable": 56,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855456": {
"id": 33855456,
"sizeId": "plus-4 x",
"colorId": "900",
"totalQuantityAvailable": 67,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "900",
"value": "Bone",
"sizes": "_s:xx-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5858438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/5835a37b-e5c6-4bb9-9564-02f506ac745c.jpeg?crop=fit&w=31&h=31"
}
},
"33855464": {
"id": 33855464,
"sizeId": "x-small",
"colorId": "003",
"totalQuantityAvailable": 1,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855477": {
"id": 33855477,
"sizeId": "large",
"colorId": "003",
"totalQuantityAvailable": 720,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855478": {
"id": 33855478,
"sizeId": "x-large",
"colorId": "003",
"totalQuantityAvailable": 317,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855479": {
"id": 33855479,
"sizeId": "plus-2 x",
"colorId": "003",
"totalQuantityAvailable": 166,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855480": {
"id": 33855480,
"sizeId": "xx-small",
"colorId": "003",
"totalQuantityAvailable": 22,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855482": {
"id": 33855482,
"sizeId": "plus-3 x",
"colorId": "003",
"totalQuantityAvailable": 11,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"33855483": {
"id": 33855483,
"sizeId": "plus-4 x",
"colorId": "003",
"totalQuantityAvailable": 18,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "003",
"value": "Soot",
"sizes": "_s:xx-small|_s:x-small|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5859438
],
"swatch": "https://n.nordstrommedia.com/id/sr3/51d8a867-3627-4f76-88c2-5f3a6397ad2a.jpeg?crop=fit&w=31&h=31"
}
},
"36450158": {
"id": 36450158,
"sizeId": "medium",
"colorId": "053",
"totalQuantityAvailable": 241,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450160": {
"id": 36450160,
"sizeId": "large",
"colorId": "053",
"totalQuantityAvailable": 137,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450161": {
"id": 36450161,
"sizeId": "x-large",
"colorId": "053",
"totalQuantityAvailable": 69,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450162": {
"id": 36450162,
"sizeId": "plus-2 x",
"colorId": "053",
"totalQuantityAvailable": 40,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450163": {
"id": 36450163,
"sizeId": "xx-small",
"colorId": "053",
"totalQuantityAvailable": 16,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450164": {
"id": 36450164,
"sizeId": "plus-3 x",
"colorId": "053",
"totalQuantityAvailable": 23,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450165": {
"id": 36450165,
"sizeId": "plus-4 x",
"colorId": "053",
"totalQuantityAvailable": 27,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450185": {
"id": 36450185,
"sizeId": "x-small",
"colorId": "053",
"totalQuantityAvailable": 46,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"36450186": {
"id": 36450186,
"sizeId": "small",
"colorId": "053",
"totalQuantityAvailable": 197,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "053",
"value": "Light Heather Grey",
"sizes": "_s:xx-small|_s:x-small|_s:small|_s:medium|_s:large|_s:x-large|_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5847438,
5847448,
5847458,
5847468,
5847478,
5847488
],
"swatch": "https://n.nordstrommedia.com/id/sr3/9c98532a-adb9-4dad-b511-0ac149511a58.jpeg?crop=fit&w=31&h=31"
}
},
"38558224": {
"id": 38558224,
"sizeId": "plus-2 x",
"colorId": "446",
"totalQuantityAvailable": 22,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "446",
"value": "Kyanite",
"sizes": "_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5854438,
5854448,
5854458
],
"swatch": "https://n.nordstrommedia.com/id/sr3/2f637c12-349c-4506-9021-70e078f2ffe4.jpeg?crop=fit&w=31&h=31"
}
},
"38558226": {
"id": 38558226,
"sizeId": "plus-3 x",
"colorId": "446",
"totalQuantityAvailable": 5,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "446",
"value": "Kyanite",
"sizes": "_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5854438,
5854448,
5854458
],
"swatch": "https://n.nordstrommedia.com/id/sr3/2f637c12-349c-4506-9021-70e078f2ffe4.jpeg?crop=fit&w=31&h=31"
}
},
"38558227": {
"id": 38558227,
"sizeId": "plus-4 x",
"colorId": "446",
"totalQuantityAvailable": 7,
"price": {
"currencyCode": "USD",
"units": 48,
"nanos": 0
},
"color": {
"id": "446",
"value": "Kyanite",
"sizes": "_s:plus-2 x|_s:plus-3 x|_s:plus-4 x|",
"mediaIds": [
5854438,
5854448,
5854458
],
"swatch": "https://n.nordstrommedia.com/id/sr3/2f637c12-349c-4506-9021-70e078f2ffe4.jpeg?crop=fit&w=31&h=31"
}
}
}
}
<!--kg-card-end: markdown--><!--kg-card-begin: markdown-->
Setup
For this scraper we'll be using the hidden web data scraping approach. We'll be collecting HTML pages and extracting hidden JSON datasets, then parsing them with JSON parsing tools:
- httpx - powerful HTTP client which we'll be using to retrieve the HTML pages.
- parsel - HTML parser which we'll be using to extract hidden JSON datasets.
- nested-lookup - JSON/Dict parser which will help us find specific keys in large JSON datasets.
- jmespath - JSON query engine which we'll be using to reduce JSON datasets to important bits like product prices, images etc. For more see our introduction to parsing JSON with JMESPath.
All of these packages can be installed using Python's pip
console command:
$ pip install httpx parsel jmespath nested-lookup
For Scrapfly users there's also a Scrapfly SDK version of each code example. The SDK can be installed using pip
as well:
$ pip install "scrapfly-sdk[all]"
<!--kg-card-end: markdown--><!--kg-card-begin: markdown-->
Scrape Nordstrom Product Data
Let's start by scraping product data of a single product. For this, let's take a look at an example product page like:
nordstrom.com/s/nike-phoenix-fleece-crewneck-sweatshirt/
We could parse the HTML data using CSS selectors or XPath but since Nordstrom is using React javascript framework to power their website we can extract the dataset directly from the page source:
If we open up page source and ctrl+f for unique product identifier text (like description or title) we can see there's a hidden JSON dataset. In web scraping, this is called hidden web data scraping and let's take a look how to scrape this in Python.
Our scraper process will look something like this:
- Retrieve HTML page of the product using
httpx
. - Find the hidden JSON dataset from
<script>
tag usingparsel
and XPath. - Load the JSON dataset using
json.loads()
and find product fields usingnested-lookup
In Python this scraper will look like this:
Python
ScrapFly
import asyncio
import json
import httpx
from parsel import Selector
from nested_lookup import nested_lookup
# setup httpx client with http2 enabled and browser-like headers to avoid being blocked:
client = httpx.AsyncClient(
http2=True,
headers={
"User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
}
)
def find_hidden_data(html) -> dict:
"""extract hidden web cache from page html"""
# use XPath to find script tag with data
data = Selector(html).xpath("//script[contains(.,' __INITIAL_CONFIG__')]/text()").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
async def scrape_product(url: str):
"""scrape Nordstrom.com product page for product data"""
response = await client.get(url)
# find all hidden dataset:
data = find_hidden_data(response.text)
# extract only product data from the dataset
# find first key "stylesById" and take first value (which is the current product)
product = nested_lookup("stylesById", data)
product = list(product[0].values())[0]
return product
# example scrape run:
print(asyncio.run(scrape_product("https://www.nordstrom.com/s/nike-phoenix-fleece-crewneck-sweatshirt/6665302")))
import asyncio
import json
from nested_lookup import nested_lookup
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
client = ScrapflyClient(key="YOUR SCRAPFLY KEY")
def find_hidden_data(result: ScrapeApiResponse) -> dict:
"""extract hidden web cache from page html"""
# use XPath to find script tag with data
data = result.selector.xpath("//script[contains(.,' __INITIAL_CONFIG__')]/text()").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
async def scrape_product(url: str):
response = await client.scrape(ScrapeConfig(
url=url,
asp=True, # enable anti-scraping-protection bypass
cache=True, # enable cache while we develop
debug=True, # enable debug mode while we develop
))
# find all hidden dataset:
data = find_hidden_data(response.text)
# extract only product data from the dataset
# find first key "stylesById" and take first value (which is the current product)
product = nested_lookup("stylesById", data)
product = list(product[0].values())[0]
return product
# example scrape run:
print(asyncio.run(scrape_product("https://www.nordstrom.com/s/nike-phoenix-fleece-crewneck-sweatshirt/6665302")))
In only a few lines of Python code, we got the entire product dataset on Nordstrom! However, this dataset is huge and can be difficult to ingest by our data pipeline if we were to do some analytics or data storage. So next, let's use JMESPath to reduce the dataset to the most important values like pricing, images and variant data.
<!--kg-card-end: markdown--><!--kg-card-begin: markdown-->
Parsing with JMESPath
JMESPath is a JSON query language and since Python dictionaries are equivelent to JSON objects we can use JMESPath in our Nordstrom data parsing.
We'll be using JMESPath data reshaping feature which allows specifying a key map to reduce a dataset. For example:
import jmespath
data = {
"id": "123456",
"productTitle": "Product Title",
"type": "sweater",
"unimportant": "foobar",
"photos": {
"desktop": "http://example.com/photo.jpg",
"mobile": "http://example.com/photo-small.jpg",
},
}
# jmespath search takes a query string and a data object.
# here we use `{}` remapping feature to rename keys of the original dataset
reduced = jmespath.search(
"""{
id: id,
title: productTitle,
type: type,
photo: photos.desktop
}""",
data,
)
print(reduced)
{"id": "123456", "title": "Product Title", "type": "sweater", "photo": "http://example.com/photo.jpg"}
This powerful tool allows us to easily reshape scraped datasets. So, let's use it to reshape our Nordstrom product dataset we just scraped:
import jmespath
def parse_product(data: dict) -> dict:
# parse product basic data like id, name, features etc.
product = jmespath.search(
"""{
id: id,
title: productTitle,
type: productTypeName,
typeParent: productTypeParentName,
ageGroups: ageGroups,
reviewAverageRating: reviewAverageRating,
numberOfReviews: numberOfReviews,
brand: brand,
description: sellingStatement,
features: features,
gender: gender,
isAvailable: isAvailable
}""",
data,
)
# product variants have their own colors, prices and photos:
prices_by_sku = data["price"]["bySkuId"]
colors_by_id = data["filters"]["color"]["byId"]
product["media"] = {}
for media_id, media in data["styleMedia"]["byId"].items():
product["media"][media_id] = jmespath.search(
"""{
id: id,
colorId: colorId,
name: colorName,
url: imageMediaUri.largeDesktop
}""",
media,
)
# Each product has SKUs(Stock Keeping Units) which are the actual variants:
product["variants"] = {}
for sku, sku_data in data["skus"]["byId"].items():
# get basic variant data
parsed = jmespath.search(
"""{
id: id,
sizeId: sizeId,
colorId: colorId,
totalQuantityAvailable: totalQuantityAvailable
}""",
sku_data,
)
# get variant price from
parsed["price"] = prices_by_sku[sku]["regular"]["price"]
# get variant color data
parsed["color"] = jmespath.search(
"""{
id: id,
value: value,
sizes: isAvailableWith,
mediaIds: styleMediaIds,
swatch: swatchMedia.desktop
}""",
colors_by_id[parsed["colorId"]],
)
product["variants"][sku] = parsed
return product
This might appear complex but all we did is map the original dataset keys to new keys using JMESPath. Now our scraper can scrape nice and tidy product datasets that we can easily ingest into our data pipelines!
<!--kg-card-end: markdown--><!--kg-card-begin: markdown-->
Finding Products
Now that we can scrape individual Nordstrom products we need to find the product URLs to scrape. We could find desired products and input their URLs manually but to scale up our scraper we find scrape product categories or search.
For this, we'll be using the same hidden data scraping approach as each category or search result page contains a hidden dataset with product preview data (like price, title, image, etc.) and product page URLs.
For example, let's take a look at one of Nordstrom search pages:
nordstrom.com/sr?origin=keywordsearch&keyword=indigo
We can see that every search (or category) page is made up from several pages. So, we need to scrape pagination as well.
To scrape this we'll be using a very similar approach we used to scrape product pages:
- Scrape the first search/category page HTML.
- Find hidden web data using
parsel
and XPath. - Extract product preview data and pagination info from the hidden dataset using
nested-lookup
. - Calculate the total number of pages and scrape them.
Let's see how this works in Python:
Python
ScrapFly
import asyncio
import json
from typing import Dict, List
from urllib.parse import parse_qs, urlencode, urlparse
import httpx
from nested_lookup import nested_lookup
from parsel import Selector
# setup httpx client with http2 enabled and browser-like headers to avoid being blocked:
client = httpx.AsyncClient(
http2=True,
headers={
"User-Agent": "Mozilla/4.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=-1.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
}
)
def find_hidden_data(html) -> dict:
"""extract hidden web cache from page html"""
# use XPath to find script tag with data
data = Selector(html).xpath("//script[contains(.,' __INITIAL_CONFIG__')]/text()").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
def update_url_parameter(url, **params):
"""update url query parameter of an url with new values"""
current_params = parse_qs(urlparse(url).query)
updated_query_params = urlencode({ **current_params,** params}, doseq=True)
return url[: url.find("?")] + "?" + updated_query_params
async def scrape_search(url: str, max_pages: int = 10) -> List[Dict]:
"""Scrape Nordstrom search or category url for product preview data"""
print(f"scraping first search page: {url}")
first_page = await client.get(url)
# parse first page for product search data and total amount of pages:
data = find_hidden_data(first_page.text)
_first_page_results = nested_lookup("productResults", data)[0]
products = list(_first_page_results["productsById"].values())
paging_info = _first_page_results["query"]
total_pages = paging_info["pageCount"]
if max_pages and max_pages < total_pages:
total_pages = max_pages
# then scrape other pages concurrently:
print(f" scraping remaining {total_pages - 1} search pages")
_other_pages = [client.get(update_url_parameter(url, page=page)) for page in range(2, total_pages + 1)]
for response in asyncio.as_completed(_other_pages):
response = await response
if not response.status_code != 200:
print(f'!!! scrape page {response.url} got blocked; skipping')
continue
data = find_hidden_data(response.text)
data = nested_lookup("productResults", data)[0]
products.extend(list(data["productsById"].values()))
return products
# example scrape run for search of "indigo" keyword with max 2 pages:
print(asyncio.run(scrape_search("https://www.nordstrom.com/sr?origin=keywordsearch&keyword=indigo", max_pages=2))
import asyncio
import json
from typing import Dict, List
from urllib.parse import parse_qs, urlencode, urlparse
from nested_lookup import nested_lookup
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")
def find_hidden_data(result: ScrapeApiResponse) -> dict:
"""extract hidden web cache from page html"""
# use XPath to find script tag with data
data = result.selector.xpath("//script[contains(.,' __INITIAL_CONFIG__')]/text()").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
def update_url_parameter(url, **params):
"""update url query parameter of an url with new values"""
current_params = parse_qs(urlparse(url).query)
updated_query_params = urlencode({ **current_params,** params}, doseq=True)
return url[: url.find("?")] + "?" + updated_query_params
async def scrape_search(url: str, max_pages: int = 10) -> List[Dict]:
"""Scrape StockX search"""
print(f"scraping first search page: {url}")
first_page = await scrapfly.async_scrape(
ScrapeConfig(
url=url,
country="US",
asp=True,
debug=True,
cache=True,
)
)
# parse first page for product search data and total amount of pages:
data = find_hidden_data(first_page)
_first_page_results = nested_lookup("productResults", data)[0]
products = list(_first_page_results["productsById"].values())
paging_info = _first_page_results["query"]
total_pages = paging_info['pageCount']
if max_pages and max_pages < total_pages:
total_pages = max_pages
# then scrape other pages concurrently:
print(f" scraping remaining {total_pages - 1} search pages")
_other_pages = [
ScrapeConfig(
url=update_url_parameter(url, page=page),
country="US",
asp=True,
)
for page in range(2, total_pages + 1)
]
async for result in scrapfly.concurrent_scrape(_other_pages):
data = find_hidden_data(result)
data = nested_lookup("productResults", data)[0]
products.extend(list(data["productsById"].values()))
return products
# example scrape run for search of "indigo" keyword with max 2 pages:
print(asyncio.run(scrape_search("https://www.nordstrom.com/sr?origin=keywordsearch&keyword=indigo", max_pages=2))
<!--kg-card-end: markdown--><!--kg-card-begin: markdown-->
Avoiding Blocking with ScrapFly
Nordstrom is somewhat notorious for blocking web scraping, so to scale up our scrapers beyond the few scrapes of this guide we'll need to use proxies or other tools to avoid scraper blocking.
Scrapfly service does the heavy lifting for you!
Scrapfly API is a perfect tool for scaling up web scrapers and avoiding being blocked. It's a drop-in replacement for the tools we used in this guide and comes with scraper power up features like:
- Millions of Residential Proxies
- Anti Scraping Protection bypass
- Javascript rendering and headless cloud browsers
- Web dashboard for monitoring and managing scrapers
All these tools can be easily accessed through Python SDK:
from scrapfly import ScrapeConfig, ScrapflyClient
client = ScrapflyClient(key="")
result = client.scrape(ScrapeConfig(
url="https://www.nordstrom.com/sr?origin=keywordsearch&keyword=indigo",
# enable scraper blocking service bypass
asp=True
# optional - render javascript using headless browsers:
render_js=True,
))
print(result.content)
For more on web scraping Nordstrom with ScrapFly check out the Full Scraper Code section.
<!--kg-card-end: markdown--><!--kg-card-begin: markdown-->
FAQ
To wrap this article up let's take a look at some frequently asked questions about scraping Nordstrom:
Is it legal to scrape Nordstrom?
Yes. Public data on Nordstrom is perfectly legal to scrape. However, attention should be paid to scraping speeds and scraping of user reviews as they might contain copyrighted data like images which might require permission to store depending on the country.
Can Nordstrom be crawled?
Yes. Like many e-commerce website Nordstrom lends itself to web crawling as it has many product references through out the website. Note that crawling is significantly more resource intensive than direct web scraping we've covered in this tutorial so it's not recommended. Related: What's the difference between Web Scraping and Crawling?
Summary
In this web scraping guide we've taken a look at how to scrape Nordstrom - a popular fashion e-commerce store.
For this, we used Python with httpx
, parsel
, nested-lookup
and jmespath
and the hidden web data scraping approach. We've collected HTML pages and extracted hidden React framework data to find product data fields with just a few lines of Python code.
To avoid blocking, we've taken a look at ScrapFly - a web scraping API that can be used to scale up web scrapers and avoid being blocked. Try it out for free!
<!--kg-card-end: markdown--><!--kg-card-begin: markdown-->
Get Your FREE API KeyDiscover ScrapFly
<!--kg-card-end: markdown--><!--kg-card-begin: markdown-->
Full Scraper Code
Here's the full Nordstrom scraper using Python and Scrapfly Python SDK:
π This code should only be used as a reference. To scrape data from Nordstrom at scale you'll need to adjust it to your preferences and environment
import asyncio
import os
import json
from pathlib import Path
from typing import Dict, List
from urllib.parse import parse_qs, urlencode, urlparse
from nested_lookup import nested_lookup
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse
import jmespath
scrapfly = ScrapflyClient(key=os.environ["SCRAPFLY_KEY"], max_concurrency=10)
def find_hidden_data(result: ScrapeApiResponse) -> dict:
"""extract hidden web cache from page html"""
data = result.selector.xpath("//script[contains(.,' __INITIAL_CONFIG__')]/text()").get()
data = data.split("=", 1)[-1].strip().strip(";")
data = json.loads(data)
return data
def parse_product(data: dict) -> dict:
# parse product basic data like id, name, features etc.
product = jmespath.search(
"""{
id: id,
title: productTitle,
type: productTypeName,
typeParent: productTypeParentName,
ageGroups: ageGroups,
reviewAverageRating: reviewAverageRating,
numberOfReviews: numberOfReviews,
brand: brand,
description: sellingStatement,
features: features,
gender: gender,
isAvailable: isAvailable
}""",
data,
)
# product variants have their own colors, prices and photos:
prices_by_sku = data["price"]["bySkuId"]
colors_by_id = data["filters"]["color"]["byId"]
product["media"] = {}
for media_id, media in data["styleMedia"]["byId"].items():
product["media"][media_id] = jmespath.search(
"""{
id: id,
colorId: colorId,
name: colorName,
url: imageMediaUri.largeDesktop
}""",
media,
)
# Each product has SKUs(Stock Keeping Units) which are the actual variants:
product["variants"] = {}
for sku, sku_data in data["skus"]["byId"].items():
# get basic variant data
parsed = jmespath.search(
"""{
id: id,
sizeId: sizeId,
colorId: colorId,
totalQuantityAvailable: totalQuantityAvailable
}""",
sku_data,
)
# get variant price from
parsed["price"] = prices_by_sku[sku]["regular"]["price"]
# get variant color data
parsed["color"] = jmespath.search(
"""{
id: id,
value: value,
sizes: isAvailableWith,
mediaIds: styleMediaIds,
swatch: swatchMedia.desktop
}""",
colors_by_id[parsed["colorId"]],
)
product["variants"][sku] = parsed
return product
async def scrape_product(url: str) -> dict:
"""scrape a single stockx product page for product data"""
result = await scrapfly.async_scrape(
ScrapeConfig(
url=url,
asp=True,
cache=True,
)
)
data = find_hidden_data(result)
# extract all products datasets from page cache
product = nested_lookup("stylesById", data)
product = list(product[0].values())[0]
return parse_product(product)
def update_url_parameter(url, **params):
"""update url query parameter of an url with new values"""
current_params = parse_qs(urlparse(url).query)
updated_query_params = urlencode({ **current_params,** params}, doseq=True)
return url[: url.find("?")] + "?" + updated_query_params
async def scrape_search(url: str, max_pages: int = 10) -> List[Dict]:
"""Scrape StockX search"""
print(f"scraping first search page: {url}")
first_page = await scrapfly.async_scrape(
ScrapeConfig(
url=url,
asp=True,
cache=True,
)
)
# parse first page for product search data and total amount of pages:
data = find_hidden_data(first_page)
_first_page_results = nested_lookup("productResults", data)[0]
products = list(_first_page_results["productsById"].values())
paging_info = _first_page_results["query"]
total_pages = paging_info["pageCount"]
if max_pages and max_pages < total_pages:
total_pages = max_pages
# then scrape other pages concurrently:
print(f" scraping remaining {total_pages - 1} search pages")
_other_pages = [
ScrapeConfig(
url=update_url_parameter(url, page=page),
country="US",
asp=True,
)
for page in range(2, total_pages + 1)
]
async for result in scrapfly.concurrent_scrape(_other_pages):
data = find_hidden_data(result)
data = nested_lookup("productResults", data)[0]
products.extend(list(data["productsById"].values()))
return products
async def example_run():
"""
this example run will scrape example product and 2 pages of search results and
save them to ./results/product.json and ./results/search.json respectively
"""
out_dir = Path( __file__ ).parent / "results"
out_dir.mkdir(exist_ok=True)
product = await scrape_product("https://www.nordstrom.com/s/nike-phoenix-fleece-crewneck-sweatshirt/6665302?page=2")
out_dir.joinpath("product.json").write_text(json.dumps(product, indent=2, ensure_ascii=False))
search = await scrape_search("https://www.nordstrom.com/sr?origin=keywordsearch&keyword=foo", max_pages=2)
out_dir.joinpath("search.json").write_text(json.dumps(search, indent=2, ensure_ascii=False))
if __name__ == " __main__":
asyncio.run(example_run())
<!--kg-card-end: markdown--><!--kg-card-begin: html-->{<br> "@context": "<a href="https://schema.org">https://schema.org</a>",<br> "@type": "FAQPage",<br> "mainEntity": [<br> {<br> "@type": "Question",<br> "name": "Is it legal to scrape Nordstrom?",<br> "acceptedAnswer": {<br> "@type": "Answer",<br> "text": "<p>Yes. Public data on Nordstrom is perfectly legal to scrape. However, attention should be paid to scraping speeds and scraping of user reviews as they might contain copyrighted data like images which might require permission to store depending on the country.</p>"<br> }<br> },<br> {<br> "@type": "Question",<br> "name": "Can Nordstrom be crawled?",<br> "acceptedAnswer": {<br> "@type": "Answer",<br> "text": "<p>Yes. Like many e-commerce website Nordstrom lends itself to web crawling as it has many product references through out the website. Note that crawling is significantly more resource intensive than direct web scraping we've covered in this tutorial so it's not recommended. Related: <a class=\"inline-reference\" href=\"https://scrapfly.io/blog/whats-the-difference-between-scraping-and-crawling/\">What's the difference between Web Scraping and Crawling?</a></p>"<br> }<br> }<br> ]<br> }<!--kg-card-end: html-->
Top comments (0)