Scraping web microformats are one the easiest ways to web scrape public data. So, in this tutorial, we'll take a look at how to use this web scraping technique in Python using extruct library.
Web microformats are a set of standardized metadata formats that can be embedded in HTML pages to provide structured data about various types of content, such as products, people, organizations, and more.
By scraping microformats we can easily scrape public data and receive a predictable format as microformats often follow strict schema definitions defined by schema.org.
Today we'll cover common microformat types, how to scrape them and see an example scraping project by scraping Etsy.com
What are Microformats?
Microformats were created to standardize the representation of important web data objects so they can be machine-readable. Most commonly microformats are used to create preview cards for web pages. Most commonly it's used to provide data view for search engines, social networks and other communication channels.
In practice, most people are familiar with microformats through website preview features in social media or communication platforms (like Slack). i.e. when you post a website the hosting server scrapes microformat data to generate a little website preview:
... IMG
The only downside of microformats is that usually they don't contain the whole available page dataset. When web scraping, we might need to extend microformat parser with additional HTML parsing using HTML parsing tools like beautifulsoup or using CSS selector and XPath parsers.
Setup
To scrape microformats we'll be using Python with extruct library which uses HTML parsing tools to extract microformat data.
It can be installed using pip install
terminal command:
$ pip install extruct
Schema.org
Schema.org is a collaborative initiative between major search engines and other tech industry leaders with the aim to provide standard data types for web content.
Schema.org contains schemas (data object rules and definitions) for popular data objects like People, Websites, Articles, Companies etc. These standard, static schema types simplify web automation.
These schemas were created for microformats though not all microformats have to use schema.org object definitions.
Let's take a look at microformat types by exploring a schemar.org/Person object next.
Microformat Types
There are several microformat data type standards used across the web. They are very similar and only differ in markup and use case.
Let's take a look at popular microformat types and how to extract them using extruct
and Python.
JSON-LD
JSON-LD is the most popular modern microformat. It uses embedded JSON documents that directly represent schema.org objects.
Here's a JSON-LD markup example and how to parse it using extruct
:
html = """
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Person",
"name": "John Doe",
"image": "johndoe.jpg",
"jobTitle": "Software Engineer",
"telephone": "(555) 555-5555",
"email": "john.doe@example.com",
"address": {
"@type": "PostalAddress",
"streetAddress": "123 Main St",
"addressLocality": "Anytown",
"addressRegion": "CA",
"postalCode": "12345"
}
}
</script>
"""
from extruct import JsonLdExtractor
data = JsonLdExtractor().extract(html)
print(data)
[
{
"@context": "http://schema.org",
"@type": "Person",
"name": "John Doe",
"image": "johndoe.jpg",
"jobTitle": "Software Engineer",
"telephone": "(555) 555-5555",
"email": "john.doe@example.com",
"address": {
"@type": "PostalAddress",
"streetAddress": "123 Main St",
"addressLocality": "Anytown",
"addressRegion": "CA",
"postalCode": "12345",
},
}
]
This is an example of a Person object (indicated by meta field @type
) and we can find schema details on schema.org/Person.
JSON-LD is easy to implement and use but since it's a separate dataset from the visible data on the page it can mismatch the page data.
Microdata
Microdata is the second most popular microformat and it uses HTML attributes to mark up microformat data fields. This microformat is great for web scraping as it covers visible page data which means we get exactly what we see on the page.
Here's a microdata markup example and how to parse it using extruct:
html = """
<div itemscope itemtype="http://schema.org/Person">
<h1 itemprop="name">John Doe</h1>
<img itemprop="image" src="johndoe.jpg" alt="John Doe">
<p itemprop="jobTitle">Software Engineer</p>
<p itemprop="telephone">(555) 555-5555</p>
<p itemprop="email"><a href="mailto:john.doe@example.com">john.doe@example.com</a></p>
<div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<p><span itemprop="streetAddress">123 Main St</span>, <span itemprop="addressLocality">Anytown</span>, <span itemprop="addressRegion">CA</span> <span itemprop="postalCode">12345</span></p>
</div>
</div>
"""
from extruct import MicrodataExtractor
data = MicrodataExtractor().extract(html)
print(data)
[
{
"type": "http://schema.org/Person",
"properties": {
"name": "John Doe",
"image": "johndoe.jpg",
"jobTitle": "Software Engineer",
"telephone": "(555) 555-5555",
"email": "john.doe@example.com",
"address": {
"type": "http://schema.org/PostalAddress",
"properties": {
"streetAddress": "123 Main St",
"addressLocality": "Anytown",
"addressRegion": "CA",
"postalCode": "12345",
},
},
},
}
]
Microdata uses the itemprop
attribute to specify the field key and the inner HTML data as the value. This format is a bit more complex but it matches the real source closer as it's the same data as displayed on the web page.
🧙 Microdata is often marked up using schema.org schema but because it's so flexible other markups that do not match schema.org schema are possible as well.
RDFA
RDFA is similar to Microdata and uses HTML attribute markup to provide additional microformat data. It's almost identical to the Microdata format and shares the same advantages of marking up data visible on the page.
Here's an RDFA markup example and how to parse it using extruct:
html = """
<div vocab="http://schema.org/" typeof="Person">
<h1 property="name">John Doe</h1>
<img property="image" src="johndoe.jpg" alt="John Doe"/>
<p property="jobTitle">Software Engineer</p>
<p property="telephone">(555) 555-5555</p>
<p property="email"><a href="mailto:john.doe@example.com">john.doe@example.com</a></p>
<div property="address" typeof="PostalAddress">
<p><span property="streetAddress">123 Main St</span>, <span property="addressLocality">Anytown</span>, <span property="addressRegion">CA</span> <span property="postalCode">12345</span></p>
</div>
</div>
"""
from extruct import RDFaExtractor
data = RDFaExtractor().extract(html)
print(data)
[
{"@id": "", "http://www.w3.org/ns/rdfa#usesVocabulary": [{"@id": "http://schema.org/"}]},
{
"@id": "_:Naa49dc28a80f47119694913cd98fc5dc",
"@type": ["http://schema.org/Person"],
"http://schema.org/address": [{"@id": "_:Nb8c8aea8ce7d434989a88308e1a12e7e"}],
"http://schema.org/email": [{"@value": "john.doe@example.com"}],
"http://schema.org/image": [{"@id": "johndoe.jpg"}],
"http://schema.org/jobTitle": [{"@value": "Software Engineer"}],
"http://schema.org/name": [{"@value": "John Doe"}],
"http://schema.org/telephone": [{"@value": "(555) 555-5555"}],
},
{
"@id": "_:Nb8c8aea8ce7d434989a88308e1a12e7e",
"@type": ["http://schema.org/PostalAddress"],
"http://schema.org/addressLocality": [{"@value": "Anytown"}],
"http://schema.org/addressRegion": [{"@value": "CA"}],
"http://schema.org/postalCode": [{"@value": "12345"}],
"http://schema.org/streetAddress": [{"@value": "123 Main St"}],
},
]
RDFa structures are very similar to Microdata and are great for web scraping as they mark up the real web source through extra data cleanup is necessary as the output dataset is a bit convoluted.
OpenGraph
Facebook's opengraph is another popular microformat mostly used to generate preview cards in social media posts. So, while Opengraph supports all schema.org objects it's rarely used to markup beyond basic website preview information.
Here's an opengraph markup example and how to parse it using extruct
:
html = """
<head>
<meta property="og:type" content="profile" />
<meta property="og:title" content="John Doe" />
<meta property="og:image" content="johndoe.jpg" />
<meta property="og:description" content="Software Engineer" />
<meta property="og:phone_number" content="(555) 555-5555" />
<meta property="og:email" content="john.doe@example.com" />
<meta property="og:street-address" content="123 Main St" />
<meta property="og:locality" content="Anytown" />
<meta property="og:region" content="CA" />
<meta property="og:postal-code" content="12345" />
<meta property="og:country-name" content="USA" />
</head>
"""
from extruct import OpenGraphExtractor
data = OpenGraphExtractor().extract(html)
print(data)
[
{
"namespace": {"og": "http://ogp.me/ns#"},
"properties": [
("og:type", "profile"),
("og:title", "John Doe"),
("og:image", "johndoe.jpg"),
("og:description", "Software Engineer"),
("og:phone_number", "(555) 555-5555"),
("og:email", "john.doe@example.com"),
("og:street-address", "123 Main St"),
("og:locality", "Anytown"),
("og:region", "CA"),
("og:postal-code", "12345"),
("og:country-name", "USA"),
],
}
]
Opengraph is very similar to JSON-LD as it's not part of the natural page. This means that opengraph information can differ from the data presented on the page.
Microformat
Microformat is one of the oldest markups that predate schema.org objects. Instead, microformats have their own schema definitions for marking up people, organizations, events, locations, blog posts, products, reviews, resumes, recipes etc.
Here's a microformat markup example and how to parse it using extruct:
html = """
<div class="h-card">
<h1 class="fn">John Doe</h1>
<img class="photo" src="johndoe.jpg" alt="John Doe">
<p class="title">Software Engineer</p>
<p class="tel">(555) 555-5555</p>
<a class="email" href="mailto:john.doe@example.com">john.doe@example.com</a>
<div class="adr">
<span class="street-address">123 Main St</span>,
<span class="locality">Anytown</span>,
<span class="region">CA</span>
<span class="postal-code">12345</span>
</div>
</div>
"""
from extruct import MicroformatExtractor
data = MicroformatExtractor().extract(html)
print(data)
[
{
"type": ["h-card"],
"properties": {
"name": ["John Doe"],
"photo": ["johndoe.jpg"],
"job-title": ["Software Engineer"],
"tel": ["(555) 555-5555"],
"email": ["mailto:john.doe@example.com"],
"adr": [
{
"type": ["h-adr"],
"properties": {"name": ["123 Main St, Anytown, CA 12345"]},
"value": "123 Main St, Anytown, CA 12345",
}
],
},
}
]
Example Scraper
Let's take a look at microformat scraping through an example scraper. We'll be scraping a few popular websites that use microformats to mark up their data.
For scraping, we'll be using ScrapFly SDK which will help us retrieve HTML pages without being blocked and extruct to parse microformat data.
All of these libraries can be installed using pip install
command:
$ pip install "scrapfly-sdk[all]" extruct
For our first example let's take a look at scraping Etsy.com - a popular e-commerce website that specializes in handmade and vintage items.
For example, let's take this jewelry product etsy.com/listing/1214112656 and see what we can scrape from it using extruct
:
import json
import os
import extruct
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient(os.environ["SCRAPFLY_KEY"])
result = scrapfly.scrape(ScrapeConfig(
url="https://www.etsy.com/listing/1214112656/",
asp=True,
))
micro_data = extruct.extract(result.content)
Example Output
{
"microdata": [],
"json-ld": [
{
"@type": "Product",
"@context": "https://schema.org",
"url": "https://www.etsy.com/listing/1214112656/9k-tiny-solid-yellow-gold-coin-pendant",
"name": "9K TINY Solid Yellow Gold Coin Pendant, Gold Disk Necklace, 9k Gold Coin Necklace, Solid Gold Rose Necklace, Christmas Gift for Her",
"sku": "1214112656",
"gtin": "n/a",
"description": "----------------🌸🌸🌸Welcome to MissFlorenceJewelry🌸🌸🌸---------------\nDetails:\n· Material: 9K Solid Yellow Gold\n· Measurement: Pendant approx 6.5*6.5mm. Pendant hole 3.5*1.5mm\n· Please not that a chain is NOT included. If you are interested to get one, check out this chain listing: \nhttps://www.etsy.com/listing/1199362536/14k-gold-chain-necklace-simple-cable?click_key=83fddaad04776d7bd8f6d871ccac4e71a3f96272%3A1199362536&click_sum=5d2ef6ce&ref=shop_home_active_11&frs=1\n· All jewelry is custom handcrafted with Love and Care. ❤️\n· All items are custom made to order, about 2 weeks.\n\n\nShipping :\n· It takes 1-2 business days to ship the item to you, and 7-10 days additionally for the USPS to deliver the package.\n· Packing: The item will be presented in a beautiful box. Complimentary gift wrapping and gift tags available.\n\n\nReturns and Exchanges:\n· I gladly accept returns and exchanges, just contact me within 15 days of delivery.\n· Buyers are responsible for return shipping costs. If the item is not returned in its original condition, the buyer is responsible for any loss in value.\n\n---------------🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸-------------------\n\n· If you can't find the information you need, Please feel free to contact us.😊\n· Thank you so much for your visit and hope you have a happy shopping here.❤️",
"image": [
{
"@type": "ImageObject",
"@context": "https://schema.org",
"author": "MissFlorenceJewelry",
"contentURL": "https://i.etsystatic.com/34276015/r/il/1fbd0e/3856672332/il_fullxfull.3856672332_2867.jpg",
"description": null,
"thumbnail": "https://i.etsystatic.com/34276015/c/650/516/41/108/il/1fbd0e/3856672332/il_340x270.3856672332_2867.jpg"
},
{
"@type": "ImageObject",
"@context": "https://schema.org",
"author": "MissFlorenceJewelry",
"contentURL": "https://i.etsystatic.com/34276015/r/il/01c4eb/3904169515/il_fullxfull.3904169515_9c7r.jpg",
"description": null,
"thumbnail": "https://i.etsystatic.com/34276015/r/il/01c4eb/3904169515/il_340x270.3904169515_9c7r.jpg"
}
],
"category": "Jewelry < Necklaces < Pendants",
"brand": {
"@type": "Brand",
"@context": "https://schema.org",
"name": "MissFlorenceJewelry"
},
"logo": "https://i.etsystatic.com/isla/862c6f/58067961/isla_fullxfull.58067961_aiop800d.jpg?version=0",
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "4.8",
"reviewCount": 25
},
"offers": {
"@type": "AggregateOffer",
"offerCount": 8,
"lowPrice": "62.00",
"highPrice": "167.00",
"priceCurrency": "USD",
"availability": "https://schema.org/InStock"
},
"review": [
{
"@type": "Review",
"reviewRating": {
"@type": "Rating",
"ratingValue": 5,
"bestRating": 5
},
"datePublished": "2022-11-05",
"reviewBody": "Thank you, just perfect although I would have liked the 14k on the backside, but it’s precious. I will wear it everyday with my other pendants.",
"author": {
"@type": "Person",
"name": "Juanita Bell"
}
}
]
},
"..."
],
"opengraph": [
{
"namespace": {
"og": "http://ogp.me/ns#",
"product": "http://ogp.me/ns/product#"
},
"properties": [
[
"og:title",
"9K TINY Solid Yellow Gold Coin Pendant Gold Disk Necklace 9k - Etsy South Korea"
],
[
"og:description",
"This Pendants item by MissFlorenceJewelry has 45 favorites from Etsy shoppers. Ships from United States. Listed on Jan 31, 2023"
],
[
"og:type",
"product"
],
[
"og:url",
"https://www.etsy.com/listing/1214112656/9k-tiny-solid-yellow-gold-coin-pendant?utm_source=OpenGraph&utm_medium=PageTools&utm_campaign=Share"
],
[
"og:image",
"https://i.etsystatic.com/34276015/r/il/1fbd0e/3856672332/il_1080xN.3856672332_2867.jpg"
],
[
"product:price:amount",
"62.00"
],
[
"product:price:currency",
"USD"
]
]
},
"..."
],
"microformat": [],
"rdfa": [
{
"@id": "",
"al:android:app_name": [
{
"@value": "Etsy"
}
],
"al:android:package": [
{
"@value": "com.etsy.android"
}
],
"al:android:url": [
{
"@value": "etsy://listing/1214112656?ref=applinks_android"
}
],
"al:ios:app_name": [
{
"@value": "Etsy"
}
],
"al:ios:app_store_id": [
{
"@value": "477128284"
}
],
"al:ios:url": [
{
"@value": "etsy://listing/1214112656?ref=applinks_ios"
}
],
"http://ogp.me/ns#description": [
{
"@value": "This Pendants item by MissFlorenceJewelry has 45 favorites from Etsy shoppers. Ships from United States. Listed on Jan 31, 2023"
}
],
"http://ogp.me/ns#image": [
{
"@value": "https://i.etsystatic.com/34276015/r/il/1fbd0e/3856672332/il_1080xN.3856672332_2867.jpg"
}
],
"http://ogp.me/ns#title": [
{
"@value": "9K TINY Solid Yellow Gold Coin Pendant Gold Disk Necklace 9k - Etsy South Korea"
}
],
"http://ogp.me/ns#type": [
{
"@value": "product"
}
],
"http://ogp.me/ns#url": [
{
"@value": "https://www.etsy.com/listing/1214112656/9k-tiny-solid-yellow-gold-coin-pendant?utm_source=OpenGraph&utm_medium=PageTools&utm_campaign=Share"
}
],
"https://www.facebook.com/2008/fbmlapp_id": [
{
"@value": "89186614300"
}
],
"product:price:amount": [
{
"@value": "62.00"
}
],
"product:price:currency": [
{
"@value": "USD"
}
]
},
"..."
],
"dublincore": [
{
"namespaces": {},
"elements": [
{
"name": "description",
"content": "This Pendants item by MissFlorenceJewelry has 45 favorites from Etsy shoppers. Ships from United States. Listed on Jan 31, 2023",
"URI": "http://purl.org/dc/elements/1.1/description"
}
],
"terms": []
},
"..."
]
}
We can see that Etsy contains many different formats but when it comes to product data json-ld
is a clear winner containing most of the product details: sku, name, price, description and even review metadata:
{
"@type": "Product",
"@context": "https://schema.org",
"url": "https://www.etsy.com/listing/1214112656/9k-tiny-solid-yellow-gold-coin-pendant",
"name": "9K TINY Solid Yellow Gold Coin Pendant, Gold Disk Necklace, 9k Gold Coin Necklace, Solid Gold Rose Necklace, Christmas Gift for Her",
"sku": "1214112656",
"gtin": "n/a",
"description": "----------------\ud83c\udf38\ud83c\udf38\ud83c\udf38Welcome to MissFlorenceJewelry\ud83c\udf38\ud83c\udf38\ud83c\udf38---------------\nDetails:\n\u00b7 Material: 9K Solid Yellow Gold\n\u00b7 Measurement: Pendant approx 6.5*6.5mm. Pendant hole 3.5*1.5mm\n\u00b7 Please not that a chain is NOT included. If you are interested to get one, check out this chain listing: \nhttps://www.etsy.com/listing/1199362536/14k-gold-chain-necklace-simple-cable?click_key=83fddaad04776d7bd8f6d871ccac4e71a3f96272%3A1199362536&click_sum=5d2ef6ce&ref=shop_home_active_11&frs=1\n\u00b7 All jewelry is custom handcrafted with Love and Care. \u2764\ufe0f\n\u00b7 All items are custom made to order, about 2 weeks.\n\n\nShipping :\n\u00b7 It takes 1-2 business days to ship the item to you, and 7-10 days additionally for the USPS to deliver the package.\n\u00b7 Packing: The item will be presented in a beautiful box. Complimentary gift wrapping and gift tags available.\n\n\nReturns and Exchanges:\n\u00b7 I gladly accept returns and exchanges, just contact me within 15 days of delivery.\n\u00b7 Buyers are responsible for return shipping costs. If the item is not returned in its original condition, the buyer is responsible for any loss in value.\n\n---------------\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38-------------------\n\n\u00b7 If you can't find the information you need, Please feel free to contact us.\ud83d\ude0a\n\u00b7 Thank you so much for your visit and hope you have a happy shopping here.\u2764\ufe0f",
"image": [
{
"@type": "ImageObject",
"@context": "https://schema.org",
"author": "MissFlorenceJewelry",
"contentURL": "https://i.etsystatic.com/34276015/r/il/1fbd0e/3856672332/il_fullxfull.3856672332_2867.jpg",
"description": null,
"thumbnail": "https://i.etsystatic.com/34276015/c/650/516/41/108/il/1fbd0e/3856672332/il_340x270.3856672332_2867.jpg"
},
{
"@type": "ImageObject",
"@context": "https://schema.org",
"author": "MissFlorenceJewelry",
"contentURL": "https://i.etsystatic.com/34276015/r/il/01c4eb/3904169515/il_fullxfull.3904169515_9c7r.jpg",
"description": null,
"thumbnail": "https://i.etsystatic.com/34276015/r/il/01c4eb/3904169515/il_340x270.3904169515_9c7r.jpg"
}
],
"category": "Jewelry < Necklaces < Pendants",
"brand": {
"@type": "Brand",
"@context": "https://schema.org",
"name": "MissFlorenceJewelry"
},
"logo": "https://i.etsystatic.com/isla/862c6f/58067961/isla_fullxfull.58067961_aiop800d.jpg?version=0",
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "4.8",
"reviewCount": 25
},
"offers": {
"@type": "AggregateOffer",
"offerCount": 8,
"lowPrice": "62.00",
"highPrice": "167.00",
"priceCurrency": "USD",
"availability": "https://schema.org/InStock"
},
"review": [
{
"@type": "Review",
"reviewRating": {
"@type": "Rating",
"ratingValue": 5,
"bestRating": 5
},
"datePublished": "2022-11-05",
"reviewBody": "Thank you, just perfect although I would have liked the 14k on the backside, but it\u2019s precious. I will wear it everyday with my other pendants.",
"author": {
"@type": "Person",
"name": "Juanita Bell"
}
}
]
}
With a bit of data flattening code we can use extruct to extract beautiful datasets in just few lines of code:
import json
import extruct
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient("YOUR SCRAPFLY KEY")
result = scrapfly.scrape(ScrapeConfig(url="https://www.etsy.com/listing/1214112656/"))
micro_data = extruct.extract(result.content)
product = next(data for data in micro_data['json-ld'] if data['@type'] == "Product")
parsed = {
# copy basic fields over
"url": product["url"],
"name": product["name"],
"sku": product["sku"],
"description": product["description"],
"category": product["category"],
# flatten complex fields:
"store": product["brand"]["name"],
"review_count": product["aggregateRating"]["reviewCount"],
"review_avg": product["aggregateRating"]["ratingValue"],
"price_min": product["offers"]["lowPrice"],
"price_max": product["offers"]["highPrice"],
"images": [img['contentURL'] for img in product['image']]
}
print(json.dumps(parsed, indent=2))
Which will output:
{
"url": "https://www.etsy.com/es/listing/1214112656/colgante-de-monedas-de-oro-amarillo",
"name": "Colgante de monedas de oro amarillo s\u00f3lido 9K TINY, collar de disco de oro, collar de monedas de oro de 9k, collar de rosas de oro macizo, regalo de Navidad para ella",
"sku": "1214112656",
"description": "----------------\ud83c\udf38\ud83c\udf38\ud83c\udf38Bienvenido a MissFlorenceJewelry\ud83c\udf38\ud83c\udf38\ud83c\udf38---------------\nDetalles:\n\u00b7 Material: oro amarillo macizo de 9 quilates\n\u00b7 Medida: Colgante aprox 6.5 * 6.5mm. Agujero colgante 3.5 * 1.5mm\n\u00b7 Por favor, tenga en cuenta que una cadena NO est\u00e1 incluida. Si est\u00e1 interesado en obtener uno, consulte esta lista de cadenas:\nhttps://www.etsy.com/listing/1199362536/14k-gold-chain-necklace-simple-cable?click_key=83fddaad04776d7bd8f6d871ccac4e71a3f96272%3A1199362536&click_sum=5d2ef6ce&ref=shop_home_active_11&frs=1\n\u00b7 Todas las joyas est\u00e1n hechas a mano con amor y cuidado. \u2764\ufe0f\n\u00b7 Todos los art\u00edculos est\u00e1n hechos a medida a pedido, aproximadamente 2 semanas.\n\n\nNaviero:\n\u00b7 Se tarda de 1 a 2 d\u00edas h\u00e1biles en enviarle el art\u00edculo, y de 7 a 10 d\u00edas adicionales para que el USPS entregue el paquete.\n\u00b7 Embalaje: El art\u00edculo se presentar\u00e1 en una hermosa caja. Envoltura de regalo de cortes\u00eda y etiquetas de regalo disponibles.\n\n\nDevoluciones y cambios:\n\u00b7 Con mucho gusto acepto devoluciones y cambios, solo cont\u00e1ctame dentro de los 15 d\u00edas posteriores a la entrega.\n\u00b7 Los compradores son responsables de los gastos de env\u00edo de devoluci\u00f3n. Si el art\u00edculo no se devuelve en su estado original, el comprador es responsable de cualquier p\u00e9rdida de valor.\n\n---------------\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38-------------------\n\n\u00b7 Si no puede encontrar la informaci\u00f3n que necesita, no dude en contactarnos. \ud83d\ude0a\n\u00b7 Muchas gracias por su visita y espero que tenga una feliz compra aqu\u00ed. \u2764\ufe0f",
"category": "Joyer\u00eda < Collares < Colgantes",
"store": "MissFlorenceJewelry",
"review_count": 25,
"review_avg": "4.8",
"price_min": "62.00",
"price_max": "167.00",
"images": [
"https://i.etsystatic.com/34276015/r/il/1fbd0e/3856672332/il_fullxfull.3856672332_2867.jpg",
"https://i.etsystatic.com/34276015/r/il/01c4eb/3904169515/il_fullxfull.3904169515_9c7r.jpg"
]
}
FAQ
Before we wrap up our introduction to web scraping microformat data let's take a look at some frequently asked questions:
Is web scraping microformat data legal?
Yes. Microformats were created for easier web scraping and any public page that contains microformat data is perfectly legal to scrape.
What is the most popular microformat encountered in web scraping?
JSON-LD and microdata microformats are most commonly encountered in web scraping. JSON-LD is the most popular microformat on the web though microdata is preferred in web scraping as it often contains higher-quality data.
Summary
In this quick introduction to web scraping microformat data we've taken a quick look at popular microformat types: json-ld, microdata, microformat and rdfa. These formats usually contain schema.org type of data which makes web scraping predictable datasets a breeze.
To illustrate this we wrapped up our tutorial with a quick Etsy.com scraper where we scraped product data using just a few lines of code!
Scraping microformats is one of the easiest ways to scrape public data and while all of the page data is not always available for extraction it can be a great starting point for any scraper.
<!--kg-card-end: markdown--><!--kg-card-begin: html-->{<br> "@context": "<a href="https://schema.org">https://schema.org</a>",<br> "@type": "FAQPage",<br> "mainEntity": [<br> {<br> "@type": "Question",<br> "name": "Is web scraping microformat data legal?",<br> "acceptedAnswer": {<br> "@type": "Answer",<br> "text": "<p>Yes. Microformats were created for easier web scraping and any public page that contains microformat data is perfectly legal to scrape.</p>"<br> }<br> },<br> {<br> "@type": "Question",<br> "name": "What is the most popular microformat encountered in web scraping?",<br> "acceptedAnswer": {<br> "@type": "Answer",<br> "text": "<p>JSON-LD and microdata microformats are most commonly encountered in web scraping. JSON-LD is the most popular microformat on the web though microdata is prefered in web scraping as it often contains higher quality data.</p>"<br> }<br> }<br> ]<br> }<!--kg-card-end: html-->
Top comments (1)
Wow! Great tutorial! Really helpful breakdown of scraping microformats with Python using extruct. Makes data extraction seem easy and straightforward. Thanks for sharing! And if you're looking for a powerful scraping platform, you should check out Crawlbase too!
Some comments have been hidden by the post's author - find out more