Sequence labelling in Python (part 1)

#python #datascience #nlp

Why?

I was looking for a cool project to practice sequence labelling with Python so... there is this Mexican website called VuelaX, in it, flight offers are shown. Most of the offers follow a simple pattern: Destination - Origin - Price - Extras, while extracting this may seem easy for a regular expression, it is not as there are many patterns. It would be tough for us to cover them all.

I know it is not ideal to work in a foreign language, but bear with me, as the same techniques could be applied in your language of choice.

The idea is to create a tagger that will be able to extract this information. However, one first tag is to identify the information that we want to extract. Following the pattern described above:

o: Origin
d: Destination
s: Separator token
p: Price
f: Flag
n: Irrelevant token

Text	d	o	p	n
¡CUN a Holanda $8,885! Sin escala EE.UU	CUN	Holanda	8,885	Sin escala EE.UU
¡CDMX a Noruega $10,061! (Y agrega 9 noches de hotel por $7,890!)	CDMX	Noruega	10,061	Y agrega 9 noches de hotel por $7,890!
¡Todo México a Pisa, Toscana Italia $12,915! Sin escala EE.UU (Y por $3,975 agrega 13 noches hotel)	México	Pisa, Toscana Italia	12,915	Sin escala EE.UU (Y por $3,975 agrega 13 noches hotel)

CRFs in Python

If you are familiar with data science, you know this is known as a sequence labelling problem. While there are various ways to approach it, in this post, I will show you one that uses a statistical model known as Conditional Random Fields. Having said that, I will not delve too much into details, so if you want to learn more about CRFs you are on your own; I will show you a practical way to use it with a Python implementation.

Getting some data

To start, I scraped the offer titles data from the page mentioned above. I will not detail how I did it since it is pretty straightforward to find a tutorial on web scraping on the web. If you don't feel like spending some time scraping a website, I collected some data in a CSV file that you can access now here.

This tutorial will be divided into other 4 parts:

Hopefully, you will follow along and will ask some questions if you have by leaving a comment here or contacting me on twitter via @io_exception.

Top comments (1)

Rodrigo Cuéllar Hidalgo • Oct 28 '21

Is there a little error in your labeled data example, for example in the first text CUN is the Origin and Holanda is the destination, this happen in al rows...

DEV Community

Sequence labelling in Python (part 1)

Why?

CRFs in Python

Getting some data

Top comments (1)

Read next

How to Optimize Loops for Better Performance

AI Breakthrough: Evolution-Based System Creates More Efficient Neural Networks

🚀 Building a User Management API with FastAPI and SQLite

Python Best Practices: Writing Clean and Maintainable Code