DEV Community

loading...
Cover image for Tokenization and Sequencing in TensorFlow [Tutorial]

Tokenization and Sequencing in TensorFlow [Tutorial]

Bala Priya C
I read, write, and code
Originally published at towardsai.net Updated on ・4 min read

In this blog post, we shall seek to learn how to implement tokenization and sequencing, important text pre-processing steps, in Tensorflow.

Outline

  • Introduction to Tokenizer
  • Understanding Sequencing

Introduction to Tokenizer

Tokenization is the process of splitting the text into smaller units such as sentences, words or subwords. In this section, we shall see how we can pre-process the text corpus by tokenizing text into words in TensorFlow. We shall use the Keras API with TensorFlow backend; The code snippet below shows the necessary imports.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
Enter fullscreen mode Exit fullscreen mode

And voila! we have all modules imported! Let’s initialize a list of sentences that we shall tokenize.

sentences = [
'Life is so beautiful',
'Hope keeps us going',
'Let us celebrate life!'
]
Enter fullscreen mode Exit fullscreen mode

The next step is to instantiate the Tokenizer and call the fit_on_texts method.

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
Enter fullscreen mode Exit fullscreen mode

Well, when the text corpus is very large, we can specify an additional num_words argument to get the most frequent words. For example, if we’d like to get the 100 most frequent words in the corpus, then tokenizer = Tokenizer(num_words=100) does just that!

To know how these tokens have been created and the indices assigned to words, we can use the word_index attribute.

word_index = tokenizer.word_index
print(word_index)
 # Here’s the output:

{life: 1, us: 2, is: 3, so: 4, beautiful: 5, hope: 6, keeps: 7, going: 8, let: 9, celebrate: 10}
Enter fullscreen mode Exit fullscreen mode

Well, so far so good! But what happens when the test data contains words that we’ve not accounted for in the vocabulary?🤔

test_data = [
'Our life is to celebrate',
'Hoping for the best!',
'Let peace prevail everywhere'
]
Enter fullscreen mode Exit fullscreen mode

We have introduced sentences in test_data which contain words that are not in our earlier vocabulary.

How do we account for such words which are not in vocabulary?We can define an argument oov_token to account for such Out Of Vocabulary (OOV) tokens.

tokenizer = Tokenizer(oov_token=<OOV>)
The word_index now returns the following output:

{<OOV>: 1, life: 2, us: 3, is: 4, so: 5, beautiful: 6, hope: 7, keeps: 8, going: 9, let: 10, celebrate: 11}
Enter fullscreen mode Exit fullscreen mode

Understanding Sequencing

In this section, we shall build on the tokenized text, using these generated tokens to convert the text into a sequence.

We can get a sequence by calling the texts_to_sequences method.

sequences = tokenizer.texts_to_sequences(sentences)
#Here’s the output:

[[2, 4, 5, 6], [7, 8, 3, 9], [10, 3, 11, 2]]
Enter fullscreen mode Exit fullscreen mode

Let’s now take a step back. What happens when the sentences are of different lengths?Then, we will have to convert all of them to the same length.

We shall import pad_sequences function to pad our sequences and look at the padded sequences.

from tensorflow.keras.preprocessing.sequence import pad_sequences
padded = pad_sequences(sequences)
print("\nPadded Sequences:")
print(padded)
# Output
Padded Sequences:
 [[ 2  4  5  6]  
  [ 7  8  3  9] 
  [10  3 11  2]]
Enter fullscreen mode Exit fullscreen mode

By default, the length of the padded sequence = length of the longest sentence. However, we can limit the maximum length by explicitly setting the maxlen argument.

padded = pad_sequences(sequences,maxlen=5)
print("\nPadded Sequences:")
print(padded)
# Output
Padded Sequences: 
[[ 0  2  4  5  6] 
 [ 0  7  8  3  9] 
 [ 0 10  3 11  2]]
Enter fullscreen mode Exit fullscreen mode

Now, let’s pad our test sequences after converting them to sequences.

test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)
padded = pad_sequences(test_seq, maxlen=10)
print("\nPadded Test Sequence: ")
print(padded)
And heres our output.

# Output 
Test Sequence =  [[1, 2, 4, 1, 11], [1, 1, 1, 1], [10, 1, 1, 1]]  
Padded Test Sequence:  
[[ 0  0  0  0  0  1  2  4  1 11] 
 [ 0  0  0  0  0  0  1  1  1  1] 
 [ 0  0  0  0  0  0 10  1  1  1]]
Enter fullscreen mode Exit fullscreen mode

We see that all the padded sequences are of length maxlen and are padded with 0s at the beginning. What if we would like to add trailing zeros instead of at the beginning? We only need to specify padding='post'

padded = pad_sequences(test_seq, maxlen=10, padding='post')
print("\nPadded Test Sequence: ")
print(padded)
# Output
Padded Test Sequence: 
 [[ 1  2  4  1 11  0  0  0  0  0] 
 [ 1  1  1  1  0  0  0  0  0  0]  
 [10  1  1  1  0  0  0  0  0  0]]
Enter fullscreen mode Exit fullscreen mode

So far, none of the sentences have length exceeding maxlen, but in practice, we may have sentences that are much longer than maxlen. In that case, we have to truncate the sentences and can set the argument truncating='post' or 'pre' to drop the first few or the last few words that exceed the specified maxlen.

Happy learning and coding!🎈✨🎉👩🏽‍💻

Reference

Natural Language Processing in TensorFlow on Coursera

Cover Image: Photo by Susan Q Yin on Unsplash

Discussion (0)