DEV Community

Cover image for Preparing data for Sentiment analysis with TensorFlow
Aman Gupta
Aman Gupta

Posted on

Preparing data for Sentiment analysis with TensorFlow

Sentiment in Text

  • Tokenizer - to vectorise an sentence with words into numbers, it strips punctuation out and converts everything to lower case, The num_words parameter used in the initializer specifies the maximum number of words minus one (based on frequency) to keep when generating sequences.

    from tensorflow.keras.preprocessing.text import Tokenizer
    
    # Define input sentences
    sentences = [
        'i love my dog',
        'I, love my cat'
        ]
    
    # Initialize the Tokenizer class
    tokenizer = Tokenizer(num_words = 100) #It only takes the first 100 most occurring words  
    
    # Generate indices for each word in the corpus
    tokenizer.fit_on_texts(sentences)
    
    # Get the indices and print it
    word_index = tokenizer.word_index
    print(word_index)
    
```python
# Define input sentences
sentences = [
    'i love my dog',
    'I, love my cat',
    'You love my dog!'
]

# Initialize the Tokenizer class
tokenizer = Tokenizer(num_words = 1)

# Generate indices for each word in the corpus
tokenizer.fit_on_texts(sentences)

# Get the indices and print it
word_index = tokenizer.word_index
print(word_index)
```
Enter fullscreen mode Exit fullscreen mode
  • The important thing to note is it does not affect how the word_indexdictionary is generated. You can try passing 1 instead of 100 as shown on the next cell and you will arrive at the same word_index.

  • Padding - since before feeding the data into the model, we need to make sure it’s uniform. We use padding to do so, and append the corpus with zero's. With the use of arguments we can determine either to append from the front or the back.

  • We use "OOV" for out of vocabulary words.

    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    
    # Define your input texts
    sentences = [
        'I love my dog',
        'I love my cat',
        'You love my dog!',
        'Do you think my dog is amazing?'
    ]
    
    # Initialize the Tokenizer class
    tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
    
    # Tokenize the input sentences
    tokenizer.fit_on_texts(sentences)
    
    # Get the word index dictionary
    word_index = tokenizer.word_index
    
    # Generate list of token sequences
    sequences = tokenizer.texts_to_sequences(sentences)
    
    # Print the result
    print("\nWord Index = " , word_index)
    print("\nSequences = " , sequences)
    
    # Pad the sequences to a uniform length
    padded = pad_sequences(sequences, maxlen=5) #override the max length you want for the sequence
    
    # Print the result
    print("\nPadded Sequences:")
    print(padded)
    

Image description

Image description

  • By default the padding happens from left side, same with the maxlen function, the data is truncated from left. We can change this by specifying in the arguments.

    padded = pad_sequences(sequences,padding = 'post',trunction = 'post',maxlen=5)
    
  • Using a sarcasm dataset to try this out

    # Download the dataset
    !wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json
    
    import json
    
    # Load the JSON file
    with open("./sarcasm.json", 'r') as f:
        datastore = json.load(f)
    
    # Initialize lists
    sentences = [] 
    labels = []
    urls = []
    
    # Append elements in the dictionaries into each list
    for item in datastore:
        sentences.append(item['headline'])
        labels.append(item['is_sarcastic'])
        urls.append(item['article_link'])
    
  • Processing the dataset, using tokenizer and then padding the corpus.

    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    
    # Initialize the Tokenizer class
    tokenizer = Tokenizer(oov_token="<OOV>")
    
    # Generate the word index dictionary
    tokenizer.fit_on_texts(sentences)
    
    # Print the length of the word index
    word_index = tokenizer.word_index
    print(f'number of words in word_index: {len(word_index)}')
    
    # Print the word index
    print(f'word_index: {word_index}')
    print()
    
    # Generate and pad the sequences
    sequences = tokenizer.texts_to_sequences(sentences)
    padded = pad_sequences(sequences, padding='post')
    
    # Print a sample headline
    index = 2
    print(f'sample headline: {sentences[index]}')
    print(f'padded sequence: {padded[index]}')
    print()
    
    # Print dimensions of padded sequences
    print(f'shape of padded sequences: {padded.shape}')
    
  • Preparing data for NLP algorithms, by removing stop words and converting everything to lower case

    def remove_stopwords(sentence):
        # List of stopwords
        stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]
    
        # Sentence converted to lowercase-only
        sentences = sentence.lower()
    
        words=sentences.split()
        filtered = [word for word in words if word not in stopwords]
        sentence = " ".join(filtered)
    
        return sentence
    
  • Reading data and extracting labels and text’s

    
    def parse_data_from_file(filename):
        sentences = []
        labels = []
        with open(filename, 'r') as csvfile:
    
            reader = csv.reader(csvfile)
            next(reader) #ignoring the first line, headers
    
            for row in reader:
                lable = row[0]
                text = " ".join(row[1:])
                text = remove_stopwords(text)
                labels.append(lable)
                sentences.append(text)
    
        return sentences, labels
    
  • Tokenizing labels

    def tokenize_labels(labels):
    
        # Instantiate the Tokenizer class
        # No need to pass additional arguments since you will be tokenizing the labels
        label_tokenizer = Tokenizer()
    
        # Fit the tokenizer to the labels
        label_tokenizer.fit_on_texts(labels)
    
        # Save the word index
        label_word_index = label_tokenizer.word_index
    
        # Save the sequences
        label_sequences = label_tokenizer.texts_to_sequences(labels)
    
        return label_sequences, label_word_index
    

Top comments (0)