Jay Codes

Posted on Jan 30

AI and Medicine: How I Figured Out What People Feel about Drugs

#ai #medicine #sentimentanalysis #machinelearning

Sentiment analysis, also known as opinion mining, is a fascinating field in natural language processing (NLP) that revolves around understanding and extracting sentiments or opinions from textual data. In simpler terms, it involves determining whether a piece of text expresses a positive, negative, or neutral sentiment.
In the context of our journey into sentiment analysis, we'll be working with a real-world dataset comprising drug reviews. This dataset provides a valuable glimpse into how people express their opinions and experiences with different medications. The dataset includes information such as drug names, user ratings, and the written reviews themselves.

The Drug Review Dataset

Our dataset consists of drug reviews collected from various sources, offering diverse opinions and sentiments. Each entry in the dataset provides insights into a user's experience with a specific drug, allowing us to explore feelings associated with different medications.
Columns in the dataset:
drugName: the name of the drug being reviewed.
rating: the user's rating for the drug on a scale from 1 to 10.
review: the written review expressing the user's experience with the drug.

Understanding sentiments in drug reviews can be instrumental in healthcare and pharmaceutical decision-making. Whether it's identifying the effectiveness of a medication, addressing potential side effects, or gauging overall patient satisfaction, sentiment analysis proves to be a valuable tool.

What is Sentiment Analysis?

Sentiment analysis is the process of gauging the sentiments or emotions expressed in a piece of text. It involves leveraging natural language processing (NLP) techniques and machine learning algorithms to analyze and interpret subjective information. The primary goal is to determine whether a given text carries a positive, negative, or neutral sentiment.
In AI and Medicine, sentiment analysis can be a game-changer. It allows us to gain valuable insights into how people perceive and feel about different medications. Understanding the sentiments expressed in drug reviews, patient testimonials, or healthcare-related discussions can contribute significantly to medical research, patient care, and pharmaceutical decision-making.
Sentiment analysis finds application in a myriad of real-world scenarios. Consider scenarios where a healthcare provider wants to assess patient experiences with a particular medication or a pharmaceutical company is interested in understanding the market reception of a new drug.
Social Media Monitoring: Analyzing sentiments in social media posts can help monitor public opinions about medications.
Product Reviews: Evaluating sentiments in product reviews aids in understanding user satisfaction and identifying areas for improvement.
Health Forums and Blogs: Extracting sentiments from health-related discussions provides valuable insights into patient experiences and concerns.
Let's discuss the practical aspects of sentiment analysis, shedding light on implementing this powerful tool in your projects. Then, let's venture into the workings of sentiment analysis and its underlying principles.

How Sentiment Analysis Works

Sentiment analysis operates on the premise that the words and expressions used in a text convey the author's emotion. The process involves breaking down the text into smaller units, such as sentences or phrases, and analyzing them to discern the sentiment or feel.
Here's a simplified overview of the basic sentiment analysis process:

Text Input: Begin with a piece of text that you want to analyze. This could be a product review, a social media comment, or any other form of written communication.
Text Preprocessing:Clean the text by removing unnecessary elements such as punctuation, special characters, and numbers. Convert the text to lowercase for consistency.
Tokenization: Break the text into individual words or tokens. This step helps in analyzing the sentiment associated with each word.
Sentiment Labeling: Assign sentiment labels to each token based on predefined criteria. These labels often include 'positive,' 'negative,' or 'neutral.'
Aggregate Sentiments: Summarize the individual sentiments to determine an overall sentiment for the entire text. This could involve counting the number of positive and negative tokens. Applying machine learning and natural language processing (NLP) techniques is underlying the basic sentiment analysis process. Machine learning models are trained on labeled datasets to recognize patterns and associations between words and sentiments. In the subsequent sections, we will leverage Python and popular libraries like TensorFlow and Pandas to implement sentiment analysis on drug reviews. We'll work with real-world data, preprocess text, and build a machine-learning model to categorize sentiments. So, let's roll up our sleeves and start coding!

Getting Started with Python

Python has emerged as a powerhouse in the field of data science, offering a rich ecosystem of libraries and tools for various tasks. In sentiment analysis, we leverage Python's simplicity and extensive libraries to efficiently process and analyze text data.
Before we delve into the code, let's ensure you have the necessary libraries installed. We'll be using TensorFlow for building our machine learning model and Pandas for data manipulation.

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

These libraries form the backbone of our sentiment analysis implementation. TensorFlow provides a powerful platform for creating and training machine learning models, while Pandas simplifies data manipulation and analysis. NLTK (Natural Language Toolkit) will be used for text preprocessing.
Now, let's proceed to load and explore the dataset.

# Load the TSV dataset
train_dataset = pd.read_csv("/content/drugsComTrain_raw.tsv", sep="\t")
test_dataset = pd.read_csv("/content/drugsComTest_raw.tsv", sep="\t")

We're loading a dataset in TSV (Tab-Separated Values) format in this example. The dataset contains drug reviews, ratings, and other relevant information. You can replace the file paths with your dataset if needed.
With the data loaded, let's set sentiment labels based on predefined thresholds.

# Define thresholds for sentiment labels
positive_threshold = 7.0
negative_threshold = 4.0
# Create sentiment labels based on thresholds
train_dataset['sentiment'] = train_dataset['rating'].apply(lambda x: 'positive' if x >= positive_threshold else ('negative' if x <= negative_threshold else 'neutral'))
test_dataset['sentiment'] = test_dataset['rating'].apply(lambda x: 'positive' if x >= positive_threshold else ('negative' if x <= negative_threshold else 'neutral'))

Here, we're categorizing reviews as 'positive,' 'negative,' or 'neutral' based on predefined rating thresholds. This step sets the foundation for our sentiment analysis model.
We'll go deeper into text preprocessing and model building in the upcoming sections.

Loading the Data and Text Preprocessing

Downloading NLTK Resources and Text Preprocessing Functions

Before we dive into text preprocessing, we need to ensure that we have the necessary resources and functions. NLTK (Natural Language Toolkit) provides tools for working with human language data. Let's download the required resources and define functions for text preprocessing.

# Download NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Load the TSV dataset
train_dataset = pd.read_csv("/content/drugsComTrain_raw.tsv", sep="\t")
test_dataset = pd.read_csv("/content/drugsComTest_raw.tsv", sep="\t")
# Preprocessing functions
def preprocess_text(text):
 # Lowercasing
 text = text.lower()
# Tokenization
 words = word_tokenize(text)
# Removing stopwords and non-alphabetic words
 stop_words = set(stopwords.words('english'))
 words = [word for word in words if word.isalpha() and word not in stop_words]
# Lemmatization (or Stemming)
 lemmatizer = WordNetLemmatizer()
 words = [lemmatizer.lemmatize(word) for word in words]
# Join the words back into a string
 preprocessed_text = ' '.join(words)
return preprocessed_text
# Apply preprocessing to the 'review' column
train_dataset['preprocessed_review'] = train_dataset['review'].apply(preprocess_text)
test_dataset['preprocessed_review'] = test_dataset['review'].apply(preprocess_text)

In this snippet, we download the necessary NLTK resources and define a preprocess_text function. This function takes a piece of text, performs tasks such as lowercasing, tokenization, removing stopwords, and lemmatization, and returns the preprocessed text.

Understanding Text Preprocessing

Text preprocessing is critical in any NLP task, including sentiment analysis. It involves transforming raw text into a format that is suitable for analysis. The preprocessing steps enhance the data's quality and contribute to better model performance.
Lowercasing: Convert all text to lowercase to ensure uniformity.
Tokenization: Break the text into individual words or tokens.
Removing Stopwords:Eliminate common words (e.g., "the," "and") that do not contribute much to the sentiment.
Lemmatization: Reduce words to their base or root form for consistency.

The preprocessed reviews will serve as the input to our sentiment analysis model, enabling it to focus on meaningful content while disregarding noise.
Let's talk about splitting the data, defining model parameters, and building our sentiment analysis model using TensorFlow.

Splitting the Data and Mapping Sentiment Labels

from sklearn.model_selection import train_test_split
# Splitting the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
 train_dataset['preprocessed_review'], # Features (preprocessed text)
 train_dataset['rating'], # Labels (encoded sentiment labels)
 test_size=0.1, random_state=42 # Size of the validation set (adjust as needed)
)

This section uses the train_test_split function from scikit-learn to split our dataset into training and validation sets. We extract the preprocessed reviews (X_train and X_val) as features and the original ratings (y_train and y_val) as labels.

Mapping Sentiment Labels

# Map sentiment labels to the correct range
label_mapping = {1.0: 0, 2.0: 0, 3.0: 1, 4.0: 1, 5.0: 1, 6.0: 1, 7.0: 2, 8.0: 2, 9.0: 2, 10.0: 2}
# Apply label mapping to the training and validation labels
y_train_encoded = y_train.map(label_mapping)
y_val_encoded = y_val.map(label_mapping)
# Now, the labels should be in the range [0, 2]
print(set(y_train_encoded))
print(set(y_val_encoded))

Here, we create a mapping dictionary to categorize ratings into sentiment labels. Ratings from 1 to 3 are mapped to label 0 (negative), ratings from 4 to 6 to label 1 (neutral), and ratings from 7 to 10 to label 2 (positive). We then apply this mapping to both the training and validation labels.
Understanding the distribution of labels ensures a balanced representation during training, which is important for the model's ability to generalize well to unseen data.
Now, we'll define model parameters, load pre-trained word embeddings, and build our sentiment analysis model using TensorFlow in the next section.

Defining Model Parameters and Loading Pre-trained Word Embeddings

# Define parameters
embedding_dim = 128 # Dimensionality of the word embeddings
max_sequence_length = 100 # Maximum length of padded sequences
num_classes = 3 # Number of sentiment classes (negative, neutral, positive)
num_epochs = 10
batch_size = 64

Here, we set parameters that will guide the construction and training of our sentiment analysis model. The embedding_dim represents the dimensionality of the word embeddings, and max_sequence_length determines the maximum length of the padded sequences. num_classes defines the number of sentiment classes (negative, neutral, positive), and num_epochs and batch_size are related to the training process.

Load Pre-trained Word Embeddings

# Load pre-trained word embeddings
embedding_layer = hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim128/2", input_shape=[], dtype=tf.string, output_shape=[embedding_dim])

In this snippet, we utilize a pre-trained word embedding model from TensorFlow Hub. Word embeddings capture the semantic meaning of words and are essential for understanding the contextual relationships within text. The chosen embedding model has 128-dimensional vectors.
As we progress, we'll integrate this embedding layer into our sentiment analysis model, providing it with a solid foundation for understanding the contextual meaning of words in drug reviews.
Great! Let's move on to the next section, where we'll build and compile our sentiment analysis model using TensorFlow.

Building and Compiling the Model

# Build the model
model = tf.keras.Sequential([
 embedding_layer,
 tf.keras.layers.Reshape((1, embedding_dim)), # Reshape the output to match LSTM input
 tf.keras.layers.LSTM(128, return_sequences=True), # Set return_sequences=True
 tf.keras.layers.Dense(64, activation='relu'),
 tf.keras.layers.Dense(3, activation='linear')
])
# Compile the model
model.compile(
 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
 optimizer=tf.keras.optimizers.Adam(0.001),
 metrics=['accuracy']
)
# Print the model summary
model.summary()

Here, we're constructing a sequential model using TensorFlow's Keras API. The model comprises several layers:

Embedding Layer: Utilizes pre-trained word embeddings to represent words in a continuous vector space.
Reshape Layer: Adjusts the output shape to match the input requirements of the LSTM layer.
LSTM Layer: Long Short-Term Memory layer for capturing sequential dependencies in the data.
Dense Layers: Fully connected layers for learning hierarchical representations. - The first Dense layer uses ReLU activation. - The final Dense layer produces the output with three units, corresponding to the three sentiment classes. We compile the model using the sparse categorical crossentropy loss function, Adam optimizer, and accuracy as the metric for evaluation. Model Summary The model.summary() provides an overview of the model architecture, including the number of parameters in each layer. Understanding the model summary is crucial for ensuring that the model is constructed as intended. In the next section, we'll train our sentiment analysis model using the preprocessed data. ##Training the Model

# Train the model
history = model.fit(X_train, y_train_encoded, epochs=5, batch_size=64, validation_data=(X_val, y_val_encoded))

In this snippet, we use the fit method to train our sentiment analysis model. The training data (X_train and y_train_encoded) are used to teach the model to associate preprocessed reviews with their corresponding sentiment labels. The validation_data parameter allows us to monitor the model's performance on a separate validation set during training.
The epochs parameter determines the number of times the model will iterate over the entire training dataset. Adjusting this parameter allows you to control the duration of training.
As the model trains, it learns to capture the patterns and relationships between words and sentiments, ultimately becoming adept at classifying the sentiment of drug reviews.

Preprocessing Test Data and Model Evaluation

# Preprocess test data
test_dataset['preprocessed_review'] = test_dataset['review'].apply(preprocess_text)
# Map sentiment labels to the correct range
test_dataset['encoded_sentiment'] = test_dataset['rating'].map(label_mapping)
# Split test data into features and labels
X_test = test_dataset['preprocessed_review']
y_test_encoded = test_dataset['encoded_sentiment']
# Evaluate the model
loss = model.evaluate(X_test, y_test_encoded)
print("Test Loss:", loss)

In this section, we preprocess the test data using the same preprocess_text function. We then map the sentiment labels based on the previously defined label_mapping and split the test data into features (X_test) and labels (y_test_encoded).
Finally, we evaluate the trained model on the test data using the evaluate method. The test loss provides insights into how well the model generalizes to unseen data.

Interpreting the Results

Analyzing the test loss and other metrics (such as accuracy) gives us an indication of how well our sentiment analysis model performs on new, unseen drug reviews. A lower test loss and high accuracy are desirable outcomes, indicating that the model has successfully learned to predict sentiments.
As you explore the results, consider potential areas for improvement, such as adjusting model parameters, experimenting with different architectures, or increasing the amount of training data.

CONCLUSION

Congratulations! We've ended up building and evaluating a sentiment analysis model for drug reviews using Python, TensorFlow, and Pandas. This model can be a valuable tool for understanding public sentiments towards medications and making informed decisions in the healthcare and pharmaceutical domains.
Feel free to adapt and extend this code for your specific projects, exploring new datasets and applications of sentiment analysis.