DEV Community

loading...
Cover image for Text Mining and Movie Review Classification through RapidMiner.

Text Mining and Movie Review Classification through RapidMiner.

Daniel Gomez Jaramillo
Microsoft Learn Student Ambassador | C# Corner MVP | Systems Engineer | Dev. Speaker | Dev/Tech Writer | Passionate about technology for the benefit of society. 🕊️
・3 min read

One of the most complex tasks in supervised classification methods is to have a tagged data set that allows you to train a model in the first instance and then validate it. Within the text mining process there are few corpus that can be used for these processes. Therefore, the objective will be to create a set of tagged data that allows to form a classification model to recognize the positive or negative reviews of a set of films.

Data source: https://www.cinemablend.com/reviews/

Project repositories: https://github.com/esdanielgomez/MovieClassificationRMP/

Part 1: Collection of unstructured data.

For this process we begin with the extraction of the links of the comments of the films (1470 in total) through 30 pages on the site.

REPOSITORY: Final_A

Alt Text

In this case, both the comment link and its star rating are being selected (weighted between 1.0 and 5.0):

Alt Text

REPOSITORY: Final_B

Alt Text

Now as a next step, the comments are divided into two sections: Positive and Negative:

Alt Text

For negative comments, the selection of comments with a rating of less than or equal to 3.0 is made:

Alt Text

And for negative comments with a rating greater than 3.0:

Alt Text

Subsequently, the information of the films is obtained, in which the most relevant for this study is the Comment attribute:

Alt Text

The process for obtaining comments is as follows:

Alt Text

Finally, these comments are stored in two Excel files (datasetComentariosNegativos.xlsx and datasetComentariosPositivos.xlsx).

REPOSITORY: Final_C

Next, 650 movie comments are selected per dataset (Positive and Negative) and two new datasets are prepared for training data and data for subsequent tests.

Alt Text

Part 2: Preparation of the data.

Until now the data is organized in this way:

Alt Text

REPOSITORY: Final_D

Next, pre-processing activities are executed to eliminate HTML tags, empty words in English and through a custom dictionary, among other processes:

Alt Text

Process Documents from Data:

Alt Text

Part 3: Identify key characteristics.

REPOSITORY: Final_D (In the same repository as part 2).

The result of the previous process (Part 2) returns a structured table consisting of one row for each entry extracted from the website and a number of columns indicating the different tokens that are part of the document and their occurrence weight. However, trying to execute a learning process with so many variables is a very expensive task. Therefore, in this process the objective is to filter those columns (tokens) by some method of selecting characteristics.

Alt Text

For this, the Weight of Information Gain operator is used to classify the characteristics by their relevance to the tag attribute based on the information gain ratio.

Alt Text

Also, the Select by Weights operator is used which selects only those attributes of an input set whose weights meet the specified criteria with respect to the input weights. In this case, through the top k attribute in which the best 90 characteristics of the data set are selected.

Part 4 and 5: Build the training model and apply it.

Two algorithms are used to build the training model: Naive Bayes and SVM, in which you can identify which provides the best accuracy and performance.

Also, within the same process, the two trained models are applied to the test data and their accuracy is verified (The data in this test dataset is pre-processed in the previous parts).

Alt Text

Model and validation NB (Naive Bayes)

Model and validation SVM (Support Vector Machine)

Discussion (2)

Collapse
sgenzer profile image
Scott Genzer

This is great @esdaniekgomez!

Collapse
esdanielgomez profile image
Daniel Gomez Jaramillo Author

Thanks Scott!