Building an Intelligent Recommendation Engine with Collaborative Filtering

#machinelearning #collaborativefiltering #recommendationsystem

In this post, we will talk about building a collaborative recommendation system. For this, we will utilize patient ratings with a drug and medical condition dataset to generate treatment suggestions.

Let's take a practical scenario where multiple medical practitioners have treated patients with different medical conditions with the most suitable drugs available. For every prescribed drug, the patients are diagnosed and then suggested a treatment plan, which is our experiences.

The purpose of the recommendation system is to understand and find patterns with the information provided by patients during the diagnosis, and then suggest a treatment plan, which most closely matches the pattern identified by the recommendation system.

At the end of this article, we are going deeper into how these recommendations work and how we can find one preferred suggestion, and the next five closest suggestions for any treatment.

Definitions

A recommendation system suggests or predicts a user's behavior by observing patterns of their past behavior compared to others.

In simple terms, it is a filtering engine that picks more relevant information for specific users by using all the available information. It is often used in e-commerce like Amazon, Flipkart, Youtube, and Netflix and personalized user products like Alexa and Google Home Mini.

For the medical industry, where suggestions must be most accurate, a recommendation system will also take experiences into account. So, we must use all our experiences, and such applications will use every piece of information for any treatment.

Recommendation systems use information like various medical conditions and their effect on each patient. They compare these patterns to every new treatment to find the closest similarity.

Concepts and Technology

To design the recommendation system, we need a few concepts, which are listed below.

Concepts: Pattern Recognition, Correlation, Cosine Similarity, Vector norms (L1, L2, L-Infinity)‍
Language: Python (library: Numpy & Pandas), Scipy, Sklearn

As far as the prototype development is concerned, we have the support of a library (Scipy & Sklearn) that executes all the algorithms for us. All we need is a little Python and to use library functions.

Different Approaches for Recommendation Systems

Below I have listed a few filtering approaches and examples:

Collaborative filtering: It is based on the review or response of users for any entity. Here, the suggestion is based on the highest rated item by most of the users. E.g., movies or mobile suggestions.

Content-based filtering: It is based on the pattern of each user's past activity. Here, the suggestion is based on the most preferred by similar users. E.g., food suggestions.

Popularity-based filtering: It is based on a pattern of popularity among all users. E.g., YouTube video suggestions

Based on these filtering approaches, there will be different approaches to recommender systems, which are explained below:

Multi-criteria recommender systems: Various conditions like age, gender, location, likes, and dislikes are used for categorization and then items are suggested. E.g., the suggestion of apparel based on age and gender.

Risk-aware recommender systems: There is always uncertainty when users use Internet applications (website or mobile). Recommending any advertisement over the Internet must consider risk and users must be aware of this. E.g., advertisement display suggestion over Internet application.

Mobile recommender systems: These are location-based suggestions that consist of users’ current location or future location and provide suggestions based on that. E.g., mostly preferred in traveling and tourism.

Hybrid recommender systems: These are the combination of multiple approaches for recommendations. E.g., suggestions of hotels and restaurants based on user preference and travel information.

Collaborative and content recommender systems: These are the combination of collaborative and content-based approaches. E.g., the suggestion of the highest-rated movie of users’ preference along with their watch history.

Practical Example with Implementation

In this example, we have a sample dataset of drugs prescribed for various medical conditions and ratings given by patients. What we need here is for any medical condition we have to receive a suggestion for the most suitable prescribed drugs for treatment.

Sample Dataset:

Below is the sample of the publicly available medical drug dataset used from the Winter 2018 Kaggle University Club Hackathon.

Sample Code:

We will do this in 5 steps:

Importing required libraries
Reading the drugsComTest_raw.csv file and creating a pivot matrix.
Creating a KNN model using the NearestNeighbors function with distance metric- 'cosine' & algorithm- 'brute'. Possible values for distance metric are 'cityblock', 'euclidean', 'l1', 'l2' & ‘manhattan’. Possible values for the algorithm are 'auto', 'ball_tree', 'kd_tree', 'brute' & 'cuml'.
Selecting one medical condition randomly for which we have to suggest 5 drugs for treatment.
Finding the 6 nearest neighbors for the sample, calling the kneighbors function with the trained KNN models created in step 3. The first k-neighbor for the sample medical condition is self with a distance of 0. The next 5 k-neighbors are drugs prescribed for the sample medical condition.

Explanation:

This is the collaborative-based recommendation system that uses the patients’ ratings of given drug treatments to find similarities in medical conditions. Here, we are matching the patterns for ratings given to drugs by patients. This system compares all the rating patterns and tries to find similarities (cosine similarity).

Challenges of Recommendation System

Any recommendation system requires a decent quantity of quality information to process. Before developing such a system, we must be aware of it. Acknowledging and handling such challenges improve the accuracy of recommendation.

Cold Start: Recommending a new user or a user without any previous behavior is a problem. We can recommend the most popular options to them. E.g., YouTube videos suggestion for newly registered users.‍
Not Enough Data: Having insufficient data provides recommendations with less certainty. E.g., suggestion of hotels or restaurants will not be accurate if systems are uncertain about users’ locations.
Grey Sheep Problem: This problem occurs when the inconsistent behavior of a user makes it difficult to find a pattern. E.g., multiple users are using the same account, so user activity will be wide, and the system will have difficulty in mapping such patterns.
Similar items: In these cases, there is not enough data to separate similar items. For these situations, we can recommend all similar items randomly. E.g., apparel suggestions for users with color and sizes. All shirts are similar.
Shilling Attacks: Intentional negative behavior that leads to bad/unwanted recommendations. While immoral, we cannot deny the possibility of such attacks. E.g., user ratings and reviews over various social media platforms.

Accuracy and Performance Measures

Accuracy evaluation is important as we always follow and try to improve algorithms. The most preferred measures for improving algorithms are user studies, online evaluations, and offline evaluations. Our recommendation models must be ready to learn from users' activity daily. For online evaluations, we have to regularly test our recommendation system.

If we understand the challenges of the recommendation system, we can prepare such testing datasets to test its accuracy. With these variations of datasets, we can improve our approach of user studies and offline evaluations.

Online Evaluations: In online evaluations, prediction models are updated frequently with the unmonitored data, which leads to the possibility of unexpected accuracy. To verify this, the prediction models are exposed to the unmonitored data with less uncertainty and then the uncertainty of unmonitored data is gradually increased.
Offline Evaluations: In offline evaluations, the prediction models are trained with a sample dataset that consists of all possible uncertainty with expected outcomes. To verify this, the sample dataset will be gradually updated and prediction models will be verified with predicted and actual outcomes. E.g., creating multiple users with certain activity and expecting genuine suggestions for them.

Conclusion

As a part of this article, we have learned about the approaches, challenges, and evaluation methods, and then we created a practical example of the collaboration-based recommendation system. We also explored various types and filtering approaches with real-world scenarios.

We have also executed a sample code with a publicly available medical drug dataset with patient ratings. We can opt for various options for distance matrix and algorithm for the NearestNeighbors calculation. We have also listed various challenges for this system and understood the accuracy evaluation measures and things that affect and improve them.