Written by Manish Saraswat
At mobile.de, we continuously strive to provide our users with a better, faster and a unique search experience.
Every day, millions of people visit mobile.de to find their dream car. The user journey typically starts by entering a search query and later refining it based on their requirements. If the user finds a relevant listing, they contact the seller to purchase the vehicle. Our search engine is responsible for matching users with the right sellers.
With over 1 million listings to display, finding the top 100 relevant results within a few milliseconds is an immense challenge. Not only do we need to ensure the listings match the user’s search intent, but we also must honour the exposure guarantees made to our premium dealers in their sales packages.
Identifying the ideal search results from over 1 million listings quickly while optimising for user relevance and business commitments requires an intricate balancing act.
In this post, I would like to share how we are building learning to rank models and deploying them in our infrastructure using a python microservice.
What motivated us?
Our current Learning to Rank (LTR) system is integrated into our ElasticSearch cluster using the native ranking plugin. This plugin offers a scalable solution to deploy learning to rank models out-of-the-box.
While it has provided a solid foundation over several years, we have encountered some limitations:
- Our DevOps team faced plugin integration issues when upgrading ElasticSearch versions
- There is no automated model deployment, requiring manual pre-deployment checks by our data scientists. This introduces risks of human error.
- Overall system maintenance has become difficult
- The infrastructure bottlenecks limit our data scientists from testing newer ML models that could improve relevance
Clearly, while the native ElasticSearch ranking plugin gave us an initial working solution, it has become an obstacle for iterating and improving our LTR capabilities. We realised the need to evolve to a more scalable, automated and flexible LTR architecture.
This would empower our data scientists to rapidly experiment with more advanced ranking algorithms while enabling easier system maintenance.
How did we start?
Realising our outdated search architecture was the primary obstacle to improving relevance, we knew a pioneering solution was needed to overcome this roadblock.
We initiated technical discussions with Site Reliability Engineers, Principal Backend Engineers and Product Managers to assess how revamping search could impact website experience.
Our solution had to balance speed with business metrics. We needed to keep search fast while improving key conversions like unique user conversion rate.
Based on the feedback, we decided to decouple the relevance algorithm into a separate microservice. To empower data scientists and engineers, we chose Python to align development and production environments closely while ensuring scalability.
Implementing Learning to Rank
There are several techniques to implement learning to rank (LTR) models in python. Up until a few years ago, we were using a pointwise ranking approach, which worked well for us.
Last year, we decided to test a pairwise ranking model (trained using XGBoost) against the pointwise model and it outperformed in the A/B test.
This gave us good confidence to continue using the pairwise ranking approach. Also, the latest XGBoost version (>=2.0) provides lots of cool features such as handling position bias options while training the model. Also, since XGBoost supports using custom loss function, we trained the model using a multi objective loss function.
In our case, our objectives are set to listing relevance and dealer exposure. As mentioned above, we try to optimise the balance between showing relevant results and showing our premium/sponsored dealers at top positions.
Training the models in the jupyter notebook is the easy part. We can use all the features we need and build a model. However, as a data scientist, we should always ask ourselves, will these features be available in production? Approaching a machine learning (ML) model from a product perspective helps to tackle lots of problems in advance.
Keeping the features feasibility in mind, we decided to test the model with following raw and derived features:
- Historical performance of the listing
- Historical performance of the seller
- Listing attributes (make, model, price, rating, location etc)
- Freshness of the listing
- Age of the listing (based on registration date)
When tested offline using NDCG@k metric, we found that these features gave us a good uplift as compared to the existing model. We always aim for uplift in offline metrics before testing the model online in an A/B test, this helps us to iterate faster.
How did we serve the models?
We learnt that serving a machine learning model has multiple aspects:
- Ensuring the model has access to features array to predict
- Ensuring the model is trained periodically to learn the latest trends in the business
To tackle the above aspects, we used airflow to schedule our ETL jobs to calculate features. Due to the choice of our features, we were able to precompute the feature vector and store it in a feature store. To summarise, we had to setup following jobs:
- To fetch latest information, every new update of a listing is pushed to a kafka stream, we consume this stream using a python service to update our feature array
- Another task reads these updated feature arrays, generates prediction and store them into our feature store
- Training job retrains the model once a week based on optimised set of parameters, adds versioning to the model and stores it in gcp bucket.
We created a microservice (API) using FastAPI to serve the models. You might ask why not Flask? We have been using FastAPI for quite some time now and haven’t found any bottleneck yet to think about other frameworks. Also, FastAPI framework has quite solid documentation where they also share the best practices to build an API.
Our service workflow looks like the following:
- The development work for FastAPI happened in Python.
- The code gets pushed to github. Using CI pipelines integrated with linting test, unit testing, integration testing we make sure every new line of code pushed is tested. Also, the code gets packaged into a docker image and gets pushed to a registry.
- Deploy the docker image on kubernetes (although this part is mainly handled by our site ops team).
- Track the service health metrics using grafana dashboards.
Show me the results
We were also waiting to see if our months of hard work was going to make an impact. We decided to launch an A/B test for two weeks. At mobile.de, the best part of being a data scientist is that you are involved in the end to end process.
After putting all the pieces together, we launched an A/B test for two weeks and recorded positive significant improvements in the business metrics. For example, while not affecting the SRP (search result page) performance — microservice responding under 30 milliseconds at p99, the new search relevance algorithm generated:
This uplift is special for the team because the baseline we were competing against was already providing solid results. Given the significant uplifts in our metrics we strongly believe that the team has done a tremendous job in improving the search relevance for our users. That is, making it easier for our users to find the right vehicle and contact the seller.
End Notes
In this post, I shared our experience building learning to rank models and serving them using a microservice in python. The idea here was to give you a high level overview of the different aspects we touched during this project.
All of this would have not been possible without an incredible team. Special thanks to Alex Thurm, Melanya Bidzyan, Stefan Elmlinger for contributing to this project at different stages.
In case you have questions/suggestions, feel free to write them below in the comments section. Stay tuned for more stories :)
Top comments (0)