In my previous organisation (Choice.AI), we had customers (read E-Commerce Shops) ranging from small size handling upto maximum of 100–1000 unique visitors per month to mid-size handling around 30k-50k of unique traffic per month to their stores to finally large customers with 50k+ regular shoppers on websites. Based on their traffic size, each store came under different pricing plan of organisation, so our investment on them also should be planned based on their ROIs.
To deliver faster and equal value to all types of customers, we built solutions which were independent of Store catalog and traffic size. This approach started costing us more as more stores became our clients, because they needed nearly same amount of investment in terms of computing resources and power.
Here in this series of blog I will talk about major problems which we solved by “knowing” and understanding what our data contains and their solution helped us in terms of optimised allocation of resources to stores based on their catalog and traffic size without compromising in terms of quality, relevance and features organisation offered.
Our primary problem was handling aggregated user events analytics in a document based data storage (MongoDB in our case). Our main goal was to count number of unique visitors across different dimensions and segments where dimensions could be Campaigns or Experiments and Segments could be Devices or Traffic Sources.
To get the number of unique visitors across different span of time which was only available at query time, we needed to store HLL Registers’ serialised data in a data store which would be deserialised and merged to give the value required.
This string was of length 8192 in bytes to be exact, considering 8-bit for a single character and total string length of 8192 characters. It is independent of value it represents whether it could be as small as 1 or as large as 50k, all of the string are of same length.
Our customers were ranging from different areas of e-commerce and had different and unique products to offer to end consumer. Because of our primary principal of treating every customer data in similar way, the most common issue which we faced was interference of problems from one customer’s badly structured data into processing of another customer’s clean structured hierarchical data.
We allowed our customers to import as many attributes of a product as they want and they will get filtering and aggregation support on those attributes. This approach resulted in keys explosion, which means more fields to index in Highly Indexed Document based Storage System (Elasticsearch in our case) and ultimately occupying large disk size and more time to index and retrieve data on filters.
Next Posts in Series
- Optimising Document Based Storage - Know Your Data (KYD)
- Optimising Highly Indexed Document Storage - Know Your Data (KYD)