Instacart Market Basket Analysis

Which products will an Instacart consumer purchase again?

Abhishek Shaw
14 min readMar 3, 2021
Image Courtesy: https://www.slideshare.net/JeremyStanley4/deep-learning-at-instacart

Table of Contents

  1. Business Problem
  2. Business objectives and constraints
  3. Data overview and Data set column analysis
  4. Mapping the real world problem to a Machine Learning Problem
  5. Performance metric
  6. Exploratory Data Analysis
  7. Existing Solutions/Approaches
  8. First cut approach
  9. Feature Engineering
  10. Model building
  11. Inference Pipeline
  12. Summary
  13. Future Work
  14. Profile
  15. References

1. Business/Real-world Problem

One major question retailers have is which products users will purchase and when. Market Basket analysis is a technique that is used by retailers to better understand their customer behavior more specifically their purchase patterns. While purchasing products from online shopping sites like amazon we have seen a section where it shows “people who bought this item also bought this” is a result of Market Basket analysis.

Instacart is an American company that operates on same day delivery service. The Instakart data set contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. Instacart’s data science team plays a big part in providing this delightful shopping experience. Currently they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session. We need to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order.

Apart from product recommendations descriptive analysis on the data can help Instacart to better understand which products are purchased together and also products that are highly purchased so that sufficient product stock can be maintained and customers don’t experience out of stock for any product.

2. Business objectives and constraints

Objectives:

  • We need to predict which previously purchased product will be in user’s next order
  • Minimize mean F1 score.

Constraints:

  • Some form of interoperability
  • No strict latency constraints

3. Data overview and Data set column analysis

Data Source:

The data set for this problem is a relational set of files describing customers’ orders over time. The data set is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we are provided between 4 and 100 of their orders, with the sequence of products purchased in each order. We are also provided the week and hour of day the order was placed, and a relative measure of time between orders.

File Name : orders.csv

File Name: aisles.csv

File Name: departments.csv

File Name: order_products__*.csv (* includes prior, train)

prior file contains User’s previous purchase history [upto (N-1)th order]

train file contains user’s last purchase history [only Nth order]

4. Mapping the real world problem to a Machine Learning Problem

Type of Machine Learning Problem:

Given a user and order number we need to predict which products will be ordered next
We can also pose this as a Time-series Forecasting binary classification
problem since we have label named reordered which takes value of 0/1.
We are calling this as time series forecasting since for each user we have products ordered and order number which shows user’s purchase behavior over time.
We can use prior data set to capture user X product behavior over time and
create features related to that using aggregation functions.

5. Performance metric:

  • Mean F1 score
  • Confusion matrix

6. Exploratory Data Analysis

Before starting EDA we need to join given files based on relation between them to get the de normalized data. The preprocessing code is available on my GitHub repository.

Basic sanity checks on the Data

df_prior_final is obtained by joining order_products_train.csv file with orders.csv, aisles.csv, departments.csv

Observation : Only days_since_prior_order column contains NaN values. It may be due to people placing their first product as from the above analysis we can see that for all such NaN rows of ‘days_since_prior_order’ we have order_number as 1. And this also makes sense that people placing their first order doesn’t have any order history.

df_orders is DataFrame built from orders.csv file

Data distribution in different files

Observation : We have lots of data( 3.2 Million ) about users previous purchase history. We need to give predictions on the test set which is 75000 (2.2% of total data points) as mentioned in the problem description.

Total number of Orders/Reorders per day

Observation : People are more likely to place orders on Saturday(day 0) and Sunday (day 1) than other days. Same trend follows for reorders also.

Total number of orders per hour

Observation : Most people buy products between 7 AM to 8 PM. 10 AM is the peak hour when highest order is received.

Total number of orders vs days_since_prior_order

Observation : Clearly People are likely to reorder products after 7th, 14th,21th,30th day after 0th day as we can clearly see spike at every 7th, 14th,21th,30th w.r.t its neighborhood days.This also makes sense as people tend to buy products weekly and monthly basis. 30th day is the highest no of orders placed

Total orders placed by user vs Order count

Observation : There is no user who has placed less than 4 orders and max order placed by users is 100.

Bucket size of users vs Order count

Observation : Bucket size of 5 and 6 are most frequent. Highest bucket size is 145 and there is only one user who has bought 145 products in an order

Mean_add_to_cart_order vs Reorder

Observation : Products that are not reordered have slightly high add_to_cart_order than products that are reordered. People are likely to put the reordered products in the cart first and they put not reordered products in the cart last.

Word Cloud of Products Ordered

Observation : People are more likely to order organic products and fruits specially organic fruits.Banana is the highest ordered product.

Top 10 department based on orders and their reorder count

Observation : Produce department has highest order and reorder.

Top 10 aisle based on orders and their reorder count

Observation : Fresh fruit and fresh vegetables have highest order and reorders. Which is quite natural as people buy fruits and vegetables regular basis

Contribution of each department and aisle to Total Reorders

Observation : In the above tree-map we have shown contribution of Department and its aisle to total reorders. Produce has highest reorders and under produce fresh fruits and vegetables are the top aisle in terms of reorder.

7. Existing Solutions/Approaches

Approach1:

Here also the approach is creating user , user-product features.

  • User features includes for each user average days between orders,nb_orders,total_items bought , all_products bought,average_basket size etc
  • User X Product features include for each user-product number of orders,last orderid, sum position in the cart.
  • Next for each train orderid it combines user features with user-product features and also adds new set of feature Following are some of the features

○ ‘user_total_orders’, ○ ‘user_total_items’, ○ ‘total_distinct_items’, ○ ‘user_average_days_between_orders’, ○ ‘user_average_basket’, ○ ‘order_hour_of_day’, ○ ‘days_since_prior_order’, ○ ‘UP_orders’, [total no times user orders the product] ○ ‘UP_orders_ratio’[total no times user orders the product / user_total_orders] Etc..

Below diagram describes the process followed

Approach2:

Though the problem statement described in the above link is different from our problem, the solution approach of the above can help us solve the current problem at hand. In our current problem we do not need to deal with cold start problems as we are predicting reorder of the users who have previous order history.

Below diagram describes the process flow

MF is matrix factorization technique. Surprise is a python library to solve recommender systems problems.

Approach3:

  • After merging the data EDA is done on the data set to understand how reorder frequency is changing over days in a week, hours of a day. To understand the reorder interval, most popular products, sales per department.
  • For product recommendation the above notebook uses bi gram counts to determine recommended products.First it will group the products based on OrderID. Then it will replace the space present in each product with underscore(_) and for each order products are separated by space. For each product X, it will consider the bi-grams containing X and sort the bigrams based on total bigram count for all orders in descending order(i.e sum across axis=0). Then it will pick top N products from bi-grams containing X and recommend them. If the N is greater than the total bi-grams(let’s say K) of X then it will first recommend k products from k bi-grams and then it will try to find out similar products for the product that is in the highest bi-gram count of X.

8. First cut approach

Approach:

Our first cut approach is for each user X product in each order we are going to look user’s purchase history in a window in recent past that means for each user our window has order number which is less than current order_number in consideration. Then we calculate certain summary statistics in that particular window as a part of feature engineering. For example let us consider the last order_id in the above pic as pointed by the arrow and window size is 1. Then we look back in time where the order number < current order number(=3). There are 2 order numbers for that user_id(=3). Since our window size is 1 we consider only latest order number which is 2 for that user_id(=3). The window is highlighted in the above pic. Window size is a hyper parameter that needs to be tuned.

Now the problem arises when processing such a large volume of data.

I have used a Linux VM in GCP with 8 CPU and 32 GB RAM.

To get things processed faster I have used multiprocessing by dividing the large DataFrame in smaller DataFrames in such a way that for each user it’s complete purchase history is present in only one of the smaller DFs. Below diagram depicts the preprocessing flow.

feature_engineering function is used to extract features from prior data for each row in train data,the complete code can be found on my Github repository

The function feature_engineering calculates certain features in a window

Here is a list of some of those features. uxp denotes user-product and u denotes user

  • uxp_reorder_bias
  • u_total_orders
  • u_reordered_ratio
  • u_average_days_between_orders
  • u_dow_most_orders
  • u_hod_most_orders
  • u_total_items_bought
  • u_avg_basket_size
  • uxDeparment_reorder_count
  • is_organic

After doing preprocessing I have fed the data to tree based ensemble models(like Random forest , GBDT, Light GBM). Though the confusion matrix and F1 score on train and validation data was very good but my submission score on the test data was low.

It was 0.23 only. Some how this approach was not working for predicting the user’s next purchase bucket.

Working Approach:

Below is the process flow of the working approach.

Process Flow

Now how to arrive to above mentioned features is described below.

9. Feature Engineering

User features

  • u_total_orders : Number of orders per User
  • u_reordered_ratio : How frequent a user has reordered products
  • u_average_days_between_orders : average days between user’s purchases
  • u_days_between_orders_std : standard deviation in user’s days_since_prior_order
  • u_dow_most_orders : Day of the week the users orders the most
  • u_hod_most_orders : hour of day the user has ordered most.
  • u_total_items_bought : Total products bought per user
  • u_total_unique_prod : Total unique products ordered
  • u_avg_basket_size ,basket standard deviation, basket_sum
  • u_date_inscription : max date for each user excluding last order
  • u_tot_active_prod : Number of products that user has reordered
  • u_reorder_ratio_bool : mean reorder value across products

Product features

  • p_total_purchases : Number of purchases for each product
  • p_reorder_ratio : How frequent a product has been reordered
  • p_avg_cart_position : Mean add to cart for each product
  • p_unique_user_count : user base for each product
  • p_recency_order : mean order_number
  • p_recency_order_rev : mean order_number_reverse
  • p_recency_date : mean date
  • p_freq_days : Mean days before the product is reordered
  • p_freq_order : Mean number of orders before the product is reordered
  • p_tot_active_usr : Number of users for a product that has been reordered
  • p_reorder_ratio_bool : mean reorder value across usres for a particular product
  • p_trend_rt : no of times product ordered in last order/ no of times product ordered in 2nd last order across user’s
  • p_trend_diff : no of times product ordered in last order minus no of times product ordered in 2nd last order across user’s
  • *Product trend — we are taking last 2 orders across user’s and calculating product reorder trend

User X product Features

  • uxp_total_bought : number of times a user have bought a product
  • uxp_reorder_ratio : How frequent a user has reordered a particular product
  • uxp_avg_cart_position : user’s mean add_to_cart_order for a paricular product
  • uxp_add_to_cart_order_relative_mean : user’s mean add_to_cart_order_relative for a particular product
  • uxp_add_to_cart_order_inverted_mean : user’s mean add_to_cart_order_inverted for a paricular product
  • uxp_last_order_number :user’s last order_number for a particular product
  • uxp_first_order_number:user’s first order_number for a particular product
  • up_last_order_date : user’s last date for a particular product
  • up_first_order_date : user’s first date for a particular product
  • uxp_bought_last5 : product bought by users in the last_five orders. This will capture recency of user-product
  • uxp_date_strike : 1 / 2 ** (date / 7)
  • uxp_order_strike : 1 / 2 ** (order_number_reverse)

For last two features the concept is as below

For an user with 5 orders (NA means the product wasn’t bought yet)
0 0 0 0 1 = 1/2**5 = 0.03125 a product bought the first time and never purchased again
1 1 NA NA NA = 1/2**1 + 1/2**2 = 0.75 a product purchased the last two times
1 0 0 1 NA = 1/2**1 + 1/2**4 = 0.5625
It helps to focus on a product that user is recently started reordering

Aisle features

  • a_total_purchases : Number of purchases for each aisle
  • a_reorder_ratio : How frequent a aisle has been reordered
  • a_avg_cart_position : Mean add to cart for each aisle.
  • a_unique_user_count : user base for each aisle
  • a_tot_active_user : Number of users for a aisle that has been reordered
  • a_reorder_ratio_bool : mean reorder value across usres for a particular aisle
  • uxa_unique_products_ratio : Ratio of products purchased by the user in this aisle

Department

  • d_total_purchases : Number of purchases for each department
  • d_reorder_ratio : How frequent a department has been reordered
  • d_avg_cart_position : Mean add to cart for each department.
  • d_unique_user_count : user base for each department
  • a_tot_active_user : Number of users for a department that has been reordered
  • a_reorder_ratio_bool : mean reorder value across usres for a particular department
  • uxa_unique_products_ratio : Ratio of products purchased by the user in this department

Word2Vec on products

The objective of using this is to identify products that are bought together or products that are similar. In each order user orders a number of products. We are using word to vector since it focuses on neighborhood words that comes in same context and tries to represent words as vectors. We can use each order as a sentence and each product as a word.

We have used window=longest since if we use smaller window then products that are outside the window will not be considered in same context though they are added to cart against same order. That’s why we are setting window size large enough to accommodate all products against same order

We used PCA to determine right number of components to project the data

From the above we can see that if we choose n_components=40 then we can preserve 80% of variance

Encoding cyclic features

Here order_dow and order_hour_of_day are both cyclic features. For order_dow the cycle repeats between 0 to 6 and for order_hour_of_day the cycle repeats between 0 to 23. There are certain problems with cyclic features if we don’t encode them properly. For example the difference between hour 23 and 22 is 1 but for hour 23 and 0 the difference is 23 although the real difference for both of them is 1 hour. We need to encode this cyclic features in such a way that hour 23 and 0 are close even though the absolute difference between them is 23.

One common way to encode cyclic features is to use sine an cosine transformations. 
We can do that using following transformation.
Xsin = sin(2∗π∗x/max(x))
Xcos = cos(2∗π∗x/max(x))

Whether a product is Organic or not

We have Seen that people are more likely to reorder organic products. So we re adding a flag which indicates if a product is organic or not

10. Model building

Since We have all our data ready after feature engineering we start start building models.

There are different models to try but in most cases ensemble based models works best in classical Models. So we will start with one of my favorite Light GBM. As it provides faster training speed and Lower memory usage than XGboost.

Hyper-parameter tuning LGBM using RandomizedSearchCV

We have applied random search technique to find out the best set of parameters of Light GBM. Now to predict results it is important to find out the right probability threshold. We tried different values and found that 0.22 threshold gave us maximus F1 score.

Below pic depicts models feature importance. User-product last order date feature has highest importance.

Feature importance of final Model

I also tried with other models such as Random Forest, XGboost,Logistic Regression classifier but the score was lower compared to Light GBM.

11.Inference Pipeline

12. Summary

As expected Light GBM model performed best. The submission score achieved was 0.38. Below is the screenshot of score obtained for the given problem. My leader board score is in top 10%.

Kaggle Submission Score
Success

13. Future Work

Since the data is a time series data we can try with LSTM ,GRU or even encoder-decoder based models like BERT to capture the user’s purchase pattern over time.

14. Profile

The complete analysis with deployable code can be found on my GitHub repository

15. References

  1. https://netflixtechblog.com/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429
  2. https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/34977
  3. https://www.kaggle.com/errolpereira/xgboost-with-minimal-feature-engineering#Creating-user-product-features.
  4. https://www.kaggle.com/c/instacart-market-basket-analysis/discussion/38097
  5. http://blog.davidkaleko.com/feature-engineering-cyclical-features.html

--

--

Abhishek Shaw

Software Engineer with interest in Data driven analysis using Machine Learning and Deep Learning techniques.