Udacity Data Science Capstone Project — Starbucks data set
Project Overview
For this project I will be looking at a dataset provided by Starbucks, that mimics customer behaviour on their mobile app. The data pertains to offers Starbucks issue to their customers, as well as transactions of those customers, including any offers they have been sent. I have 3 different datasets related to:
· The offers available
· The users of the application
· Transactions of the user
Problem Statement
I will be investigating how offers impact customer behaviour and looking to find the offers that trigger the biggest spend increase by specific customer groups. There are 2 ways to categorise the offers, either by type of offer (discount, BOGO or advertisement), or the medium through which the offer was delivered (email, web, app, social media). To determine which offer type triggers the biggest spend increase, I will be measuring the average daily spend of each customer, during an offer period and outside of an offer period. I will also be measuring the average number of transactions per day per customer. This should enable me to see if there is an increase in the value of the individual transactions, or if there is an increase in the number of transactions during an offer. i.e., do offers drive bigger transactions, or more regular transactions.
I will then perform a Students T Test comparing transactions during and outside of an offer period to see if there is a statistically significant difference in user behaviour.
Once I have ascertained if offers do have a meaningful impact on customer behaviour, I will then investigate which types of offers impact different user groups. I will be looking at user groups based on age and income. One of the problems with the dataset is that we have customers for whom we have no personal information. Whilst it is tempting to discount these ‘null users’ entirely, it makes more sense to consider them as their own user group. This will hopefully enable me to predict behaviour of future members who also chose not to share personal information.
Metrics
The metrics I will be looking for are the P value when comparing transactions outside of and during in an offer period. My hypothesis for these are
H0 — Transaction amount during offer = Transaction amount outside of an offer
H1 — Transaction amount during offer ≠Transaction amount outside of an offer
and
H0 — Daily Transactions during offer = Daily transactions outside of an offer
H1 — Daily Transactions during offer ≠ Daily transactions outside of an offer
I will perform a Students T Test to see if there is a significant different between both the average transaction value during an offer, as well as the number of transactions during the offer. I will be comparing all transactions during an offer with those transactions completed out of an offer. I will consider a transaction to be during an offer if the user has viewed the offer, and the transaction was made during the validity period of the offer.
To aid me in discovering if offers impact customer behaviour differently I will be measuring the correlation co-officiant’s between a user’s age and their difference in spending habits during an offer, as well as the users income and the difference in spending habits.
Data Exploration
Profile Dataset:
This dataset refers to users of the mobile app. It contains information such as, users age, gender, income and when they became a member. I know that users that did not supply an age had a generic age of 118 entered. I did a quick count of null values, and found there were 2175 null values in both the gender and income columns. Once I’d checked the number of people aged 118 I found there were also 2175 of these. Although it seemed likely that these 2175 null and generic values all related to the same rows, I wanted to confirm this. I then checked the number of nulls for users just aged 118, and with 2175 null entries in income and gender out of just those aged 118 I could confirm that users who did not supply an age also did not supply a gender or income, and all other users supplied all information. To not discount these users entirely, as they made up a significant proportion of the population (2175 out of 17000 users) I decided to enter follow the logic of the age=118 and assign an income value of the maximum income + 10% and enter a gender of U for unknown for all these users. This meant that I got rid of null values and would be able to see these users in any visualisations that I created, but would make them distinct enough so that I would know I was looking at this group of “null users”.
Portfolio Dataset:
This dataset refers to the offers that are sent to customers throughout the experiment. It only contains 10 rows, and has 0 null values in. One of the columns is a list type object, so to create dummies from this required slightly different handling. I found the solution to that problem here: https://stackoverflow.com/questions/29034928/pandas-convert-a-column-of-list-to-dummies
Once I had created dummies from that column, I was then able to analyse the amount of offers that were sent by different methods. I could see that every one of the 10 offers was sent over email, which made tracking that in future rather irrelevant as there could be no insight gained there. I could also see that the available offers were a 40/40/20 split between Discount, BOGO and Information type discounts.
Transcript Dataset:
This dataset contains information around user events. Events are broken down into “offer received”, “offer viewed”, “offer completed” and “transaction”. These were contained within a columns in Json format, which again I created dummy columns for. There were also no null values in this dataset at all. One of the things I needed to check if users would receive the same offer more than once, and after a quick query I could see that this was the case, this would make cleaning the data afterwards trickier as it would be harder to link an “offer viewed” event with the “offer received” event, as I could no longer use just the offer ID.
Data Visualisations
Data Pre-Processing
One of the biggest challenges will be in correlating which transactions were part of which offer. I need to consider if the user has viewed the offer prior to the transaction, and if the transaction is during the validity period of the offer. It is important to track this, as I don’t want to consider an offer as having had an impact on customer behaviour, if the customer never actually viewed that offer.
To overcome this problem, I will need use the validity period from the offers data to enter a new “valid until” field on the transaction dataset. Once I have this column, I will then be able to find if a view was in the validity period of the offer, if it is then I will mark this as the customer is aware of the offer. When I know which offers a customer is aware of, I will then be able to calculate if any transactions made were when a customer was or was not aware of an offer, as well as assigning offer information to that transaction.
I will then be able to calculate how many days a user was aware of an offer, by subtracting the day the offer was viewed from the day the offer was valid until. This would allow me to calculate the number of transactions that were made during offer periods, and the transactions per day of an offer
Implementation
The algorithm that was used to calculate if an offer was completed, or a transaction was during an offer validity period took a very long time to run, due to having to cross-reference different rows of the dataframe. This meant that I needed to save the output to a csv after this step had been competed to allow me to continue with my analysis when re-visiting the project at different times. This was the same for if the customer was aware of the offer when completing the offer or making a transaction as well as adding the offer information to the transactions when the customer was aware of the offer.
Evaluation and Validation
I could also now see if the transaction amount increased during an offer. I created a boxplot of transaction amounts during and out of offers, which showed a massive number of outliers:
Based on this I decided to remove anything above the 99th percentile, as I felt that transactions of these high amounts would not have been affected by vouchers, people would not be spending $500 on an item to save $2 when they only needed to spend $10. These high value transactions were more likely to do with things like catering to significant events in the users’ personal lives as opposed to any offers received. Once I had removed these outliers the boxplot proved to be a lot more useful:
This shows that the transaction offer was slightly higher when a user was aware of an offer, but it doesn’t appear from the plot to be particularly significant. However, when performing a students T test on these 2 datasets I found that the p value of 0.000003 indicated that there was actually a statistically significant difference between the 2 purchase amounts.
I also wanted to check if the number of transactions increased during an offer period. Although the transaction amount difference is statistically significant, the increase is very small, so is probably not practically significant. However, if we can prove that not only does the value of each transaction increase, but also the frequency of the transactions, that will add a bit more weight to the argument that offers are a valid form of increasing revenue. This came back with a p value of 0, so we can be 100% confident that the number of transactions per day increases when a customer is aware of a valid offer.
I now wanted to investigate if different offer types were taken up by different user groups. I wanted to get an overall view of spending habits per customer, so I calculated total and daily spend and transaction amounts overall, during offers, outside of offers and then for each individual offer type. I then calculated the different per user between their in offer and out offer spend. When plotting this on scatter graphs there was no immediate patterns evident:
As the offer type didn’t reveal too much information, I checked against the medium the offer was delivered over, with the visuals being incredibly similar.
This led for me to check for number transactions based on a customers age. I decided to calculate this as spend per person as opposed to a total number for each age in the dataset. This ensures that even though I have more people in certain age groups, the output would take that into account. Again though this did not provide any real insight into which ages would respond best to which offer type. But again no patterns were immediately obvious. Other than younger customers tended to make more transactions outside of offer periods.
I then looked at the daily transaction count per person per age to see if this gave any meaningful insight:
Again, there are no obvious patterns present from these visualisations.
Although the visualisations suggest there is no correlation between age or salary and spend during an offer, I wanted to prove this, to do that I calculated the Spearman’s correlation coefficient for these 2 parameters and the daily difference in transactions for a customer. Age and daily difference came out as 0.010 and income and daily difference came out as 0.015. Both numbers back up the visuals in that there is no significant correlation between a customer change in behaviour during an offer, and their age or income.
There was a flaw in this approach though, in that users who had never even viewed an offer were also considered in calculating this co-efficient. After removing those users who hadn’t viewed an offer at all (had 0 offer days) I again calculated the Spearman’s co-efficient, and found although the results indicated more of a correlation, they were still largely insignificant. With scores of 0.064 and 0.1033 for age and income respectively.
Refinement
When checking for statistical significance for an increase in spend per day during an offer period, I have not considered the cost of the offer. This is important, as if the increase in spend is not as high as the cost off the offer, then although we can prove statistical significance of offers generating more income, we can not determine if they are practically significant. Once we have considered the cost of the offer also, we are able to confirm that they are both statistically and practically significant. To do this I will need to include a ‘reward per day’ and then use this in calculating the practical increase in daily transaction amount. Once I have this data, I will be able to perform the Students T Test again to determine if there is still a statistically significant different between spend during in and outside of an offer validity period.
This gave a p vale of 0 again, which means we would reject the null hypothesis that the in offer and out of offer spend per day were the same. However, this is a 2 tailed test, and doesn’t actually give an indication of which would be greater. To find this calculation we would need to check if our statistic is a positive or negative number (it’s positive 12.003) and then divide our p value by 2. If the statistic is positive, and the p value / 2 is greater than alpha we can reject our null hypothesis that
Daily spend during offer <= Daily spend outside of an offer
Justification
We have been able to prove that customers spending behaviour changes significantly when they are aware of an offer. We did this by failing to reject the following null hypothesis:
Transaction amount during offer = Transaction amount outside of an offer
Daily Transactions during offer = Daily transactions outside of an offer
Daily spend during offer <= Daily spend outside of an offer
We could reject all of these with greater than 95% confidence thanks to p values of practically 0 (to within 5 decimal places!).
I also wanted to ascertain which user groups responded best to which offers, however when checking the Spearman’s correlation coefficient for both ae and salary against offer spend, and number of transaction, however with these coming out as 0.010 and 0.015 it does not appear that there is any correlation between age or income, and a change in behaviour when a user is aware of an offer. This was supported by several visualisations. These visualisations also suggested that there would not be a correlation between gender, or how long since a user created their account.
Reflection
This leaves me with the conclusion that all customers react equally to offers when they are aware of the offers. Which leads to the question, which customers are more likely to view an offer. However, that question is beyond the scope of this investigation.
There was a lot of data cleaning involved in this project, to get the required data elements to be able to correlate transactions with offers. I really enjoyed this part of the project as it was fascinating to see just how much data could be inferred from 3 seemingly simple tables. There were a lot of challenges with the cleaning, mainly due to having to cross reference different rows of a dataframe against other rows within it. I could not find a more efficient way of doing this than iterating through all the relevant rows and applying the output of these to others in the dataframe.
The data came through in such a way that to get dummies from a couple of the columns was more complex than using the pd.get_dummies method. One was a list type column and the other was a JSON format. This required a bit of research into how to get the dummy values from these columns.
Whilst it is disappointing not to be able to pinpoint a specific user group that responds best to certain offer types, it is satisfying to know that there is a statistically significant increase in customers spend per transaction, although this isn’t particularly practically significant, being so small. And an increase in the number of daily transactions we see for customers who are aware of the offer. If I were to attempt this project again, I would focus on customer view rates of offers, and try to determine the best medium to deliver an offer to a specific user to ensure that they would view the offer. As we can now prove that customers who do view offers spend more money, getting them to
Improvement
I would have liked to improve the speed of some of the algorithms used in the pre-processing of the data. Whilst this wouldn’t have an impact on the results themselves, it would have increased the speed with which these results could be computed. To improve this I would need to research how to perform operations on DataFrame rows, based on values contained within other rows.