Bellabeat Case Study

The following analysis was done as a Capstone Project in order to complete the Google Data Analytics Certificate
Data Analysis
SQL
Google Sheets
Author

Margarita Valdés A.

Published

May 9, 2024

How can a Wellness Technology Company Pay it Smart?

The following analysis was performed using BigQuery (SQL) and Connected Sheets (Google Sheets)

1. Summary of business task

Bellabeat was founded in 2014, the company developed one of the first wearables specifically designed for women and has since gone on to create a portfolio of digital products for tracking and improving the health of women. Its focus is on creating innovative health and wellness products for women. The company’s mission is to empower women to take control of their health by providing them with technology-driven solutions that blend design and function.

Business task: Analyze smart device data to gain insight into how consumers are using their smart devices. Focus on one of Bellabeat’s products and give insights about what was discovered in the analysis, to help guide the marketing strategy for the company.

2. Description of the datasets

The data used for this analysis was downloaded from the Kaggle page: FitBit Fitness Tracker Data. The data comes from Fitbit users that responded to a survey via Amazon Mechanical Turk. The data was collected from 03.12.2016 to 05.12.2016.

The downloaded data came in two folders, one for each month of collected data:

  1. Fitabase Data 3.12.16-4.12.16 (11 files)
  • dailyActivity_merged.csv
  • heartrate_seconds_merged.csv
  • hourlyCalories_merged.csv
  • houryIntensities_merged.csv
  • hourlySteps_merged.csv
  • minuteCaloriesNarrow_merged.csv
  • minuteIntensitiesNarrow_merged.csv
  • minuteMETsNarrow_merged.csv
  • minuteSleep_merged.csv
  • minuteStepsNarrow_merged.csv
  • weightLogInfo_merged.csv
  1. Fitabase Data 4.12.16-5.12.16 (18 files)
  • dailyActivity_merged.csv
  • dailyCalories_merged.csv
  • dailyIntensities_merged.csv
  • dailySteps_merged.csv
  • heartrate_seconds_merged.csv
  • hourlyCalories_merged.csv
  • houryIntensities_merged.csv
  • hourlySteps_merged.csv
  • minuteCaloriesNarrow_merged.csv
  • minuteCaloriesWide_merged.csv
  • minuteIntensitiesNarrow_merged.csv
  • minuteIntensitiesWide_merged.csv
  • minuteMETsNarrow_merged.csv
  • minuteSleep_merged.csv
  • minuteStepsNarrow_merged.csv
  • minuteStepsWide_merged.csv
  • sleepDay_merged.csv
  • weightLogInfo_merged.csv

3. Observations about the data

The downloaded data sets were not consistent for both months. There are some extra data sets in the second month that are just the same data in its wide format instead of narrow format. However, there are only data sets with information for daily intensities, daily steps, and sleep day for the second month. After inspection of each dataset I realized that most of the data containing intensities, steps, and calories is already summarized in the dailyActivity_merged.csv data sets. Therefore, I focused on cleaning and analyzing the following datasets:

  • Daily Activity
  • Sleep Day
  • Weight Log Info

To understand the collected data I inspected each dataset in SQL to assess how many unique Ids were available per data set and whether they were shared in the other sets.

4. Cleaning or manipulation of the data

Before I could import the datasets as tables to BigQuery, I cleaned the data using Google Sheets. I searched for duplicates and formatted all the Dates columns since BigQuery did not recognize the date-time format with AM/PM at the end.

I then proceeded to check for unique Ids in all datasets and to compare if they were shared among them using the following queries:

  • Count unique Ids in each table:

SELECT

COUNT(DISTINCT Id) AS Tot_id

FROM `[project_name].Bellabeat.[dataset]`
  • Checked for shared Ids between tables:

SELECT

COUNT(DISTINCT table1.id)

FROM `[project_name].Bellabeat.[dataset1]` AS table1

INNER JOIN `[project_name].Bellabeat.[dataset2]` AS table2

ON table1.id = table2.id

From applying the queries above to all tables, I made this summary of the Ids that were shared between datasets to understand how big the sample was, and how I could join the datasets.

Datasets Unique Ids Shared with daily_activity_3_4 Shared with weightLogInfo_3_4
daily_activity_3_4 35 35 -
daily_activity_4_5 35 33 -
SleepDay_4_5 25 24 -
weightLogInfo_3_4 11 11 -
weightLogInfo_4_5 8 8 6

There were only 34 consistent participants that shared data in both of the daily activities data sets. And from the Sleep Day data, only 24. The number of participants is very low, so I decided to merge the data sets including all participants that appeared in both of the daily activity tables. The field names were the same in both sets, so I proceeded to merge them and saved the table in the same project with the following query under the name ‘daily_activity_merged’.

SELECT *
FROM `[project_name].Bellabeat.daily_activity_3_4`
UNION ALL
SELECT *
FROM `[project_name].Bellabeat.daily_activity_4_5`

I then made a summary table including the total and average values of the daily measurements of steps, total distance, sedentary minutes, and burned calories per unique id with the following query, and saved it as ‘daily_activity_summary’:

SELECT
 id,
 COUNT(ActivityDate) AS active_days,
 SUM(TotalSteps) AS tot_steps,
 AVG(TotalSteps) AS avg_steps,
 SUM(TotalDistance) AS tot_distance,
 AVG(TotalDistance) AS avg_distance,
 SUM(SedentaryMinutes) AS tot_sedentary,
 AVG(SedentaryMinutes) AS avg_sedentary,
 SUM(Calories) AS total_calories,
 AVG(Calories) AS avg_calories
FROM `[project_name].Bellabeat.daily_activity_merged`
GROUP BY id
ORDER BY active_days DESC

For this analysis, I analyzed two tables, the above table with summary data and a general table including all individual measurements included in the daily_activity, the weight_log, and sleep_day data without grouping by id. To obtain the last mentioned table, I first merged both of the weight_logs tables with the following query and saved it as ‘weight_logs_merged’:

SELECT *
FROM `[project_name].Bellabeat.weight_log_3_4`
UNION ALL
SELECT *
FROM `[project_name].Bellabeat.weight_log_4_5`

Then, I joined all three main tables, ‘daily_activity_merged’, ‘weight_log_merged’, and ‘sleep_day’, joining them by Id and ActivityDate with the following query and named it ‘all_data’:

SELECT
 table1.*,
 table2.WeightKg,
 table2.WeightPounds,
 table2.BMI,
 table3.TotalMinutesAsleep,
 table3.TotalTimeInBed
FROM `[project_name].Bellabeat.daily_activity_merged` AS table1
LEFT JOIN `[project_name].Bellabeat.weight_logs_merged` AS table2
ON table1.Id = table2.Id
 AND table1.ActivityDate = table2.Date
LEFT JOIN `[project_name].Bellabeat.sleep_day` AS table3
 ON table1.Id = table3.Id
   AND table1.ActivityDate = table3.Sleep_Day
ORDER BY table1.ActivityDate

I exported both tables, ‘all_data’ and ‘daily_activity_summary’, to Connected Sheets to further explore, analyze and make visualizations.

5. Data Analysis, visualizations and key findings

I first analyzed how the data was collected during those 62 days and how many unique Ids recorded activities using the table ‘all_data’. I found that the number of recorded activities during the first month (march) were very low, and most users started recording until 01.04.2016 and up to 12-05-2016. There was oddly one day with more than 60 recorded activities, but for the most part, that period it looked even.

I then checked for the number of recorded activities for each of the users, or unique Ids. There were around 40 recorded activities on average for the whole sample, and only a few participants recorded significantly less or more activities than the average. It is a small sample to work with, so I decided to leave all participants in, and proceed with the analysis.

From the data above, I classified each user considering the frequency on which they use their trackers. Participants that used their trackers 15 days or less were considered infrequent users. Participants that used their trackers less than 40 of the 62 days were classified as moderate users. And, participants that used their trackers 40 or more days were considered frequent users. The doughnut chart below, shows that most of the analyzed sample is conformed by moderate and frequent users and the frequent users are 65% of the sample.

In the analyzed data set, the most basic measure that allows us to understand individual activity is the number of steps taken daily. To understand the levels of activity in our sample, and how much they are exercising, I calculated the daily average of total steps taken by the whole group conformed by 35 people, which is a little less than 7,000 steps. Then I checked if there were specific days of the week where they might exercise more or less. During Monday to Friday the number of steps is pretty close to the average, and it looks like people exercise a little more on Saturdays and less on Sundays. This information might help to understand when users might be more likely to actively exercise, and when they might prefer rest. Also, from Monday to Friday, users could be prompted to get more passive exercise by just taking more steps between work, commuting, stairs, etc.

Since the analyzed group is small I decided to check for steps taken on different days of the week by each of the individuals. A couple of individuals have recorded steps in only 2-3 days, however those were the ones who only used their trackers for a few days. The rest of the group varied, some showing to record more steps on Fridays, Saturdays, and/ or Sundays. However, differences were not so big when compared to steps taken the rest of the week.

I then proceeded to ask if there is a correlation between the measured steps and other measured variables. First, I wanted to know the relationship between daily steps and daily burned calories. The following scatter chart shows a positive relationship. In March of 2020, the NIH recommended to get at least 8,000 daily steps in order to reduce risk of death. However, the average daily steps in our group is below 7,000 steps.

I classified individuals in the group, depending on the average number of steps they take per day. Individuals that walk less than 4,000 steps were classified as Not Active. If they take between 4,000 to 8,000 steps, they are classified as Moderate individuals. If they take between 8,000 and 10,000 steps they are classified as Active, and if they track more than 10,000 daily steps on average they are considered Very Active individuals. This classification showed that 60% of the group does not take the 8,000 recommended steps. This data is important to incentivise users to get more steps even in casual daily scenarios, in order to get more daily steps.

As expected, there is also a positive relationship between the amount of Very Active Minutes and Burned Calories. This would be important for users interested in losing weight and setting weekly activity goals. Reminders, to get more active time could help for users that are interested in this goal.

Finally, a negative relationship was found between the amount of Sedentary Minutes and the Total Minutes Asleep. Only 24 participants recorded their Sleep activity, and the average amount of Total Minutes Asleep in this analysis is 445 min (7.4 hrs). The CDC recommends that adults sleep >=7 hours a day. This negative relationship indicates that the less active individuals are, the less sleep they get.

When checking for the average hours of sleep during the week, Sunday and Monday are the most restful nights with 8 hours of sleep. From Tuesday to Saturday there is a reduction in the average sleep. Implementing notifications to incentivise movement during the day, notifications to prepare before sleep, and information about the importance of sleep and sleep hygiene could help users to achieve an ideal amount of daily sleep.

6. Top high-level recommendations

Important Note: These suggestions are made based on the above described dataset, which consists of a very small sample of 35 individuals from which we do not have information about their gender or age. Also, the data was only collected during two months. Further focusing on a dataset with a bigger sample, preferably focusing on female individuals (since that is Bellabeat’s target group) and including more months of data would give more accurate information.

  • One of the first observations is that not all users are consistent in using their tracker. I would suggest focusing on the positive characteristics of the Bellabeat Ivy+, such as being lightweight, hypoallergenic, discrete and stylish, waterproof, and its long lasting battery. All which makes it a tracker that can be worn daily without needing to take it off to go to bed or to shower, helping to record your daily measurements in a consistent manner and to get more accurate trends and recommendations. 

  • I would focus on women who might not be primarily focused on getting their exercise stats, but want to achieve a more mindful and balanced life and who like their accessories to reflect their lifestyle and personality.

  • For the Bellabeat app, I would make recommendations to promote ideas for their daily activities. Since most individuals do not get their needed daily 8,000 steps, I would incentivise walking even if it is in more casual scenarios. For people who don’t have time to exercise, promote taking the stairs instead of elevators, walking to nearby places, biking, etc. Since, less sedentarism is related to more (and maybe better) sleep, you can create short info capsules about why being more active during the day can help with having a good night sleep and the importance of sleep and its health benefits. You can also add reminders of when users, according to their preferences, should prepare to go to sleep in order to achieve a good night’s sleep. In general the app should both show individual stats, but also help incentivize the user to be more active giving good suggestions on how to do it.