How to Identify a Trend and Turn It Into Business Opportunity?
Backstory: In this scenario, the stakeholder did not give out clearly defined Business Objective. I am tasked with handling a database, study it to find out any plausible patterns regarding how users use the health tracking device, and present a marketing pitch bashed on my findings. To that end, I need to first re-examine the nature of the product to come up with hypothesis and work along it.
Hypothesis Building
To have a better understanding at the data I am working with, I started first by asking some questions regarding the data source; the users.
Who: People who are interested in or in need of improving their living habit.
What: Exercising and Sleeping pattern
When: Whenever the user is wearing the device
Why: Use collected health data to help correct/discipline user's living habit
After looking at the answers, I was able to narrow my scope further; I would be looking at elements that may contribute to increasing the interest of people or providing assistance to improve their living habit; particularly, their exercising and sleeping pattern. In other words, transforming the smart device from a mere passive tracker to an active agent in building and reinforcing healthy life style.
With this objective in mind, I then performed a more detailed inspection on spreadsheet columns to see if there's anything stands out that I would want to use immediately;
Although I found that a comparison between calories versus activity intensity or steps made in a day alone may be too common to be considered an unique business opportunity. On the bright side however, I was able to identify a very positive correlation between calories expenditure and exercises done at any time from the vast amount of data points.
Now knowing that the data is reliable, from there, I picked out the top 5 and bottom 5 active users. I had decided to study both their exercising habits from above, and their sleeping habits, which we will further elaborate below and attempt to make a connection. This angle has later on proven to be more insightful than I had previously thought, and thus established as the core of my findings.
Business Questions
-
Is there a difference in sleeping quality or habit among the most active and the least active users?
-
If there is any difference, what business opportunities can we build on top the already existing services?
Result
-
"Most active" and "least active" users have different sleeping hours and waking hours.
-
"Most active" users tend to sleep early and wake up early. "Least active" users tend to have a more diverse and irregular sleeping habit.
-
Despite having more record logs from "most active" users, the amount of "restless" period during sleeping hours recorded from the "least active" user group is still quantitatively higher.
Suggestions
-
Market the smart device as a sleeping quality improvement device
-
Provide services that help prepare users to sleep at healthy hours, and then reinforce their habit.
DASHBOARDS (TABLEAU)
-
The "most active" users engage more frequently with the device than the "least active" users.
-
The "most active" users also consistently sleep early and rise early.
-
The "least active" users have a sleeping habit that varies greatly.
-
Restlessness (Yellow line) is the focus of this study
-
"Most Active" users experience a steady sleeping habit
-
"Least Active" users experience a varied quality of sleep
-
Despite having a less amount of sleeping record data, "Least Active" users have a higher sum of "restlessness" recorded. It implies that their sleeping habit contributes very negatively to their sleeping quality.
Data Cleaning Journey
MySQL
One of the biggest headache I had encountered during the data cleaning process was to try to correct the datetime format into one that MySQL can read.
For the sake of practising string manipulation, I opted to do things the long way by utilising substring_index() and concat() to rearrange the database datetime format into “%Y-%m%d %h:%i:%s %p”
Much to my surprise however; that didn’t work. I thought I made a mistake somewhere along the way so I then opted for the more direct approach by using str_to_date() function, which didn’t work either.
I recognised that MySQL in many cases have more than 1 function representing the same datetime format, such as %h and %I, both representing Hour (00 to 12), which had led me to suspect that it could be the reason of errors; that MySQL prefers a certain form of expression over others. Eventually, I found out that in case of a time structure that is the same as “%h:%i:%s %p” (Time in 12 hour AM or PM format (hh:mm:ss AM/PM) ), MySQL would not process it unless it is interpreted as %r.
With all the format and datatype problem solved, next is to have a better look at the tables and decide what other bad apples need to be picked out.
Doing a count(*) and group by ID reveals that not all users submitted a substantial amount of data; some were severely lacking. The database description had suggested that it contained approximately 30 user’s voluntary record over 2 months. However, Of the 33 rows of record, while most ID had 30 to 31 rows of log, a few of them have under half of 30. Those are excluded from the further processing. A closer inspection would further reveal that despite having a complete record, a couple other users also had not had any exercising record at all (TotalSteps and everything else 0). The small sample size and otherwise unhelpful data dictates that those couple records are to be excluded as well.
With the bad apples picked off, the remaining data are largely standardised. After some table joining to simplify table relations, the output was exported to tableau to check for data relevance and prepare for plotting.
​
Up next, Python Project: Cervical Cancer Risk Analysis
​