What is Most Associated With a High Risk of Cervical Cancer?
And when should one be checked, just to be safe?
Plotting and Discovery
Made with Seabon Library. A half-size heatmap is used to limit the size of the plot.
​
The attribute to focus on here is Dx:Cancer. It is a boolean data type in the datase. 1 means positive with cancer.
The heatmap has shown that there's only 1 attribute that has a high direct correlation with it, and it is Dx: HPV.
​
HPV stands for human papillomavirus. It is a sexually transmitted infection (skin to skin contact). Further researches has revealed that HPV would cause infected cells to multiply in an uncontrolled manner. When the growth is not controlled by the body, they will become precancerous cell that may become cancerous if untreated.
​
​
​
So at what age should a woman start doing checkups?
​
The dataset suggested that if you wanted to be safe, as early as 19 years old when the rate began to pick up or starting from 27 years old. It is recommended the latest to start yearly check-ups would be 31 years old.
​
​
Data Cleaning Journey
The was a smaller data set intended for machine learning training (Source: https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29), thus its size is relatively small compared to my previous projects and practices. However, the topic is interesting so why not give it a try and see what we can find?
​
​
It is regrettable but not too surprising that there are many null values in this dataset. Nature of attributes aside, it is also explained in the dataset information that due to privacy reasons, patients may opt to not provide information.
​
I speculate that it was for input convenience reasons, most of the attributes were 'object'/'string'/'VARCHAR type by default, despite appearing to be numbers on the surface. Later on, I discovered that many of them are in fact, Boolean type data.
​
For correlation fairness, I had decided to adopt only data with all records. Due to the columns "STDs: Time since first diagnosis" and "STDs: Time since last diagnosis" having too little valid records, the two attributes were dropped entirely from the final dataset.