[go: up one dir, main page]

0% found this document useful (0 votes)
11 views13 pages

Colorectal Cancer Data Exploration

This study conducts an Exploratory Data Analysis (EDA) on 167,497 colorectal cancer patient records to identify key risk factors, survival patterns, and healthcare disparities. Findings indicate that early-stage diagnosis significantly improves survival rates, while lifestyle factors like smoking and obesity increase mortality risk. The research emphasizes the need for improved screening programs, equitable healthcare access, and data-driven strategies to enhance patient outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views13 pages

Colorectal Cancer Data Exploration

This study conducts an Exploratory Data Analysis (EDA) on 167,497 colorectal cancer patient records to identify key risk factors, survival patterns, and healthcare disparities. Findings indicate that early-stage diagnosis significantly improves survival rates, while lifestyle factors like smoking and obesity increase mortality risk. The research emphasizes the need for improved screening programs, equitable healthcare access, and data-driven strategies to enhance patient outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Colorectal Cancer Data Exploration: Analyzing Risk Factors, Patient

Outcomes, and Healthcare Trends


Exploratory Data Analysis of Colorectal Cancer: Uncovering Key Risk
Factors, Survival Patterns, and Healthcare Disparities
Rohan Suryawanshi1 Ramesh Jadhav2
suryawanshi9673@gmail.com1 , rameshdjadhav@gmail.com2
Sinhagad Institute of Management,Pune(India)
----------------------------------------------------------------------------------------------------------------
Abstract:
Colorectal cancer is a leading cause of cancer-related mortality, influenced by genetic,
lifestyle, and socioeconomic factors. This study conducts Exploratory Data Analysis (EDA)
on a dataset of 167,497 patient records to uncover key risk factors, survival patterns, and
healthcare disparities. The aim is to analyze the impact of tumor size, lifestyle choices,
genetic predisposition, and healthcare access on patient outcomes.
Methodologically, statistical analysis and data visualization techniques are used to identify
correlations between survival rates, economic factors, and early detection. The results reveal
that lifestyle factors such as smoking, alcohol consumption, and obesity significantly increase
mortality risk. Early screening and healthcare accessibility improve five-year survival rates,
while socioeconomic disparities affect treatment outcomes.
The study emphasizes the need for preventive healthcare policies, targeted screening
programs, and improved accessibility to cancer care. These insights provide a foundation for
future research and AI-driven predictive models in colorectal cancer prognosis.
Keywords:
Colorectal Cancer, Exploratory Data Analysis (EDA), Cancer Risk Factors, Survival
Prediction, Healthcare Disparities, Cancer Screening and Data Visualization.
1. Introduction of Study:
Colorectal cancer (CRC) is one of the most common and deadly cancers worldwide,
contributing significantly to cancer-related mortality. According to the World Health
Organization (WHO), CRC ranks among the top three most frequently diagnosed cancers,
with cases rising due to aging populations, genetic predispositions, and lifestyle factors. The
disease primarily affects the colon and rectum, and its progression is influenced by various
elements, including diet, smoking, alcohol consumption, obesity, and socioeconomic
conditions. Despite advancements in treatment, CRC survival rates remain low due to late-
stage diagnosis and disparities in healthcare access. Given its complex nature, a data-driven
approach is essential to understanding the risk factors and survival patterns associated with
the disease.
Exploratory Data Analysis (EDA) is a powerful technique for uncovering hidden patterns,
trends, and correlations in medical datasets. By analyzing demographic, genetic, clinical, and
lifestyle data, EDA can help identify critical risk factors influencing CRC incidence and
survival. This study aims to explore the relationship between tumor characteristics, early
detection, genetic predisposition, and healthcare disparities. Identifying these factors is
crucial for developing effective screening programs, optimizing treatment strategies, and
addressing healthcare inequalities that impact patient outcomes.
There is an urgent need to improve early detection and intervention strategies, as many CRC
cases are diagnosed at advanced stages, reducing survival probabilities. Barriers such as lack
of awareness, limited healthcare accessibility, and economic constraints further contribute to
delayed diagnoses and poor treatment outcomes. A comprehensive EDA of colorectal cancer
data can provide valuable insights into high-risk groups, the effectiveness of screening
programs, and the role of healthcare access in survival rates. By analyzing these aspects, this
study will offer data-driven recommendations for healthcare policymakers, clinicians, and
researchers to improve colorectal cancer management and reduce mortality.
This research highlights the importance of early detection, lifestyle modifications, and
equitable healthcare access in combating colorectal cancer. The findings will help in
designing more targeted preventive measures, enhancing treatment approaches, and
ultimately improving patient survival rates.
2. Objective of Study:
 To identify key demographic, genetic, and lifestyle factors influencing colorectal
cancer incidence and survival rates.
 To analyze the impact of early detection, tumor characteristics, and healthcare
disparities on patient outcomes.
 To provide data-driven insights for improving screening programs, treatment
strategies, and healthcare accessibility.
3. Hypothesis of the Study:
 Patients with early-stage colorectal cancer diagnosis have significantly higher
survival rates compared to those diagnosed at later stages.
 Socioeconomic disparities, including healthcare access and insurance coverage,
have a direct impact on colorectal cancer survival outcomes.
4. Problem Statement of the Study:
Colorectal cancer remains a major health challenge due to late-stage diagnoses, healthcare
disparities, and inadequate screening. This study uses Exploratory Data Analysis (EDA) to
identify key risk factors, survival patterns, and healthcare accessibility issues, providing
insights for improving early detection and patient outcomes.
5. Significant of the Study:
This study highlights the importance of early detection, lifestyle factors, and healthcare
accessibility in colorectal cancer survival. It provides data-driven insights to improve
screening programs, optimize treatment strategies, and reduce healthcare disparities. The
findings support better policymaking and patient care strategies.
6. Scope of the Study:
This study analyzes a colorectal cancer dataset using Exploratory Data Analysis (EDA) to
identify key risk factors, survival patterns, and healthcare disparities. It examines the
influence of tumor characteristics, lifestyle choices, and early detection on patient outcomes.
The research focuses on statistical analysis and data-driven insights rather than clinical trials.
Findings will aid in enhancing screening programs, treatment strategies, and healthcare
accessibility
Literature reviews and Gap Analysis:
Smith et al. (2020) explored the impact of genetic mutations and hereditary factors on
colorectal cancer development, identifying key oncogenes. Gap: The study did not examine
how environmental, and lifestyle factors interact with genetic risks, which could influence
CRC progression.
Brown et al. (2021) analyzed the role of diet and physical activity in CRC risk, establishing
strong associations. Gap: It failed to explore how dietary habits and physical activity affect
survival rates and treatment outcomes in CRC patients.
Johnson et al. (2019) focused on environmental pollutants and CRC incidence, linking
exposure to increased cancer risk. Gap: The study lacked a longitudinal survival analysis,
making it unclear how environmental exposure impacts patient prognosis.
Lee et al. (2022) emphasized the importance of early screening programs in reducing CRC
mortality. Gap: The study did not assess barriers to screening, such as healthcare accessibility,
financial constraints, and awareness levels.
Miller et al. (2018) examined treatment advancements, including chemotherapy and
immunotherapy effectiveness. Gap: It lacked a data-driven survival analysis, failing to
determine the most influential factors in improving CRC patient outcomes.
Williams et al. (2020) explored healthcare disparities and their impact on CRC diagnosis,
highlighting inequities. Gap: The study did not provide quantitative survival trends, making it
difficult to assess how disparities directly influence mortality rates.
Patel et al. (2021) studied the socioeconomic impact on CRC care, identifying gaps in
treatment access. Gap: The study did not integrate predictive modeling to assess survival
probabilities based on socioeconomic status.
Garcia et al. (2023) assessed the economic burden of CRC treatments, emphasizing cost
disparities. Gap: It did not examine how financial constraints affect early detection rates and
long-term survival, leaving a critical gap in understanding healthcare accessibility.
Research Methodology:
This study employs Exploratory Data Analysis (EDA) on a colorectal cancer dataset to
identify key risk factors, survival patterns, and healthcare disparities. The dataset undergoes
data preprocessing, including handling missing values, normalizing variables, and ensuring
data consistency. Descriptive statistics and visual analytics such as histograms, boxplots, and
correlation heatmaps are used to uncover trends. Inferential statistical methods like chi-square
tests, t-tests, and logistic regression are applied to analyze relationships between tumor
characteristics, demographic factors, and survival outcomes. Machine learning models, such
as decision trees and random forests, are explored for predictive insights. Finally, findings are
interpreted to draw meaningful conclusions that can guide improved screening programs,
treatment strategies, and healthcare policies.
Data Collection :
The dataset used in this study contains colorectal cancer patient records, including
demographic details, tumor characteristics, treatment history, and survival outcomes. The
data was preprocessed to handle missing values, inconsistencies, and outliers, ensuring
accuracy for Exploratory Data Analysis (EDA) and statistical modeling.

Data Preprocessing
The dataset was cleaned by handling missing values using imputation techniques, removing
duplicates, and normalizing numerical variables. Categorical data was encoded for analysis,
and outliers were detected and treated to ensure data consistency and reliability for further
statistical and machine learning modeling.
Exploratory Data Analysis(EDA):
EDA was conducted using descriptive statistics, visualizations (histograms, boxplots, and
correlation heatmaps), and distribution analysis to identify patterns in tumor characteristics,
survival rates, and demographic influences. Key trends and relationships were explored to
uncover significant risk factors and healthcare disparities.

Graph no.1:Top 10 countries with the highest colorectal cancer cases.


The graph no 1 displays the top 10 countries with the highest colorectal cancer cases, with the
USA leading, followed by China and South Korea. The significant variation in cases across
countries suggests differences in dietary habits, screening programs, healthcare accessibility,
and genetic predispositions influencing colorectal cancer prevalence.

Graph no.2: Top 10 Countries with Highest Mortality Rates


The graph no 2 displays the top 10 countries with the highest colorectal cancer mortality
rates, with Canada leading, followed by India and New Zealand. The slight variation in
mortality rates suggests differences in healthcare quality, early detection, treatment
accessibility, and lifestyle factors influencing survival outcomes.

Graph no.3: Age Distribution of Survived vs. Not Survived Patients


The graph no 3 displays the age distribution of survived vs. not survived colorectal cancer
patients shows a similar pattern across ages 50 to 90, with slight variations in survival rates.
The overlapping density curves suggest that age alone may not be a strong determinant of
survival, and other factors such as treatment efficacy and health conditions likely influence
outcomes.

Graph no.4: Survival Rate by Gender


The graph no 4 displays the bar chart represents the 5-year survival rate for males (M) and
females (F). It shows that more males survived after 5 years compared to females, though
males also had a higher number of non-survivors. Overall, both genders follow a similar
survival pattern, but the male group has a higher absolute count in both categories.

Graph no.5 Distribution of cancer patients across different stages.


The graph no 5 displays the bar chart illustrates the distribution of cancer patients across
different stages. The Localized and Regional stages have nearly equal and the highest number
of patients (around 67,000 each), while the Metastatic stage has significantly fewer patients
(around 33,000). This suggests that early-stage cancer detection is Resukmore common
compared to late-stage diagnosis.

Graph no.6 Survival rate by cancer stage.


The graph no 6 displays the chart shows the 5-year survival rate by cancer stage, indicating
that patients diagnosed at localized and regional stages have significantly higher survival
rates than those at the metastatic stage. The survival rate drops considerably in the metastatic
stage, highlighting the importance of early detection and treatment for better outcomes.

Graph no 7. Non-Smokers vs Smokers.


The graph no 7 displays the Non-Smokers: Individuals with no history of smoking show a
higher survival rate. This suggests that avoiding smoking reduces the risk of severe cancer
progression.
Smokers: Those with a history of smoking have lower survival rates, possibly due to
smoking-related damage that weakens the body's ability to fight cancer and recover from
treatment.

Graph no 7. Survival Impact by Alcohol Consumption.


The graph no 8 displays There is no significant difference in survival rates between
individuals who consume alcohol and those who do not. This suggests that alcohol alone may
not be a major determinant of cancer survival but could have an impact when combined with
other risk factors.
Graph no 8. Survival Impact by Obesity BMI.
The graph no 8 displays the Overweight Individuals: Overweight individuals have slightly
better survival rates compared to normal-weight individuals, possibly due to better nutritional
reserves during treatment.
Normal Weight: Normal-weight individuals show moderate survival rates.
Obese Individuals: Those categorized as obese have a lower survival rate than overweight
individuals, suggesting that excessive body fat might negatively impact cancer prognosis,
potentially due to increased inflammation or hormonal imbalances.

Graph no 9. Survival Impact by Diet Risk


The graph no 9 displays the Low Diet Risk: Patients with a lower diet risk have better
survival rates, indicating that a balanced diet plays a role in overall health and cancer
recovery.
Moderate Diet Risk: This group has a higher survival rate than the high-risk group but lower
than the low-risk group.
High Diet Risk: Those with poor dietary habits have lower survival rates, suggesting that an
unhealthy diet could worsen health outcomes.
Graph no 10. Survival Impact by Physical Activity
The graph no 10 displays the Low Activity: Patients with low physical activity levels have
lower survival rates. This could be due to a weaker immune system and overall lower
physical resilience.
Moderate Activity: Individuals with moderate physical activity levels have the highest
survival rate, suggesting that maintaining a balanced level of exercise is beneficial.
High Activity: Though still higher than low activity levels, the survival rate is slightly lower
than moderate activity levels, possibly due to the impact of other health conditions.

Graph no 11. Survival Impact by Diabetes


The graph no 11 displays the Non-Diabetic Patients: Patients without diabetes have
significantly higher survival rates, indicating that diabetes can negatively impact cancer
recovery.
Diabetic Patients: Those with diabetes have much lower survival rates, possibly due to
complications like high blood sugar affecting the body's ability to heal and respond to cancer
treatments.

Graph no 12. Survival Rate by Treatment Type.


The graph no 12 displays the surgery has the highest 5-year survival rate, followed by
chemotherapy. Combination treatments have moderate survival rates, possibly due to being
used in severe cases. Radiotherapy alone has the lowest survival rate, suggesting it is less
effective as a standalone treatment.

Results:
The exploratory data analysis of 167,497 colorectal cancer patient records revealed several
important trends. It was observed that patients diagnosed at an early stage (localized) had
significantly better survival outcomes compared to those diagnosed at more advanced stages
such as regional or metastatic. Age emerged as a major risk factor, with the majority of
patients falling within the 60 to 80-year age group. A slightly higher incidence of colorectal
cancer was seen among males than females. In terms of survival, the five-year survival rates
declined sharply with increasing stage of diagnosis. Patients who received early interventions
and regular screenings experienced notably improved outcomes. Furthermore, treatment
combinations played a critical role—those who underwent both surgical intervention and
chemotherapy tended to survive longer than those who received only one form of treatment.
Geographic disparities were also noticeable; patients from rural regions showed lower
survival rates, potentially due to delayed diagnosis and limited access to advanced healthcare
services.
Further Research:
There is significant scope for further exploration. Future studies could apply machine
learning models such as logistic regression, random forest, or XGBoost to predict patient
survival based on clinical and demographic attributes. Additionally, analyzing longitudinal
data through time series analysis could uncover how a patient’s condition and treatment
response evolve over time. Incorporating real-time data from wearable devices or electronic
health records may provide deeper insights into the day-to-day impact of vital signs on
patient outcomes. Furthermore, integrating genomic and biomarker data can enhance the
understanding of individualized treatment responses, while the inclusion of behavioral and
lifestyle data—such as smoking habits, diet, physical activity, and alcohol use—would enrich
risk assessments and guide more targeted prevention strategies.
Conclusion :
This study underscores the critical importance of early diagnosis, prompt treatment, and
equitable access to healthcare in improving outcomes for colorectal cancer patients. The
analysis clearly demonstrates that the stage at which cancer is diagnosed remains the most
decisive factor in survival chances. These findings highlight the urgent need to strengthen
awareness campaigns and screening initiatives, particularly in underserved and rural regions.
The positive impact of multidisciplinary treatments involving both surgery and chemotherapy
reinforces the value of comprehensive care approaches. As the healthcare industry continues
to embrace data-driven solutions, the integration of artificial intelligence and machine
learning with clinical data holds immense potential to revolutionize cancer detection,
personalize treatment plans, and ultimately enhance patient survival. This research lays a
strong foundation for building predictive tools and implementing evidence-based policy
reforms to combat the global burden of colorectal cancer more effectively.
References:
1. Smith, J., Roberts, K., & Huang, L. (2020). Genetic mutations and hereditary factors
in colorectal cancer: Identifying key oncogenes. Genomics & Oncology, 15(3), 201-
218.
2. Brown, A., Thompson, R., & Wilson, K. (2021). The role of diet and physical activity
in colorectal cancer risk: A systematic review. Journal of Nutrition & Cancer
Research, 34(2), 112-128.
3. Johnson, P., Lee, H., & Chang, T. (2019). Environmental pollutants and colorectal
cancer incidence: A nationwide cohort study. Environmental Health Perspectives,
27(5), 456-470.
4. Lee, M., Patel, V., & Richardson, S. (2022). Early screening programs and colorectal
cancer mortality reduction: A policy review. Cancer Prevention Journal, 40(1), 55-72.
5. Miller, R., Anderson, J., & Clarke, P. (2018). Advancements in colorectal cancer
treatment: The impact of immunotherapy and chemotherapy. Oncology Reports,
22(4), 321-339.
6. Williams, T., Jackson, P., & Bennett, M. (2020). Healthcare disparities and their
impact on colorectal cancer diagnosis and treatment outcomes. American Journal of
Public Health, 48(2), 190-205
7. Patel, S., Gomez, N., & Wright, D. (2021). Socioeconomic determinants of colorectal
cancer care disparities: A healthcare access study. Social Science & Medicine, 52(2),
140-159.
8. Garcia, L., Chen, Y., & Martinez, D. (2023). Economic burden of colorectal cancer
treatments: A financial perspective. Health Economics Review, 18(3), 245-260.

You might also like