Report
Report
HISTOPATHOLOGY IMAGES
A Project Report submitted in partial fulfillment of the requirements for the award of the
degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted by
Team - P601
K V L BHAVYA (HU21CSEN0101141)
CH HARIKA (HU21CSEN0101163)
K PRANATHI (HU21CSEN0100954)
DECLARATION
I hereby declare that the project report entitled “Prediction of Breast Cancer
Using Histopathology Images” is an original work done in the Department of
Computer Science and Engineering, GITAM School of Technology, GITAM
(Deemed to be University) submitted in partial fulfillment of the requirements
for the award of the degree of B.Tech. in Computer Science and Engineering.
The work has not been submitted to any other college or University for the
award of any degree or diploma.
Date:
CERTIFICATE
This is to certify that the project report entitled “Prediction of Breast Cancer
Using Histopathology Images” is a bonafide record of work carried out by
K V L BHAVYA (HU21CSEN0101141), CH HARIKA (HU21CSEN0101163),
K PRANATHI (HU21CSEN0100954), K SURYA TEJA (HU21CSEN101927)
students submitted in partial fulfillment of requirement for the award of degree
of Bachelors of Technology in Computer Science and Engineering.
Date :
Our project report would not have been successful without the help of several people. We
would like to thank the personalities who were part of our seminar in numerous ways, those
who gave us outstanding support from the birth of the seminar.
We are very much obliged to our beloved Prof. S Mahaboob Basha, Head of the Department
of Computer Science & Engineering, for providing the opportunity to undertake this seminar
and encouragement in the completion of this seminar.
We hereby wish to express our deep sense of gratitude to Dr. A B Pradeep Kumar, Project
Coordinator, Department of Computer Science and Engineering, School of Technology and to
our guide, Dr Figlu Mohanty, Assistant Professor, Department of Computer Science and
Engineering, School of Technology for the esteemed guidance, moral support and invaluable
advice provided by them for the success of the project report.
We are also thankful to all the Computer Science and Engineering department staff members
who have cooperated in making our seminar a success. We would like to thank all our parents
and friends who extended their help, encouragement, and moral support directly or indirectly
in our seminar work.
Sincerely,
HU21CSEN0101141 K V L BHAVYA
HU21CSEN0101163 CH HARIKA
HU21CSEN0100954 K PRANATHI
HU21CSEN0101927 K SURYA TEJA
i
ABSTRACT
Breast cancer is one of the most prevalent and life-threatening diseases affecting women
worldwide. Accurate and efficient classification of histopathology images is crucial for early
diagnosis and effective treatment planning. This project presents a robust breast cancer
classification framework integrating deep learning-based feature extraction with traditional
machine learning classifiers. The study utilizes the BreaKHis dataset, which contains
microscopic biopsy images categorized as benign or malignant.
In the proposed approach, feature extraction is performed using Convolutional Neural
Networks (CNN) and VGG16, two powerful deep learning architectures. To enhance
computational efficiency and reduce redundancy, Principal Component Analysis (PCA) is
applied for dimensionality reduction while retaining 95% of the variance. Additionally, JAYA
optimization is employed to fine-tune feature selection and classifier hyperparameters, further
improving model performance. The extracted and optimized features are then classified using
machine learning algorithms such as Random Forest (RF), K-Nearest Neighbors (KNN), and
Extreme Gradient Boosting (XGBoost).
The performance of different model combinations is evaluated based on metrics such as
accuracy, precision, recall, and F1-score. Experimental results demonstrate that integrating
deep learning-based feature extraction with traditional classifiers yields improved
classification accuracy, with JAYA optimization further refining the results. This study
highlights the effectiveness of hybrid deep learning and machine learning approaches in
breast cancer diagnosis and contributes to the development of efficient, automated diagnostic
tools for histopathology image analysis.
ii
TABLE OF CONTENTS
1. Introduction 1
2. Literature Review 3
6. Technologies Used 13
10. References 25
iii
LIST OF FIGURES
Figure 4.3: Proposed System 8
Figure 7.1: Data processing 15
Figure 7.2: CNN architecture 15
Figure 7.3: VGG16 architecture 15
Figure 7.4: Feature Reduction 16
Figure 7.5: JAYA algorithm for optimization 16
Figure 7.6: Random Forest Classifier 16
Figure 7.7: Extreme Gradient Boost Classifier 17
Figure 7.8: K Nearest Neighbour Classifier 17
Figure 7.9: Benign image prediction 17
Figure 7.10: Malignant image prediction 17
Figure 8.1: CNN Performance 18
Figure 8.2: VGG16 Performance 20
Figure 8.3: Accuracies before and after JAYA 21
iv
LIST OF TABLES
v
CHAPTER 1: INTRODUCTION
Breast cancer is one of the most prevalent and life-threatening diseases among women worldwide.
Early and accurate detection of breast cancer plays a crucial role in improving survival rates and
facilitating timely treatment. Traditional diagnostic methods, such as histopathological examination
by expert pathologists, are time-consuming, subjective, and prone to human error. With
advancements in artificial intelligence (AI) and deep learning, automated breast cancer
classification systems have emerged as a powerful tool to assist medical professionals in diagnosing
cancer with high accuracy and efficiency.
Several classification systems have been proposed for breast cancer diagnosis, leveraging machine
learning (ML) and deep learning (DL) techniques. Traditional ML models rely on handcrafted
feature extraction methods, such as texture analysis and morphological characteristics, which often
fail to capture complex patterns in histopathology images. Deep learning-based approaches,
particularly Convolutional Neural Networks (CNNs), have demonstrated superior performance in
feature extraction and classification. However, CNNs require extensive computational resources
and large amounts of labeled data for effective training. Furthermore, high-dimensional feature
representations generated by CNNs can introduce redundancy, leading to inefficiencies in the
classification process.
To address these challenges, this study proposes a robust breast cancer classification framework
that integrates deep learning-based feature extraction with traditional machine learning classifiers.
The framework leverages CNN and VGG16 architectures to extract meaningful features from
histopathology images. To mitigate high-dimensionality issues, Principal Component Analysis
(PCA) is applied for feature reduction while retaining the most significant information.
Additionally, classification is performed using Random Forest (RF), K-Nearest Neighbors (KNN),
and XGBoost (XGB) classifiers, which are further optimized using the JAYA optimization
algorithm to enhance classification accuracy.
The justification for the title, "Prediction of Breast Cancer Using Histopathological Images", stems
from the study’s comprehensive approach. The proposed framework combines the strengths of deep
learning for feature extraction and traditional machine learning for classification, ensuring both
accuracy and interpretability. By incorporating PCA for dimensionality reduction and JAYA
optimization for feature selection and hyperparameter tuning, the system achieves a balance
1
between computational efficiency and classification performance. The study systematically
evaluates different model configurations to identify the most effective approach for breast cancer
classification.This research aims to contribute to the development of an automated, efficient, and
highly accurate breast cancer diagnostic system, assisting pathologists in making more reliable and
consistent diagnoses.
2
CHAPTER 2: LITERATURE REVIEW
Breast cancer prediction using histopathology images has gained significant attention in recent
years due to advancements in deep learning and machine learning techniques. Researchers have
explored various approaches, including pre-trained convolutional neural networks (CNNs),
traditional classifiers, and optimization algorithms, to improve the accuracy and efficiency of breast
cancer diagnosis. This section reviews recent studies that have contributed to the field, focusing on
methodologies, datasets, results, and limitations.
Recent studies have explored the integration of pre-trained Convolutional Neural Networks (CNNs)
with traditional classifiers for breast cancer classification. Li et al. (2021) [5] proposed a deep
feature extraction method that leveraged different CNN levels to classify benign and malignant
breast cancer cases. Using AlexNet for feature extraction and classifiers such as Support Vector
Machines (SVM), Logistic Regression, and Random Forest, they achieved an accuracy of up to
88.69% on the BreaKHis dataset. The study demonstrated that intermediate and high-level CNN
features, when combined with traditional classifiers, enhanced classification performance.
However, it was limited to binary classification and did not explore subtype differentiation.
Similarly, Gupta and Chawla (2020) [3] employed pre-trained models, including VGG16, VGG19,
Xception, and ResNet50, for feature extraction and classified them using SVM and Logistic
Regression. ResNet50 combined with Logistic Regression achieved the highest accuracy of
93.27%. Despite its success, the study was restricted to binary classification and lacked data
augmentation techniques.
Transfer learning has been widely utilized to enhance breast cancer detection by leveraging
pre-trained models. Rana and Bhushan (2023) [7] evaluated seven transfer learning models,
including LENET, VGG16, DarkNet53, and Xception, for histopathological image classification.
Xception achieved the highest accuracy (83.07%), while DarkNet53 provided the best-balanced
accuracy (87.17%). Additionally, comparison with YOLOv3 demonstrated a high accuracy of
96.50%, highlighting the potential of object detection models in medical imaging.
Bayramoglu (2016) [2] developed a magnification-independent classification model using CNNs.
The study introduced single-task and multi-task CNNs to classify breast cancer irrespective of
3
magnification levels. The multi-task CNN model attained an accuracy of 82.13% in malignancy
classification and 80.10% in magnification classification. Despite its effectiveness, the study was
constrained by dataset limitations and the need for fine-tuning.
Several studies have focused on optimizing deep learning architectures for improved classification
performance. Spanhol et al. (2016) [8] implemented CNN-based classification using the Caffe
framework, incorporating patch extraction and fusion techniques. The study demonstrated that
CNNs outperformed traditional texture-based models by 6%, with sum-rule fusion achieving the
best results. However, computational cost remained a challenge.
Su et al. (2023) [10] introduced BCR-Net, a deep learning framework that predicts breast cancer
recurrence using Multiple Instance Learning (MIL) and attention-based pooling. The model
achieved an AUC of 0.775 for H&E-stained whole slide images (WSIs) and 0.811 for Ki67-stained
WSIs, emphasizing the effectiveness of MIL and interpretability improvements. However, the
model's clinical applicability remains untested, and further validation on diverse datasets is
required.
Recent advancements have also focused on predicting gene expression and treatment response
using deep learning. Mondol et al. (2023) [6] proposed hist2RNA, a deep learning model that
predicts gene expression from histopathology images. Using EfficientNet, RegNet, and DenseNet
for feature extraction, hist2RNA achieved a Spearman correlation of 0.82 for gene expression
prediction and an AUROC of up to 0.89 for subtype classification. The study demonstrated high
computational efficiency but required further validation on diverse patient cohorts.
Hoang et al. (2024) [4] introduced ENLIGHT-DeepPT, an AI framework that imputes
transcriptomics (mRNA expression) from histopathology images to predict cancer treatment
response. Evaluated on TCGA datasets, the framework achieved an odds ratio of 2.28,
outperforming existing models. However, the study lacked computational efficiency analysis and
feature importance evaluation.
Breast cancer recurrence prediction has been a key area of research, with deep learning models
showing promise. Su et al. (2023) [10] demonstrated that BCR-Net effectively classifies high-risk
4
patients, with Ki67-stained WSIs achieving a classification accuracy of 79.2%. The study
emphasized the importance of intelligent patch sampling and MIL for recurrence prediction.
However, clinical validation and integration with additional biomarkers remain necessary.
While deep learning has significantly improved breast cancer classification, several challenges
remain. Most studies rely on the BreaKHis dataset, limiting generalization to diverse clinical
settings. Furthermore, computational costs and dataset imbalance continue to pose challenges.
Future research should explore multi-magnification fusion, diverse datasets, and advanced
optimization techniques to enhance model robustness.
The reviewed studies highlight the effectiveness of CNNs, transfer learning, and optimization
techniques in breast cancer classification. Pre-trained models, when combined with traditional
classifiers, improve classification performance. Transfer learning models like Xception and
DarkNet53 demonstrate high accuracy, while architectures such as BCR-Net and
ENLIGHT-DeepPT show promise in recurrence prediction and treatment response analysis.
However, dataset limitations, computational efficiency, and clinical applicability require further
exploration to ensure robust and interpretable AI-driven solutions for breast cancer detection and
prognosis.
5
CHAPTER 3: PROBLEM IDENTIFICATION AND OBJECTIVES
Breast cancer is one of the leading causes of mortality among women worldwide, and early and
accurate diagnosis is crucial for effective treatment. Histopathology image analysis is a widely used
diagnostic method, but manual examination by pathologists is time-consuming, subjective, and
prone to human error. Traditional machine learning techniques for classification rely on handcrafted
feature extraction, which often fails to capture complex patterns in histopathology images. Deep
learning-based methods, while highly effective, generate high-dimensional feature representations
that can lead to redundancy and increased computational complexity. Additionally, selecting the
most relevant features and optimizing classifier performance remains a challenge. Therefore, a
robust and efficient classification framework is needed to improve the accuracy and reliability of
breast cancer diagnosis.
3.2 OBJECTIVES
6
CHAPTER 4: EXISTING SYSTEM AND PROPOSED SYSTEM
Traditional breast cancer classification methods rely on handcrafted feature extraction techniques
such as texture analysis (GLCM, LBP), shape descriptors, and statistical measures, followed by
classification using machine learning algorithms like Support Vector Machines (SVM) and Random
Forest (RF). While these methods have been effective to some extent, they suffer from several
limitations:
The proposed system overcomes these limitations by integrating deep learning-based feature
extraction with traditional machine learning classifiers, combined with dimensionality reduction
and optimization techniques. The key improvements are:
7
● Dimensionality Reduction Using PCA: PCA is applied to eliminate redundant and less
informative features while retaining 95% of variance, reducing computational complexity
and mitigating overfitting.
● Hybrid Classification Approach: Deep features are classified using machine learning
classifiers (RF, KNN, XGB), leveraging the strengths of both deep learning and traditional
classification techniques.
● JAYA Optimization for Feature Selection and Hyperparameter Tuning: The JAYA
optimization algorithm is integrated to refine feature selection and optimize classifier
parameters, further enhancing accuracy.
● Comprehensive Performance Evaluation: The proposed system is evaluated using multiple
classification metrics, including accuracy, precision, recall, and F1-score, ensuring a robust
and reliable classification model.
● Increased Generalization Capability: Data augmentation techniques (rotation, flipping,
rescaling, contrast normalization) are applied to improve model robustness against
variations in histopathology images.
8
CHAPTER 5: SYSTEM ARCHITECTURE
This study presents a robust breast cancer classification framework by integrating deep
learning-based feature extraction with traditional machine learning classifiers. The methodology
comprises multiple stages:
The study employs the BreaKHis dataset, which consists of histopathology images categorized as
benign or malignant tumors. The dataset is divided into an 80% training set and a 20% testing set,
ensuring both classes are well represented.
To enhance model generalization and prevent overfitting, the following data augmentation
techniques are applied to training images:
These augmentations introduce variability into the dataset, allowing the model to learn robust
feature representations, which improves classification performance.
Feature extraction is performed using two different deep learning architectures: a custom CNN
model and VGG16, both of which extract high-dimensional feature vectors from histopathology
images.
9
5.2.1 Custom CNN for Feature Extraction
A Convolutional Neural Network (CNN) is designed to automatically learn spatial and hierarchical
features from the images. The CNN consists of:
● Multiple convolutional layers with 3×3 kernels to extract spatial patterns.
● Batch Normalization to stabilize learning.
● ReLU activation functions for non-linearity.
● Max-Pooling layers to reduce spatial dimensions.
● Fully Connected Layers, where the penultimate layer's activations are extracted as feature
vectors.
Since deep learning models generate high-dimensional feature vectors, Principal Component
Analysis (PCA) is applied to reduce dimensionality while preserving the most important
information.
The PCA process includes:
● Standardizing the feature vectors to zero mean and unit variance.
● Computing the covariance matrix to identify relationships between features.
● Performing Eigen decomposition to obtain principal components.
● Selecting the top k principal components that retain at least 95% of the variance.
This dimensionality reduction helps:
● Reduce computational complexity.
● Mitigate the risk of overfitting.
● Improve the performance of traditional classifiers.
10
5.4 JAYA OPTIMIZATION
The extracted feature vectors, after PCA, are classified using Random Forest (RF), K-Nearest
Neighbors (KNN), and XGBoost (XGB) classifiers. Random Forest operates by constructing
multiple decision trees and aggregating their predictions, which improves classification robustness
and reduces overfitting. K-Nearest Neighbors classifies new samples based on their proximity to
existing labeled data points, making it a simple yet effective non-parametric classification method.
XGBoost, an optimized gradient boosting algorithm, enhances classification accuracy by iteratively
improving weak learners while handling complex decision boundaries efficiently.
The CNN and VGG16 feature extraction approaches are evaluated using different classifier
combinations, including CNN + RF, CNN + KNN, CNN + XGB, VGG16 + RF, VGG16 + KNN,
and VGG16 + XGB, both with and without JAYA optimization. This comparative analysis helps
identify the most effective feature extraction classification combination for breast cancer
histopathology image classification.
5.6.1. Accuracy
11
5.6.2. Precision (Positive Predictive Value)
Indicates how many of the predicted malignant cases are actually malignant.
5.1.4. F1-Score
A harmonic mean of precision and recall, balancing the trade-off between false positives and false
negatives.
12
CHAPTER 6: TOOLS/TECHNOLOGIES USED
The implementation of the breast cancer classification framework involves multiple tools and
technologies to ensure efficient data processing, feature extraction, classification, and evaluation.
The key tools and their significance are as follows:
Python is used due to its extensive support for machine learning, deep learning, and scientific
computing. Its libraries simplify data handling, model training, and evaluation.
● TensorFlow: Provides an efficient platform for building and training deep learning models,
particularly CNN and VGG16 for feature extraction.
● Keras: A high-level API of TensorFlow that simplifies model implementation and
experimentation.
● Scikit-Learn: Used for preprocessing, feature extraction, PCA, and implementing machine
learning classifiers like RF and KNN.
● XGBoost: An optimized gradient boosting library for high-performance classification.
13
6.6. DATA HANDLING & VISUALIZATION: NumPy, Pandas, Matplotlib, and Seaborn
● NumPy & Pandas: Used for efficient handling and manipulation of dataset features.
● Matplotlib & Seaborn: Used for visualizing dataset distributions, model performance, and
comparative analysis.
14
CHAPTER 7: IMPLEMENTATION AND TESTING
15
Figure 7.4: Feature Reduction
16
Figure 7.7: Extreme Gradient Boost Classifier
17
CHAPTER 8: RESULTS AND DISCUSSION
The CNN model was used for feature extraction, followed by classification using RF, KNN, and
XGB classifiers. Table 8.1 summarizes the performance of these models in terms of accuracy,
precision, recall, and F1-score.
● CNN+XGB achieved the highest accuracy (87.18%), outperforming the other classifiers.
18
● CNN+RF (86.37%) and CNN+KNN (85.85%) showed competitive performance, with RF
slightly outperforming KNN.
● Precision and recall values varied for benign and malignant classes, with XGB
demonstrating better balance across both classes.
● CNN-extracted features contributed to robust classification performance, proving the
effectiveness of deep learning-based feature extraction.
Feature extraction using the pre-trained VGG16 model was followed by classification with RF,
KNN, and XGB classifiers. The results are summarized in Table 8.2.
Key Observations:
19
Figure 8.2: VGG16 Performance
To understand the overall model effectiveness, we compared all CNN- and VGG16-based models.
Figure 8.1 illustrates the accuracy comparison.
KEY FINDINGS:
The JAYA optimization algorithm was applied to refine feature selection and hyperparameters,
leading to improved classification performance. Table 8.3 compares the accuracy before and after
applying JAYA.
20
Table 8.3: Accuracies before and after JAYA
21
8.5 CHALLENGES AND LIMITATIONS
● CNN-based models perform better than VGG16-based models for this dataset.
● XGB was the most effective classifier, achieving the highest accuracy in both CNN and
VGG16 models.
● JAYA optimization significantly improved model performance, especially for VGG16-based
approaches.
● Deep learning-based feature extraction combined with traditional classifiers provides a
robust solution for breast cancer classification.
22
CHAPTER 9: CONCLUSION AND FUTURE SCOPE
9.1 CONCLUSION
This project successfully implemented and evaluated deep learning and machine learning
techniques for the classification of breast cancer histopathological images using the BreaKHis
dataset. A hybrid approach combining CNN/VGG16-based feature extraction with machine
learning classifiers such as Random Forest (RF), K-Nearest Neighbors (KNN), and XGBoost
(XGB) was explored. The study also incorporated Principal Component Analysis (PCA) for
dimensionality reduction and the JAYA optimization algorithm to enhance classification
performance.
The results demonstrated that integrating deep learning-based feature extraction with traditional
classifiers improves classification accuracy. Among all models, CNN+XGB with JAYA
optimization achieved the highest test accuracy of 87.18%, highlighting the importance of feature
selection and hyperparameter tuning in improving model performance. The study also confirmed
that JAYA optimization significantly enhances accuracy compared to conventional hyperparameter
tuning methods.
● Hybrid deep learning and machine learning models improve breast cancer classification
accuracy compared to standalone deep learning or traditional classifiers.
● CNN-based feature extraction provides superior results compared to VGG16-based feature
extraction for this dataset.
● PCA reduces feature dimensionality, improving computational efficiency while maintaining
classification performance.
● JAYA optimization effectively selects hyperparameters, improving model accuracy and
generalization.
● The approach developed in this study can assist pathologists in early breast cancer
detection, reducing subjectivity in diagnosis.
While this study achieved promising results, there are several areas for future improvement:
23
● Expanding the dataset: Training on larger and more diverse datasets can improve model
generalization.
● Exploring advanced deep learning models: Investigating deeper architectures like ResNet,
EfficientNet, and Vision Transformers can enhance feature extraction capabilities.
● Real-world clinical application: Extending this approach for whole-slide histopathology
image analysis can improve its applicability in clinical settings.
24
REFERENCES
● [1] Araújo, T., Aresta, G., Castro, E., Rouco, J., Aguiar, P., Eloy, C., Polónia, A., & Campilho, A.
(2017). Classification of Breast Cancer Histology Images Using Convolutional Neural Networks.
PLoS One, 12(6), e0177544. https://doi.org/10.1371/journal.pone.0177544
● [2] Bayramoglu, N., Kannala, J., & Heikkilä, J. (2016). Deep Learning for Magnification
Independent Breast Cancer Histopathology Image Classification. Proceedings of the 2016 23rd
International Conference on Pattern Recognition (ICPR), 2440-2445.
https://doi.org/10.1109/ICPR.2016.7900001
● [3] Gupta, K., & Chawla, N. (2020). Analysis of Histopathological Images for Prediction of Breast
Cancer Using Traditional Classifiers with Pre-Trained CNN. Procedia Computer Science, 167,
878-889. https://doi.org/10.1016/j.procs.2020.03.427
● [4] Hoang, D.-T., et al. (2024). A Deep-Learning Framework to Predict Cancer Treatment Response
from Histopathology Images through Imputed Transcriptomics. Nature Cancer, 5, 1-13.
https://doi.org/10.1038/s43018-024-00793-2
● [5] Li, X., Li, H., Cui, W., Cai, Z., & Jia, M. (2021). Classification on Digital Pathological Images
of Breast Cancer Based on Deep Features of Different Levels. Mathematical Problems in
Engineering, 2021, 1-13. https://doi.org/10.1155/2021/8403025
● [6] Mondol, R. K., Millar, E. K. A., Graham, P. H., Browne, L., Sowmya, A., & Meijering, E.
(2023). hist2RNA: An Efficient Deep Learning Architecture to Predict Gene Expression from Breast
Cancer Histopathology Images. Cancers, 15(9), 2569. https://doi.org/10.3390/cancers15092569
● [7] Rana, M., & Bhushan, M. (2023). Classifying Breast Cancer Using Transfer Learning Models
Based on Histopathological Images. Neural Computing and Applications.
https://doi.org/10.1007/s00521-023-08484-2
● [8] Spanhol, F. A., Oliveira, L. S., Petitjean, C., & Heutte, L. (2016). Breast Cancer
Histopathological Image Classification Using Convolutional Neural Networks. Proceedings of the
2016 International Joint Conference on Neural Networks (IJCNN).
https://doi.org/10.1109/ijcnn.2016.7727519
● [9] Sudharshan, P. J., Petitjean, C., Spanhol, F., Oliveira, L. E., Heutte, L., & Honeine, P. (2019).
Multiple Instance Learning for Histopathological Breast Cancer Image Classification. Expert
Systems with Applications, 117, 103-111. https://doi.org/10.1016/j.eswa.2018.09.049
● [10] Su, Z., Niazi, M. K. K., Tavolara, T. E., Niu, S., Tozbikian, G. H., Wesolowski, R., & Gurcan,
M. N. (2023). BCR-Net: A Deep Learning Framework to Predict Breast Cancer Recurrence from
Histopathology Images. PLOS ONE, 18(4).
● [11] Dataset: https://www.kaggle.com/datasets/ambarish/breakhis
25
ANNEXURE 1 (SOURCE CODE)
Python
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
import os
train_dir = r'K:\PROJECT\ORGANISED DATASET\train'
test_dir = r'K:\PROJECT\ORGANISED DATASET\test'
train_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=30,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest',
validation_split=0.2
)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(224, 224),
batch_size=32,
class_mode='binary',
subset='training'
)
validation_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(224, 224),
batch_size=32,
class_mode='binary',
subset='validation'
26
)
test_generator = test_datagen.flow_from_directory(
test_dir,
target_size=(224, 224),
batch_size=32,
class_mode='binary',
)
from tensorflow.keras.regularizers import l2
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(224, 224, 3),
kernel_regularizer=l2(0.01)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu', kernel_regularizer=l2(0.01)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3), activation='relu', kernel_regularizer=l2(0.01)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu', kernel_regularizer=l2(0.01)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])
early_stopping = EarlyStopping(monitor='val_loss', patience=5,
restore_best_weights=True)
history = model.fit(
train_generator,
epochs=13,
validation_data=validation_generator,
callbacks=[early_stopping]
)
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Training and Validation Loss')
27
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
model.save('cnn_model.h5')
test_loss, test_accuracy = model.evaluate(test_generator)
print(f"Test Accuracy: {test_accuracy*100:.2f}%")
Python
train_dir = r'K:/PROJECT/ORGANISED DATASET/train'
test_dir = r'K:/PROJECT/ORGANISED DATASET/test'
IMG_HEIGHT = 224
IMG_WIDTH = 224
BATCH_SIZE = 32
train_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=30,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest'
)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(IMG_HEIGHT, IMG_WIDTH),
batch_size=BATCH_SIZE,
class_mode='binary',
shuffle=True
)
test_generator = test_datagen.flow_from_directory(
test_dir,
target_size=(IMG_HEIGHT, IMG_WIDTH),
batch_size=BATCH_SIZE,
class_mode='binary',
28
shuffle=False
)
base_model = VGG16(weights='imagenet', include_top=False,
input_shape=(IMG_HEIGHT, IMG_WIDTH, 3))
base_model.trainable = False
model = Sequential([
base_model,
Flatten(),
Dense(128, activation='relu'),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy',
metrics=['accuracy'])
early_stopping = EarlyStopping(monitor='val_loss', patience=5,
restore_best_weights=True)
history = model.fit(
train_generator,
epochs=10,
validation_data=test_generator,
callbacks=[early_stopping]
)
model.save("C:/Users/K M SASTRY/Desktop/PROJECT/vgg16_model.h5")
test_loss, test_accuracy = model.evaluate(test_generator)
print(f"Test accuracy: {test_accuracy * 100:.2f}%")
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
29
3. CNN+RF+PCA / CNN+KNN+PCA / CNN+XGB+PCA (Base code)
Python
cnn_model = load_model(r"C:\Users\K M SASTRY\Desktop\CAPSTONE
PROJECT\cnn_model.h5")
cnn_model = tf.keras.Model(inputs=cnn_model.input,
outputs=cnn_model.get_layer('flatten').output)
for layer in cnn_model.layers:
layer.trainable = False
train_dir = r"K:\PROJECT\ORGANISED DATASET\train"
test_dir = r"K:\PROJECT\ORGANISED DATASET\test"
train_datagen = ImageDataGenerator(rescale=1.0 / 255.0)
test_datagen = ImageDataGenerator(rescale=1.0 / 255.0)
train_generator = train_datagen.flow_from_directory(
train_dir, target_size=(224, 224), batch_size=32, class_mode='binary', shuffle=False
)
test_generator = test_datagen.flow_from_directory(
test_dir, target_size=(224, 224), batch_size=32, class_mode='binary', shuffle=False
)
def extract_features(generator, model):
features = model.predict(generator, verbose=1)
labels = generator.classes
return features, labels
train_features, train_labels = extract_features(train_generator, cnn_model)
test_features, test_labels = extract_features(test_generator, cnn_model)
print(f"Number of features before PCA: {train_features.shape[1]}")
scaler = StandardScaler()
train_features_scaled = scaler.fit_transform(train_features)
test_features_scaled = scaler.transform(test_features)
pca = PCA(n_components=0.95, random_state=42)
train_features_pca = pca.fit_transform(train_features_scaled)
test_features_pca = pca.transform(test_features_scaled)
print(f"Number of features after PCA: {train_features_pca.shape[1]}")
4. RF Classifier
Python
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(train_features_pca, train_labels)
30
test_predictions = rf_classifier.predict(test_features_pca)
accuracy = accuracy_score(test_labels, test_predictions)
print(f"Test Accuracy with PCA: {accuracy:.4f}")
print("Classification Report with PCA:")
print(classification_report(test_labels, test_predictions,
target_names=test_generator.class_indices.keys()))
Python
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors': [3, 5, 7, 9, 11, 15, 21], 'weights': ['uniform', 'distance']}
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(train_features_pca, train_labels)
best_knn = grid_search.best_estimator_
print(f"Best K value: {grid_search.best_params_['n_neighbors']}, Weights:
{grid_search.best_params_['weights']}")
test_predictions = best_knn.predict(test_features_pca)
accuracy = accuracy_score(test_labels, test_predictions)
print(f"Test Accuracy with PCA: {accuracy:.4f}")
print("Classification Report with PCA:")
print(classification_report(test_labels, test_predictions,
target_names=test_generator.class_indices.keys()))
Python
xgb_classifier = XGBClassifier(
objective='binary:logistic',
eval_metric='logloss',
use_label_encoder=False,
random_state=42
)
param_grid = {
'n_estimators': [100, 200],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
31
'subsample': [0.8, 1.0],
'colsample_bytree': [0.8, 1.0]
}
grid_search = GridSearchCV(xgb_classifier, param_grid, cv=3, scoring='accuracy',
n_jobs=-1, verbose=1)
grid_search.fit(train_features_pca, train_labels)
best_xgb = grid_search.best_estimator_
print(f"Best Parameters: {grid_search.best_params_}")
print("XGBoost model trained using CNN features with PCA.")
test_predictions = best_xgb.predict(test_features_pca)
accuracy = accuracy_score(test_labels, test_predictions)
print(f"Test Accuracy with PCA & XGBoost: {accuracy:.4f}"
print("Classification Report with XGBoost:")
print(classification_report(test_labels, test_predictions,
target_names=test_generator.class_indices.keys()))
Python
vgg16_model = load_model(r"C:\Users\K M
SASTRY\Desktop\PROJECT\vgg16_model.h5")
vgg16_model = tf.keras.Model(inputs=vgg16_model.input,
outputs=vgg16_model.get_layer('flatten').output)
for layer in vgg16_model.layers:
layer.trainable = False
train_dir = r"K:\PROJECT\ORGANISED DATASET\train"
test_dir = r"K:\PROJECT\ORGANISED DATASET\test"
train_datagen = ImageDataGenerator(rescale=1.0 / 255.0)
test_datagen = ImageDataGenerator(rescale=1.0 / 255.0)
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=(224, 224),
batch_size=32,
class_mode='binary',
shuffle=False
)
test_generator = test_datagen.flow_from_directory(
test_dir,
target_size=(224, 224),
batch_size=32,
32
class_mode='binary',
shuffle=False
)
def extract_features(generator, model):
features = model.predict(generator, verbose=1)
labels = generator.classes
return features, labels
train_features, train_labels = extract_features(train_generator, vgg16_model)
test_features, test_labels = extract_features(test_generator, vgg16_model)
print(f"Train features shape: {train_features.shape}")
print(f"Test features shape: {test_features.shape}")
print("Before PCA, train features shape:", train_features.shape)
print("Before PCA, test features shape:", test_features.shape)
pca = PCA(n_components=0.95)
train_features_pca = pca.fit_transform(train_features)
test_features_pca = pca.transform(test_features)
print(f"After PCA, train features shape: {train_features_pca.shape}")
print(f"After PCA, test features shape: {test_features_pca.shape}")
8. RF Classifier
Python
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(train_features_pca, train_labels)
print("Random Forest training complete with PCA.")
test_predictions = rf_classifier.predict(test_features_pca)
accuracy = accuracy_score(test_labels, test_predictions)
print(f"Test Accuracy with PCA: {accuracy:.4f}")
print("Classification Report with PCA:")
print(classification_report(test_labels, test_predictions,
target_names=test_generator.class_indices.keys()))
Python
knn_classifier = KNeighborsClassifier(n_neighbors=7, weights='distance',
metric='minkowski', p=2)
knn_classifier.fit(train_features_pca, train_labels)
33
test_predictions = knn_classifier.predict(test_features_pca)
accuracy = accuracy_score(test_labels, test_predictions)
print(f"Test Accuracy with PCA and optimized KNN: {accuracy:.4f}")
print("Classification Report with PCA:")
print(classification_report(test_labels, test_predictions,
target_names=test_generator.class_indices.keys()))
10.XGB Classifier
Python
xgb_classifier = xgb.XGBClassifier(
n_estimators=300, max_depth=6, learning_rate=0.05, subsample=0.8,
colsample_bytree=0.8, reg_lambda=1, objective='binary:logistic',
use_label_encoder=False, eval_metric='logloss'
)
xgb_classifier.fit(train_features_pca, train_labels)
test_predictions = xgb_classifier.predict(test_features_pca)
accuracy = accuracy_score(test_labels, test_predictions)
print(f"Test Accuracy with PCA and XGBoost: {accuracy:.4f}")
print("Classification Report with PCA:")
print(classification_report(test_labels, test_predictions,
target_names=test_generator.class_indices.keys()))
Python
def jaya_optimization(train_features, train_labels, test_features, test_labels, max_iter=10,
population_size=5):
min_components, max_components = 100, 400
min_estimators, max_estimators = 50, 300
min_max_depth, max_max_depth = 5, 50
population = []
for _ in range(population_size):
params = {
'n_components': random.randint(min_components, max_components),
'n_estimators': random.randint(min_estimators, max_estimators),
'max_depth': random.randint(min_max_depth, max_max_depth),
}
34
population.append(params)
def evaluate(params):
pca = PCA(n_components=params['n_components'])
train_pca = pca.fit_transform(train_features)
test_pca = pca.transform(test_features)
rf = RandomForestClassifier(n_estimators=params['n_estimators'],
max_depth=params['max_depth'], random_state=42)
rf.fit(train_pca, train_labels)
predictions = rf.predict(test_pca)
return accuracy_score(test_labels, predictions)
best_params = None
best_accuracy = 0
total_start_time = time.time()
for iteration in range(max_iter):
iter_start_time = time.time()
new_population = []
for params in population:
new_params = {
'n_components': random.randint(min_components, max_components),
'n_estimators': random.randint(min_estimators, max_estimators),
'max_depth': random.randint(min_max_depth, max_max_depth),
}
old_acc = evaluate(params)
new_acc = evaluate(new_params)
if new_acc > old_acc:
new_population.append(new_params)
else:
new_population.append(params)
if new_acc > best_accuracy:
best_accuracy = new_acc
best_params = new_params
population = new_population
iter_end_time = time.time()
iter_time = iter_end_time - iter_start_time
print(f"Iteration {iteration + 1}/{max_iter}, Best Accuracy: {best_accuracy:.4f}, Time
Taken: {iter_time:.2f} seconds")
total_end_time = time.time()
total_time = total_end_time - total_start_time
print(f"Total Optimization Time: {total_time:.2f} seconds")
return best_params
35
ANNEXURE 2 (OUTPUT SCREENS)
36
CNN+RF Feature count after JAYA
37
Malignant image prediction
38