[go: up one dir, main page]

0% found this document useful (0 votes)
90 views8 pages

Graph Neural Networks for Web Fingerprinting

This document summarizes research on using graph neural networks (GNNs) for website fingerprinting. It proposes a method called Graph Attention Pooling Network for fine-grained website fingerprinting (GAP-WF) that constructs a trace graph based on webpage loading flows and uses a GNN model to learn features of nodes and structures. It evaluates GAP-WF on four datasets and investigates its potential advantages over existing techniques, as well as challenges regarding computational demands and accuracy limitations. The document also reviews related work on deep learning approaches, transfer learning, data augmentation, few shot learning techniques, and homology analysis for website fingerprinting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views8 pages

Graph Neural Networks for Web Fingerprinting

This document summarizes research on using graph neural networks (GNNs) for website fingerprinting. It proposes a method called Graph Attention Pooling Network for fine-grained website fingerprinting (GAP-WF) that constructs a trace graph based on webpage loading flows and uses a GNN model to learn features of nodes and structures. It evaluates GAP-WF on four datasets and investigates its potential advantages over existing techniques, as well as challenges regarding computational demands and accuracy limitations. The document also reviews related work on deep learning approaches, transfer learning, data augmentation, few shot learning techniques, and homology analysis for website fingerprinting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Node Classification in Website Fingerprinting using

Graph Neural Networks


Suaz Hanif Malgundkar Rahul Babar Kage Priyanshu Chilkoti
Dept. of Computer Engineering Dept. of Computer Engineering Dept. of Computer Engineering
Pillai College of Engineering Pillai College of Engineering Pillai College of Engineering
Navi Mumbai, India Navi Mumbai, India Navi Mumbai, India
msuaz20ce@[Link] rkage20comp@[Link] pchil20comp@[Link]

Adheesh Sreedharan Prof. K.S Suresh Babu


Dept. of Computer Engineering Dept. of Computer Engineering
Pillai College of Engineering Pillai College of Engineering
Navi Mumbai, India Navi Mumbai, India
achakrambil20comp@[Link] sureshbabu@[Link]

Abstract— Website fingerprinting aims to identify the specific potential superiority over existing techniques. Additionally, the
webpages in encrypted traffic by observing patterns of traffic scalability and practicality of deploying GAP-WF in real-world
traces. The existing system mainly focuses on website homepage scenarios will be investigated. GAP-WF's applicability extends
fingerprinting and it is more difficult to identify different to scenarios involving encrypted or anonymized traffic, making
webpages within the same website because the traffic traces are it potentially valuable for surveillance, security monitoring, and
very similar. In our project, we propose Graph Attention Pooling research purposes. However, challenges such as computational
Network for fine-grained WF(GAP-WF). We will construct the demands and accuracy limitations must be addressed. The
trace graph according to the flow sequence in webpage loading broader implications of website fingerprinting using GAP-WF
and use a GNN based model to better learn the features of
underscore the importance of considering both its benefits and
nodes(intra-flow) and structures(inter-flow). There may be
different flows having different effects on classification. For this,
risks in any deployment.
we will utilize Graph Attention Network to pay attention to the II. LITERATURE SURVEY
more useful nodes. The algorithms that will be used are Support
Vector Machine(SVM) and Random Forest with the contribution A. Literature Review
using K-Nearest Neighbor(KNN). Also, we use four datasets
In the feature extraction stage, the system extracts relevant
comprising of WEB100, APPLE60, CDC30, PAGE100 to evaluate
features from the encrypted traffic using a pre-trained deep
the performance of GAP-WF.
learning model. In the classification stage, the system uses a
Keywords—component, formatting, style, styling, insert (key feed-forward neural network to classify the extracted features
words) and identify the visited website.

I. INTRODUCTION (HEADING 1)
This report offers a thorough examination of node B. Deep Learning Approach
classification in website fingerprinting, focusing on the Xue, “Fingerprinting HTTPS Websites: “A Deep Learning
challenges posed by the widespread adoption of SSL/TLS Approach"(2021)
encryption. Such encryption complicates the identification of
visited websites, prompting the use of techniques like SSL/TLS This paper provides an idea about deep learning-based
website fingerprinting, which relies on encrypted traffic approach for HTTPS website fingerprinting. The system uses a
analysis. Various methodologies, including machine learning convolutional neural network (CNN) to extract features from the
frameworks like TensorFlow and Keras, alongside graph theory encrypted traffic and classify the traffic into different websites.
libraries such as NetworkX, are employed in these systems. C. Transfer Learning Approach
However, ethical concerns, privacy issues, and algorithmic
Garg, "Improving Website Fingerprinting Attacks through
biases present notable challenges. The project aims to
Transfer Learning"(2021)
implement the Graph Attention Pooling Network (GAP-WF) for
SSL/TLS website fingerprinting, evaluating its efficacy in This survey paper involves training a model on a source domain
accurately classifying encrypted traffic and assessing its with a large amount of labeled data and transferring the learned

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


knowledge to a target domain with limited labeled data. The 2. Website Advantages: It does not rely on the
authors propose a novel transfer learning algorithm called TL- Fingerprinting assumption that the attacker has
WF, which fine-tunes a pre-trained convolutional neural Attacks Based prior knowledge of the website's
network (CNN) on a small set of labeled data from the target on Homology content.
domain. Analysis” by Disadvantages: It requires the
Maohua Guo attacker to have access to the
2021 user's traffic.
D. Data Augmentation
3. “Var-CNN: A Advantages: It requires less
Data augmentation is the addition of new data artificially Data-Efficient training data.
derived from existing training data. Techniques include Website Disadvantages: It is vulnerable to
resizing, flipping, rotating, cropping, padding, etc. It helps to Fingerprinting adversarial attack.
address issues like overfitting and data scarcity, and it makes Attack Based
the model robust with better performance. on Deep
Learning”, by
E. Few Shot technique Sanjit Bhatt
Yongjun Wang, " “Few-Shot Website Fingerprinting Attack 2020
with Data Augmentation, 2021. 4. “Few-Shot Advantages: It requires only a few
This paper fingerprinting attack method that uses data Website labeled samples of a target
augmentation techniques to improve the performance of the Fingerprinting website.
attack. The method leverages a few labeled samples of a target Attack with Disadvantage: It requires a large
website and a large amount of unlabeled traffic data to construct Data amount of unlabeled traffic data to
a fingerprint of the target website. The authors showed that their Augmentation” construct a fingerprint.
method outperforms existing few-shot website fingerprinting by Yongjun
attack methods in terms of accuracy. Wang 2021
5. "Improving Advantages: It can improve the
Website accuracy of website fingerprinting
F. Homology Analysis Fingerprinting attacks on a target domain with
Maohua Guo, “Website Fingerprinting Attacks Based on Attacks through limited labeled data. The approach
Homology Analysis”,2021. Transfer is robust to traffic obfuscation
The survey paper of two phases: training and testing. In the Learning" by techniques, as demonstrated by the
training phase, the attacker collects a set of website traces and Garg et al. 2021 WFPad dataset.
extracts their corresponding features. The features are then Disadvantages: It requires a large
clustered using homology analysis to generate a set of amount of labeled data from a
representative templates. In the testing phase, the attacker source domain.
captures the traffic of the target user and extracts the features of
each packet. The attacker then compares the extracted features 6. "Website Advantages: It is relatively
with the representative templates to identify the visited web fingerprinting: efficient and can be implemented
pages. It does not rely on the assumption that the attacker has attacking on a variety of platforms. It is able
prior knowledge of the website's content. It has a high accuracy popular privacy to accurately identify websites
rate, which means that it is effective in identifying the websites enhancing even when they are accessed
visited by a user. technologies through privacy-enhancing
with the technologies.
TABLE I. SUMMARY OF TECHNIQUES FOR WEBSITE FINGERPRINTING
multinomial Disadvantage: It requires a
naïve-bayes significant amount of
SN Author & Advantages and Disadvantages classifier.“ by computational resources, which
Year of Herrmann, may not be available on all
Publication Dominik, Rolf platforms. It is highly dependent
1. Graph Attention Advantages: It achieves higher Wendolsky, and on the quality of the traffic data,
Pooling accuracy, precision, and recall Hannes which can be noisy and difficult to
Network for Disadvantages: Requires large Federrath, 2020 obtain in some cases.
Fine-grained amount of labeled data to train the
SSL/TLS model.
Website
Fingerprinting,“
by Jie Lu 2021
III. WEBSITE FINGERPRINTING SYSTEM traffic, so it's important to choose an appropriate
structure.
A. Website Fingerprinting Techniques
The objective of the project is to create a software  Attention mechanism: The attention mechanism used in
implementation of the Graph Attention Pooling Network (GAP- the GAPN can affect the performance of the model.
WF) for SSL/TLS website fingerprinting and evaluate the Different attention mechanisms can focus on different
effectiveness of GAP-WF in accurately classifying SSL/TLS parts of the input data and capture different patterns.
encrypted traffic and identifying the websites being visited.  Network architecture: The architecture of the network
Moreover investigate the potential for improved performance can also impact the performance of the model. Different
compared to existing SSL/TLS fingerprinting techniques. architectures can capture different levels of abstraction
 Improved Feature Extraction: The proposed system and complexity in the data.
can employ more advanced feature extraction  Adversarial attacks: Website fingerprinting using
techniques such as convolutional neural networks GAPN is vulnerable to adversarial attacks, where an
(CNNs) and recurrent neural networks (RNNs) to attacker tries to disguise their network traffic to avoid
extract more discriminative and informative features detection. Robustness against such attacks is an
from network traffic data. important factor to consider when evaluating the
 Ensemble Learning: The proposed system can use effectiveness of the [Link] combining SI and
ensemble learning techniques to combine multiple CGS units, such as current in amperes and magnetic
classifiers to improve the accuracy and robustness of field in oersteds. This often leads to confusion because
the classification model. equations do not balance dimensionally. If you must use
mixed units, clearly state the units for each quantity that
 Online Learning: The proposed system can implement you use in an equation.
an online learning framework that continuously
updates the model parameters with new data,
improving the system's efficiency and adaptability to C. Proposed System Architecture
dynamic network traffic patterns.
The previous sections discussed the strengths and
 Multi-task Learning: The proposed system can utilize weaknesses of the existing system. The proposed architecture is
multi-task learning to jointly learn website shown in Figure 2.
fingerprinting and other related tasks, such as traffic
classification or intrusion detection, to leverage the
shared information and improve the overall
performance of the system.

B. Existing System Architecture

Fig. 2. Proposed system architecture

Depending on the domain and data characteristics,


different types of combinations might produce dissimilar
outputs. The following list describes several hybridization
techniques that come into consideration to merge CF and CBF
recommenders
Fig. 1. Existing system architecture used for Website Fingerprinting [2]
 Data Collection: Collection of a dataset of SSL/TLS
encrypted website traffic traces from various websites.
 Dataset size and diversity: The performance of the model  Preprocessing: Preprocess the dataset by filtering out
is highly dependent on the dataset and diversity of the irrelevant traffic and splitting the remaining traffic into
dataset used to train it. A larger and more diverse dataset packets.
is likely to lead to better performance.  Graph Construction: Construct a graph representation of
 Graph structure: The structure of the graph used to the SSL/TLS encrypted website traffic using the packets
represent the network traffic can have a significant as nodes and their relationships as edges.
impact on the performance of the model. Different graph  Feature Extraction: Extract features from the SSL/TLS
structures can capture different aspects of the network encrypted website traffic graph using techniques such as
subgraph enumeration and edge distance distribution.
 Graph Attention Pooling Network: Train a Graph analyze the contribution of each component of their
Attention Pooling Network (GAP-Net) using the approach and identified areas for future research.
SSL/TLS encrypted website traffic graph and the
extracted features to perform fine-grained SSL/TLS  Results and Analysis: The results of the experiments
website fingerprinting. were analyzed to determine the effectiveness of the
GAP-WF network in accurately classifying SSL/TLS
 Evaluation: Evaluate the performance of the trained traffic and identifying the websites being visited. The
GAP-Net using metrics such as accuracy, precision, analysis also identified any potential limitations or areas
recall, and F1-score. for improvement in the network architecture.
 Deployment: Deploy the trained GAP-Net for SSL/TLS  Scalability Analysis: The scalability of the network was
website fingerprinting on a system. analyzed by measuring the computational efficiency and
resource usage of the network on larger datasets of
D. Implementation Techniques SSL/TLS traffic. This involved identifying any potential
 Dataset Preparation: The authors collected a dataset of bottlenecks or limitations that may affect the practical
SSL/TLS traffic by monitoring the network traffic of use of the network in real-world applications.
various websites using a passive network sniffer. They
Traffic flow in Tor Browser
then preprocessed the dataset by filtering out non-
HTTPS traffic and splitting it into training and testing  Data is encrypted by the VPN
sets.
 Data is encrypted again by Tor
 Feature Extraction: Next, the authors extracted features
from the SSL/TLS packets using the following  Data is sent to the VPN server
techniques:  The VPN decrypts the data, but it is still encrypted by
Byte frequency distribution (BFD): A histogram of the Tor
frequency of each byte value in the packet payload.
 Data is sent through a series of nodes on the Tor network
Byte sequence distribution (BSD): A histogram of the
 If visiting an onion address on the dark web, data is
frequency of each byte sequence of length k in the packet
decrypted upon arriving at its destination
payload.
Statistical moments: A set of statistical moments of the
packet payload, such as mean, variance, skewness, and kurtosis.
 Graph Construction: The authors constructed a graph
representation of the SSL/TLS traffic by considering
each packet as a node in the graph and adding edges
between nodes based on their inter-arrival time and
packet direction. The resulting graph is a directed
acyclic graph (DAG) that captures the temporal ordering
of the SSL/TLS packets.
 Graph Attention Pooling: To learn a representation of
the graph, the authors used a graph attention pooling Fig. 3. Traffic flow in Tor
(GAP) network. The GAP network consists of multiple
graph attention layers that learn to assign importance
scores to the nodes and edges of the graph based on their
IV. USE CASE
features and relations. The outputs of the final attention
layer are then pooled to produce a fixed-size A. Online Fraud Detection
representation of the graph. Websites can detect fraudulent user behavior using browser
 Classification: The authors trained a classifier on the fingerprinting technology. When a website detects fraudulent
graph representations produced by the GAP network to user activity, it adds extra authentication steps to the login
predict the website being visited based on the SSL/TLS process to prevent unauthorized access and theft of legitimate
traffic. They experimented with several classifiers, users’ accounts.
including logistic regression, random forest, and support Real-world browser printing example:
vector machines (SVMs), and evaluated their
performance using various metrics such as For instance, browser fingerprinting is used to authenticate
accuracy,precision, recall, and F1 score. users when they login to an online banking system. Wachovia, a
financial services company, created unique fingerprinting for
 Evaluation: The authors evaluated the performance of their customers’ devices to verify their identity with a unique
their approach on several benchmark datasets and identifier and block malicious users.
compared it with existing SSL/TLS fingerprinting
techniques. They also conducted ablation studies to
B. Tailored content recommendations C. Avoidance of Browser fingerprinting
It’s possible to delete cookie history and block them on your a) Anonymize your browser: The possibility of being
web browsers. Web browsers such as Google, Firefox, Safari, recognized and tracked by websites increases as the level of
and Microsoft Edge enable users to disable cookies in their web browser uniqueness increases. It is difficult to avoid browser
browsers. fingerprinting. However, you can lower your browser’s
uniqueness. For example, when you connect to a website, your
machine sends an HTTP header-encoded connection request to
the web server to disable the content. These headers contain
machine-specific information such as your user agent (browser
name, version, etc.), location, language, time zone, browser
plugins, and other browser properties. We recommend that you
anonymize or lower your browser uniqueness to avoid being
easily identified by websites.

Fig 4. Working of Tor

Unlike cookies, it is hard to block browser fingerprints.


That’s why browser fingerprinting is a more effective technique
for advertising to track users’ behaviors and activities across the
web.
Websites use fingerprinting to track and analyze visitors’
activities and behaviors to create personalized experiences based
on visitors’ behaviors and activities. For example, when you
request a website to display its content, the website can reveal
your geo-location by tracking your IP fingerprinting. This
enables eCommerce sites to recommend locally relevant content
and nearby stores to their online visitors. Websites can access
Fig 6. Blocking of Malicious content by Tor
the following information about their visitors using browser
fingerprint technology:
b) Leverage proxies: You can use a proxy server or a
 A user’s current location. headless browser to make you anonymous while browsing.
 The type of device the user uses, such as a desktop, Shared proxies, for example, are a highly anonymous option for
tablet, or mobile phone. combating browser fingerprinting technology. Since multiple
users share the same IP address with you simultaneously, being
 Traffic source, Traffic source, users can reach your detected by website servers is hard. It blends your machine in
website through different ways such as direct, social with the machines of many other users. Websites will have
media, referral, and paid traffic. difficulty targeting your machine because you use a less unique
configuration.

Fig 5. Attacking Sources

Fig 7. Overview of Website Fingerprint


c) Hardware and Software Specifications: The TABLE IV. FULL FEATURE SET OF WA-KNN

experiment setup is carried out on a computer system which has Number of features Description
the different hardware and software specifications as given in 4 P|, |P+|, |P−|, T|P|
Table 3.2 and Table 3.3 respectively. 3001 A list M, such that Mi = 1 if ∃ i ∈
P` and Mi = 0 otherwise. M is
TABLE II. HARDWARE DETAILS
similar to PU , the unique packet
length feature. Given that the MTU
Processor Intel Core i5 is 1500, we have |Mi| = 3001 for all
HDD 500 GB lengths between −1500 and 1500.
RAM 8 RAM 500 Position of the first 500 outgoing
packets, i, such that Pli > 0.
500 Difference in position between the
first 500 outgoing packets and
TABLE III. SOFTWARE DETAILS the next outgoing packet, i − j, such
API Keras, Tensorflow, Numpy, Pandas that Pli > 0 and Plj > 0.
Programming Languages Python 100 Dividing P into non-overlapping
windows of size 30, the number of
outgoing packets in each of the first
100 windows.
3 Define a burst as a sequence of
V. RESULT AND DISCUSSION packets in the same direction (much
A. Standard datasets used like Dy-VNG [DCRS12]), and the
burst length as the number of said
We collected the following major data sets used across several packets. We extract the longest
sections: burst, the number of bursts, and the
 Non-Tor data set: We used Firefox 38.0 to visit web mean size of each burst.
6 The number of bursts with lengths
pages. Most privacy technologies (such as VPNs and longer than 2, 5, 10, 15, 20, and 50
proxies) do not attempt to defend the packet sequence, respectively.
so results on this data set would be similar to results on 100 The length of the first 100 bursts.
most privacy technologies. We collected 100 instances 10 The direction of the first 10 packets.
of Alexa’s top 100 pages as the set of monitored pages, 2 The mean and standard deviation of
and 9000 instances of Alexa’s top 10000 pages (1 the interpacket times.
instance of each page, and discarding failed pages) as
the set of non-monitored pages. We collected this data k-NN uses a distance metric d to describe how similar two
in July 2023. packet sequences are. We want the distance to be accurate on
simple encrypted data without extra padding, but also accurate
 Tor data set: We collected our data on Tor Browser if the client applies defenses that remove features from our
4.5a4 with Tor [Link]-alpha. As above, we collected the available feature set (for example, Tor removes unique packet
same number of monitored and non-monitored pages. lengths). We therefore start with a large feature set
Tor is especially interesting to us because it has applied
F = {f1, f2, . . .}.
defenses against website fingerprinting and continues to
improve its defenses. We collected the data in June
2023. In our previous work we collected a data set of Each feature is a function f which takes in a packet sequence P
sensitive pages banned by several ISPs in various and computes f(P), a nonnegative number. Conceptually, we
countries. We do not do so in this thesis because we have select each feature such that packet sequences from the same
found that these sites are taken down frequently, which page are more likely to have similar features than packet
makes it harder to reproduce our results. Thus, we will sequences from different pages. We list our full feature set of
only use Alexa’s top sites. We also collected several 4226 features in Table **.
other data sets for use in specific sections, and we will The distance between P and P’ is
describe these data sets in detail in the relevant sections.
B. Evaluation Parameters
To tackle the open-world scenario, we designed Wa-kNN
[WCN+14] (2014). Wa-kNN uses the k-Nearest Neighbours (k-
NN) classifier, a simple supervised machine learning algorithm. We learn the weights W = {w1, w2, . . . , w|F|} using our new
k-NN starts with a set of training points Strain = {P1, P2, . . .}, weight learning process we call WLLCC (Weight Learning by
each point being a packet sequence. Let the class of Pi be c(Pi). Locally Collapsing Classes). With WLLCC, we reduce weights
Given a testing point Ptest ∈ Stest, the classifier computes the for uninformative features. As weight learning proceeds, the k-
distance d(Ptest, Ptrain) for each Ptrain ∈ Strain. The NN distance comes to focus on features that are useful for
algorithm then classifies Ptest based on the classes of the k classification. Despite its simplicity, k-NN has a number of
closest training points to Ptest. advantages over other classifiers:
 Short training and testing time. The training time VI. CONCLUSION AND FUTURE SCOPE
consists of learning the weights W; we will show in
Section 3.4.4 that this is fast. Since the distance A. Conclusion
comparison between two packet sequences is very Website fingerprinting exposes the difference between security
simple, the testing time is short as well. and privacy. We never attempt to break the encryption of the
client’s communication channel, but the client’s page access is
 Multi-modal classification. Wa-kNN is based on the compromised nonetheless. Since different clients’
important observation that a class representing a web
communication are largely similar (especially on Tor Browser)
page is multi-modal. A multi-modal class is a class that
naturally consists of several subclasses, where points in when accessing the same page, we have gained significant
each subclass are similar to each other, but points in knowledge of their browsing activities. Website fingerprinting
different subclasses are not similar. Web pages are often is conceptually analogous to a dictionary attack against an
multi-modal: for example, a single web page can have encryption scheme with a small set of possible plaintexts.
different localization versions for different users. Without randomly padding the plaintexts (analogous to website
fingerprinting defenses), even an otherwise correctly
 k-NN is well-suited for multi-modal classes, for which implemented encryption scheme can be broken by
different modes of the class do not need to have any precomputing ciphertexts (analogous to the attacker’s training
relationship with each other, because only the k closest set).
neighbours are relevant for classification. As long as the
size of each mode is larger than k, k-NN will not suffer Wa-kNN achieves the best for both worlds, it has a very low
from the presence of other modes in the same class. In
training and testing time, but it is also nearly as accurate as Wa-
contrast, many other classifiers (SVM, Naive Bayes,
etc.) will take all points in the class in consideration, OSAD in the closed world. It trumps all other attacks in the
including those from other modes. open world, as it has tunable parameters that trade off the TPR
for the FPR and vice-versa effectively, allowing us to tackle the
There are a number of ways to deal with the case when the k base rate fallacy. We have seen that our new WLLCC algorithm
neighbours of a testing point differ in class assignment. Most for Wa-kNN has lifted the competitiveness of the simple and
commonly, the k neighbours act as voters in a majority scheme, fast k-NN classifier to that of SVMs for website fingerprinting.
possibly with their votes weighted by their distance to the The algorithm may be useful for k-NN classification in general,
testing point. In Wa-kNN, if any of the k neighbours disagree, and it can also be used in clustering.
we assign the testing point to the non-monitored class, even if
all of the k neighbours pointed to monitored pages (but different
ones). This approach gives us a strategic advantage: we can use B. Future Scope
k as a parameter to trade off TPR and FPR. A higher k means We include significant new work in this thesis beyond the scope
that we will more often decide that pages are non-monitored. of our published papers on the above attacks. We implement 12
We can also use |C0|, the size of the non-monitored class, to website fingerprinting attacks with standardized input and
trade off TPR and FPR. output to allow comparative experiments at a greater scale than
previously seen work. We show new results on training and
testing time. Our systematization of attacks by comparing their
C. Perfomrnace Evaluation use of distances and features allows the possible development
In order to evaluate the proposed system, we used it on various of hybrid attacks that combine approaches from different
popular websites like Google, Youtube, etc. The first step is attacks.
data collection. This step involves collecting the fingerprints or
the data of the particular website visited by several users. We We developed our attacks and performed our experiments with
use a script [Link] to capture the traffic on the particular a strong focus on Tor, because it is one of the easiest and most
website in order to use it to identify third party websites trying popular tools for clients to achieve the level of privacy required
to gain access. The captured traffic is then saved into a file and to render website fingerprinting relevant: an encrypted channel
the proposed system creates another file depicting the addresses across proxies. Tor is actively developing defenses against
of unknown traffic or third party websites trying to access the website fingerprinting attacks, because if website fingerprinting
website, thus compromising the user’s safety. were perfectly successful, it would reduce the privacy of Tor to
simply that of an encrypted channel.
The accuracy of the proposed technique was better in
comparison to the existing system. We split the data into Any entry node on Tor may perform website fingerprinting to
training and test data. The classifier KNN is used to train the discover its clients’ behaviour. Nevertheless, website
model. The [Link] script is used to load the classifier and fingerprinting attacks have limitations. Some of these
identifies which web page the unknown traffic originated from. limitations are inherent to the problem statement, and
It is worth noting that using the classifier, the proposed system overcoming them may be truly difficult. We categorize these
obtained an accuracy of 95% which was higher than the existing limitations, which may make for interesting future work:
system.
 Identifying very rare pages. The low base rate of a very ACKNOWLEDGMENT
rare page may cause the number of false positives to We would like to express our special thanks to Prof. K.S
overwhelm the number of true positives, and the base Suresh Babu, our major project guide who guided us through the
rate fallacy is much harder to overcome. Our project and who helped us in applying the knowledge that we
experiments suggest that we can achieve high precision have acquired during the semester and learning new concepts.
down to a base rate of around 0.01% (1 in 10,000), but
rare pages would require more advanced techniques, and We would like to express our special thanks to Prof. Sharvari
there is no limit to how rare a web page might be. Govilkar the H.O.D of our Computer Engineering department
who gave us the opportunity to do this major project because of
 Identifying activity. It may be possible to identify which we learned new concepts and their application.
whether a client is web browsing, chatting, listening to
music, watching a video, and so on. All of these We are also thankful to our mini project coordinator along
activities can be done through a regular browser, and can with other faculties for their encouragement and support.
therefore be done on Tor. Currently, we do not tackle Finally we would like to express our specials thanks of
any activity except web browsing. Identifying them gratitude to Principal Dr. Sandeep Joshi who gave us the
would give the attacker more information to work with. opportunity and facilities to conduct this major project.
 Identifying web sites (as opposed to individual web
pages as in this work). There is some previous work on REFERENCES
identifying web sites [CZJJ12]. However, past results
are generally limited to identifying only one or two [1] A. Freier, “GAP-WF: Graph Attention Pooling network for fine-grained
specific web sites. It is not clear if techniques used for SSL/TLS website fingerprinting,” IEEE Conference Publication, IEEE
Xplore, pp20- 24, July 2021.
identifying these specific web sites can be extended to
[2] Z. Li, “Deep Forest with LRRS Feature for Fine-grained Website
identifying hundreds of web sites at once. Fingerprinting with Encrypted SSL/TLS,” ACM Publications, pp. 100–
135, November 2019.
 Understanding the underlying web page. Web pages are
[3] M. Shen, Y. Liu, L. Zhu, X. Du and J. Hu, "Fine-Grained Webpage
essentially request-response pairs, and some responses Fingerprinting Using Only Packet Length Information of Encrypted
trigger further requests. Currently, we collect web pages Traffic," in IEEE Transactions on Information Forensics and Security,
only as packet sequences, ignoring the intermediate vol. 16, pp. 2046-2059, 2021.
request-response structure. If we were able to simulate [4] A. Panchenko, "Website fingerprinting at internet scale", in Proceedings
packet sequences accurately from the request-response of ACM CCS, pp 20-34, December 2016.
structure, with the network conditions as input, we could [5] S. Bhat, D. Lu, A. Kwon, and S. Devadas, “VAR-CNN: A Data-Efficient
significantly improve website fingerprinting. Data Website Fingerprinting attack based on Deep learning,” Proceedings on
collection would be much faster, as we would not have Privacy Enhancing Technologies, pp. 292–310, July 2019.
to wait for Tor’s latency to load web pages. We would [6] X. He, J. Wang, Y. He and Y. Shi, "A Deep Learning Approach for
Website Fingerprinting Attack," 2018 IEEE 4th International Conference
also be able to build better classifiers that ignore random on Computer and Communications (ICCC), pp. 1419-1423, July 2018M.
noise such as advertisements. Young, The Technical Writer’s Handbook. Mill Valley, CA: University
Science, 1989.
IEEE conference templates contain guidance text for
The astute reader may have noted that we collect data sets in composing and formatting conference papers. Please
this chapter in a simplistic manner. We start the browser, visit ensure that all template text is removed from your
a web page, record the packet sequence, and exit the browser. conference paper prior to submission to the
Realistic clients can visit several web pages, sequentially or at conference. Failure to remove template text from your
paper may result in your paper not being published.
once, and can generate noise by engaging in other activities
(e.g. listening to streaming music in the background).

You might also like