Graph Neural Networks for Web Fingerprinting
Graph Neural Networks for Web Fingerprinting
Abstract— Website fingerprinting aims to identify the specific potential superiority over existing techniques. Additionally, the
webpages in encrypted traffic by observing patterns of traffic scalability and practicality of deploying GAP-WF in real-world
traces. The existing system mainly focuses on website homepage scenarios will be investigated. GAP-WF's applicability extends
fingerprinting and it is more difficult to identify different to scenarios involving encrypted or anonymized traffic, making
webpages within the same website because the traffic traces are it potentially valuable for surveillance, security monitoring, and
very similar. In our project, we propose Graph Attention Pooling research purposes. However, challenges such as computational
Network for fine-grained WF(GAP-WF). We will construct the demands and accuracy limitations must be addressed. The
trace graph according to the flow sequence in webpage loading broader implications of website fingerprinting using GAP-WF
and use a GNN based model to better learn the features of
underscore the importance of considering both its benefits and
nodes(intra-flow) and structures(inter-flow). There may be
different flows having different effects on classification. For this,
risks in any deployment.
we will utilize Graph Attention Network to pay attention to the II. LITERATURE SURVEY
more useful nodes. The algorithms that will be used are Support
Vector Machine(SVM) and Random Forest with the contribution A. Literature Review
using K-Nearest Neighbor(KNN). Also, we use four datasets
In the feature extraction stage, the system extracts relevant
comprising of WEB100, APPLE60, CDC30, PAGE100 to evaluate
features from the encrypted traffic using a pre-trained deep
the performance of GAP-WF.
learning model. In the classification stage, the system uses a
Keywords—component, formatting, style, styling, insert (key feed-forward neural network to classify the extracted features
words) and identify the visited website.
I. INTRODUCTION (HEADING 1)
This report offers a thorough examination of node B. Deep Learning Approach
classification in website fingerprinting, focusing on the Xue, “Fingerprinting HTTPS Websites: “A Deep Learning
challenges posed by the widespread adoption of SSL/TLS Approach"(2021)
encryption. Such encryption complicates the identification of
visited websites, prompting the use of techniques like SSL/TLS This paper provides an idea about deep learning-based
website fingerprinting, which relies on encrypted traffic approach for HTTPS website fingerprinting. The system uses a
analysis. Various methodologies, including machine learning convolutional neural network (CNN) to extract features from the
frameworks like TensorFlow and Keras, alongside graph theory encrypted traffic and classify the traffic into different websites.
libraries such as NetworkX, are employed in these systems. C. Transfer Learning Approach
However, ethical concerns, privacy issues, and algorithmic
Garg, "Improving Website Fingerprinting Attacks through
biases present notable challenges. The project aims to
Transfer Learning"(2021)
implement the Graph Attention Pooling Network (GAP-WF) for
SSL/TLS website fingerprinting, evaluating its efficacy in This survey paper involves training a model on a source domain
accurately classifying encrypted traffic and assessing its with a large amount of labeled data and transferring the learned
experiment setup is carried out on a computer system which has Number of features Description
the different hardware and software specifications as given in 4 P|, |P+|, |P−|, T|P|
Table 3.2 and Table 3.3 respectively. 3001 A list M, such that Mi = 1 if ∃ i ∈
P` and Mi = 0 otherwise. M is
TABLE II. HARDWARE DETAILS
similar to PU , the unique packet
length feature. Given that the MTU
Processor Intel Core i5 is 1500, we have |Mi| = 3001 for all
HDD 500 GB lengths between −1500 and 1500.
RAM 8 RAM 500 Position of the first 500 outgoing
packets, i, such that Pli > 0.
500 Difference in position between the
first 500 outgoing packets and
TABLE III. SOFTWARE DETAILS the next outgoing packet, i − j, such
API Keras, Tensorflow, Numpy, Pandas that Pli > 0 and Plj > 0.
Programming Languages Python 100 Dividing P into non-overlapping
windows of size 30, the number of
outgoing packets in each of the first
100 windows.
3 Define a burst as a sequence of
V. RESULT AND DISCUSSION packets in the same direction (much
A. Standard datasets used like Dy-VNG [DCRS12]), and the
burst length as the number of said
We collected the following major data sets used across several packets. We extract the longest
sections: burst, the number of bursts, and the
Non-Tor data set: We used Firefox 38.0 to visit web mean size of each burst.
6 The number of bursts with lengths
pages. Most privacy technologies (such as VPNs and longer than 2, 5, 10, 15, 20, and 50
proxies) do not attempt to defend the packet sequence, respectively.
so results on this data set would be similar to results on 100 The length of the first 100 bursts.
most privacy technologies. We collected 100 instances 10 The direction of the first 10 packets.
of Alexa’s top 100 pages as the set of monitored pages, 2 The mean and standard deviation of
and 9000 instances of Alexa’s top 10000 pages (1 the interpacket times.
instance of each page, and discarding failed pages) as
the set of non-monitored pages. We collected this data k-NN uses a distance metric d to describe how similar two
in July 2023. packet sequences are. We want the distance to be accurate on
simple encrypted data without extra padding, but also accurate
Tor data set: We collected our data on Tor Browser if the client applies defenses that remove features from our
4.5a4 with Tor [Link]-alpha. As above, we collected the available feature set (for example, Tor removes unique packet
same number of monitored and non-monitored pages. lengths). We therefore start with a large feature set
Tor is especially interesting to us because it has applied
F = {f1, f2, . . .}.
defenses against website fingerprinting and continues to
improve its defenses. We collected the data in June
2023. In our previous work we collected a data set of Each feature is a function f which takes in a packet sequence P
sensitive pages banned by several ISPs in various and computes f(P), a nonnegative number. Conceptually, we
countries. We do not do so in this thesis because we have select each feature such that packet sequences from the same
found that these sites are taken down frequently, which page are more likely to have similar features than packet
makes it harder to reproduce our results. Thus, we will sequences from different pages. We list our full feature set of
only use Alexa’s top sites. We also collected several 4226 features in Table **.
other data sets for use in specific sections, and we will The distance between P and P’ is
describe these data sets in detail in the relevant sections.
B. Evaluation Parameters
To tackle the open-world scenario, we designed Wa-kNN
[WCN+14] (2014). Wa-kNN uses the k-Nearest Neighbours (k-
NN) classifier, a simple supervised machine learning algorithm. We learn the weights W = {w1, w2, . . . , w|F|} using our new
k-NN starts with a set of training points Strain = {P1, P2, . . .}, weight learning process we call WLLCC (Weight Learning by
each point being a packet sequence. Let the class of Pi be c(Pi). Locally Collapsing Classes). With WLLCC, we reduce weights
Given a testing point Ptest ∈ Stest, the classifier computes the for uninformative features. As weight learning proceeds, the k-
distance d(Ptest, Ptrain) for each Ptrain ∈ Strain. The NN distance comes to focus on features that are useful for
algorithm then classifies Ptest based on the classes of the k classification. Despite its simplicity, k-NN has a number of
closest training points to Ptest. advantages over other classifiers:
Short training and testing time. The training time VI. CONCLUSION AND FUTURE SCOPE
consists of learning the weights W; we will show in
Section 3.4.4 that this is fast. Since the distance A. Conclusion
comparison between two packet sequences is very Website fingerprinting exposes the difference between security
simple, the testing time is short as well. and privacy. We never attempt to break the encryption of the
client’s communication channel, but the client’s page access is
Multi-modal classification. Wa-kNN is based on the compromised nonetheless. Since different clients’
important observation that a class representing a web
communication are largely similar (especially on Tor Browser)
page is multi-modal. A multi-modal class is a class that
naturally consists of several subclasses, where points in when accessing the same page, we have gained significant
each subclass are similar to each other, but points in knowledge of their browsing activities. Website fingerprinting
different subclasses are not similar. Web pages are often is conceptually analogous to a dictionary attack against an
multi-modal: for example, a single web page can have encryption scheme with a small set of possible plaintexts.
different localization versions for different users. Without randomly padding the plaintexts (analogous to website
fingerprinting defenses), even an otherwise correctly
k-NN is well-suited for multi-modal classes, for which implemented encryption scheme can be broken by
different modes of the class do not need to have any precomputing ciphertexts (analogous to the attacker’s training
relationship with each other, because only the k closest set).
neighbours are relevant for classification. As long as the
size of each mode is larger than k, k-NN will not suffer Wa-kNN achieves the best for both worlds, it has a very low
from the presence of other modes in the same class. In
training and testing time, but it is also nearly as accurate as Wa-
contrast, many other classifiers (SVM, Naive Bayes,
etc.) will take all points in the class in consideration, OSAD in the closed world. It trumps all other attacks in the
including those from other modes. open world, as it has tunable parameters that trade off the TPR
for the FPR and vice-versa effectively, allowing us to tackle the
There are a number of ways to deal with the case when the k base rate fallacy. We have seen that our new WLLCC algorithm
neighbours of a testing point differ in class assignment. Most for Wa-kNN has lifted the competitiveness of the simple and
commonly, the k neighbours act as voters in a majority scheme, fast k-NN classifier to that of SVMs for website fingerprinting.
possibly with their votes weighted by their distance to the The algorithm may be useful for k-NN classification in general,
testing point. In Wa-kNN, if any of the k neighbours disagree, and it can also be used in clustering.
we assign the testing point to the non-monitored class, even if
all of the k neighbours pointed to monitored pages (but different
ones). This approach gives us a strategic advantage: we can use B. Future Scope
k as a parameter to trade off TPR and FPR. A higher k means We include significant new work in this thesis beyond the scope
that we will more often decide that pages are non-monitored. of our published papers on the above attacks. We implement 12
We can also use |C0|, the size of the non-monitored class, to website fingerprinting attacks with standardized input and
trade off TPR and FPR. output to allow comparative experiments at a greater scale than
previously seen work. We show new results on training and
testing time. Our systematization of attacks by comparing their
C. Perfomrnace Evaluation use of distances and features allows the possible development
In order to evaluate the proposed system, we used it on various of hybrid attacks that combine approaches from different
popular websites like Google, Youtube, etc. The first step is attacks.
data collection. This step involves collecting the fingerprints or
the data of the particular website visited by several users. We We developed our attacks and performed our experiments with
use a script [Link] to capture the traffic on the particular a strong focus on Tor, because it is one of the easiest and most
website in order to use it to identify third party websites trying popular tools for clients to achieve the level of privacy required
to gain access. The captured traffic is then saved into a file and to render website fingerprinting relevant: an encrypted channel
the proposed system creates another file depicting the addresses across proxies. Tor is actively developing defenses against
of unknown traffic or third party websites trying to access the website fingerprinting attacks, because if website fingerprinting
website, thus compromising the user’s safety. were perfectly successful, it would reduce the privacy of Tor to
simply that of an encrypted channel.
The accuracy of the proposed technique was better in
comparison to the existing system. We split the data into Any entry node on Tor may perform website fingerprinting to
training and test data. The classifier KNN is used to train the discover its clients’ behaviour. Nevertheless, website
model. The [Link] script is used to load the classifier and fingerprinting attacks have limitations. Some of these
identifies which web page the unknown traffic originated from. limitations are inherent to the problem statement, and
It is worth noting that using the classifier, the proposed system overcoming them may be truly difficult. We categorize these
obtained an accuracy of 95% which was higher than the existing limitations, which may make for interesting future work:
system.
Identifying very rare pages. The low base rate of a very ACKNOWLEDGMENT
rare page may cause the number of false positives to We would like to express our special thanks to Prof. K.S
overwhelm the number of true positives, and the base Suresh Babu, our major project guide who guided us through the
rate fallacy is much harder to overcome. Our project and who helped us in applying the knowledge that we
experiments suggest that we can achieve high precision have acquired during the semester and learning new concepts.
down to a base rate of around 0.01% (1 in 10,000), but
rare pages would require more advanced techniques, and We would like to express our special thanks to Prof. Sharvari
there is no limit to how rare a web page might be. Govilkar the H.O.D of our Computer Engineering department
who gave us the opportunity to do this major project because of
Identifying activity. It may be possible to identify which we learned new concepts and their application.
whether a client is web browsing, chatting, listening to
music, watching a video, and so on. All of these We are also thankful to our mini project coordinator along
activities can be done through a regular browser, and can with other faculties for their encouragement and support.
therefore be done on Tor. Currently, we do not tackle Finally we would like to express our specials thanks of
any activity except web browsing. Identifying them gratitude to Principal Dr. Sandeep Joshi who gave us the
would give the attacker more information to work with. opportunity and facilities to conduct this major project.
Identifying web sites (as opposed to individual web
pages as in this work). There is some previous work on REFERENCES
identifying web sites [CZJJ12]. However, past results
are generally limited to identifying only one or two [1] A. Freier, “GAP-WF: Graph Attention Pooling network for fine-grained
specific web sites. It is not clear if techniques used for SSL/TLS website fingerprinting,” IEEE Conference Publication, IEEE
Xplore, pp20- 24, July 2021.
identifying these specific web sites can be extended to
[2] Z. Li, “Deep Forest with LRRS Feature for Fine-grained Website
identifying hundreds of web sites at once. Fingerprinting with Encrypted SSL/TLS,” ACM Publications, pp. 100–
135, November 2019.
Understanding the underlying web page. Web pages are
[3] M. Shen, Y. Liu, L. Zhu, X. Du and J. Hu, "Fine-Grained Webpage
essentially request-response pairs, and some responses Fingerprinting Using Only Packet Length Information of Encrypted
trigger further requests. Currently, we collect web pages Traffic," in IEEE Transactions on Information Forensics and Security,
only as packet sequences, ignoring the intermediate vol. 16, pp. 2046-2059, 2021.
request-response structure. If we were able to simulate [4] A. Panchenko, "Website fingerprinting at internet scale", in Proceedings
packet sequences accurately from the request-response of ACM CCS, pp 20-34, December 2016.
structure, with the network conditions as input, we could [5] S. Bhat, D. Lu, A. Kwon, and S. Devadas, “VAR-CNN: A Data-Efficient
significantly improve website fingerprinting. Data Website Fingerprinting attack based on Deep learning,” Proceedings on
collection would be much faster, as we would not have Privacy Enhancing Technologies, pp. 292–310, July 2019.
to wait for Tor’s latency to load web pages. We would [6] X. He, J. Wang, Y. He and Y. Shi, "A Deep Learning Approach for
Website Fingerprinting Attack," 2018 IEEE 4th International Conference
also be able to build better classifiers that ignore random on Computer and Communications (ICCC), pp. 1419-1423, July 2018M.
noise such as advertisements. Young, The Technical Writer’s Handbook. Mill Valley, CA: University
Science, 1989.
IEEE conference templates contain guidance text for
The astute reader may have noted that we collect data sets in composing and formatting conference papers. Please
this chapter in a simplistic manner. We start the browser, visit ensure that all template text is removed from your
a web page, record the packet sequence, and exit the browser. conference paper prior to submission to the
Realistic clients can visit several web pages, sequentially or at conference. Failure to remove template text from your
paper may result in your paper not being published.
once, and can generate noise by engaging in other activities
(e.g. listening to streaming music in the background).