Chapter 2
Chapter 2
Cyber security
-Understanding
sources of
Cybersecurity data-
Vandana P. Janeja
1
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved.
End to End Opportunities for data collection
• Log Data
• Router Connectivity and Log Data
• Firewall Log Data
• Raw pay load data
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 2
Sources of • Cyber threats often lead to loss of assets
Cybersecurity • Multitude of datasets can be harvested and used
to track these losses and origins of the attack
Data • This chapter is not about the data lost during
cyberattacks but the data that organizations can
scour from their networks to understand threats
better so that they can potentially prevent or
even predict future attacks
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 3
End to End Opportunities for data collection
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 4
Logical View
Logical View (a)
Request Request
Business Application
User Internet
Response Response
The logical view of the user requesting access to a business application can appear to be
fairly straightforward
Within this pipeline there could be several points through which the request and response
pass
Leading to several opportunities in the end-to-end process for data collection to help
understand when a cyber threat may occur in this process
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 5
• User request on a network– follow a complex networking pipeline
• The user may have a firewall on their own system and the router through which
they send out the request
• This request can be filtered through the internet service provider
• lookups can be performed in the domain name system (DNS) and
Physical View
• the data can be routed through multiple paths of routers, which are linked
through the routing table
• The request on the other side may again have to pass through the routers and firewalls
at multiple points in the system being accessed by the user
• There may be multiple intrusion detection systems (IDS) posted throughout the
systems to monitor the network flow for malicious activity
• This is just one example scenario; different network layouts will result in different
types of intermediate steps in this process of request and response, particularly based
on
• the type of response
• the type of network being used
• the type of organization of business applications
• the cloud infrastructure being used
• However, certain key components are always present that allow for multiple
opportunities to glean and scour for data related to potential cyber threats
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 10/2/2022
Physical View
Request
User Domain
Firewall Name
Response Router Internet System
Service
Internet Provider Routing
Internet Table
Exchange
Enterprise Server
Routing
Internal IDS
Table
Systems
Internet
Service DB Files
Provider IDS
IDS
Router Switch
Firewall Request Firewall
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 8
Common types of cybersecurity data
Browser cache
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 9
• Cybersecurity-related data collection will vary across the type of networks, including
computer networks, sensor networks, or cyberphysical systems
Sources of
• The method and level of data collection will also vary based on the application domains for
which the networks are being used and the important assets being protected
Cybersecurity • Social media businesses, such Facebook, are primarily user data driven, where the
revenue is based on providing access to user data and monitoring usage data
Data and • E-commerce businesses, such as Amazon, are usage and product delivery based
Variations • Portals, such as Yahoo, are again user data driven but more heavily reliant on
advertisements, which can target users based on what they see and use most often
• Cyberphysical systems, such as systems for monitoring and managing power grids, are
based on accurate functioning of physical systems and delivery of services to users over
these physical infrastructural elements
• The level of monitoring and management of data will vary with the level of prevention,
detection, or recovery expected in the domain
• Some domains have a high emphasis on prevention; others may have a high level of emphasis
on detection or recovery
• In all such cases, multiple types of datasets can be collected to provide intelligence on the
cyber threats, and user behaviors can be evaluated to prevent future threats or even identify
an insider propagating the threats
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 10/2/2022
Log Data
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 11
Log Data
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 12
Raw Pay Load Data
Data over the There are several Payload data can be This data may be Payload data is Payload data can be
network contains privacy concerns in accessed only where encrypted, so its accessible through massive even for a
accessing this legally allowed and usefulness as raw packet sniffers such few minutes of data
The header payload data since users have provided data to be mined is as Wireshark, where capture
information, which this data is the permissions to limited the data dump of
stores data about actual content that access this data the traffic can be
source and is being sent which retrieved
destination among may be under strict
other things and access controls
the actual content
being transmitted,
referred to as
payload
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 13
• To discover individual user’s behavior
• To detect presence of malwares in the payloads
• To detect other security threats based on the actual
content of the payload
• To identify threats based on signatures of malwares
that may be present in the payload
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 10/2/2022 14
Network Topology Data
A computer network can be represented as a Network traffic data dump can be used to Header data collected from a traffic dump file
graph in terms of the structure of the network and generate the communication graphs through Wireshark can be utilized to plot the
in terms of the communication taking place over communication between the source and
the networks destination IP addresses, which become the
vertices of each edge in the graph
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 15
Network
Topology
Data Preprocessing
Example
extraction of
communication
graph from
network traffic
Header data collected
from a traffic dump file
through Wireshark can
be utilized to plot the
communication
between the source
and destination IP
addresses, which
become the vertices of
each edge in the
graph.
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 16
Communications to Graphs
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 10/2/2022 18
Data
Preprocessing
19
10/2/2022
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved.
• Key features can be extracted to monitor unusual activities at the
individual system level
• Examples: active process resident memory usage, which is available
for all operating systems (OS) and allows for building a profile on the
normal memory usage of a process over time
• An abnormal spike in memory usage can be attributed to processing
a large volume of data
• Useful in detecting a potential insider threat, especially when
User System integrated with other user behavioral data from sensors monitoring
user stress levels or integrating with other log datasets
Data • CPU time utilization can be used for measuring system usage
• Several OS-specific features, such as kernel modules and changes in
registry values
• It is important to use multiple signatures over time from several of the
features to eliminate the regular spikes of day-to-day operations
• Key differentiator for a robust analysis where we do not simply rely on one
or two features but multiple features and their stable signatures (as
compared to historical data) to distinguish alerts
• Tools such as OSQuery and Snare can facilitate capture of these features
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 10/2/2022 20
Key Features To Monitor Unusual Activities at the Individual System Level
Feature Name OS Specific
Active process name, Active process filesystem path, Active All OS
process ports and sockets, Active process file access, Active
process resident memory usage, Active process CPU time
utilization, Active process system calls, Active process priority
value, Active process owner and group information, Loaded
peripherals drivers, Key-store access Patterns
System level sensors (current, voltage in different bus inside PC, Almost all peripherals
CPU/GPU fan speed etc)
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 21
• Access control data: These data can help better understand
usage of the assets that need to be protected. Role mining
from access control data can help shape and create better
and more robust roles
• Eye tracker data: A user’s behavior can be judged by the
interactions of the user with the system being used. One
such mode of input is the screen. Data collected from the
Other user’s eye gaze, captured through an eye tracker, can help
analyze the user’s level of engagement with a system and
Datasets user preferences or positioning important items on the
screen
• Vulnerability data: Software vulnerability is a defect in the
system (such as a software bug) that allows an attacker to
exploit the system and potentially pose a security threat.
Vulnerabilities can be investigated, and trends can be
discovered in various operating systems to determine levels
of strength or defense against cyberattacks
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 10/2/2022 22
Example: NVD Datasets
National Vulnerability Database from the National Institute of Standards and Technology (NIST)
Trends can be analyzed for several years and across major releases for operating systems to
reinforce knowledge of choices for critical infrastructural or network projects
NVD is built on the concept of Common Vulnerabilities and Exposures (CVE), which is a dictionary of
publicly known vulnerabilities and exposures
CVEs allow the standardization of vulnerabilities across products around the world. NVD scores
every vulnerability using the Common Vulnerability Scoring System (CVSS)
CVSS is comprised of several submetrics, including (a) base, (b) temporal, and (c) environmental
metrics. Each of these metrics quantifies some type of feature of a vulnerability
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 23
Cross-site Scripting vulnerability:
Data regarding the number of vulnerabilities pulled from NVD across 2006 to 2012
Comparing the occurrences of different types of vulnerabilities such as cross-site scripting and buffer overflow
14
BUFFER XSS
12
50
45
10
40
35
8
30
Number
6 25
20
4 15
10
2
5
0
0
2006 2007 2008 2009 2010 2011 2012
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Year
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 24
Integrated Use of Multiple Datasets
• If multiple datasets result in similar types of anomalies, then the
credibility of labeling an anomaly is higher
• Example: to discover anomalies in network traffic data with a temporal,
spatial, and human behavioral perspective
• Studying how network traffic changes over time, which locations are the
sources, where is it headed, and how are people generating this traffic – all
these aspects become very critical in distinguishing the normal from the
abnormal in the domain of cybersecurity
• This requires shifting gears to view cybersecurity as a holistic people problem
rather than a hardened defense problem
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 25
Computer networks evolve over time, and communication
patterns change over time
• Can we identify these key changes that are deviant from the normal changes in
a communication pattern and associate them with anomalies in the network
Integrated Use of traffic?
Consider • Can key geolocations that are sources of attacks, or key geolocations that are
destinations of attacks, be identified?
• Can IP spoofing be mitigated by looking at multiple data sources to supplement
the knowledge of a geospatial traffic pattern?
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 10/2/2022 26
Summary of Sources of Cybersecurity Data
Source of cybersecurity
data Literature study examples Type of detection it can be used for
Simon 2007 , Gupta et al. 2016, Cai and Hao 2011, User behavior, malicious use to detect user
Keystroke logging Muzzamil et al. 2016 credentials
Router connectivity and Sklower 1991, Tsuchiya 1988, Geocoding Infosec 2013, Kim
log data Zetter Security 2013 , Jian 2007 Suspicious rerouting, traffic hijacking, bogus routes
Generate efficient rule sets, anomaly detection in
Firewall log data Golnabi et al. 2006, Abedin et al. 2010 policy rules
Wang and Stolfo 2004, Kim et al. 2014, Limmer and Malware detection, embedded malware, user
Raw payload data Dressler 2010, Parekh et al. 2006, Roy 2014 behavior
Massicotte et al. 2003, Nicosia 2013, Namayanja and Janeja Consistent and inconsistent nodes, time points
Network topology 2015 and 2017, corresponding to anomalous activity
User system data Stephens and Maloof 2014, Meigham 2016 User profiles, user behavior data, insider threats
Access control Data Vaidya et al. 2007, Mitra et al. 2016 Generate efficient access control roles
Browser security indicators, security cues, user
Eye tracker data Darwish and Bataineh 2012 behavior
Vulnerability data Frei et al. 2006 Vulnerability trend discovery
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 27
References
• Heron, Simon. "The rise and rise of the keyloggers." Network Security 2007.6 (2007): 4-6.
• Gupta, Haritabh, et al. "Deciphering Text from Touchscreen Key Taps." IFIP Annual Conference on Data and Applications Security and Privacy. Springer International Publishing,
2016.
• Cai, Liang, and Hao Chen. "TouchLogger: Inferring Keystrokes on Touch Screen from Smartphone Motion." HotSec 11 (2011): 9-9.
• Hussain, Muzammil, et al. "The rise of keyloggers on smartphones: A survey and insight into motion-based tap inference attacks." Pervasive and Mobile Computing 25 (2016): 1-25.
• Deokar, Bhagyashree, and Ambarish Hazarnis. "Intrusion Detection System using log files and reinforcement learning." International Journal of Computer Applications 45.19 (2012):
28-35.
• Vaarandi, Risto, and Kārlis Podiņš. "Network ids alert classification with frequent itemset mining and data clustering." 2010 International Conference on Network and Service
Management. IEEE, 2010.
• Quader, Faisal, Vandana Janeja, and Justin Stauffer. "Persistent threat pattern discovery." Intelligence and Security Informatics (ISI), 2015 IEEE International Conference on. IEEE,
2015.
• Chen Song, Janeja V., Human Perspective to Anomaly Detection for Cybersecurity, Journal of Intelligent Information Systems, Journal of Intelligent Information Systems, February
2014 (Accepted 2013) , Volume 42, Issue 1, pp 133-153
• Quader, Faisal; Janeja, Vandana, Computational Models to Capture Human Behavior in Cybersecurity Attacks Academy of Science and Engineering (ASE), USA, ©ASE 2014,
2014-06-16
•
• Janeja, Vandana P., et al. "B-dids: Mining anomalies in a Big-distributed Intrusion Detection System." Big Data (Big Data), 2014 IEEE International Conference on. IEEE, 2014.
• Abad, Cristina, et al. "Log correlation for intrusion detection: A proof of concept." Computer Security Applications Conference, 2003. Proceedings. 19th Annual. IEEE, 2003.
• Koike, Hideki, and Kazuhiro Ohno. "SnortView: visualization system of snort logs." Proceedings of the 2004 ACM workshop on Visualization and data mining for computer security.
ACM, 2004.
• Sklower, Keith. "A tree-based packet routing table for Berkeley unix." USENIX Winter. Vol. 1991. 1991.
• Tsuchiya, Paul F. "The Landmark Hierarchy: A new hierarchy for routing in very large networks." ACM SIGCOMM Computer Communication Review. Vol. 18. No. 4. ACM, 1988.
• Qiu, Jian, et al. "Detecting bogus BGP route information: Going beyond prefix hijacking." Security and Privacy in Communications Networks and the Workshops, 2007. SecureComm
2007. Third International Conference on. IEEE, 2007.
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 28
References
• Kim Zetter Security, WIRED, Someone’s Been Siphoning Data Through a Huge Security Hole in the Internet, https://www.wired.com/2013/12/bgp-hijacking-belarus-iceland/ 2013,
Last accessed 12/26/16
• Geocoding-Infosec: Robert Barnes, Infosec Institute, http://resources.infosecinstitute.com/geocoding-router-log-data/#gref, Geocoding Router Log Data, Aug, 8 2013
• Korosh Golnabi, Richard K. Min, Latifur Khan, Ehab Al-Shaer. Analysis of Firewall Policy Rules Using Data Mining Techniques[C]. Network Operations and Management
Symposium,2006. NOMS 2006. 10th IEEE/IFIP
• Abedin, Muhammad, et al. "Analysis of firewall policy rules using traffic mining techniques." International Journal of Internet Protocol Technology 5.1-2 (2010): 3-22.
• Wireshark, https://www.wireshark.org
• Wang K, Stolfo S. J. 2004. Anomalous Payload-based Network Intrusion Detection. In: Symposium on Recent Advances in Intrusion Detection, Sophia Antipolis, France.
•
• Sun-il Kim, William Edmonds, and Nnamdi Nwanze. 2014. On GPU accelerated tuning for a payload anomaly-based network intrusion detection scheme. In Proceedings of the 9th
Annual Cyber and Information Security Research Conference (CISR '14),
•
• Robert K. Abercrombie and J. Todd McDonald (Eds.). ACM, New York, NY, USA, 1-4. DOI=http://dx.doi.org/10.1145/2602087.2602093
•
• Tobias Limmer and Falko Dressler. 2010. Dialog-based payload aggregation for intrusion detection. In Proceedings of the 17th ACM conference on Computer and communications
security (CCS '10). ACM, New York, NY, USA, 708-710. DOI=http://dx.doi.org/10.1145/1866307.1866405
•
• Janak J. Parekh, Ke Wang, and Salvatore J. Stolfo. 2006. Privacy-preserving payload-based correlation for accurate malicious traffic detection. In Proceedings of the 2006
SIGCOMM workshop on Large-scale attack defense (LSAD '06). ACM, New York, NY, USA, 99-106. DOI=http://dx.doi.org/10.1145/1162666.1162667
•
• SNORT Rules Infographic https://snort-org-site.s3.amazonaws.com/production/document_files/files/000/000/116/original/Snort_rule_infographic.pdf?X-Amz-Algorithm=AWS4-
HMAC-SHA256&X-Amz-Credential=AKIAIXACIED2SPMSC7GA%2F20210316%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210316T191343Z&X-Amz-
Expires=172800&X-Amz-SignedHeaders=host&X-Amz-Signature=bcfc7d75d223ab40badd8bd9e89ded29cc98ed3896f0140f55320ee9bcdf1383, last accessed March 2020
• Cheok, Roy. "Wire shark: A Guide to Color My Packets Detecting Network Reconnaissance to Host Exploitation." GIAC certification paper (2014), SANS Institute Reading Room
• NodeXL: https://www.smrfoundation.org/nodexl/
• Nicosia, Vincenzo, et al. "Graph metrics for temporal networks." Temporal networks. Springer Berlin Heidelberg, 2013. 15-40.
• Namayanja, Josephine M., and Vandana P. Janeja. "Change detection in evolving computer networks: Changes in densification and diameter over time." 2015 IEEE International
Conference on Intelligence and Security Informatics (ISI).
• Namayanja, Josephine M., and Vandana P. Janeja. "Characterization of Evolving Networks for Cybersecurity." Information Fusion for Cyber-Security Analytics. Springer
International Publishing, 2017. 111-127.
• Massicotte, Frédéric, Tara Whalen, and Claude Bilodeau. "Network Mapping Tool for Real-Time Security Analysis." Real Time Intrusion Detection (2003).
• OSQuery. OSquery, last accessed, March 2020,. WWW: https://osquery.io/, 2016.
• Snare. Snare, last accessed, March 2020, WWW: https://www.snaresolutions.com/central-83/.
• Stephens, Gregory D., and Marcus A. Maloof. "Insider threat detection." U.S. Patent No. 8,707,431. 22 Apr. 2014.
• Van Mieghem, Vincent. Masters Thesis Delft University, "Detecting malicious behaviour using system calls.", 2016
• Vaidya, Jaideep, Vijayalakshmi Atluri, and Qi Guo. "The role mining problem: finding a minimal descriptive set of roles." Proceedings of the 12th ACM symposium on Access control
models and technologies. ACM, 2007.
• Mitra, Barsha, et al. "A Survey of Role Mining." ACM Computing Surveys (CSUR) 48.4 (2016): 50.
• Darwish, A.; Bataineh, E., "Eye tracking analysis of browser security indicators," Computer Systems and Industrial Informatics (ICCSII), 2012 International Conference on , vol., no.,
pp.1,6, 18-20 Dec. 2012 doi: 10.1109/ICCSII.2012.6454330
• Frei, S., May, M., Fiedler, U. and Plattner, B., 2006, September. Large-scale vulnerability analysis. In Proceedings of the 2006 SIGCOMM workshop on Large-scale attack defense
(pp. 131-138). ACM.
• NIST. National institute of standards and technology: National vulnerability database. Accessed Sept, 2017, http://nvd.nist.gov/.
10/2/2022 Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved. 29