0% found this document useful (0 votes)

46 views16 pages

Chatbots For OSINT

Chatbots for OSINT

Uploaded by

loiuy99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views16 pages

Chatbots For OSINT

Chatbots for OSINT

Uploaded by

loiuy99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Evaluation of LLM Chatbots for OSINT-based Cyber Threat Awareness

Samaneh Shafee* , Alysson Bessani, Pedro M. Ferreira

LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal

ARTICLE INFO ABSTRACT

Keywords: Knowledge sharing about emerging threats is crucial in the rapidly advancing field of cybersecurity
Cyber Threat Intelligence, and forms the foundation of Cyber Threat Intelligence (CTI). In this context, Large Language
Open-Source Intelligence, Models are becoming increasingly significant in the field of cybersecurity, presenting a wide range
Natural Language Processing, of opportunities. This study surveys the performance of ChatGPT, GPT4all, Dolly, Stanford Alpaca,
Large Language Models, Alpaca-LoRA, Falcon, and Vicuna chatbots in binary classification and Named Entity Recognition
Chatbots (NER) tasks performed using Open Source INTelligence (OSINT). We utilize well-established data
collected in previous research from Twitter to assess the competitiveness of these chatbots when
arXiv:2401.15127v3 [cs.CR] 19 Apr 2024

compared to specialized models trained for those tasks. In binary classification experiments, Chatbot
GPT-4 as a commercial model achieved an acceptable F1 score of 0.94, and the open-source GPT4all
model achieved an F1 score of 0.90. However, concerning cybersecurity entity recognition, all
evaluated chatbots have limitations and are less effective. This study demonstrates the capability of
chatbots for OSINT binary classification and shows that they require further improvement in NER
to effectively replace specially trained models. Our results shed light on the limitations of the LLM
chatbots when compared to specialized models, and can help researchers improve chatbots technology
with the objective to reduce the required effort to integrate machine learning in OSINT-based CTI
tools.

1. Introduction the scope of generating indicators of compromise for CTI.

This integration enhances the field’s capabilities, enabling
Cybersecurity, a continuously evolving domain, involves organizations to strengthen their defences and proactively
experts publicly sharing their knowledge on cyber threats. mitigate potential risks. The important question remains of
This information is the primary source for Cyber Threat whether LLM-based chatbots can achieve the level of per-
Intelligence (CTI) tools, and researchers have contributed formance [6] of specialized models [7], which are known to
to developing methods for extracting cyber threat intelli-
achieve excellent results using a wide variety of techniques
gence from text sources (e.g., [1, 2]). With the development
for binary classification [8], and NER [9] tasks.
of Machine Learning (ML), significant advancements have This paper presents an empirical study that uses induc-
been made in the field of cybersecurity. ML-based tools tive reasoning and a comparative methodology to answer the
have deepened our understanding of cyber threats, enabling following research question: Are LLM chatbots competitive
innovative and responsive solutions. A notable development
with state-of-the-art specialized models for detecting OSINT
in this domain is the emergence of Large Language Models
CTI and extracting pertinent information? To produce the
(LLM), representing a substantial breakthrough in Natural empirical results and inductively assess the performance
Language Processing (NLP). This development has empow- of chatbots, we use a publicly available annotated dataset
ered chatbots to extract meaningful information about cyber- [10, 11] collected from Twitter, a known reliable OSINT
security threats. Such advancements are exemplified by aca- CTI source [12]. This dataset provides an accurate testing
demic works such as SecBot [3] and Arora et al. [4], which
environment for challenging chatbots in a real-world CTI
provide proof-of-concept chatbots to support cybersecurity
scenario. Then, a comparative analysis of previous work
planning and management, and recent high-profile products and this paper’s results effectively answers the research
such as Microsoft Security Copilot [5]. These efforts (among question, allowing the discussion of the conclusion’s impli-
others, described in Section 2) highlight the practical use of cations and outlining possibilities for further research and
chatbots to address cybersecurity and CTI challenges. improvements.
In today’s cybersecurity landscape, the timely detection
The LLM-based chatbots in this study include Chat-
and response to emerging threats are crucial. To address
GPT [13], GPT4all [14], Dolly [15], Stanford Alpaca [16],
this vital requirement, we focus on researching the potential Alpaca-LoRA [17], Falcon [18], and Vicuna [19]. These sys-
of LLM-based chatbots to improve cyber threat awareness tems represent the two prominent types of chatbots widely
and streamline the detection processes. Specifically, we in- used nowadays: paid chatbots available as a service through
vestigate using pre-trained LLM chabots for binary classi- APIs, and Open-source chatbot models that are built to run
fication and Named Entity Recognition (NER) tasks within
on local GPU servers. When planning the experiments, we
∗ Corresponding author at: LASIGE, Faculdade de Ciências, Universi- picked all the available variants of GPT-like models since
dade de Lisboa, Portugal. they are standard in the field [20].
E-mail addresses: sshafee@ciencias.ulisboa.pt (S. Shafee),
Since utilizing LLM chatbots raises additional technical
anbessani@ciencias.ulisboa.pt (A. Bessani), pmf@ciencias.ulisboa.pt
(P.M. Ferreira). ( ) questions, we also study the impact of different chatbot
utilization methods on the empirical results obtained to

First Author et al.: Preprint submitted to Elsevier Page 1 of 16

answer the main research question. These questions concern time, transformer models handle tokens or words simulta-
the capability of chatbots to provide clear yes-or-no answers, neously. This makes it easier to model the global interac-
and the cost (in terms of processing time) required for these tions [27, 28] and dependencies [23]. This feature makes
chatbots to perform the NLP tasks. transformers highly effective for various tasks, including
Our contributions can be summarized as follows: but not limited to machine translation, sentiment analysis,
and text generation. Transformers have emerged as the ba-
1. We present a state-of-the-art survey on chatbot models
sic foundation for models such as Bidirectional Encoder
and their application to cybersecurity.
Representations Transformers (BERT) [29] and Generative
2. We investigate the extent to which the inherent flexi- Pre-trained Transformers (GPT) [30]. These models have
bility of LLM-based chatbots can be tailored to meet demonstrated exceptional performances across diverse NLP
the specific requirements of OSINT-based CTI appli- benchmarks, leading to progress in language understanding,
cations. text generation, and question answering [31].
3. The study provides a comparative analysis of the prac-
tical use and performance of chatbots in specialized
CTI tasks, including binary text classification and 2.2. Large Language Models
NER. LLMs have emerged as significant innovations that revolu-
The remainder of this paper is organized as follows. tionize the human language process, generation, and under-
Section 2 provides some background on LLMs and chatbots, standing. These models are trained on large text datasets,
and reviews related works on LLM chatbots for cybersecu- including immense volumes of language data, which enables
rity, and the evaluation of chatbots in NLP tasks. Section them to perform tasks that demand contextual comprehen-
3 presents a deep exploration of LLM-based chatbots and sion and generate coherent and meaningful responses [32].
highlights their significance and capabilities. In Section 4, In terms of applications, the latest generation of language
we shift our focus to the methodology used to evaluate models has been applied across diverse fields. For instance,
chatbots, detailing our dataset, methods, and the comparison the utilization of LLMs within mathematics, physics, and
criteria employed for evaluation. Section 5 explores strate- chemistry problem-solving has been evaluated by Arora
gies aimed at optimizing the utilization of these chatbots, in- et al. [33]. Agrawal evaluated the ability of LLMs for human-
cluding prompt fine-tuning and text length control. Section 6 like reasoning tasks [34]. The findings of the study indicate
presents our main experimental results and their discussion. that LLMs have strong abilities in analogical and moral rea-
Finally, we conclude our study in Section 7 by summarizing soning but face challenges in spatial reasoning tests. Chat-
the key contributions and insights derived from this study. bots powered by LLMs have attracted interest as powerful
tools for data annotation in NLP domains [35]. This interest
arises from chatbots’ proficiency in language tasks and the
2. Background and Related work critical role of data annotation in developing NLP systems.
In this section, we briefly present the required back- An illustrative study [36] compared ChatGPT with human
ground for this paper. We start by analyzing transformer crowd workers in annotation tasks. The research emphasized
models, the foundational elements facilitating recent progress that ChatGPT surpassed human workers in terms of perfor-
in NLP. Next, we discuss LLMs by examining their develop- mance, agreement among annotators, and cost-effectiveness.
ment, abilities, and significant influence on various specific There are two widely used types of LLMs: GPT and
areas. Building on this foundation, we review research on BERT. GPT-style models utilize a regressive transformer
how LLMs can effectively be utilized for cybersecurity- architecture to capture contextual dependencies and relation-
related NLP, exploring in detail the role of OSINT in ships within text [32], while BERT-style language models
cybersecurity. We end the section by examining chatbots’ utilize a bidirectional transformer architecture and adopt
evolution and NLP capabilities to address cybersecurity a masked language modelling objective. Both approaches
concerns and the literature gap we aim to contribute with gained popularity for NLP capabilities because of their abil-
this paper. ity to model contextual dependencies. Besides the difference
in the context they consider to make text predictions, they
also differ in the number of parameters and training dataset
2.1. Transformers size, with GPT-style models requiring more resources in
The introduction of transformers [21] revolutionized the both cases.
field of NLP, as they became the preferred architecture for Yang et al. [32] provides a complete guide explain-
various NLP tasks due to their ability to effectively cap- ing how to use LLMs from various perspectives, including
ture the extensive dependencies and contextual associations model selection, data consideration, and task specificity.
of textual data [22, 23]. This ability enables transformers It thoroughly investigates the details of pre-training and
to overcome the limitations of previous methods such as the significance of the training and test data and offers
Recurrent Neural Networks (RNNs) [24], Convolutional insights into tasks that require a rich knowledge base, nat-
Neural Networks (CNNs) [25], and Long Short-Term Mem- ural language comprehension, generation capabilities, and
ory (LSTM) [26]. Instead of processing inputs one at a other emergent features. In particular, LLMs such as the
3

Generative Pre-trained Transformer 3 (GPT-3), have gained capabilities. Among these tools, LLMs stand out for their ex-
attention due to their remarkable ability to acquire a broad ceptional capabilities in interpreting and generating human-
spectrum of knowledge during pre-training and apply it to like text.
downstream NLP tasks [35], with significant potential for The specific case of ChatGPT raised discussions and re-
chatbots [13, 15, 16, 17, 18, 19, 37]. views on its potential applications in the cybersecurity field
[43, 44, 45] and specific attack mitigation approaches have
2.3. Leveraging LLMs for cybersecurity NLP already been described [46]. Other works designed chatbots
specifically to help cybersecurity analysis. SecBot [3] is
Employing the capabilities of LLMs in ML tasks, specifi- a cybersecurity-driven conversational chatbot that extracts
cally within the scope of cybersecurity, offers several pos- information from a conversation to support cybersecurity
sibilities [38]. Two problems that stand out are text classi- planning and management. It was developed and evaluated
fication, which categorizes text according to its relevance on a small dataset utilizing the Rasa framework [47], achiev-
to a cybersecurity context, and NER, which recognizes ing 100% accuracy in extracting the intent of attacks and
specific cybersecurity entities in the text, for example, if it associated named entities.
describes a vulnerability, which products are affected. LLMs
have shown outstanding effectiveness in many NLP tasks,
2.6. Evaluation of LLM Chatbots
including binary classification and NER [6, 31].
Given the widespread attention LLM chatbots are attracting,
2.4. OSINT for cybersecurity several works tried to evaluate the strengths and limita-
tions of the technology. An analysis of ChatGPT’s per-
In the CTI field, NLP tasks such as text classification and formance [6, 48] showed that despite its achievements, it
NER, which involve exploring various cybersecurity threats frequently fell behind supervised baselines in various NLP
and concepts, are crucial. In a previous study, SYNAPSE tasks. Several factors influence this, including limitations on
[10], an OSINT processing pipeline was implemented to ef- the number of tokens, a misalignment with specific NLP
ficiently identify and concisely present various cybersecurity tasks due to its generative nature, and challenges inher-
incidents to security analysts. SYNAPSE was motivated by ent to LLMs, such as hallucination, which involves mak-
the results of previous works (e.g., [12, 39]) that analyzed ing false positive predictions. In their effort to optimize
the completeness and timeliness of cybersecurity-related ChatGPT, the authors have proposed solutions, including
OSINT on Twitter. multiple prompts, task-specific fine-tuning, and strategies
SYNAPSE employed Support Vector Machines (SVM) to counter hallucination. These methods were thoroughly
[40] for binary classification, and a novel stream clus- tested across 21 datasets, covering ten critical NLP tasks,
tering approach to aggregate related tweets. To improve including NER. The testing of the solutions resulted in
SYNAPSE, Dionísio et al. [11] designed a new tool that em- significant performance improvements for ChatGPT, with
ploys a CNN to detect cybersecurity-related texts gathered instances where it outperformed state-of-the-art models in
from Twitter and a BiLSTM network for performing NER existing benchmarks [48].
on the detected tweets to identify the type of threat, affected The study by Megahed et al. [49] shows that ChatGPT
products and other information. That work was further excels in structured tasks like code translation and explain-
extended using a multitask Deep Learning (DL) approach ing established concepts. However, it faces challenges when
[41] that could simultaneously perform classification and handling nuanced tasks such as recognizing unfamiliar enti-
NER. ties and generating code from scratch.
Although there are many ML techniques for binary clas- Recent research findings show that the ChatGPT may en-
sification and NER [8, 9], to the best of our knowledge, the counter challenges and limitations in accurately identifying
architecture of Dionísio et al. [11, 41] is recognized [42] as entities, including locations, names, and organizations. Qin
the state-of-the-art for OSINT-based CTI extraction on Twit- et al. [50] present the results of their experiments on the per-
ter data, reaching almost 95% for different quality metrics. formance of GPT-3.5, ChatGPT, and fine-tuned models on
This work also shows superior performance when compared the multi-domain CONLL dataset for recognizing entities.
with classifiers based on SVMs and Multi-Layer Perceptrons According to the findings reported in this paper, ChatGPT
(MLPs), and several alternative NER approaches. Due to and GPT-3.5 achieved F1 score of 53.7% and 53.5%, respec-
this, we consider Dionisio et al. [41] as the reference spe- tively. Sun et al. [48] investigated the factors contributing to
cialized model in our comparative analysis. the sub-optimal performance of GPT-based chatbots in NLP
tasks, such as NER. They have identified several underlying
2.5. Chatbots for Cybersecurity causes and have proposed a set of generalized modules to
mitigate these challenges in different NLP tasks. Kocoń et al.
Although traditional and DL techniques have paved the way [6] examined the capabilities of ChatGPT on a diverse set
for advancements in cybersecurity text classification and of subjective analytical tasks and objective reasoning tasks,
NER, a new wave of NLP tools offers even more refined revealing that when compared to state-of-the-art models, the
4

average quality loss is about 25% in zero-shot and few-shot curation, labelling, and model training and updating would
settings. no longer be as necessary as they are for specialized models.
As LLM-based chatbots have become widespread, con- Our study aims to fill this gap by carefully comparing
cerns about their own vulnerabilities and their role as tools open-source and publicly available paid chatbots in the con-
for cyberattacks have increased [51]. The usefulness of text of an OSINT-based CTI application, considering two
ChatGPT extends the time-to-conquer or delay attacker downstream NLP tasks: binary classification and NER.
timelines, making it valuable for organizations seeking
to enhance their cybersecurity posture [52]. Additionally,
the performances of the ChatGPT and GPT-3 models are 3. LLM-based chatbots
evaluated for vulnerability detection in code [53]. Based on LLM-based chatbots simulate human-like conversations
a real-world dataset, this evaluation focuses on binary and with users through text or speech interaction. They offer
multi-label classification tasks related to common weakness users more intelligent and contextually relevant responses
enumeration of vulnerabilities. The findings indicate that to queries by utilizing LLMs’ language comprehension
ChatGPT does not outperform the baseline classifier in and generation capabilities. This section reviews the eight
classification tasks for code vulnerability detection [53]. state-of-the-art commercial and open-source chatbot models
More precisely, the performance of both GPT models was available at the time of writing, namely, LLaMA, GPT4all,
assessed by accuracy, precision, recall, F1 score, and area Dolly 2.0, Stanford Alpaca, Alpaca-LoRA, Vicunna, Falcon,
under the curve. The highest F1 score of 0.67 was achieved and ChatGPT. We used these models in our experimental
using the text-davinci-003 model for binary classification. evaluation (Section 6) to assess their effectiveness in de-
In contrast, the F1 score of all ChatGPT models remained tecting cybersecurity-related texts and identifying relevant
below 0.53 for multilabel classification. entities.
Sentiment analysis plays a crucial role in extracting user
opinions and emotions from textual data to assess threats. 3.1. LLaMA
Recent research has been directed toward developing sus-
tainable strategies to diminish threats, vulnerabilities, and LLaMA [37] is a compilation of 7 to 65 Billion (B) parameter-
data manipulation within chatbots, ultimately improving the based foundation language models. These were trained on
scope of cybersecurity. To achieve this objective, researchers trillions of tokens, demonstrating the possibility of training
created an interactive chatbot using the Bot Libre platform1 cutting-edge models using only publicly accessible datasets.
and placed it on social media platforms such as Twitter for Their training method is comparable to that described in
the specific purpose of cybersecurity [4]. This study em- prior research [54] and is influenced by the Chinchilla
ploys a sentiment analysis strategy by deploying chatbots on scaling laws [55]. Specifically, LLaMA-13B demonstrated
Twitter and subsequently analyzing Twitter data to anticipate superior performance compared to GPT-3 (175B) across a
forthcoming threats and cyberattacks. wide range of benchmarks, whereas LLaMA-65B exhib-
ited similar performance levels to leading models, such as
Chinchilla-70B and PaLM-540B. LLaMA models undergo
2.7. The Research Gap training on substantial textual datasets by employing a
The reviewed papers in this section primarily focused on conventional optimizer and large-scale transformers [37].
evaluating and testing the commercial ChatGPT model,
with limited attention given to assessing open-source chat- 3.2. Vicunna
bot models in the context of cybersecurity applications.
Moreover, to the best of our knowledge, there is a notable An open-source chatbot, Vicuna-13B [19], was developed by
absence of comprehensive comparative studies explicitly fine-tuning LLaMA with user-shared conversations gathered
dedicated to open-source chatbots within the specialized from the 70K ShareGPT. In Vicuna, gradient checkpointing
field of OSINT-based CTI. This gap in existing research [56] and flash attention [57] alleviate the memory demand.
underscores the need to examine open-source chatbot mod- The Vicuna report states that like other large language mod-
els, their effectiveness, and their potential contributions to els, it has limitations. For example, it is not adept at tasks
enhancing cyber threat awareness and detection. Although requiring logic or mathematics, and may have limitations
there are specialized state-of-the-art models, including DL in correctly identifying itself or assuring the factual perfor-
models that excel at CTI binary classification and NER mance of its outputs [19].
tasks [10, 11, 41], no comparative study has been conducted
to determine whether LLM-based chatbots can compete with 3.3. GPT4all
their performance. If they can provide competitive results,
CTI tools could be changed to integrate LLM chatbots GPT4all [14] utilizes LLaMA, which operates under a non-
into the OSINT processing pipeline, decreasing the tools’ commercial license. The data for the assistant come from
complexity and maintenance costs. By employing general- OpenAI’s GPT-3.5-turbo, which has restrictions that prevent
purpose LLM chatbots, tasks related to CTI data collection, the development of models that directly compete with Ope-
nAI in commercial applications. GPT4all underwent several
1 https://www.botlibre.com/.
5

iterations with different versions featuring different parame- models such as ChatGPT is that they do not produce repeti-
ter sizes. While preparing this article, the initial version had tive responses to specific prompts [49].
7B parameters; however, the latest iteration used 13B.
3.8. Falcon
3.4. Dolly
The Falcon family [18] consists of two primary models:
Dolly [15] operated by slightly modifying an open-source Falcon-40B and its smaller counterpart, Falcon-7B. The
model with 6B parameters sourced from EleutherAI [58]. Falcon-7B and Falcon-40B models underwent training on
These modifications enabled Dolly to possess instruction- a corpus of 1.5 trillion and 1 trillion tokens, respectively
following capabilities, such as brainstorming and text gen- [18]. A feature of Falcon models is their utilization of
eration, which were not initially present in the base model. multiquery attention [63]. In the vanilla multihead attention
These modifications are implemented using the data from scheme, each head is associated with a query, key, and value.
Alpaca [16]. Subsequently, Dolly-v2-7b emerged as a highly However, in the multiquery approach, a single key and value
advanced 6.9B parameter causal language model derived are shared across all heads.
from EleutherAI’s Pythia-6.9b. Although Dolly-v2-7b may
not be considered a state-of-the-art model, it exhibits instruc-
4. Chatbots evaluation
tion-following capabilities of remarkably high quality, which
are not typically associated with its foundational model. We divided this section into three topics relevant to the
The most recent version available is Dolly-v2-12b [59], a evaluation: dataset, experimental methodology, and exper-
model with 12B parameters developed based on EleutherAI’s imental results evaluation criteria. First, we introduce the
Pythia-12b. It has been finely tuned using a dataset called Twitter dataset, which is the foundational source for gen-
databricks-dolly-15k, which consists of an instruction cor- erating subsequent prompts. The methodology subsection
pus created by employees of Databricks.2 outlines the techniques and strategies used to assess and
compare chatbot performances. Finally, in the evaluation
criteria subsection, we explore the metrics and standards by
3.5. Stanford Alpaca which the chatbots are assessed.
Stanford Alpaca [37] is an instruction-following language
model that is fine-tuned from Meta’s LLaMA 7B model. 4.1. Dataset
It was instructed using 52k self-instruct-style demonstra-
tions [60]. Alpaca displays several shortcomings in lan- We leveraged a comprehensive Twitter dataset made avail-
guage models, such as hallucination, toxicity, and stereo- able by Alves et al. [12] in their work to retrieve the In-
types. Specifically, hallucination appear to be a recurring dicators of Compromise (IoCs) from Twitter OSINT. This
issue in Alpaca [16]. dataset contains a combined total of 31281 tweets collected
during two distinct periods: one from November 21, 2016,
to March 27, 2017, and the other from June 1, 2018, to
3.6. Alpaca-LoRA September 1, 2018. After collection, the tweets were filtered
This model [17] reproduces the Stanford Alpaca results using specific keywords and manually labelled as positive or
using low-rank adaptation (LoRA) [61]. LoRA fine-tuning is negative, considering their relevance to the cybersecurity of
a strategy for reducing memory requirements by employing a an IT infrastructure, thus creating labelled datasets suitable
limited set of trainable parameters, known as adapters, rather for supervised learning. Later, Dionísio et al. labeled the
than updating all model parameters, which remain constant. dataset [11] for NER and republished it.
The dataset consists of 31281 tweet records, including
the timestamp of the tweet, specific keywords found in the
3.7. ChatGPT tweet, the original tweet, a pre-processed tweet cleaned from
OpenAI’s ChatGPT [13] has generated substantial interest some special characters, a binary label marking the tweet
and sparked extensive discussion within the NLP commu- as relevant for cybersecurity or not, and a string identifying
nity as well as in various other domains. The lack of clarity the named entities in the pre-processed tweet. Table B1
regarding the training process and architectural specifics in Appendix B shows representative examples of dataset
of ChatGPT poses a significant obstacle to both research records.
endeavors and the advancement of open-source innovation In this work, we used the pre-processed tweets, the
within this domain. Moreover, the distinction between the cybersecurity relevance, and the sequences of named entity
ChatGPT API and the web version model should be ac- tags, to create a customized dataset for the binary classifica-
knowledged. Recent research shows considerable variability tion and NER experiments through which the chatbots will
in the performance and behavior of GPT-3.5 and GPT-4 be evaluated.
models over time [62]. Another feature of generative AI
2 https://www.databricks.com/blog/2023/04/12/dolly-first-open-

commercially-viable-instruction-tuned-llm
6

4.2. Experimental methodology and context between consecutive questions during

testing. It allows for clean and unbiased interaction
As discussed in previous work [10, 11, 41], the extraction
with the chatbot for each question. For this test, we
of actionable CTI from OSINT entails three foundational
employ a methodology that closes the chatbot session
tasks: text classification, extraction of pertinent information
after each question and initializes a fresh session
from textual sources, and the synthesis of the informa-
for the subsequent question. This ensures the chatbot
tion gathered into concise summaries. Considering the role
does not retain any knowledge or bias from previous
of summarization to aggregate correlated threat indicators
interactions, thus providing a fair evaluation of its
prior to dissemination, this study directs its attention towards
performance on individual questions.
evaluating and comparing chatbot performance within the
two first tasks. These NLP tasks, being more amenable We conducted research on two models of commercial
to LLM methodologies [31], are deemed as primary focal ChatGPT, namely GPT-3.5-turbo and GPT-4, using versions
points for research within the scope of this study. that were available on August 5th, 2023. The API provided
From a practical point of view, we interacted with the by OpenAI was used to send requests to ChatGPT for each
chatbots using Python language scripts to iterate through the tweet in the dataset. Various parameters can be configured
dataset entries. For each entry, different prompts were used by utilizing the ChatGPT API. One particularly influential
to address the two tasks of interest: deciding on the relevance parameter is referred to as “temperature". This parameter
of a tweet for cybersecurity using a binary classification governs the level of creativity or randomness of the gener-
formulation and extracting relevant information from the ated text. A higher temperature setting (e.g., 0.7) produces a
tweets using NER. more varied and imaginative output, whereas a lower setting
(e.g., 0.2) yields a more predictable and concentrated output.
4.2.1. Binary classification We adjusted the temperature parameter to 0.2 to reduce
output randomness while leaving the other parameters at
We designed prompts to determine whether tweets are re- their default values.
lated to the cybersecurity field, aiming to generate a response
including ‘yes’ or ‘no’. We focused on constraining the
4.2.2. Named entity recognition
chatbot to provide concise yes or no answers without ad-
ditional explanations. This restriction simplifies the process The assessment of chatbot performance in NER involves two
of extracting binary labels (0 or 1) from the responses. distinct methodologies.
For instance, in a scenario where the “pre-processed tweet" The initial approach, which is called Entity-Specific
column within our dataset contained the text “cyber in- Prompting (ESP), was chosen based on a careful evaluation
fosec kenno media SQL injection", to compose the specific of the various experimental results. It was evident from
prompts we desired, we added the phrase “Is the sentence" these experiments that attempting to extract all required
at the beginning of each pre-processed tweet. Then, we ap- entities from a single tweet using a single question yielded
pended the phrase “related to cybersecurity? Please respond significant misrecognition, rendering the approach imprac-
with yes or no." at the end. Consequently, the final prompt tical. Consequently, we decided on a more precise strategy
becomes: Is the sentence ‘threatmeter dos microsoft internet involving the creation of specific prompts for each entity
explorer 9 mshtml cdisp node::insert sibling node use-after- within each tweet. After sending several prompt requests for
free ms13-0’ related to cybersecurity? Just answer yes or no. each entity to the chatbots, we found that the organization
This process was applied to all the tweets in the “pre- names and product version entities were the most extractable
processed tweet" column to create a question dataset. We ad- by chatbots. The revised approach focuses on two specific
ministered three distinct tests, each consisting of the 31281 entities: organization names (B-ORG) and product versions
questions asked to chatbots. (B-VER). We selected these entities for extraction by inter-
acting with ChatGPT-3.5-turbo, ChatGPT-4, GPT4all, and
• Test 1- Normal Dataset: In this test, we assess the
Dolly models. Since only 11074 out of 31281 questions in
performance of chatbots using the Twitter dataset
the dataset had been tagged with NER labels, we limited
described before, keeping the order of the rows un-
our analysis for the NER task to these 11074 tweets. In
changed. The objective is to assess how well chatbots
our experiment, we employed various prompts to choose the
can accurately identify tweets related to cybersecurity.
proper one; ultimately, we selected the following prompts to
• Test 2- Shuffled dataset: To examine the impact of identify the organization name and product version:
question order on the resulting answers, the question
dataset was shuffled and provided as input to the • Find the name of organizations in the following sen-
models. tence: ‘threatmeter dos microsoft internet explorer
9 mshtml cdisp node::insert sibling node use-after-
• Test 3- Isolated Prompt Testing: In this test, we in- free ms13-0’. Give the shortest answer, and only use
dicate that each question was regarded as an isolated sentence segments in your response.
prompt without considering the context of past inter-
actions. This test aims to clear the chatbot’s history
7

• Find only product version numbers without any prod- • No response: Instances in which chatbot models failed
uct, vulnerability, and company names in the fol- to respond.
lowing sentence: ‘threatmeter dos microsoft internet
explorer 9 mshtml cdisp node::insert sibling node use- • Precise response: The questions to which the chat-
after-free ms13-0’. Give the shortest answer, and only bot models provided accurate answers were precisely
use sentence segments in your response. aligned with the desired yes or no response.
• Implicit response: Answers in which the chatbot’s re-
The second approach, which is called Guide-Line Prompt-
sponse did not explicitly mention yes or no. However,
ing (GLP), exclusively employed for ChatGPT-4, involves
careful inference from the generated answers revealed
a comprehensive specification of all entities within the
that the intended response was either yes or no.
‘GUIDELINES_PROMPT’ section. This approach attempts
to extract seven entities in a single prompt and considers only By analyzing these distinct response modes from the output
the ChatGPT model because the GUIDELINES_PROMPT files, we gain valuable insights into the performance and
feature is absent in open-source models. In this guideline, capabilities of the evaluated chatbot models to effectively
we include two examples of tweets from 11074 NER-tagged detect and address cybersecurity-related questions. The ex-
tweets in the dataset, each annotated with their respective perimental results section provides a detailed explanation
entities. In addition, we have included the output format of the experimental findings, focusing on the quality and
as a means to direct ChatGPT’s response for subsequent accuracy evaluation criteria.
processing. We then sent a dedicated prompt for each of the
11074 tweets and systematically covered all the extracted
entities. The prompt guideline is given in Appendix A. 5. Optimal utilization strategies of chatbots
The fundamental principle behind creating prompts is
4.3. Evaluation criteria ensuring clear and precise instructions. Prompt engineer-
ing [64] is crucial for optimizing the performance of chatbot
This section describes our evaluation criteria to assess chat- models by enhancing the clarity and specificity of the given
bot models’ performance in addressing binary classification instructions. Thorough prompt design and testing improve
and NER tasks. Our evaluation methodology focuses on the ability of the model to comprehend requests, making it a
performance and quality. more effective tool for generating desired outcomes. These
approaches enhance the possibility of directing the model to
4.3.1. Performance the desired output while decreasing the chances of receiving
irrelevant or incorrect responses.
To assess the selected chatbots’ performance, we used the F1
score, which is a metric that computes the harmonic mean
of Precision and Recall to measure the binary classification 5.1. Prompt enginering approaches
performance of a model. It strikes a balance between the We devoted considerable attention to formulating suitable
proportion of true positive results in all positive predictions prompts to maximize accurate and relevant responses from
and the proportion of true positive results in all actual the chatbots during our experiments. The objective was to
positives. The F1 score is given by design prompts that effectively allow models to capture the
Precision × Recall essence of the CTI task in the information present in tweets.
F1 = 2 × (1) Following the best practices for prompt engineering [58],
Precision + Recall
we progressively refined our prompts using two approaches.
where Precision is the ratio of correctly predicted positive For NER, we considered first a prompt guideline template
observations to the total predicted positives, given by approach but with poor results (prompt template is shown
in Appendix A). We used a zero-shot methodology for
𝑇𝑃
Precision = , binary classification, considering that LLMs are excellent
𝑇𝑃 + 𝐹𝑃
zero-shot reasoners [65]. Since this approach produced very
and Recall measures the ratio of correctly predicted positive good results, for the NER tests, we transformed the prompt
observations to all observations in the positive class: guideline into a sequence of zero-shot questions, one for each
entity we aim to extract.
𝑇𝑃
Recall = . To leverage the zero-shot capability of LLMs, we ex-
𝑇𝑃 + 𝐹𝑁
plored the assumption that chatbots perfectly represent the
cybersecurity concept. Following this assumption, we de-
4.3.2. Quality
signed prompts by starting the question with information
For the classification task, the response quality of the models on the cybersecurity issue and finishing it with precise
relates to their ability to provide precise yes or no answers instructions on the expected outcome.
to the prompts they receive. The assessment is based on The process of selecting the final prompt was iterative
three distinct response modes that the models eventually pro- and relied on a few trial and error cycles, using the prin-
duced: no response, correct response, and implicit response. ciples described before. Using a sample of dataset entries,
8

we progressively queried the chatbots and evaluated the evaluation of binary classification, focusing on how chatbots
answers until consistent answers were achieved concerning classify user inputs proficiently. Next, we present the evalua-
the instructions given. This refinement cycle resulted in the tion of NER tasks by examining the effectiveness of chatbots
determination of the final prompt. in identifying and classifying entities present within user
By formulating questions such as Is the sentence ‘vuln inputs. The collective findings from these experiments offer
oracle java se cve-2016-5582 remote security vulnerability’ comprehensive insights into chatbots’ operational strengths
related to cybersecurity? Just answer yes or no., we aimed and potential areas for improvement.
to elicit yes or no responses from the chatbots. Using spaces It is essential to highlight that we ultimately elaborate
and adequately employing the apostrophe (’) played a signif- on the common and default parameters shared by all open-
icant role in clarifying the prompt. source chatbot models. In most of these models, the maxi-
This methodology allowed us to assess the chatbot’s ca- mum size of the context window3 and its default value are
pabilities to detect cybersecurity concepts and their ability to set to 512. An exception is the Dolly model, which has
generate meaningful and contextually appropriate responses. a maximum context window size of 1024. However, the
Having formulated the desired prompt, we automated the commercial ChatGPT model exhibits variability across its
process of sending the entire set of questions to chatbots. final versions, each with a distinct context window size.
For instance, ChatGPT-4 is available in two versions with
5.2. Text length control window sizes of 8k and 32k, whereas ChatGPT-3.5-turbo
is available in 4k and 16k versions. In our experiments,
Chatbots can produce text of varying lengths based on we utilized a server equipped with multiple GPU units,
specific tasks. In our experiments, minimizing the number including an NVIDIA A30 GPU (memory capacity: 24,576
of answer tokens was essential because of the high volume MiB) and an NVIDIA RTX A6000 GPU (memory capacity:
of questions and the time required for the model to answer 49,140 MiB), with 264 GB of RAM.
each question.
The parameter that defines the length of the answer or
6.1. Evaluation of binary classification
the number of tokens in local chatbots is represented as
N_predict. It plays a critical role in significantly reducing We present the results obtained from chatbots, focusing on
the execution time for answering questions across all open- two critical dimensions: quality and accuracy.
source chatbot models. To optimize the execution time, we
advise setting a parameter that controls the response length 6.1.1. Performance
to the smallest suitable value based on a specific task. After
some initial experimentation, we consistently set the value In terms of quality, different categories are shown in dif-
of N_predict to 15 across all open-source models. ferent colors. The presence of red in Fig. 1 highlights a
In the context of ChatGPT-3.5-turbo and ChatGPT-4, the noteworthy observation regarding the unanswered questions.
max_tokens parameter constrains the length of the responses Precise responses are visually represented by the green bars
generated by the model. This is achieved by establishing a on the chart, effectively indicating the successful accom-
predetermined upper limit for the number of tokens that can plishment of the intended goal. The orange bars in the
be words or characters within the generated output. Using diagram represent acceptable answers with minor imper-
more extended responses in ChatGPT-4 can increase token fections, showing chatbots’ capacity to offer responses that
consumption, potentially increasing the usage costs. After were contextually in line with the desired result, albeit with
some initial experimentation with prompts with extreme slight deviations. The Vicuna model encountered significant
lengths, the value 70 was assigned to max_tokens in the challenges in answering 17420 questions despite our efforts
experimental tests. It is worth noting that the chosen token to use iteration loops to generate answers. To address these
length of 15 for the open-source chatbots applies exclusively unanswered questions, we included them in the loop until
to the generated answers. By contrast, the selected token the model could respond. However, a considerable number
length of 70 for ChatGPT models encompasses questions of questions have become trapped in an infinite loop and
and answers. remain unanswered. This problem is unique to the Vicuna
The careful utilization of the N_predict and max_tokens chatbot model, with no other models showing it. As shown
parameters is of utmost importance, as a low setting may lead in Fig. 1, ChatGPT and Stanford Alpaca exhibited excellent
to truncation of the response, potentially producing incom- quality, with all 31281 questions consistently beginning
plete or nonsensical answers. Balancing the desired response with yes or no. By contrast, the GPT4all model achieved a
length with the need for completeness and coherence is a slightly lower rate, with 98.80% of its responses starting with
crucial factor to consider. either yes or no. For the Dolly and Falcon models, the rates
are 97.32% and 94.10%, respectively. The GPT4all model
delivered conceptual yes or no responses to 378 questions,
6. Results and Discussion accounting for 1.2%, whereas Dolly produced responses for
In this section, we present the results of the empirical 839 questions, constituting 2.68%. Interestingly, however,
assessment of chatbots’ thorough evaluations of their ca- 3 Upper limit for the range of tokens considered to provide an answer.
pabilities across multiple dimensions. First, we discuss the
9

Table 1
Accuracy of chatbot models for cybersecurity binary classification
Model Test Number Parameters Precision Recall F1 score Execution Time
ChatGPT-3.5-turbo (16k context) [13] Test 1 175B 0.9570 0.9280 0.9431 11h 23m
ChatGPT-3.5-turbo (16k context) [13] Test 2 175B 0.9700 0.9200 0.9489 11h 23m
ChatGPT-3.5-turbo (16k context) [13] Test 3 175B - - UECH -
ChatGPT-4 (8k context) [13] Test 1 1.7T 0.9580 0.9240 0.9410 11h 50m
ChatGPT-4 (8k context) [13] Test 2 1.7T 0.9590 0.9230 0.9403 11h 43m
ChatGPT-4 (8k context) [13] Test 3 1.7T - - UECH -
GPT4all [14] Test 1 13B 0.9490 0.8630 0.9049 132h 05m
GPT4all Test 2 13B 0.9490 0.8410 0.8927 132h 02m
GPT4all Test 3 13B 0.9470 0.8280 0.8844 136h 05m
Dolly 2.0 [15] Test 1 7B 0.8890 0.8000 0.8470 10h 38m
Dolly 2.0 Test 1 12B 0.9470 0.7900 0.86120 10h 16m
Dolly 2.0 Test 2 12B 0.9480 0.7910 0.8631 10h 00m
Dolly 2.0 Test 3 12B - - - LET
Falcon [18] Test 1 7B 0.8120 0.8500 0.8304 16h 02m
Falcon Test 1 40B 0.8980 0.8200 0.8511 54h 03m
Falcon Test 2 40B 0.8990 0.8080 0.8502 54h 55m
Falcon Test 3 40B 0.8990 0.7880 0.8330 71h 10m
Alpaca-LoRA [17] Test 1 65B 0.8980 0.7940 0.8477 10h 12m
Alpaca-LoRA Test 2 65B 0.8990 0.8000 0.8451 10h 44m
Alpaca-LoRA Test 3 65B 0.8980 0.7610 0.8241 11h 20m
Stanford Alpaca [16] Test 1 7B 0.2260 0.5000 0.3112 13h 03m
Stanford Alpaca Test 1 13B 0.3240 0.6000 0.4209 13h 21m
Stanford Alpaca Test 1 30B 0.6980 0.6050 0.6415 15h 48m
Stanford Alpaca Test 2 30B 0.6990 0.5920 0.6401 15h 04m
Stanford Alpaca Test 3 30B 0.6990 0.5810 0.6395 16h 18m
Vicuna [19] Test 1 13B 0.4390 0.3100 0.3611 11h 23m
Dionisio et al. [41] Test 1 - 0.9570 0.9363 0.9470 00h 43m
* LET: Long Execution Time * UECH : Uncertainty of Erasing Conversation History

neither of these models explicitly used the words yes or no Based on the results presented in Table 1, it is evident that
in these conceptual answers. For example, the conceptual the GPT4all achieved the highest accuracy among the open-
answers were: it is related to cybersecurity, or the sentence source models, as indicated by its F1 score of 0.90. The
is not related to cybersecurity. Dolly model has an accuracy of 0.86, Falcon of 0.85, Alpaca-
In terms of accuracy, we explain the results of the tests LoRA of 0.84, and the Stanford Alpaca model has a score
conducted on chatbot models. Table 1 provides an overview of 0.64. Although GPT4all achieved higher accuracy among
of the tests conducted for each chatbot and its version. open-source chatbots, it is noteworthy that the commercial
Moreover, we recorded the execution time of the models’ ChatGPT models (GPT-4 and GPT-3.5-turbo) achieved an
responses for all the questions. It includes details regarding F1 score of 0.94. ChatGPT-3.5-turbo with a 16k window
the number of parameters on which the model was trained, context size achieves the same F1 score as ChatGPT-4 with
achieved F1 score, precision, recall, and execution time for an 8k context window size. These results highlight the better
each model. The confusion matrices corresponding to each accuracy of the GPT4all and ChatGPT models, emphasizing
model are provided in Appendix C. By analyzing the F1 their effectiveness for this particular task.
score values in Test 1, we can assess the effectiveness of
the models in accurately responding to the given questions.
10

models in answering questions. Generally, models possess-

ing 7B parameters exhibited a comparatively lower perfor-
mance [66] and, based on our experiments, failed to provide
binary (’yes’ or ’no’) answers. Remarkably, GPT4all, with
7B parameters, faced limitations in providing yes or no
responses to every question. In response to each question,
the output consisted solely of explanations, leading to a
significant expenditure on human effort to discern whether
the response was yes or no, thus preventing the automated
processing of the answers. Consequently, we were unable
to calculate the F1 score for the 7B parameter model. In
addition, two other models, Dolly and Stanford Alpaca, both
of which have precisely 7B parameters, have lower F1 score
than the models with 13B parameters, as shown in Table 1.

6.1.2. Execution time

In our findings, the execution times of the models varied
significantly. The implementation process involves running
LLMs on a GPU server, which requires several days of
Fig. 1: Comparison of Chatbot model response modes -Test 1 continuous execution. GPT4all took the longest, with a
total execution time of 132 hours and five minutes. Falcon
followed with 54 hours and 3 minutes. Stanford Alpaca
completed its tasks in 15 hours and 48 minutes, while Dolly
Based on the shuffled dataset test, the GPT4all model
had a slightly shorter execution time of 13 hours and 38
achieved an F1 score of 0.89%, whereas the Dolly model
minutes. ChatGPT-4 was more efficient, ranking third with
attained an F1 score of 0.86%.
an execution time of 11 hours and 50 minutes. The fastest
This indicates a slight decrease in accuracy of approx-
among them was Alpaca-LoRA, with an execution time of
imately 1% for the GPT4all model compared with the first
10 hours and 12 minutes. The models with versions of 30B
test, which can be considered insignificant. It is essential
and 65B parameters exhibit longer execution times when
to mention that shuffling the prompt does not affect the
answering questions. An impressive aspect of the Alpaca-
accuracy of the ChatGPT, Falcon, Stanford Alpaca, and
LoRA model with 65B parameters is that despite its signifi-
Alpaca-LoRA models, as the F1 score is equal to that of the
cantly larger parameter count than Dolly, which has 7B and
first test.
12B parameters, both models exhibit equal execution times
Upon applying the isolated prompt test to the dataset, the
and F1 score.
F1 score for the GPT4all model attained a value of 0.88.
The last row of Table 1 is devoted to the multitask model
This test results in a 2% decrease compared with that of
[41] discussed in Section 2.4 of the related work. Since this
Test 1. Conducting the test for the Stanford Alpaca model
model achieves the best performance in OSINT-based CTI
did not result in a significant reduction in accuracy. When
extraction, it was selected as the reference specialized model
testing the Alpaca-LoRA model, we observed a two percent
to evaluate the chatbots. Furthermore, the Twitter dataset
reduction in the F1 score of 0.82. The result of this test
employed for both this model and the chatbots is identical,
on the model Falcon 40B is two percent less than the first
guaranteeing a fair evaluation under comparable conditions.
test, and it is equal to 0.83. We ignored this test on Dolly
This model is a Bidirectional Long Short-Term Mem-
because it was time-consuming and beyond the available
ory (BiLSTM) trained for binary classification tasks and
time and resources. For ChatGPT, we used the available
NER. It achieved an accuracy of 0.94, equivalent to the
API to send requests. In this case, Test 3 is not feasible
ChatGPT model’s accuracy. However, the execution time is
because isolated prompt functionality is not available, i.e.,
only 43 minutes, which is much shorter than the 11+ hours
the conversation history of the model cannot be completely
required by ChatGPT. Based on the experimental results,
reset, and we have no means to reinitialize the model for each
it is evident that ChatGPT, with a context window size of
prompt. It is worth noting that running this test increases
512, outperforms the open-source models in terms of the F1
the execution time because of the need to launch the model
score. One key contributing factor to this accuracy gap is
for each question. Nevertheless, the results obtained by the
the significantly larger context window size employed by the
other models indicate that we should not expect a significant
ChatGPT.
decrease in accuracy.
Chatbots are based on a varying number of parameters.
This study involved models ranging from 7B to 65B in 6.2. Evaluation of NER
parameter counts. Based on our experiment, the number of
Table 2 shows the results for identifying the ‘product ver-
parameters significantly influences the effectiveness of the
sion’ entity, which is 0.43 of the F1 score, based on 11074
11

Table 2
Comparison of NER task accuracy achieved by the ChatGPT-4 (8k context) model using two different approaches.
Approach Number of Questions Entity F1 score Execution Time
ESP 11074 Organization 0.36 4h 02m
ESP 11074 Version 0.43 4h 23m
GLP 11074 All entities 0.10 3h 09m

Table 3
Illustrative ChatGPT responses demonstrate NER identification for two distinct prompts.
Model Entity Models Responses
ChatGPT-4 (8k context) Organization Microsoft
ChatGPT-4 (8k context) Version 9
GPT-3.5-turbo Organization The name of organizations in the given sentence is "Microsoft“.
GPT-3.5-turbo Version 9, ms13-0
Dolly 2.0 Organization Microsoft Internet Explorer 9
Dolly 2.0 Version ms13-0 is 9.0.8112.16421
GPT4all Organization Microsoft, Mozilla (Firefox), and Google Chrome
GPT4all Version 546

requested questions. This F1 score exceeds the correspond- upon in Section 6.3.1. A common challenge to binary clas-
ing value for identifying the ‘organization name’ entity. Our sification and NER was the generation of effective prompts,
analysis for the ESP approach is based on the ChatGPT which required progressive refinement until satisfactory per-
version released on July 13th, 2023. The results in the last formance was achieved. This section explores the challenges
line of Table 2 indicate that the calculated F1 score for 11074 and limitations encountered in the binary classification and
questions is unexpectedly low, reaching 0.10 accuracy. The NER tasks.
findings demonstrate a significant deviation from the cur-
rent NER outcomes as presented in the [41] study, which 6.3.1. Binary classification
achieved an F1 score of 0.94. Our GLP approach evaluation
is based on the ChatGPT released on August 2, 2023. In Section 4.2.1, we discuss how each chatbot model dis-
A representative example of the responses generated by played unique response behaviors when answering ques-
each model is provided in Table 3. In the two NER prompts tions. This necessitated a cleaning step after collecting re-
mentioned in Section 4.2, the annotations in the dataset show sponses to ensure answer consistency, particularly because
‘Microsoft’ as an organization entity and ‘9’ as a product our goal was to obtain a precise binary (yes or no) response.
version. The phrase ‘without any product, vulnerability, and However, some responses implied a ‘no’ or a ‘yes’ without
company names’ in the version prompt plays a crucial role explicitly using these words. For instance, a model might
in constraining the models’ interpretation of version num- answer the question, ‘This is not related to cybersecurity’.
bers. ChatGPT-4 demonstrated precision in identifying ‘Mi- Consequently, it is crucial to review and validate the answers
crosoft’ as B-ORG and ‘9’ as B-VER, accurately matching in the output file. Our evaluation process involved two key
the annotations in the dataset. By contrast, the other models steps to ensure reliability and accuracy. First, we imple-
exhibited varying degrees of accuracy, failing to reach the mented an automated validation method for each output file
level of correctness achieved by ChatGPT-4 in these specific to confirm the presence of ‘yes’ or ‘no’ responses at the be-
instances. ginning of the responses. This helped to filter out potentially
incorrect answers, such as those lacking an explicit ‘no’ or
‘yes’. Second, we conducted a thorough manual review of
6.3. Challenges and limitations each response to confirm its accuracy and alignment with
Chatbots present opportunities in numerous applications but the expected answers. In cases such as the aforementioned
also face challenges and limitations that hinder their effec- example, we manually annotated the response with ‘no’ to
tive utilization. In this study, we outline several challenges correctly classify it. This comprehensive manual validation
and limitations encountered during our experiments, which adds a crucial layer of scrutiny, bolstering the accuracy and
are detailed in this section. Additionally, we faced specific reliability of the results. While this comprehensive valida-
limitations when applying chatbots to the specialized CTI tion process enhances the accuracy of our binary classifi-
tasks, leading to issues with timeliness that are elaborated cation, it also introduces another challenge: the need for
12

timely response processing. Timeliness is a critical factor that requires specific requirements. This includes the abil-
in the use of chatbot models for binary classification tasks, ity to automatically generate prompts that are contextually
particularly in CTI applications. The demand for real-time relevant and precise. The system must also interpret and
processing is essential, as any delay in classifying responses adapt to varying response formats and manage the com-
can significantly impact decision-making. plexities of the different types of data inputs. Additionally,
the chatbot’s ability to handle ambiguity in natural language
6.3.2. Named entity recognition while maintaining the accuracy of its responses is essential.
This requires sophisticated algorithms that are capable of
When employing chatbots in NER tasks for cybersecurity understanding subtle differences in language and context.
purposes, we observed various limitations, mainly in pro- Although our evaluation primarily focused on a spe-
viding precise and relevant results. While chatbots pow- cific Twitter dataset, it is worth noting that numerous other
ered by pre-trained language models excel at understand- OSINT resources, such as blog posts and security forums
ing natural language and utilizing general knowledge, they (even in the dark web), remain unexplored. Moreover, it is
frequently encounter challenges when dealing with domain- important to note that a manual review of chatbot responses
specific precise entity recognition [32]. This shortcoming that follow the automated checking phase may introduce
is attributed mainly to the intrinsic complexities of NER, potential human errors into the assessment process. Further,
which demand a profound understanding of context, domain such verification is time-consuming and thus not feasible in
knowledge, and syntactic intricacies. the day-to-day operation of a security operating centre.
A recurrent issue in chatbot-assisted NER is the gener-
ation of unspecific answers that fail to accurately identify
precise entities in a given sentence. This often leads to 7. Conclusion
generalized responses that lack the precision essential for We assess the capabilities of open-source and paid LLM-
obtaining reliable NER results. Consider this prompt as an based chatbots to recognize cybersecurity-related tweets
example: Find the name of organizations in the following and extract pertinent information from these. Both types of
sentence: ‘senator calls on us government to start killing models can perform similarly to specialized models trained
adobe flash now tripwire’. Give the shortest answer, and only specifically for the binary classification task of identifying
use sentence segments in your response. The ChatGPT-4 re- cybersecurity-related tweets, often achieving the same level
sponse was ‘US Government, Adobe, Tripwire’. Such chal- of performance. On the contrary, the chatbot models’ per-
lenges can be traced back to factors such as the limitations formance is still very poor on named entity recognition to
inherent in the underlying language models or the absence extract security elements from tweets. Even when training
of dedicated fine-tuning tailored to NER tasks. Another on vast datasets, these models did not perform comparably
phenomenon, hallucination, as discussed in Section 2.5, also to specialized models on the test data.
arises from these compounded challenges. As an example of Our results highlight the need for further research and
precise hallucination, GPT4all mistakenly extracted ‘546’ as refinement in the application of LLM chatbot models to
a product version, the number that was not seen at all in the extract threat indicators from open-source intelligence. Al-
product version prompt, which is mentioned in Section 4.2.2. though open-source and paid chatbots compared evenly with
specialized trained models on cybersecurity binary text clas-
6.4. Discussion sification, they fell below acceptable performance on named
entity recognition. Moreover, they can not compete with
The evaluation of chatbot models involves two critical as- specialized models on timeliness and cost.
pects. First, it involves a deep understanding of effective Based on our study, we identify various possibilities for
methods for interacting with the models. This includes con- future work based on the following research questions:
sidering timeliness, which is particularly important when
integrating chatbots into real-time systems such as those 1. How can LLM chatbots be further optimized for cost-
connected to Twitter like SYNAPSE [10]. Second, evalua- effective real-time CTI detection on social media plat-
tion requires the skill of writing structured and clear prompts forms?
for chatbots, ensuring that they produce precise and relevant 2. How to improve the NER capability of chatbots for the
responses. The ability to compose clear and concise prompts extraction of indicators of compromise?
that guide the chatbot to provide an anticipated response 3. How can cybersecurity specialists’ feedback be used
is crucial. The preceding sections explain that exercising to increase the efficiency and cost-effectiveness of
control over the response length is crucial. Long answers not open-source chatbots?
only extend execution times but also impose an additional
workload on human resources for response validation. On
the other hand, those responses might not contain enough
Acknowledgement
meaningful content. This work is funded by the European Commission
Automated interaction with chatbots for tasks such as through the SATO Project (H2020/IA/957128) and by FCT
binary classification and NER is another scope to enter through the LASIGE Research Unit (UIDB/00408/2020 and
UIDP/00408/2020).
13

References through prompts, Natural Language Processing Journal 5 (2023)

100032.
[1] X. Liao, et al., Acing the ioc game: Toward automatic discovery and [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
analysis of open-source cyber threat intelligence, in: Proceedings of Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances
the 23rd ACM CCS, 2016. in neural information processing systems 30 (2017).
[2] A. Ritter, et al., Weakly supervised extraction of computer security [22] S. R. Choi, M. Lee, Transformer architecture and attention mecha-
events from twitter, in: Proceedings of the 24th international confer- nisms in genome data analysis: a comprehensive review, Biology 12
ence on world wide web, 2015. (2023) 1033.
[3] M. F. Franco, B. Rodrigues, E. J. Scheid, A. Jacobs, C. Killer, L. Z. [23] T. Lin, Y. Wang, X. Liu, X. Qiu, A survey of transformers, Elsevier,
Granville, B. Stiller, SecBot: A business-driven conversational agent 2022.
for cybersecurity planning and management, in: 2020 16th inter- [24] L. R. Medsker, L. Jain, Recurrent neural networks, Design and
national conference on network and service management (CNSM), Applications 5 (2001) 2.
IEEE, 2020, pp. 1–7. [25] Y. Kim, Convolutional neural networks for sentence classification,
[4] A. Arora, A. Arora, J. McIntyre, Developing chatbots for cyber se- preprint arXiv:1408.5882 (2014).
curity: Assessing threats through sentiment analysis on social media, [26] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural
Sustainability 15 (2023) 13178. computation 9 (1997) 1735–1780.
[5] M. Security, Microsoft security copilot, 2023. URL: https: [27] A. Farooq, M. Awais, S. Ahmed, J. Kittler, Global interac-
//www.microsoft.com/en-us/security/business/ai-machine-learning/ tion modelling in vision transformer via super tokens, preprint
microsoft-security-copilot. arXiv:2111.13156 (2021).
[6] J. Kocoń, I. Cichecki, O. Kaszyca, M. Kochanek, D. Szydło, J. Baran, [28] C. Sanford, D. J. Hsu, M. Telgarsky, Representational strengths
J. Bielaniewicz, M. Gruza, A. Janz, K. Kanclerz, A. Kocoń, B. Kop- and limitations of transformers, Advances in Neural Information
tyra, W. Mieleszczenko-Kowszewicz, P. Miłkowski, M. Oleksy, Processing Systems 36 (2024).
M. Piasecki, Łukasz Radliński, K. Wojtasik, S. Woźniak, P. Kazienko, [29] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of
Chatgpt: Jack of all trades, master of none, Information Fusion 99 deep bidirectional transformers for language understanding, preprint
(2023) 101861. arXiv:1810.04805 (2018).
[7] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, [30] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improv-
J. Gao, Deep learning–based text classification: A comprehensive ing language understanding by generative pre-training (2018).
review, ACM computing surveys (CSUR) 54 (2021). [31] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz,
[8] Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, L. He, A E. Agirre, I. Heintz, D. Roth, Recent advances in natural language
survey on text classification: From traditional to deep learning, ACM processing via large pre-trained language models: A survey, ACM
Transactions on Intelligent Systems and Technology (TIST) 13 (2022) Computing Surveys 56 (2023) 1–40.
1–41. [32] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin,
[9] B. Jehangir, S. Radhakrishnan, R. Agarwal, A survey on named entity X. Hu, Harnessing the power of llms in practice: A survey on chatgpt
recognition—datasets, tools, and methodologies, Natural Language and beyond, ACM Trans. Knowl. Discov. Data (2024).
Processing Journal 3 (2023) 100017. [33] D. Arora, H. G. Singh, et al., Have llms advanced enough? a
[10] F. Alves, A. Bettini, P. M. Ferreira, A. Bessani, Processing tweets challenging problem solving benchmark for large language models,
for cybersecurity threat awareness, Information Systems 95 (2021) preprint arXiv:2305.15074 (2023).
101586. [34] S. Agrawal, Are LLMs the master of all trades?: Exploring domain-
[11] N. Dionísio, F. Alves, P. M. Ferreira, A. Bessani, Cyberthreat detec- agnostic reasoning skills of LLMs, preprint arXiv:2303.12810
tion from twitter using deep neural networks, in: 2019 international (2023).
joint conference on neural networks (IJCNN), IEEE, 2019, pp. 1–8. [35] B. Ding, C. Qin, L. Liu, L. Bing, S. Joty, B. Li, Is gpt-3 a good data
[12] F. Alves, A. Andongabo, I. Gashi, P. M. Ferreira, A. Bessani, Follow annotator?, preprint arXiv:2212.10450 (2022).
the blue bird: a study on threat data published on twitter, in: European [36] F. Gilardi, M. Alizadeh, M. Kubli, Chatgpt outperforms crowd work-
Symposium on Research in Computer Security, Springer, 2020, pp. ers for text-annotation tasks, Proceedings of the National Academy
217–236. of Sciences 120 (2023) e2305016120.
[13] OpenAI platform, . URL: https://platform.openai.com. [37] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
[14] Y. Anand, Z. Nussbaum, B. Duderstadt, B. Schmidt, A. Mulyar, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al.,
Gpt4all: Training an assistant-style chatbot with large scale data Llama: Open and efficient foundation language models, preprint
distillation from gpt-3.5-turbo, 2023. arXiv:2302.13971 (2023).
[15] M. Conover, M. Hayes, A. Mathur, X. Meng, J. Xie, [38] L. Zhang, Y. Zhang, K. Ren, D. Li, Y. Yang, Mlcopilot: Unleashing
J. Wan, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, the power of large language models in solving machine learning tasks,
et al., Free Dolly: Introducing the World’s First Open and preprint arXiv:2304.14979 (2023).
Commercially Viable Instruction-Tuned LLM - The Databricks [39] C. Sabottke, et al., Vulnerability disclosure in the age of social media:
Blog, . URL: https://www.databricks.com/blog/2023/04/12/ exploiting twitter for predicting real-world exploits, in: Proceedings
dolly-first-open-commercially-viable-instruction-tuned-llm. of the 24th USENIX Security Symp., 2015.
[16] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, [40] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20
P. Liang, T. B. Hashimoto, Alpaca: A strong, replicable instruction- (1995) 273–297.
following model, Stanford Center for Research on Foundation Mod- [41] N. Dionisio, F. Alves, P. M. Ferreira, A. Bessani, Towards end-
els. https://crfm. stanford. edu/2023/03/13/alpaca. html (2023). to-end cyberthreat detection from twitter using multi-task learning,
[17] E. J. Wang, tloen/alpaca-lora, 2024. URL: https://github.com/tloen/ in: 2020 international joint conference on neural networks (IJCNN),
alpaca-lora, original-date: 2023-03-13T21:52:36Z. IEEE, 2020, pp. 1–8.
[18] Falcon LLM, . URL: https://falconllm.tii.ae/falcon.html. [42] S. Altalhi, A. Gutub, A survey on predictions of cyber-attacks
[19] W. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, utilizing real-time twitter tracing recognition, Journal of Ambient
S. Zhuang, Y. Zhuang, J. Gonzalez, et al., Vicuna: An open-source Intelligence and Humanized Computing (2021) 1–13.
chatbot impressing gpt-4 with 90%* chatgpt quality, mar. 2023, . [43] O. D. Okey, E. U. Udo, R. L. Rosa, D. Z. Rodríguez, J. H. Klein-
[20] J. López Espejel, E. H. Ettifouri, M. S. Yahaya Alassan, E. M. schmidt, Investigating ChatGPT and cybersecurity: A perspective on
Chouham, W. Dahhane, GPT-3.5, GPT-4, or BARD? evaluating topic modeling and sentiment analysis, Computers & Security 135
LLMs reasoning ability in zero-shot setting and performance boosting (2023) 103476.
14

[44] M. Gupta, C. Akiri, K. Aryal, E. Parker, L. Praharaj, From chatgpt to systems 33 (2020) 1877–1901.
threatgpt: Impact of generative ai in cybersecurity and privacy, IEEE [55] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai,
Access (2023). E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark,
[45] M. Al-Hawawreh, A. Aljuhani, Y. Jararweh, Chatgpt for cybersecu- et al., Training compute-optimal large language models, preprint
rity: practical applications, challenges, and future directions, Cluster arXiv:2203.15556 (2022).
Computing 26 (2023) 3421–3436. [56] T. Chen, B. Xu, C. Zhang, C. Guestrin, Training deep nets with
[46] T. McIntosh, T. Liu, T. Susnjak, H. Alavizadeh, A. Ng, R. Nowrozy, sublinear memory cost, preprint arXiv:1604.06174 (2016).
P. Watters, Harnessing GPT-4 for generation of cybersecurity GRC [57] T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashattention: Fast
policies: A focus on ransomware attack mitigation, Computers & and memory-efficient exact attention with io-awareness, Advances
Security 134 (2023) 103424. in Neural Information Processing Systems 35 (2022) 16344–16359.
[47] Conversational AI Platform | Superior Customer Experiences Start [58] EleutherAI, 2023. URL: https://www.eleuther.ai.
Here, . URL: https://rasa.com/, last visited: February 13, 2024. [59] databricks/dolly-v2-12b · Hugging Face, 2023. URL: https://
[48] X. Sun, L. Dong, X. Li, Z. Wan, S. Wang, T. Zhang, J. Li, F. Cheng, huggingface.co/databricks/dolly-v2-12b.
L. Lyu, F. Wu, et al., Pushing the limits of chatgpt on nlp tasks, [60] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi,
preprint arXiv:2306.09719 (2023). H. Hajishirzi, Self-instruct: Aligning language model with self
[49] F. M. Megahed, Y.-J. Chen, J. A. Ferris, S. Knoth, L. A. Jones- generated instructions, preprint arXiv:2212.10560 (2022).
Farmer, How generative ai models such as chatgpt can be (mis) used [61] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
in spc practice, education, and research? an exploratory study, Quality W. Chen, Lora: Low-rank adaptation of large language models,
Engineering (2023) 1–29. preprint arXiv:2106.09685 (2021).
[50] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, D. Yang, Is [62] L. Chen, M. Zaharia, J. Zou, How is chatgpt’s behavior changing over
chatgpt a general-purpose natural language processing task solver?, time?, preprint arXiv:2307.09009 (2023).
preprint arXiv:2302.06476 (2023). [63] N. Shazeer, Fast transformer decoding: One write-head is all you
[51] A. Qammar, H. Wang, J. Ding, A. Naouri, M. Daneshmand, H. Ning, need, preprint arXiv:1911.02150 (2019).
Chatbots to chatgpt in a cybersecurity space: Evolution, vulnera- [64] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train,
bilities, attacks, challenges, and future recommendations, preprint prompt, and predict: A systematic survey of prompting methods in
arXiv:2306.09255 (2023). natural language processing, ACM Computing Surveys 55 (2023)
[52] F. McKee, D. Noever, Chatbots in a honeypot world, preprint 1–35.
arXiv:2301.03771 (2023). [65] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language
[53] A. Cheshkov, P. Zadorozhny, R. Levichev, Evaluation of chatgpt models are zero-shot reasoners, Advances in neural information
model for vulnerability detection, preprint arXiv:2304.07232 (2023). processing systems 35 (2022) 22199–22213.
[54] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, [66] X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language mod- Q. Du, Z. Fu, et al., Deepseek llm: Scaling open-source language
els are few-shot learners, Advances in neural information processing models with longtermism, preprint arXiv:2401.02954 (2024).
15

Fig. A1: GLP NER approach: employing a ChatGPT-4 guideline prompt template

Table B1
Samples from the 31281 tweet entries in the dataset
timestamp keywords original tweet pre-processed tweet relevance entities
2018-07-24 oracle RT Oracle: Learn to use rt oracle learn to use and 0 -
01:00:46+00:00 and understand #Oracle’s understand oracle s internet
Internet Intelligence Map intelligence map dyn
https://t.co/l06Nyf1FFF
Dyn
https://t.co/uzozFKwm97
2016-12-09 internet explorer threatmeter: [dos] - threatmeter dos microsoft 1 O O B-ORG
19:19:38+00:00 Microsoft Internet Explorer internet explorer 9 mshtml B-PRO
9 MSHTML - CDisp cdisp node::insert sibling I-PRO
Node::Insert Sibling Node node use-after-free ms13-0 B-VER O O
Use-After-Free (MS13-0... OOO
https://t.co/gLvEwpDL9v B-VUL B-ID

Appendix A.
The template for the guideline prompt used in the ChatGPT GLP NER approach is shown in Fig A1.

Appendix B.
As described in Section 4.1, for each collected tweet a dataset entry is generated, including timestamp, keywords, original
tweet, pre-processed tweet, cybersecurity relevance binary label, and sequence of named entities in the pre-processed tweet.
Table B1 presents two examples of dataset entries. In the relevance column, ‘1’ denotes an entry considered relevant for
cybersecurity, a ‘0’ means otherwise. The last column shows the tags used to label the different NER entities.
16

Test 1 Test 2 Test 3

0 1 0 1
ChatGPT-3.5
0 19733 458 0 19892 315
1 802 10288 1 886 10188
0 1 0 1
0 19763 444 0 19918 432
ChatGPT- 4 1 941 10133 1 820 10111
0 1 0 1 0 1
GPT4all
0 19051 580 0 18986 530 0 19694 513
1 1630 10020 1 1880 9885 1 1905 9169

0 1 0 1
Dolly 0 19719 488 0 19726 481
1 2348 8726 1 2314 8760
0 1 0 1 0 1
Falcon 0 19545 1006 0 19314 993 0 19451 956
1 1871 8859 1 2137 8837 1 2369 8505
0 1 0 1 0 1
Alpaca-Lora 0 19208 999 0 19425 982 0 19460 947
1 2281 8793 1 2137 8737 1 2535 8339
0 1 0 1 0 1
Stanford Alpaca 0 17704 2803 0 17632 2775 0 17436 2771
1 4296 6478 1 4429 6445 1 4640 6434
0 1
0 16362 4245
Vicuna 1 7352 3322

0 1
0 19835 456
Dionisio et al. [41]
1 700 10290

Fig. C1: Confusion matrix of binary classification task

Appendix C.
The confusion matrices are provided the test and model combinations considered, excluding the 7B parameter models in
Table 1 which achieved the worse results in the respective group. The rows in the matrices correspond to the actual expected
result, whereas the columns show the predicted results.

Intell Bot
No ratings yet
Intell Bot
21 pages
Chatbot Evolution & Cybersecurity Risks
No ratings yet
Chatbot Evolution & Cybersecurity Risks
17 pages
AI Revolution On Chat Bot - Evidence From A Randomized Controlled Experiment
No ratings yet
AI Revolution On Chat Bot - Evidence From A Randomized Controlled Experiment
16 pages
Cyber Threat Intelligence Using Large Language Models (LLMS)
No ratings yet
Cyber Threat Intelligence Using Large Language Models (LLMS)
4 pages
A Complete Survey On LLM-based AI Chatbots: Sumit Kumar Dam, Choong Seon Hong,, Yu Qiao, and Chaoning Zhang
No ratings yet
A Complete Survey On LLM-based AI Chatbots: Sumit Kumar Dam, Choong Seon Hong,, Yu Qiao, and Chaoning Zhang
23 pages
CUI24
No ratings yet
CUI24
11 pages
Cyber Sentinel
No ratings yet
Cyber Sentinel
10 pages
Bias and Fairness in Chatbots: An Overview
No ratings yet
Bias and Fairness in Chatbots: An Overview
44 pages
The Health ChatBots in Telemedicine Intelligent Di
No ratings yet
The Health ChatBots in Telemedicine Intelligent Di
12 pages
Chatbots in A Botnet World by Forrest McKee and David Noever
No ratings yet
Chatbots in A Botnet World by Forrest McKee and David Noever
47 pages
Full Text 01
No ratings yet
Full Text 01
46 pages
Llama2 Extracted
No ratings yet
Llama2 Extracted
4 pages
Llama 2: Advanced Open-Source Chat Models
No ratings yet
Llama 2: Advanced Open-Source Chat Models
77 pages
LLaMA 2
No ratings yet
LLaMA 2
77 pages
Generative AI in Cybersecurity: A Comprehensive Review of LLM Applications and Vulnerabilities
No ratings yet
Generative AI in Cybersecurity: A Comprehensive Review of LLM Applications and Vulnerabilities
52 pages
263 Submission
No ratings yet
263 Submission
6 pages
Health-Care Chat Bot
No ratings yet
Health-Care Chat Bot
20 pages
Report
No ratings yet
Report
32 pages
Aisha A Custom AI Library Chatbot Using The ChatGPT API
No ratings yet
Aisha A Custom AI Library Chatbot Using The ChatGPT API
23 pages
End-to-End Trainable Restaurant Chatbot
No ratings yet
End-to-End Trainable Restaurant Chatbot
64 pages
Towards Automated Penetration Testing Introducing LLM Benchmark, Analysis, and Improvements
No ratings yet
Towards Automated Penetration Testing Introducing LLM Benchmark, Analysis, and Improvements
17 pages
Llama 2 - Open Foundation and Fine-Tuned Chat Models
No ratings yet
Llama 2 - Open Foundation and Fine-Tuned Chat Models
76 pages
.Trashed 1697858794 Synopsis Report
No ratings yet
.Trashed 1697858794 Synopsis Report
6 pages
JAILBREAKER-Automated Jailbreak Across Multiple Large Language Model Chatbots-2023 7
100% (3)
JAILBREAKER-Automated Jailbreak Across Multiple Large Language Model Chatbots-2023 7
15 pages
Chatbot NIT
No ratings yet
Chatbot NIT
44 pages
1 s2.0 S004016252400043X Main
No ratings yet
1 s2.0 S004016252400043X Main
13 pages
RAG-Based LLM Chatbot Using Llama-2
No ratings yet
RAG-Based LLM Chatbot Using Llama-2
5 pages
Full Text 01
No ratings yet
Full Text 01
132 pages
CTIBench - A Benchmark For Evaluating LLMs in Cyber Threat Intelligence2406.07599v3
No ratings yet
CTIBench - A Benchmark For Evaluating LLMs in Cyber Threat Intelligence2406.07599v3
21 pages
LLM Threat Intelligence Paper
No ratings yet
LLM Threat Intelligence Paper
18 pages
Role of Artificial Intelligence Based Chat Generative Pre-Trained Transformer ChatGPT in Cyber Security
No ratings yet
Role of Artificial Intelligence Based Chat Generative Pre-Trained Transformer ChatGPT in Cyber Security
8 pages
RAG Based Chatbot Using LLMs
No ratings yet
RAG Based Chatbot Using LLMs
4 pages
Chatbot PPT 2.0
No ratings yet
Chatbot PPT 2.0
14 pages
Fin Irjmets1687886863
No ratings yet
Fin Irjmets1687886863
4 pages
AI Chatbots for Healthcare
No ratings yet
AI Chatbots for Healthcare
7 pages
The Accuracy of Domain Specific and Descriptive Analysis Generated by Large Language Models
No ratings yet
The Accuracy of Domain Specific and Descriptive Analysis Generated by Large Language Models
12 pages
Matrix Project Chatbot
No ratings yet
Matrix Project Chatbot
45 pages
University Chatbot with ChatGPT API
No ratings yet
University Chatbot with ChatGPT API
5 pages
Health Care Chatbot Final
No ratings yet
Health Care Chatbot Final
7 pages
Review of Generative AI Methods in Cybersecurity - Arxiv24
No ratings yet
Review of Generative AI Methods in Cybersecurity - Arxiv24
39 pages
LLM Honeypot: Leveraging Large Language Models As Advanced Interactive Honeypot Systems
No ratings yet
LLM Honeypot: Leveraging Large Language Models As Advanced Interactive Honeypot Systems
6 pages
Bot Detection Algorithms
No ratings yet
Bot Detection Algorithms
26 pages
Ease 24
No ratings yet
Ease 24
10 pages
Automating Customer Service Using Langchain: Building Custom Open-Source GPT Chatbot For Organizations
No ratings yet
Automating Customer Service Using Langchain: Building Custom Open-Source GPT Chatbot For Organizations
4 pages
P GPT: An LLM-empowered Automatic Penetration Testing Tool: Entest
No ratings yet
P GPT: An LLM-empowered Automatic Penetration Testing Tool: Entest
17 pages
122012502009, Rehan Molla, Health Care Chartbot Using Deeplearning Technique
No ratings yet
122012502009, Rehan Molla, Health Care Chartbot Using Deeplearning Technique
67 pages
What Can ChatGPT Do?
No ratings yet
What Can ChatGPT Do?
10 pages
Generative AI
No ratings yet
Generative AI
6 pages
Development of Chatbot For Cybersecurity
No ratings yet
Development of Chatbot For Cybersecurity
31 pages
Hallucibot: Is There No Such Thing As A Bad Question?: William Watson Nicole Cho
No ratings yet
Hallucibot: Is There No Such Thing As A Bad Question?: William Watson Nicole Cho
26 pages
Autonomous Cyberattack With Security-Augmented Generative Artificial Intelligence by Jonathan Gregory Etc 02 September 2024
No ratings yet
Autonomous Cyberattack With Security-Augmented Generative Artificial Intelligence by Jonathan Gregory Etc 02 September 2024
6 pages
Healthcare - Chatbot Report
100% (1)
Healthcare - Chatbot Report
44 pages
Research Paper
No ratings yet
Research Paper
8 pages
Evaluating Chatbots To Promote Users' Trust - Practices and Open Problems
No ratings yet
Evaluating Chatbots To Promote Users' Trust - Practices and Open Problems
9 pages
Revolutionizing Cyber Threat Detection With Large Language Models
No ratings yet
Revolutionizing Cyber Threat Detection With Large Language Models
10 pages
Hack Synth
No ratings yet
Hack Synth
16 pages
Chatbots History 2.0
No ratings yet
Chatbots History 2.0
10 pages
Towards Explainable Network Intrusion Detection Using Large Language Models
No ratings yet
Towards Explainable Network Intrusion Detection Using Large Language Models
7 pages
Generative AI in Cybersecurity - 2025
No ratings yet
Generative AI in Cybersecurity - 2025
54 pages
Github OSINT
No ratings yet
Github OSINT
1 page
OSINT MiniGuide
No ratings yet
OSINT MiniGuide
55 pages
Ejpt Partial Dump
No ratings yet
Ejpt Partial Dump
11 pages
Oscp
100% (2)
Oscp
6 pages
Prepare To BSCP
No ratings yet
Prepare To BSCP
2 pages
Solo Parenting: Child Development Study
No ratings yet
Solo Parenting: Child Development Study
18 pages
Linear System Theory: Prof - Ramkrishna Pasumarthy
No ratings yet
Linear System Theory: Prof - Ramkrishna Pasumarthy
766 pages
Transport Across Cell Membranes QP
No ratings yet
Transport Across Cell Membranes QP
18 pages
Clm-Smaw Ncii Uc1 - Core
No ratings yet
Clm-Smaw Ncii Uc1 - Core
2 pages
New Horizon Public School: Paper Pattern Grade V - Terminal Examination
No ratings yet
New Horizon Public School: Paper Pattern Grade V - Terminal Examination
8 pages
Jedtalk Principles and Strategies of Teaching Gifted and Talented Learners
No ratings yet
Jedtalk Principles and Strategies of Teaching Gifted and Talented Learners
5 pages
Analysis Sample For TJC
No ratings yet
Analysis Sample For TJC
20 pages
Implementation of Korean and Other Foreign Languages in School
100% (1)
Implementation of Korean and Other Foreign Languages in School
4 pages
Microbiology Research: African Journal of
No ratings yet
Microbiology Research: African Journal of
67 pages
Timbas Rammir R Activity-1-HCM
No ratings yet
Timbas Rammir R Activity-1-HCM
1 page
Radiology 101 The Basics & Fundamentals of Imaging (4th Edition) Smith
No ratings yet
Radiology 101 The Basics & Fundamentals of Imaging (4th Edition) Smith
10 pages
Catalog Petrified Wood
No ratings yet
Catalog Petrified Wood
37 pages
Title Defense Linsag
No ratings yet
Title Defense Linsag
14 pages
1628172537BS-Curriculum-GEB 2019-2020 Onword
No ratings yet
1628172537BS-Curriculum-GEB 2019-2020 Onword
113 pages
Cheese 2
No ratings yet
Cheese 2
24 pages
Immediate Download Smith and Aitkenhead's Textbook of Anaesthesia 7th Edition Jonathan Thompson Ebooks 2024
100% (1)
Immediate Download Smith and Aitkenhead's Textbook of Anaesthesia 7th Edition Jonathan Thompson Ebooks 2024
50 pages
Humblewood 2 Capran Arma Hedge
No ratings yet
Humblewood 2 Capran Arma Hedge
8 pages
Leah's Character Analysis
No ratings yet
Leah's Character Analysis
5 pages
Chaos Magick For Beginners
No ratings yet
Chaos Magick For Beginners
4 pages
Pediatric Speech-Language Pathologist Career Aspirations
No ratings yet
Pediatric Speech-Language Pathologist Career Aspirations
3 pages
Tinkercad Getting Started Guide ISTE
No ratings yet
Tinkercad Getting Started Guide ISTE
20 pages
CH 4
No ratings yet
CH 4
1 page
Authorized The Art of Theatre A Concise Introduction 3rd Edition by William Missouri Downs Ebook and TestBank Bundle
No ratings yet
Authorized The Art of Theatre A Concise Introduction 3rd Edition by William Missouri Downs Ebook and TestBank Bundle
327 pages
Medical English for Romanian Students
No ratings yet
Medical English for Romanian Students
10 pages
Aryan Thakur
No ratings yet
Aryan Thakur
2 pages
Free Download Occupational Therapy With Elders Strategies For The Occupational Therapy Assistant - 4th Edition Educational Ebook Download
No ratings yet
Free Download Occupational Therapy With Elders Strategies For The Occupational Therapy Assistant - 4th Edition Educational Ebook Download
15 pages
Test English - Prepare For Your English Exam - Luis David Guerra Alvarado - Exercise 2
No ratings yet
Test English - Prepare For Your English Exam - Luis David Guerra Alvarado - Exercise 2
3 pages
The Personal Development Group The Students Guide 2nd 2nd Chris Rose Download
100% (7)
The Personal Development Group The Students Guide 2nd 2nd Chris Rose Download
50 pages
Pope Francis - Wikipedia 2005
No ratings yet
Pope Francis - Wikipedia 2005
2 pages
MBBS Fee Structure 2024
No ratings yet
MBBS Fee Structure 2024
1 page