Chatbots For OSINT
Chatbots For OSINT
compared to specialized models trained for those tasks. In binary classification experiments, Chatbot
GPT-4 as a commercial model achieved an acceptable F1 score of 0.94, and the open-source GPT4all
model achieved an F1 score of 0.90. However, concerning cybersecurity entity recognition, all
evaluated chatbots have limitations and are less effective. This study demonstrates the capability of
chatbots for OSINT binary classification and shows that they require further improvement in NER
to effectively replace specially trained models. Our results shed light on the limitations of the LLM
chatbots when compared to specialized models, and can help researchers improve chatbots technology
with the objective to reduce the required effort to integrate machine learning in OSINT-based CTI
tools.
answer the main research question. These questions concern time, transformer models handle tokens or words simulta-
the capability of chatbots to provide clear yes-or-no answers, neously. This makes it easier to model the global interac-
and the cost (in terms of processing time) required for these tions [27, 28] and dependencies [23]. This feature makes
chatbots to perform the NLP tasks. transformers highly effective for various tasks, including
Our contributions can be summarized as follows: but not limited to machine translation, sentiment analysis,
and text generation. Transformers have emerged as the ba-
1. We present a state-of-the-art survey on chatbot models
sic foundation for models such as Bidirectional Encoder
and their application to cybersecurity.
Representations Transformers (BERT) [29] and Generative
2. We investigate the extent to which the inherent flexi- Pre-trained Transformers (GPT) [30]. These models have
bility of LLM-based chatbots can be tailored to meet demonstrated exceptional performances across diverse NLP
the specific requirements of OSINT-based CTI appli- benchmarks, leading to progress in language understanding,
cations. text generation, and question answering [31].
3. The study provides a comparative analysis of the prac-
tical use and performance of chatbots in specialized
CTI tasks, including binary text classification and 2.2. Large Language Models
NER. LLMs have emerged as significant innovations that revolu-
The remainder of this paper is organized as follows. tionize the human language process, generation, and under-
Section 2 provides some background on LLMs and chatbots, standing. These models are trained on large text datasets,
and reviews related works on LLM chatbots for cybersecu- including immense volumes of language data, which enables
rity, and the evaluation of chatbots in NLP tasks. Section them to perform tasks that demand contextual comprehen-
3 presents a deep exploration of LLM-based chatbots and sion and generate coherent and meaningful responses [32].
highlights their significance and capabilities. In Section 4, In terms of applications, the latest generation of language
we shift our focus to the methodology used to evaluate models has been applied across diverse fields. For instance,
chatbots, detailing our dataset, methods, and the comparison the utilization of LLMs within mathematics, physics, and
criteria employed for evaluation. Section 5 explores strate- chemistry problem-solving has been evaluated by Arora
gies aimed at optimizing the utilization of these chatbots, in- et al. [33]. Agrawal evaluated the ability of LLMs for human-
cluding prompt fine-tuning and text length control. Section 6 like reasoning tasks [34]. The findings of the study indicate
presents our main experimental results and their discussion. that LLMs have strong abilities in analogical and moral rea-
Finally, we conclude our study in Section 7 by summarizing soning but face challenges in spatial reasoning tests. Chat-
the key contributions and insights derived from this study. bots powered by LLMs have attracted interest as powerful
tools for data annotation in NLP domains [35]. This interest
arises from chatbots’ proficiency in language tasks and the
2. Background and Related work critical role of data annotation in developing NLP systems.
In this section, we briefly present the required back- An illustrative study [36] compared ChatGPT with human
ground for this paper. We start by analyzing transformer crowd workers in annotation tasks. The research emphasized
models, the foundational elements facilitating recent progress that ChatGPT surpassed human workers in terms of perfor-
in NLP. Next, we discuss LLMs by examining their develop- mance, agreement among annotators, and cost-effectiveness.
ment, abilities, and significant influence on various specific There are two widely used types of LLMs: GPT and
areas. Building on this foundation, we review research on BERT. GPT-style models utilize a regressive transformer
how LLMs can effectively be utilized for cybersecurity- architecture to capture contextual dependencies and relation-
related NLP, exploring in detail the role of OSINT in ships within text [32], while BERT-style language models
cybersecurity. We end the section by examining chatbots’ utilize a bidirectional transformer architecture and adopt
evolution and NLP capabilities to address cybersecurity a masked language modelling objective. Both approaches
concerns and the literature gap we aim to contribute with gained popularity for NLP capabilities because of their abil-
this paper. ity to model contextual dependencies. Besides the difference
in the context they consider to make text predictions, they
also differ in the number of parameters and training dataset
2.1. Transformers size, with GPT-style models requiring more resources in
The introduction of transformers [21] revolutionized the both cases.
field of NLP, as they became the preferred architecture for Yang et al. [32] provides a complete guide explain-
various NLP tasks due to their ability to effectively cap- ing how to use LLMs from various perspectives, including
ture the extensive dependencies and contextual associations model selection, data consideration, and task specificity.
of textual data [22, 23]. This ability enables transformers It thoroughly investigates the details of pre-training and
to overcome the limitations of previous methods such as the significance of the training and test data and offers
Recurrent Neural Networks (RNNs) [24], Convolutional insights into tasks that require a rich knowledge base, nat-
Neural Networks (CNNs) [25], and Long Short-Term Mem- ural language comprehension, generation capabilities, and
ory (LSTM) [26]. Instead of processing inputs one at a other emergent features. In particular, LLMs such as the
3
Generative Pre-trained Transformer 3 (GPT-3), have gained capabilities. Among these tools, LLMs stand out for their ex-
attention due to their remarkable ability to acquire a broad ceptional capabilities in interpreting and generating human-
spectrum of knowledge during pre-training and apply it to like text.
downstream NLP tasks [35], with significant potential for The specific case of ChatGPT raised discussions and re-
chatbots [13, 15, 16, 17, 18, 19, 37]. views on its potential applications in the cybersecurity field
[43, 44, 45] and specific attack mitigation approaches have
2.3. Leveraging LLMs for cybersecurity NLP already been described [46]. Other works designed chatbots
specifically to help cybersecurity analysis. SecBot [3] is
Employing the capabilities of LLMs in ML tasks, specifi- a cybersecurity-driven conversational chatbot that extracts
cally within the scope of cybersecurity, offers several pos- information from a conversation to support cybersecurity
sibilities [38]. Two problems that stand out are text classi- planning and management. It was developed and evaluated
fication, which categorizes text according to its relevance on a small dataset utilizing the Rasa framework [47], achiev-
to a cybersecurity context, and NER, which recognizes ing 100% accuracy in extracting the intent of attacks and
specific cybersecurity entities in the text, for example, if it associated named entities.
describes a vulnerability, which products are affected. LLMs
have shown outstanding effectiveness in many NLP tasks,
2.6. Evaluation of LLM Chatbots
including binary classification and NER [6, 31].
Given the widespread attention LLM chatbots are attracting,
2.4. OSINT for cybersecurity several works tried to evaluate the strengths and limita-
tions of the technology. An analysis of ChatGPT’s per-
In the CTI field, NLP tasks such as text classification and formance [6, 48] showed that despite its achievements, it
NER, which involve exploring various cybersecurity threats frequently fell behind supervised baselines in various NLP
and concepts, are crucial. In a previous study, SYNAPSE tasks. Several factors influence this, including limitations on
[10], an OSINT processing pipeline was implemented to ef- the number of tokens, a misalignment with specific NLP
ficiently identify and concisely present various cybersecurity tasks due to its generative nature, and challenges inher-
incidents to security analysts. SYNAPSE was motivated by ent to LLMs, such as hallucination, which involves mak-
the results of previous works (e.g., [12, 39]) that analyzed ing false positive predictions. In their effort to optimize
the completeness and timeliness of cybersecurity-related ChatGPT, the authors have proposed solutions, including
OSINT on Twitter. multiple prompts, task-specific fine-tuning, and strategies
SYNAPSE employed Support Vector Machines (SVM) to counter hallucination. These methods were thoroughly
[40] for binary classification, and a novel stream clus- tested across 21 datasets, covering ten critical NLP tasks,
tering approach to aggregate related tweets. To improve including NER. The testing of the solutions resulted in
SYNAPSE, Dionísio et al. [11] designed a new tool that em- significant performance improvements for ChatGPT, with
ploys a CNN to detect cybersecurity-related texts gathered instances where it outperformed state-of-the-art models in
from Twitter and a BiLSTM network for performing NER existing benchmarks [48].
on the detected tweets to identify the type of threat, affected The study by Megahed et al. [49] shows that ChatGPT
products and other information. That work was further excels in structured tasks like code translation and explain-
extended using a multitask Deep Learning (DL) approach ing established concepts. However, it faces challenges when
[41] that could simultaneously perform classification and handling nuanced tasks such as recognizing unfamiliar enti-
NER. ties and generating code from scratch.
Although there are many ML techniques for binary clas- Recent research findings show that the ChatGPT may en-
sification and NER [8, 9], to the best of our knowledge, the counter challenges and limitations in accurately identifying
architecture of Dionísio et al. [11, 41] is recognized [42] as entities, including locations, names, and organizations. Qin
the state-of-the-art for OSINT-based CTI extraction on Twit- et al. [50] present the results of their experiments on the per-
ter data, reaching almost 95% for different quality metrics. formance of GPT-3.5, ChatGPT, and fine-tuned models on
This work also shows superior performance when compared the multi-domain CONLL dataset for recognizing entities.
with classifiers based on SVMs and Multi-Layer Perceptrons According to the findings reported in this paper, ChatGPT
(MLPs), and several alternative NER approaches. Due to and GPT-3.5 achieved F1 score of 53.7% and 53.5%, respec-
this, we consider Dionisio et al. [41] as the reference spe- tively. Sun et al. [48] investigated the factors contributing to
cialized model in our comparative analysis. the sub-optimal performance of GPT-based chatbots in NLP
tasks, such as NER. They have identified several underlying
2.5. Chatbots for Cybersecurity causes and have proposed a set of generalized modules to
mitigate these challenges in different NLP tasks. Kocoń et al.
Although traditional and DL techniques have paved the way [6] examined the capabilities of ChatGPT on a diverse set
for advancements in cybersecurity text classification and of subjective analytical tasks and objective reasoning tasks,
NER, a new wave of NLP tools offers even more refined revealing that when compared to state-of-the-art models, the
4
average quality loss is about 25% in zero-shot and few-shot curation, labelling, and model training and updating would
settings. no longer be as necessary as they are for specialized models.
As LLM-based chatbots have become widespread, con- Our study aims to fill this gap by carefully comparing
cerns about their own vulnerabilities and their role as tools open-source and publicly available paid chatbots in the con-
for cyber- attacks have increased [51]. The usefulness of text of an OSINT-based CTI application, considering two
ChatGPT extends the time-to-conquer or delay attacker downstream NLP tasks: binary classification and NER.
timelines, making it valuable for organizations seeking
to enhance their cybersecurity posture [52]. Additionally,
the performances of the ChatGPT and GPT-3 models are 3. LLM-based chatbots
evaluated for vulnerability detection in code [53]. Based on LLM-based chatbots simulate human-like conversations
a real-world dataset, this evaluation focuses on binary and with users through text or speech interaction. They offer
multi-label classification tasks related to common weakness users more intelligent and contextually relevant responses
enumeration of vulnerabilities. The findings indicate that to queries by utilizing LLMs’ language comprehension
ChatGPT does not outperform the baseline classifier in and generation capabilities. This section reviews the eight
classification tasks for code vulnerability detection [53]. state-of-the-art commercial and open-source chatbot models
More precisely, the performance of both GPT models was available at the time of writing, namely, LLaMA, GPT4all,
assessed by accuracy, precision, recall, F1 score, and area Dolly 2.0, Stanford Alpaca, Alpaca-LoRA, Vicunna, Falcon,
under the curve. The highest F1 score of 0.67 was achieved and ChatGPT. We used these models in our experimental
using the text-davinci-003 model for binary classification. evaluation (Section 6) to assess their effectiveness in de-
In contrast, the F1 score of all ChatGPT models remained tecting cybersecurity-related texts and identifying relevant
below 0.53 for multilabel classification. entities.
Sentiment analysis plays a crucial role in extracting user
opinions and emotions from textual data to assess threats. 3.1. LLaMA
Recent research has been directed toward developing sus-
tainable strategies to diminish threats, vulnerabilities, and LLaMA [37] is a compilation of 7 to 65 Billion (B) parameter-
data manipulation within chatbots, ultimately improving the based foundation language models. These were trained on
scope of cybersecurity. To achieve this objective, researchers trillions of tokens, demonstrating the possibility of training
created an interactive chatbot using the Bot Libre platform1 cutting-edge models using only publicly accessible datasets.
and placed it on social media platforms such as Twitter for Their training method is comparable to that described in
the specific purpose of cybersecurity [4]. This study em- prior research [54] and is influenced by the Chinchilla
ploys a sentiment analysis strategy by deploying chatbots on scaling laws [55]. Specifically, LLaMA-13B demonstrated
Twitter and subsequently analyzing Twitter data to anticipate superior performance compared to GPT-3 (175B) across a
forthcoming threats and cyberattacks. wide range of benchmarks, whereas LLaMA-65B exhib-
ited similar performance levels to leading models, such as
Chinchilla-70B and PaLM-540B. LLaMA models undergo
2.7. The Research Gap training on substantial textual datasets by employing a
The reviewed papers in this section primarily focused on conventional optimizer and large-scale transformers [37].
evaluating and testing the commercial ChatGPT model,
with limited attention given to assessing open-source chat- 3.2. Vicunna
bot models in the context of cybersecurity applications.
Moreover, to the best of our knowledge, there is a notable An open-source chatbot, Vicuna-13B [19], was developed by
absence of comprehensive comparative studies explicitly fine-tuning LLaMA with user-shared conversations gathered
dedicated to open-source chatbots within the specialized from the 70K ShareGPT. In Vicuna, gradient checkpointing
field of OSINT-based CTI. This gap in existing research [56] and flash attention [57] alleviate the memory demand.
underscores the need to examine open-source chatbot mod- The Vicuna report states that like other large language mod-
els, their effectiveness, and their potential contributions to els, it has limitations. For example, it is not adept at tasks
enhancing cyber threat awareness and detection. Although requiring logic or mathematics, and may have limitations
there are specialized state-of-the-art models, including DL in correctly identifying itself or assuring the factual perfor-
models that excel at CTI binary classification and NER mance of its outputs [19].
tasks [10, 11, 41], no comparative study has been conducted
to determine whether LLM-based chatbots can compete with 3.3. GPT4all
their performance. If they can provide competitive results,
CTI tools could be changed to integrate LLM chatbots GPT4all [14] utilizes LLaMA, which operates under a non-
into the OSINT processing pipeline, decreasing the tools’ commercial license. The data for the assistant come from
complexity and maintenance costs. By employing general- OpenAI’s GPT-3.5-turbo, which has restrictions that prevent
purpose LLM chatbots, tasks related to CTI data collection, the development of models that directly compete with Ope-
nAI in commercial applications. GPT4all underwent several
1 https://www.botlibre.com/.
5
iterations with different versions featuring different parame- models such as ChatGPT is that they do not produce repeti-
ter sizes. While preparing this article, the initial version had tive responses to specific prompts [49].
7B parameters; however, the latest iteration used 13B.
3.8. Falcon
3.4. Dolly
The Falcon family [18] consists of two primary models:
Dolly [15] operated by slightly modifying an open-source Falcon-40B and its smaller counterpart, Falcon-7B. The
model with 6B parameters sourced from EleutherAI [58]. Falcon-7B and Falcon-40B models underwent training on
These modifications enabled Dolly to possess instruction- a corpus of 1.5 trillion and 1 trillion tokens, respectively
following capabilities, such as brainstorming and text gen- [18]. A feature of Falcon models is their utilization of
eration, which were not initially present in the base model. multiquery attention [63]. In the vanilla multihead attention
These modifications are implemented using the data from scheme, each head is associated with a query, key, and value.
Alpaca [16]. Subsequently, Dolly-v2-7b emerged as a highly However, in the multiquery approach, a single key and value
advanced 6.9B parameter causal language model derived are shared across all heads.
from EleutherAI’s Pythia-6.9b. Although Dolly-v2-7b may
not be considered a state-of-the-art model, it exhibits instruc-
4. Chatbots evaluation
tion-following capabilities of remarkably high quality, which
are not typically associated with its foundational model. We divided this section into three topics relevant to the
The most recent version available is Dolly-v2-12b [59], a evaluation: dataset, experimental methodology, and exper-
model with 12B parameters developed based on EleutherAI’s imental results evaluation criteria. First, we introduce the
Pythia-12b. It has been finely tuned using a dataset called Twitter dataset, which is the foundational source for gen-
databricks-dolly-15k, which consists of an instruction cor- erating subsequent prompts. The methodology subsection
pus created by employees of Databricks.2 outlines the techniques and strategies used to assess and
compare chatbot performances. Finally, in the evaluation
criteria subsection, we explore the metrics and standards by
3.5. Stanford Alpaca which the chatbots are assessed.
Stanford Alpaca [37] is an instruction-following language
model that is fine-tuned from Meta’s LLaMA 7B model. 4.1. Dataset
It was instructed using 52k self-instruct-style demonstra-
tions [60]. Alpaca displays several shortcomings in lan- We leveraged a comprehensive Twitter dataset made avail-
guage models, such as hallucination, toxicity, and stereo- able by Alves et al. [12] in their work to retrieve the In-
types. Specifically, hallucination appear to be a recurring dicators of Compromise (IoCs) from Twitter OSINT. This
issue in Alpaca [16]. dataset contains a combined total of 31281 tweets collected
during two distinct periods: one from November 21, 2016,
to March 27, 2017, and the other from June 1, 2018, to
3.6. Alpaca-LoRA September 1, 2018. After collection, the tweets were filtered
This model [17] reproduces the Stanford Alpaca results using specific keywords and manually labelled as positive or
using low-rank adaptation (LoRA) [61]. LoRA fine-tuning is negative, considering their relevance to the cybersecurity of
a strategy for reducing memory requirements by employing a an IT infrastructure, thus creating labelled datasets suitable
limited set of trainable parameters, known as adapters, rather for supervised learning. Later, Dionísio et al. labeled the
than updating all model parameters, which remain constant. dataset [11] for NER and republished it.
The dataset consists of 31281 tweet records, including
the timestamp of the tweet, specific keywords found in the
3.7. ChatGPT tweet, the original tweet, a pre-processed tweet cleaned from
OpenAI’s ChatGPT [13] has generated substantial interest some special characters, a binary label marking the tweet
and sparked extensive discussion within the NLP commu- as relevant for cybersecurity or not, and a string identifying
nity as well as in various other domains. The lack of clarity the named entities in the pre-processed tweet. Table B1
regarding the training process and architectural specifics in Appendix B shows representative examples of dataset
of ChatGPT poses a significant obstacle to both research records.
endeavors and the advancement of open-source innovation In this work, we used the pre-processed tweets, the
within this domain. Moreover, the distinction between the cybersecurity relevance, and the sequences of named entity
ChatGPT API and the web version model should be ac- tags, to create a customized dataset for the binary classifica-
knowledged. Recent research shows considerable variability tion and NER experiments through which the chatbots will
in the performance and behavior of GPT-3.5 and GPT-4 be evaluated.
models over time [62]. Another feature of generative AI
2 https://www.databricks.com/blog/2023/04/12/dolly-first-open-
commercially-viable-instruction-tuned-llm
6
• Find only product version numbers without any prod- • No response: Instances in which chatbot models failed
uct, vulnerability, and company names in the fol- to respond.
lowing sentence: ‘threatmeter dos microsoft internet
explorer 9 mshtml cdisp node::insert sibling node use- • Precise response: The questions to which the chat-
after-free ms13-0’. Give the shortest answer, and only bot models provided accurate answers were precisely
use sentence segments in your response. aligned with the desired yes or no response.
• Implicit response: Answers in which the chatbot’s re-
The second approach, which is called Guide-Line Prompt-
sponse did not explicitly mention yes or no. However,
ing (GLP), exclusively employed for ChatGPT-4, involves
careful inference from the generated answers revealed
a comprehensive specification of all entities within the
that the intended response was either yes or no.
‘GUIDELINES_PROMPT’ section. This approach attempts
to extract seven entities in a single prompt and considers only By analyzing these distinct response modes from the output
the ChatGPT model because the GUIDELINES_PROMPT files, we gain valuable insights into the performance and
feature is absent in open-source models. In this guideline, capabilities of the evaluated chatbot models to effectively
we include two examples of tweets from 11074 NER-tagged detect and address cybersecurity-related questions. The ex-
tweets in the dataset, each annotated with their respective perimental results section provides a detailed explanation
entities. In addition, we have included the output format of the experimental findings, focusing on the quality and
as a means to direct ChatGPT’s response for subsequent accuracy evaluation criteria.
processing. We then sent a dedicated prompt for each of the
11074 tweets and systematically covered all the extracted
entities. The prompt guideline is given in Appendix A. 5. Optimal utilization strategies of chatbots
The fundamental principle behind creating prompts is
4.3. Evaluation criteria ensuring clear and precise instructions. Prompt engineer-
ing [64] is crucial for optimizing the performance of chatbot
This section describes our evaluation criteria to assess chat- models by enhancing the clarity and specificity of the given
bot models’ performance in addressing binary classification instructions. Thorough prompt design and testing improve
and NER tasks. Our evaluation methodology focuses on the ability of the model to comprehend requests, making it a
performance and quality. more effective tool for generating desired outcomes. These
approaches enhance the possibility of directing the model to
4.3.1. Performance the desired output while decreasing the chances of receiving
irrelevant or incorrect responses.
To assess the selected chatbots’ performance, we used the F1
score, which is a metric that computes the harmonic mean
of Precision and Recall to measure the binary classification 5.1. Prompt enginering approaches
performance of a model. It strikes a balance between the We devoted considerable attention to formulating suitable
proportion of true positive results in all positive predictions prompts to maximize accurate and relevant responses from
and the proportion of true positive results in all actual the chatbots during our experiments. The objective was to
positives. The F1 score is given by design prompts that effectively allow models to capture the
Precision × Recall essence of the CTI task in the information present in tweets.
F1 = 2 × (1) Following the best practices for prompt engineering [58],
Precision + Recall
we progressively refined our prompts using two approaches.
where Precision is the ratio of correctly predicted positive For NER, we considered first a prompt guideline template
observations to the total predicted positives, given by approach but with poor results (prompt template is shown
in Appendix A). We used a zero-shot methodology for
𝑇𝑃
Precision = , binary classification, considering that LLMs are excellent
𝑇𝑃 + 𝐹𝑃
zero-shot reasoners [65]. Since this approach produced very
and Recall measures the ratio of correctly predicted positive good results, for the NER tests, we transformed the prompt
observations to all observations in the positive class: guideline into a sequence of zero-shot questions, one for each
entity we aim to extract.
𝑇𝑃
Recall = . To leverage the zero-shot capability of LLMs, we ex-
𝑇𝑃 + 𝐹𝑁
plored the assumption that chatbots perfectly represent the
cybersecurity concept. Following this assumption, we de-
4.3.2. Quality
signed prompts by starting the question with information
For the classification task, the response quality of the models on the cybersecurity issue and finishing it with precise
relates to their ability to provide precise yes or no answers instructions on the expected outcome.
to the prompts they receive. The assessment is based on The process of selecting the final prompt was iterative
three distinct response modes that the models eventually pro- and relied on a few trial and error cycles, using the prin-
duced: no response, correct response, and implicit response. ciples described before. Using a sample of dataset entries,
8
we progressively queried the chatbots and evaluated the evaluation of binary classification, focusing on how chatbots
answers until consistent answers were achieved concerning classify user inputs proficiently. Next, we present the evalua-
the instructions given. This refinement cycle resulted in the tion of NER tasks by examining the effectiveness of chatbots
determination of the final prompt. in identifying and classifying entities present within user
By formulating questions such as Is the sentence ‘vuln inputs. The collective findings from these experiments offer
oracle java se cve-2016-5582 remote security vulnerability’ comprehensive insights into chatbots’ operational strengths
related to cybersecurity? Just answer yes or no., we aimed and potential areas for improvement.
to elicit yes or no responses from the chatbots. Using spaces It is essential to highlight that we ultimately elaborate
and adequately employing the apostrophe (’) played a signif- on the common and default parameters shared by all open-
icant role in clarifying the prompt. source chatbot models. In most of these models, the maxi-
This methodology allowed us to assess the chatbot’s ca- mum size of the context window3 and its default value are
pabilities to detect cybersecurity concepts and their ability to set to 512. An exception is the Dolly model, which has
generate meaningful and contextually appropriate responses. a maximum context window size of 1024. However, the
Having formulated the desired prompt, we automated the commercial ChatGPT model exhibits variability across its
process of sending the entire set of questions to chatbots. final versions, each with a distinct context window size.
For instance, ChatGPT-4 is available in two versions with
5.2. Text length control window sizes of 8k and 32k, whereas ChatGPT-3.5-turbo
is available in 4k and 16k versions. In our experiments,
Chatbots can produce text of varying lengths based on we utilized a server equipped with multiple GPU units,
specific tasks. In our experiments, minimizing the number including an NVIDIA A30 GPU (memory capacity: 24,576
of answer tokens was essential because of the high volume MiB) and an NVIDIA RTX A6000 GPU (memory capacity:
of questions and the time required for the model to answer 49,140 MiB), with 264 GB of RAM.
each question.
The parameter that defines the length of the answer or
6.1. Evaluation of binary classification
the number of tokens in local chatbots is represented as
N_predict. It plays a critical role in significantly reducing We present the results obtained from chatbots, focusing on
the execution time for answering questions across all open- two critical dimensions: quality and accuracy.
source chatbot models. To optimize the execution time, we
advise setting a parameter that controls the response length 6.1.1. Performance
to the smallest suitable value based on a specific task. After
some initial experimentation, we consistently set the value In terms of quality, different categories are shown in dif-
of N_predict to 15 across all open-source models. ferent colors. The presence of red in Fig. 1 highlights a
In the context of ChatGPT-3.5-turbo and ChatGPT-4, the noteworthy observation regarding the unanswered questions.
max_tokens parameter constrains the length of the responses Precise responses are visually represented by the green bars
generated by the model. This is achieved by establishing a on the chart, effectively indicating the successful accom-
predetermined upper limit for the number of tokens that can plishment of the intended goal. The orange bars in the
be words or characters within the generated output. Using diagram represent acceptable answers with minor imper-
more extended responses in ChatGPT-4 can increase token fections, showing chatbots’ capacity to offer responses that
consumption, potentially increasing the usage costs. After were contextually in line with the desired result, albeit with
some initial experimentation with prompts with extreme slight deviations. The Vicuna model encountered significant
lengths, the value 70 was assigned to max_tokens in the challenges in answering 17420 questions despite our efforts
experimental tests. It is worth noting that the chosen token to use iteration loops to generate answers. To address these
length of 15 for the open-source chatbots applies exclusively unanswered questions, we included them in the loop until
to the generated answers. By contrast, the selected token the model could respond. However, a considerable number
length of 70 for ChatGPT models encompasses questions of questions have become trapped in an infinite loop and
and answers. remain unanswered. This problem is unique to the Vicuna
The careful utilization of the N_predict and max_tokens chatbot model, with no other models showing it. As shown
parameters is of utmost importance, as a low setting may lead in Fig. 1, ChatGPT and Stanford Alpaca exhibited excellent
to truncation of the response, potentially producing incom- quality, with all 31281 questions consistently beginning
plete or nonsensical answers. Balancing the desired response with yes or no. By contrast, the GPT4all model achieved a
length with the need for completeness and coherence is a slightly lower rate, with 98.80% of its responses starting with
crucial factor to consider. either yes or no. For the Dolly and Falcon models, the rates
are 97.32% and 94.10%, respectively. The GPT4all model
delivered conceptual yes or no responses to 378 questions,
6. Results and Discussion accounting for 1.2%, whereas Dolly produced responses for
In this section, we present the results of the empirical 839 questions, constituting 2.68%. Interestingly, however,
assessment of chatbots’ thorough evaluations of their ca- 3 Upper limit for the range of tokens considered to provide an answer.
pabilities across multiple dimensions. First, we discuss the
9
Table 1
Accuracy of chatbot models for cybersecurity binary classification
Model Test Number Parameters Precision Recall F1 score Execution Time
ChatGPT-3.5-turbo (16k context) [13] Test 1 175B 0.9570 0.9280 0.9431 11h 23m
ChatGPT-3.5-turbo (16k context) [13] Test 2 175B 0.9700 0.9200 0.9489 11h 23m
ChatGPT-3.5-turbo (16k context) [13] Test 3 175B - - UECH -
ChatGPT-4 (8k context) [13] Test 1 1.7T 0.9580 0.9240 0.9410 11h 50m
ChatGPT-4 (8k context) [13] Test 2 1.7T 0.9590 0.9230 0.9403 11h 43m
ChatGPT-4 (8k context) [13] Test 3 1.7T - - UECH -
GPT4all [14] Test 1 13B 0.9490 0.8630 0.9049 132h 05m
GPT4all Test 2 13B 0.9490 0.8410 0.8927 132h 02m
GPT4all Test 3 13B 0.9470 0.8280 0.8844 136h 05m
Dolly 2.0 [15] Test 1 7B 0.8890 0.8000 0.8470 10h 38m
Dolly 2.0 Test 1 12B 0.9470 0.7900 0.86120 10h 16m
Dolly 2.0 Test 2 12B 0.9480 0.7910 0.8631 10h 00m
Dolly 2.0 Test 3 12B - - - LET
Falcon [18] Test 1 7B 0.8120 0.8500 0.8304 16h 02m
Falcon Test 1 40B 0.8980 0.8200 0.8511 54h 03m
Falcon Test 2 40B 0.8990 0.8080 0.8502 54h 55m
Falcon Test 3 40B 0.8990 0.7880 0.8330 71h 10m
Alpaca-LoRA [17] Test 1 65B 0.8980 0.7940 0.8477 10h 12m
Alpaca-LoRA Test 2 65B 0.8990 0.8000 0.8451 10h 44m
Alpaca-LoRA Test 3 65B 0.8980 0.7610 0.8241 11h 20m
Stanford Alpaca [16] Test 1 7B 0.2260 0.5000 0.3112 13h 03m
Stanford Alpaca Test 1 13B 0.3240 0.6000 0.4209 13h 21m
Stanford Alpaca Test 1 30B 0.6980 0.6050 0.6415 15h 48m
Stanford Alpaca Test 2 30B 0.6990 0.5920 0.6401 15h 04m
Stanford Alpaca Test 3 30B 0.6990 0.5810 0.6395 16h 18m
Vicuna [19] Test 1 13B 0.4390 0.3100 0.3611 11h 23m
Dionisio et al. [41] Test 1 - 0.9570 0.9363 0.9470 00h 43m
* LET: Long Execution Time * UECH : Uncertainty of Erasing Conversation History
neither of these models explicitly used the words yes or no Based on the results presented in Table 1, it is evident that
in these conceptual answers. For example, the conceptual the GPT4all achieved the highest accuracy among the open-
answers were: it is related to cybersecurity, or the sentence source models, as indicated by its F1 score of 0.90. The
is not related to cybersecurity. Dolly model has an accuracy of 0.86, Falcon of 0.85, Alpaca-
In terms of accuracy, we explain the results of the tests LoRA of 0.84, and the Stanford Alpaca model has a score
conducted on chatbot models. Table 1 provides an overview of 0.64. Although GPT4all achieved higher accuracy among
of the tests conducted for each chatbot and its version. open-source chatbots, it is noteworthy that the commercial
Moreover, we recorded the execution time of the models’ ChatGPT models (GPT-4 and GPT-3.5-turbo) achieved an
responses for all the questions. It includes details regarding F1 score of 0.94. ChatGPT-3.5-turbo with a 16k window
the number of parameters on which the model was trained, context size achieves the same F1 score as ChatGPT-4 with
achieved F1 score, precision, recall, and execution time for an 8k context window size. These results highlight the better
each model. The confusion matrices corresponding to each accuracy of the GPT4all and ChatGPT models, emphasizing
model are provided in Appendix C. By analyzing the F1 their effectiveness for this particular task.
score values in Test 1, we can assess the effectiveness of
the models in accurately responding to the given questions.
10
Table 2
Comparison of NER task accuracy achieved by the ChatGPT-4 (8k context) model using two different approaches.
Approach Number of Questions Entity F1 score Execution Time
ESP 11074 Organization 0.36 4h 02m
ESP 11074 Version 0.43 4h 23m
GLP 11074 All entities 0.10 3h 09m
Table 3
Illustrative ChatGPT responses demonstrate NER identification for two distinct prompts.
Model Entity Models Responses
ChatGPT-4 (8k context) Organization Microsoft
ChatGPT-4 (8k context) Version 9
GPT-3.5-turbo Organization The name of organizations in the given sentence is "Microsoft“.
GPT-3.5-turbo Version 9, ms13-0
Dolly 2.0 Organization Microsoft Internet Explorer 9
Dolly 2.0 Version ms13-0 is 9.0.8112.16421
GPT4all Organization Microsoft, Mozilla (Firefox), and Google Chrome
GPT4all Version 546
requested questions. This F1 score exceeds the correspond- upon in Section 6.3.1. A common challenge to binary clas-
ing value for identifying the ‘organization name’ entity. Our sification and NER was the generation of effective prompts,
analysis for the ESP approach is based on the ChatGPT which required progressive refinement until satisfactory per-
version released on July 13th, 2023. The results in the last formance was achieved. This section explores the challenges
line of Table 2 indicate that the calculated F1 score for 11074 and limitations encountered in the binary classification and
questions is unexpectedly low, reaching 0.10 accuracy. The NER tasks.
findings demonstrate a significant deviation from the cur-
rent NER outcomes as presented in the [41] study, which 6.3.1. Binary classification
achieved an F1 score of 0.94. Our GLP approach evaluation
is based on the ChatGPT released on August 2, 2023. In Section 4.2.1, we discuss how each chatbot model dis-
A representative example of the responses generated by played unique response behaviors when answering ques-
each model is provided in Table 3. In the two NER prompts tions. This necessitated a cleaning step after collecting re-
mentioned in Section 4.2, the annotations in the dataset show sponses to ensure answer consistency, particularly because
‘Microsoft’ as an organization entity and ‘9’ as a product our goal was to obtain a precise binary (yes or no) response.
version. The phrase ‘without any product, vulnerability, and However, some responses implied a ‘no’ or a ‘yes’ without
company names’ in the version prompt plays a crucial role explicitly using these words. For instance, a model might
in constraining the models’ interpretation of version num- answer the question, ‘This is not related to cybersecurity’.
bers. ChatGPT-4 demonstrated precision in identifying ‘Mi- Consequently, it is crucial to review and validate the answers
crosoft’ as B-ORG and ‘9’ as B-VER, accurately matching in the output file. Our evaluation process involved two key
the annotations in the dataset. By contrast, the other models steps to ensure reliability and accuracy. First, we imple-
exhibited varying degrees of accuracy, failing to reach the mented an automated validation method for each output file
level of correctness achieved by ChatGPT-4 in these specific to confirm the presence of ‘yes’ or ‘no’ responses at the be-
instances. ginning of the responses. This helped to filter out potentially
incorrect answers, such as those lacking an explicit ‘no’ or
‘yes’. Second, we conducted a thorough manual review of
6.3. Challenges and limitations each response to confirm its accuracy and alignment with
Chatbots present opportunities in numerous applications but the expected answers. In cases such as the aforementioned
also face challenges and limitations that hinder their effec- example, we manually annotated the response with ‘no’ to
tive utilization. In this study, we outline several challenges correctly classify it. This comprehensive manual validation
and limitations encountered during our experiments, which adds a crucial layer of scrutiny, bolstering the accuracy and
are detailed in this section. Additionally, we faced specific reliability of the results. While this comprehensive valida-
limitations when applying chatbots to the specialized CTI tion process enhances the accuracy of our binary classifi-
tasks, leading to issues with timeliness that are elaborated cation, it also introduces another challenge: the need for
12
timely response processing. Timeliness is a critical factor that requires specific requirements. This includes the abil-
in the use of chatbot models for binary classification tasks, ity to automatically generate prompts that are contextually
particularly in CTI applications. The demand for real-time relevant and precise. The system must also interpret and
processing is essential, as any delay in classifying responses adapt to varying response formats and manage the com-
can significantly impact decision-making. plexities of the different types of data inputs. Additionally,
the chatbot’s ability to handle ambiguity in natural language
6.3.2. Named entity recognition while maintaining the accuracy of its responses is essential.
This requires sophisticated algorithms that are capable of
When employing chatbots in NER tasks for cybersecurity understanding subtle differences in language and context.
purposes, we observed various limitations, mainly in pro- Although our evaluation primarily focused on a spe-
viding precise and relevant results. While chatbots pow- cific Twitter dataset, it is worth noting that numerous other
ered by pre-trained language models excel at understand- OSINT resources, such as blog posts and security forums
ing natural language and utilizing general knowledge, they (even in the dark web), remain unexplored. Moreover, it is
frequently encounter challenges when dealing with domain- important to note that a manual review of chatbot responses
specific precise entity recognition [32]. This shortcoming that follow the automated checking phase may introduce
is attributed mainly to the intrinsic complexities of NER, potential human errors into the assessment process. Further,
which demand a profound understanding of context, domain such verification is time-consuming and thus not feasible in
knowledge, and syntactic intricacies. the day-to-day operation of a security operating centre.
A recurrent issue in chatbot-assisted NER is the gener-
ation of unspecific answers that fail to accurately identify
precise entities in a given sentence. This often leads to 7. Conclusion
generalized responses that lack the precision essential for We assess the capabilities of open-source and paid LLM-
obtaining reliable NER results. Consider this prompt as an based chatbots to recognize cybersecurity-related tweets
example: Find the name of organizations in the following and extract pertinent information from these. Both types of
sentence: ‘senator calls on us government to start killing models can perform similarly to specialized models trained
adobe flash now tripwire’. Give the shortest answer, and only specifically for the binary classification task of identifying
use sentence segments in your response. The ChatGPT-4 re- cybersecurity-related tweets, often achieving the same level
sponse was ‘US Government, Adobe, Tripwire’. Such chal- of performance. On the contrary, the chatbot models’ per-
lenges can be traced back to factors such as the limitations formance is still very poor on named entity recognition to
inherent in the underlying language models or the absence extract security elements from tweets. Even when training
of dedicated fine-tuning tailored to NER tasks. Another on vast datasets, these models did not perform comparably
phenomenon, hallucination, as discussed in Section 2.5, also to specialized models on the test data.
arises from these compounded challenges. As an example of Our results highlight the need for further research and
precise hallucination, GPT4all mistakenly extracted ‘546’ as refinement in the application of LLM chatbot models to
a product version, the number that was not seen at all in the extract threat indicators from open-source intelligence. Al-
product version prompt, which is mentioned in Section 4.2.2. though open-source and paid chatbots compared evenly with
specialized trained models on cybersecurity binary text clas-
6.4. Discussion sification, they fell below acceptable performance on named
entity recognition. Moreover, they can not compete with
The evaluation of chatbot models involves two critical as- specialized models on timeliness and cost.
pects. First, it involves a deep understanding of effective Based on our study, we identify various possibilities for
methods for interacting with the models. This includes con- future work based on the following research questions:
sidering timeliness, which is particularly important when
integrating chatbots into real-time systems such as those 1. How can LLM chatbots be further optimized for cost-
connected to Twitter like SYNAPSE [10]. Second, evalua- effective real-time CTI detection on social media plat-
tion requires the skill of writing structured and clear prompts forms?
for chatbots, ensuring that they produce precise and relevant 2. How to improve the NER capability of chatbots for the
responses. The ability to compose clear and concise prompts extraction of indicators of compromise?
that guide the chatbot to provide an anticipated response 3. How can cybersecurity specialists’ feedback be used
is crucial. The preceding sections explain that exercising to increase the efficiency and cost-effectiveness of
control over the response length is crucial. Long answers not open-source chatbots?
only extend execution times but also impose an additional
workload on human resources for response validation. On
the other hand, those responses might not contain enough
Acknowledgement
meaningful content. This work is funded by the European Commission
Automated interaction with chatbots for tasks such as through the SATO Project (H2020/IA/957128) and by FCT
binary classification and NER is another scope to enter through the LASIGE Research Unit (UIDB/00408/2020 and
UIDP/00408/2020).
13
[44] M. Gupta, C. Akiri, K. Aryal, E. Parker, L. Praharaj, From chatgpt to systems 33 (2020) 1877–1901.
threatgpt: Impact of generative ai in cybersecurity and privacy, IEEE [55] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai,
Access (2023). E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark,
[45] M. Al-Hawawreh, A. Aljuhani, Y. Jararweh, Chatgpt for cybersecu- et al., Training compute-optimal large language models, preprint
rity: practical applications, challenges, and future directions, Cluster arXiv:2203.15556 (2022).
Computing 26 (2023) 3421–3436. [56] T. Chen, B. Xu, C. Zhang, C. Guestrin, Training deep nets with
[46] T. McIntosh, T. Liu, T. Susnjak, H. Alavizadeh, A. Ng, R. Nowrozy, sublinear memory cost, preprint arXiv:1604.06174 (2016).
P. Watters, Harnessing GPT-4 for generation of cybersecurity GRC [57] T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashattention: Fast
policies: A focus on ransomware attack mitigation, Computers & and memory-efficient exact attention with io-awareness, Advances
Security 134 (2023) 103424. in Neural Information Processing Systems 35 (2022) 16344–16359.
[47] Conversational AI Platform | Superior Customer Experiences Start [58] EleutherAI, 2023. URL: https://www.eleuther.ai.
Here, . URL: https://rasa.com/, last visited: February 13, 2024. [59] databricks/dolly-v2-12b · Hugging Face, 2023. URL: https://
[48] X. Sun, L. Dong, X. Li, Z. Wan, S. Wang, T. Zhang, J. Li, F. Cheng, huggingface.co/databricks/dolly-v2-12b.
L. Lyu, F. Wu, et al., Pushing the limits of chatgpt on nlp tasks, [60] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi,
preprint arXiv:2306.09719 (2023). H. Hajishirzi, Self-instruct: Aligning language model with self
[49] F. M. Megahed, Y.-J. Chen, J. A. Ferris, S. Knoth, L. A. Jones- generated instructions, preprint arXiv:2212.10560 (2022).
Farmer, How generative ai models such as chatgpt can be (mis) used [61] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
in spc practice, education, and research? an exploratory study, Quality W. Chen, Lora: Low-rank adaptation of large language models,
Engineering (2023) 1–29. preprint arXiv:2106.09685 (2021).
[50] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, D. Yang, Is [62] L. Chen, M. Zaharia, J. Zou, How is chatgpt’s behavior changing over
chatgpt a general-purpose natural language processing task solver?, time?, preprint arXiv:2307.09009 (2023).
preprint arXiv:2302.06476 (2023). [63] N. Shazeer, Fast transformer decoding: One write-head is all you
[51] A. Qammar, H. Wang, J. Ding, A. Naouri, M. Daneshmand, H. Ning, need, preprint arXiv:1911.02150 (2019).
Chatbots to chatgpt in a cybersecurity space: Evolution, vulnera- [64] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train,
bilities, attacks, challenges, and future recommendations, preprint prompt, and predict: A systematic survey of prompting methods in
arXiv:2306.09255 (2023). natural language processing, ACM Computing Surveys 55 (2023)
[52] F. McKee, D. Noever, Chatbots in a honeypot world, preprint 1–35.
arXiv:2301.03771 (2023). [65] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language
[53] A. Cheshkov, P. Zadorozhny, R. Levichev, Evaluation of chatgpt models are zero-shot reasoners, Advances in neural information
model for vulnerability detection, preprint arXiv:2304.07232 (2023). processing systems 35 (2022) 22199–22213.
[54] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, [66] X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language mod- Q. Du, Z. Fu, et al., Deepseek llm: Scaling open-source language
els are few-shot learners, Advances in neural information processing models with longtermism, preprint arXiv:2401.02954 (2024).
15
Fig. A1: GLP NER approach: employing a ChatGPT-4 guideline prompt template
Table B1
Samples from the 31281 tweet entries in the dataset
timestamp keywords original tweet pre-processed tweet relevance entities
2018-07-24 oracle RT Oracle: Learn to use rt oracle learn to use and 0 -
01:00:46+00:00 and understand #Oracle’s understand oracle s internet
Internet Intelligence Map intelligence map dyn
https://t.co/l06Nyf1FFF
Dyn
https://t.co/uzozFKwm97
2016-12-09 internet explorer threatmeter: [dos] - threatmeter dos microsoft 1 O O B-ORG
19:19:38+00:00 Microsoft Internet Explorer internet explorer 9 mshtml B-PRO
9 MSHTML - CDisp cdisp node::insert sibling I-PRO
Node::Insert Sibling Node node use-after-free ms13-0 B-VER O O
Use-After-Free (MS13-0... OOO
https://t.co/gLvEwpDL9v B-VUL B-ID
Appendix A.
The template for the guideline prompt used in the ChatGPT GLP NER approach is shown in Fig A1.
Appendix B.
As described in Section 4.1, for each collected tweet a dataset entry is generated, including timestamp, keywords, original
tweet, pre-processed tweet, cybersecurity relevance binary label, and sequence of named entities in the pre-processed tweet.
Table B1 presents two examples of dataset entries. In the relevance column, ‘1’ denotes an entry considered relevant for
cybersecurity, a ‘0’ means otherwise. The last column shows the tags used to label the different NER entities.
16
0 1 0 1
Dolly 0 19719 488 0 19726 481
1 2348 8726 1 2314 8760
0 1 0 1 0 1
Falcon 0 19545 1006 0 19314 993 0 19451 956
1 1871 8859 1 2137 8837 1 2369 8505
0 1 0 1 0 1
Alpaca-Lora 0 19208 999 0 19425 982 0 19460 947
1 2281 8793 1 2137 8737 1 2535 8339
0 1 0 1 0 1
Stanford Alpaca 0 17704 2803 0 17632 2775 0 17436 2771
1 4296 6478 1 4429 6445 1 4640 6434
0 1
0 16362 4245
Vicuna 1 7352 3322
0 1
0 19835 456
Dionisio et al. [41]
1 700 10290
Appendix C.
The confusion matrices are provided the test and model combinations considered, excluding the 7B parameter models in
Table 1 which achieved the worse results in the respective group. The rows in the matrices correspond to the actual expected
result, whereas the columns show the predicted results.