Keywords

1 Introduction

Deception detection is a vital research problem studied by fields involving human interaction. The discrimination between truth and lies has drawn significant attention from fields as diverse as psychology, forensic science, sociology.

In this paper, we target the detection of deception in the interviews of real-world crime interrogation. With cooperation with National Investigation Bureau, we explore the effect of the frontier natural language process technique for deceptive language detection. Specifically, we utilize the transcript files of the polygraph test during interrogation. The polygraph test procedure has three phases:

  1. 1.

    Pre-test phase

    In the pre-test phase, the polygraph is not used. The pre-test starts with the interrogator having an interview with the subject. Here, the interrogator will ask the subject questions that are directly related to the case (“related questions”) and questions that are not directly related to the case (“control questions”). Examples of a related question and a control question are “ ” (In this case, did you attack the defendant first?) and “ ” (In your experience, have you ever made any mistake but lied and did not admit it?). These questions may be open-ended or close-ended. With the conversation, the interrogator examines the behavior, body language, and speech of the subject. During the phase, the interrogator also decides what questions should be asked when the polygraph is connected. This phase can take from 30 min to 90 min.

  2. 2.

    Test phase

    The pre-test phase leads to the test phase, during which the actual polygraph participates. The interrogator first explains to the subject that how the test will be conducted, introducing how the polygraph works and the closed-end questions to which the subject will be questioned later. The subject is then moved to the polygraph room and connected to the polygraph machine. The polygraph machine measures the subject’s respiration, heartbeat rate, blood pressure, and perspiration. Following the hook-up, the interrogator asks the subject a series of close-ended (i.e., yes/no) questions, for example, “Did you steal the money?”. The subject is expected to answer with either a “yes” or a “no”. The polygraph records the physiological responses during this phase. Once the questioning is over, the subject is disconnected from the polygraph machine.

  3. 3.

    Post-test phase

    The last phase is the post-test phase. During this phase, the results of the pre-test and test phases are analyzed by the interrogator, and he/she makes a decision regarding the truthfulness of the subject.

For methodology, we define our task as a binary classification problem to predict whether a subject is deceptive or innocent based on his/her interviewing transcript with the interrogator. With the aid of CKIP parser [1], fastText pre-trained word vectorsFootnote 1 and BERT pre-trained model [2], we propose four different modeling system with fully-connected and/or LSTM [3] neural networks to perform prediction based on the encoded transcript data.

CKIP is an open-source library that is capable of performing natural language process tasks on Chinese sentences, such as word segmentation, part-of-speech tagging. fastText is a lightweight library for text representation. Its pre-trained model, trained on Common Crawl and Wikipedia corpus, has the ability to capture hidden information about a language such as word analogies or semantic. As for BERT, which stands for Bidirectional Encoder Representations from Transforms, is the state-of-the-art contextual embedding model that can turn a sentence into its corresponding vector representation. LSTM (Long Short Term Memory) is a special kind of recurrent neural network (RNN) structure that is capable of learning long-term dependencies. It is very suitable for processing sequence data such as conversations since the meaning of a word in a sentence usually depends on previous words.

We apply four different methods including (1) part-of-speech extraction, (2) one-hot-encoding, (3) mean of word embedding vectors and (4) BERT model to each utterance of interrogator and subjects. After that, we use a hierarchical method to aggregate the hidden representations of them, and then generate a single prediction label which indicates the deceptiveness or honesty.

2 Background

Even though the literature indicates that many types of deception can be identified because the liar’s verbal and non-verbal behavior varies considerably from that of the truth teller’s [4], the reported performance of human lie detectors rarely achieves at a level above chance [5]. The challenge for people to distinguish lies from truths leads to the design that the annotators are the people who express instead of the people who receive, which causes the lack of real data.

Recent advances in natural language processing motivate the attempt to recognize the deceptive language automatically. Researchers have explored its possibility in gaming [6, 7], news articles [8], interviews [9], and criminal narratives [10]. However, some of the previous works conducted experiments in pseudo experiments or were required hand-craft features, which might include human bias. This issue may make the developed models unable to be applied to real situations.

In this paper, we use the data from real-world crime interrogation as well as modern natural language process techniques to address the task of predicting deceptiveness and honesty in transcript files during interrogation.

3 Dataset

With cooperation with the National Investigation Bureau, we have 496 transcript files of polygraph tests during interrogation. Each of the transcript files consists of a field indicating the case is judged as lying or not and 220 conversation entries on average. For each conversation entry, in addition to the utterance of the interrogator and the subject, it contains two additional numbers. One of them indicates whether the entry is a question or an answer while the other denotes whether the entry belongs to the related question or control question. Table 1 illustrates two sample conversation entries in a transcript file.

Table 1. Two sample conversation entries of a transcript file. It’s originally written in Chinese. We translate it into English for demonstration purpose.

Note that when we say a “transcript file,” it stands for a file that contains “sentences,” each of which consists of “words.” A transcript file corresponds to an interrogation case. The dataset contains 496 transcript files comprising 226 deceptive cases and 270 honest cases. In total, there are about two million characters. To parse the Chinese content, we utilize CKIP parser to extract the part-of-speech information and segment each sentence into word-level tokens. There are 24853 unique Chinese and English words, numbers, and punctuation in our dataset.

The part-of-speech information can be put into a neural network classifier (detailed in the following “Part-of-speech Extraction” section). As for sentence segmentation, we can convert each token-formatted sentence into a vector representation. These representation vectors can then be put into our LSTM model to encode the transcript file. We also utilize BERT pre-trained embedding model to encode transcript files directly. Finally, we use the encoding of each transcript file to predict whether a subject is deceptive or not. Details of how we convert sentences into vectors are elaborated in the next section.

4 System Overview

4.1 Part-of-Speech Extraction

In this method, we make a hypothesis that if a person lies in a conversation, he/she will use more words to express contrast, e.g. “however,” “but,” “nevertheless.” Furthermore, people who are deceptive will have more chances describing an event in third-person point to keep themselves away from it. In short, we believe that if a person lies, there may exist some patterns in the words he/she uses. We extract part-of-speech with CKIP parser from each transcript file and count the number of each part-of-speech tag. Then we take them as entities of input features and feed them into a fully-connected linear binary classifier. Table 2 shows a sample result of part-of-speech extraction.

Table 2. A sample result of part-of-speech extraction. It shows pairs of a Chinese word and its corresponding part-of-speech tag. The Chinese sentence means: “In this case, did you attack the defendant first?”

Except for the “Part-of-speech Extraction” method mentioned above, in general, we take the following steps on each of the structures we propose to address the task:

  1. 1.

    Embed each sentence in transcript files into a vector representation,

  2. 2.

    Encode each transcript file into vector format based on the sentence-level encoded vector obtained from the previous step, and

  3. 3.

    Train the binary classifier by leveraging vectors obtained from step two.

For step one mentioned above, we utilize three different methods (detailed in the following sections) to embed hidden information into representation vectors:

  • One-hot-encoding

  • Mean of embedding vectors

  • BERT model

Despite the difference between these methods to encode a transcript file, we use the same hierarchical neural-network structure, as depicted in Fig. 1, to perform classification and prediction. Additionally, we take the same steps to train neural networks. The followings are details about embedding sentences.

4.2 One-Hot-Encoding

We apply one-hot-encoding process, to encode each sentence of a transcript file into a one-hot vector as described below:

  1. 1.

    Extract all the words that appear in transcript files, inclusive of questions from interrogator and answers from the subject, into a vocabulary set.

  2. 2.

    For each transcript file, prepare a vector which has elements as many as the number of words in the vocabulary set. Each of the elements corresponds to a word in the vocabulary set and is assigned to 0 initially.

  3. 3.

    Assign the element to 1 if its corresponding word appears in the sentence.

  4. 4.

    Finally, we take the vector containing zeros and ones as the representation vector of a sentence.

For example, assume we have a vocabulary set containing words: this, is, an, apple, a, pen, and assume each word corresponds to index 0 to 5 of a vector. Then we can encode a sentence “this is an apple” to a vector containing {1, 1, 1, 1, 0, 0} while the sentence “this is a pen” will be encoded to a vector with the value {1, 1, 0, 0, 1, 1}.

After converting all the sentences in transcript files with the process mentioned above, we feed these result vectors into an LSTM network sequentially, taking the last hidden state as the encoding of a transcript file. As for the binary classifier, we use a fully-connected linear neural network, followed by a sigmoid activation function. We input the last hidden state of the LSTM network to the classifier and take the output to be the prediction of the system. Figure 1 depicts the process.

Fig. 1.
figure 1

This figure describes that how we process data after obtaining vector representation of sentences. Vector representation of each sentence is fed into a LSTM network. The last hidden state of LSTM network is then forwarded into the binary classifier which is made up of a fully-connected neural network with sigmoid activation function.

4.3 Mean of Embedding Vectors

For each sentence in a transcript, we collect all the words appearing in it, using the fastText pre-trained model to encode these words into vectors. Next, we calculate the mean of these vectors element-wisely, then concatenating the result with two other numbers, which respectively indicate whether the entry is a question or answer and whether it belongs to relation question or control question. We assign 1.0 to the first number if the entry is an answer or 0.0 if it’s a question. Likewise, we assign 1.0 to the second number if the entry is a related question or 0.0 if it’s a control question. Figure 2 illustrates the concept. After converting each sentence to a vector, we input them into the neural network described in Fig. 1.

Fig. 2.
figure 2

This figure illustrates how we convert a sentence in a transcript file into vector format, which can be the input of the following LSTM network with the “Mean of Embedding Vectors” method. Though sentences here are in English, we can take the same step on any language to get the sentence vector as long as the sentence is parsed into words.

4.4 BERT Model

BERT is a contextual embedding model. It captures both meanings of the word and the information of its surrounding context. Unlike the fastText pre-trained model, which addresses the embedding task in word-level, the BERT model can process sentence-level embedding. Therefore, we can use the BERT pre-trained model to encode each sentence of a transcript directly.

Next, we take the same methods as the previous structure to concatenate two additional numbers, get the encoding of transcripts, and train the linear classifier. Figure 3 illustrates how we obtain a sentence vector, which is the input of the LSTM network.

Fig. 3.
figure 3

The figure illustrates the way to encode each sentence in a transcript with the aid of BERT pre-trained model.

5 Experiments

We use pairs of transcript representation vector and the corresponding label as the ground truth to train this classifier. There are 496 cases consisting of 226 deceptive cases and 270 honest cases in our dataset.

We split our dataset into a training set, a validation set, and a testing set. To make our experiment more reliable, we use cross-validation with stratification based on class labels (deceptive and honest). We split our dataset into ten splits, taking one of them to be the validation set, another to be the testing set, while the splits left out are aggregated to be the training set. With stratified sampling, training and validation sets contain approximately the same percentage of deceptive/honesty cases.

Besides, we train our models with transcripts that contain (1) control questions only (2) related questions only (3) both control and related questions to compare the impact of various types of conversation entries. What’s more, we randomly generate embedding vectors as a baseline to perform sanity checks assuring our embedding vectors actually extract hidden information from conversations of our dataset.

To train the classifier, we use one of following optimizers: Adadelta [11], Adam [12], RMSprop [13], SGD with momentum [14], with binary cross-entropy loss and apply dropout [15]. We perform grid search based on the validation error to pick the best hyper-parameters and optimizer. The test result is tested on the testing set.

6 Results

We measure each of the settings described in the experiments section with metrics including precision, recall, and F1 score. The result is showed in Table 3. According to the result, we have findings listed below.

  1. 1.

    All of the methods we propose in this paper have a higher F1 score than the randomly initialized vectors setting. It indicates that these methods indeed extract some hidden information from our data, and the classifier has learned some underlying pattern of deceptive language.

  2. 2.

    One-hot-encoding has a higher F1 score

    Much to our surprise, the “One-hot-encoding” method has a better F1 score than any other method. In the setting of using both related and control questions, it is about 33% higher than the average F1 score. We don’t expect the result because we think that the BERT pre-trained model, which can extract not only the meaning of words but contextual information of a sentence, should be more powerful and have a better performance. On the other hand, the one-hot-encoding process can only annotate whether the word exists in a transcript file.

  3. 3.

    Using control questions only gets a higher F1 score

    From Table 3, we can see that all methods except “Part-of-speech Extraction” have a higher F1 score in the scenario of using control questions only. On average, the F1 score of using control questions only is about 6% higher than using both questions, 23% higher than using related questions only.

Table 3. The metric result of each of the methods we propose and the average. The average value is calculated based on results of four methods listed above, not including the random initialized vector setting.

7 Discussion

We are curious about why the one-hot-encoding method has a better F1 score. To further investigate what our model learns in the one-hot-encoding setting, we generate vectors whose elements all assigned to 0 except one element to be 1 to be inputs of the model. These vectors can be thought of as a sentence with only one word. The followings are some one-hot-encoding-format words that are generated with the method mentioned above. Our model considered these words to have more possibility being deceptive: (Taipei, a location name), (Taichung, a location name), (mobile phone), (mention), (April). Most of them are related to locations. On the contrary, these words are considered to have more possibility being honest: (monitor), (girlfriend), (come back home), (for example), (touch). However, we can’t say the sentence containing words above has more possibility to be deceptive/honest due to the complexity of the deep learning model. Computation in a neural network is not linear. Minor change to the input may lead to a significant change to the output. It just gives us a direction to a more in-depth investigation.

As for the reason why using control question only has a higher F1 score, we guess it’s because that sentences belonging to control questions have more words while one belonging to related questions, in which the subject just responses either yes or no, has less. The representation vectors of sentences belong to control questions hold more hidden information than that belong to related questions. As for the reason why the setting of using both related and control questions has a lower f1 score than using control questions only, we guess that related questions might be noise due to the short answers, which often only have one word from subjects.

8 Conclusion

In this paper, we utilize four different methods, including part-of-speech extraction, one-hot-encoding, means of embedding vectors, and BERT model, to capture the hidden information of real-world transcript files which contain conversations from interrogators and subjects. Besides, we use a hierarchical neural network to detect whether the conversation is deceptive or not. Finally, we compare the metric of each method and have a discussion.

After training, our system can classify the deceptive case and honest. However, we still can make our system more robust and reliable by collecting more training samples and combining some deep learning techniques such as transfer learning and multitask learning. Although improvements can be made, we believe that our methods can be the basis of more complicated neural network structures, which may be additional aids in the fields such as psychology, forensic science, and sociology someday. Moreover, the methods and structures mentioned in the paper are not restricted to Chinese transcripts. They can be applied to any other language and even other scenarios.