Introduction

As information technology continues to evolve at a rapid pace, online social media has become the most important platform for the daily information exchange [1]. Everyone in the current era enjoys the convenience of online social platforms. However, the exponential development of social media has provided fertile ground for the creation and wanton dissemination of fake news [2, 3]. The unrestricted spread of misinformation on social platforms has undermined not only the public opinion environment in cyberspace, but also political stability [4], social order and economic activities in reality [5]. Effective detection is essential for preventing the propagation fake news in the internet.

Fig. 1
figure 1

Three samples from Twitter dataset with label “fake”

Multimodal news has become popular in recent years, especially those with visual information. Compared with text-only news, they attract more attention from readers on social media. Benefiting from this feature, multimodal fake news gets more clicks and retweets to expand its reach [6]. Hence, the focus of fake news detection has been on the detection of multimodal content of social media. Figure 1 shows three samples from Twitter dataset with “fake” label. In Fig. 1a, both the text and the image indicate that this news should not be trusted. In Fig. 1b, the text provides nothing to prove its authenticity. However, the image is apparently faked or fabricated. In Fig. 1c, the image seems reasonable, while the text indicates that it is possibly not real. A hypothesis emerging from these examples is that multimodal approaches are more conducive to detecting fake news.

Recently, many works focused on addressing fake news detection tasks from moultimodal perspective [7, 8]. Benefiting from the consequences of pre-trained model, they usually adopt pre-trained models to extract feature from different modalities. However, some early works naively concatenate multimodal features together and ignore the complex interactions among features. Some studies have investigated the learning of joint text and image representations based on adversarial networks [9] and variational autoencoders [10]. But they do not consider fake news detection as a single task. This task is solved with events classification or original sample reconstruction. Later, some researchers attempt to extract more visual information by using image description model [11] or fake deep image detection algorithm [12]. But they do not make progress in combining the inter-modality features. Wu et al. then fused textual and pictorial features for several times with the real reading habits of human beings [13]. This method focuses on multimodal fusion mechanisms but neglects the importance of multimodal representation. Some studies attempt to implement semantic alignment by using ambiguity learning [14] or entity detection [15]. However, experiments prove that the performance of these models is not good enough due to their imbalances in different datasets. For a real multimodal news, the text, image content and cross-modal relation ought to be flawless. Nevertheless, most of the existing methods do not consider single-modal feature judgement and cross-modal semantic fusion simultaneously. It leads to the loss of potential information and imbalance in model performance for fake news detection. The MCAN [13], which only considers the cross-modal relationship of news, performs well in the Weibo dataset, but performs poorly in the Twitter dataset, with an accuracy difference of nearly 8 The MPFN [8] proposed in 2023 deeply integrates cross-modal features, and balances the accuracy on Weibo and Twitter datasets. However, the corresponding performance is not good enough. This suggests that new approaches need to be explored to consider both the inter-modal and intra-modal features of news.

Motivated by this, we propose a novel multimodal fake news detection method with Intra-modality Feature aggregation and Inter-modality Semantic fusion (IFIS). Specifically, to improve the detection accuracy, we extract entity features of images to reduce the noisy and redundant visual features. Next, based on the detected entities, we design a intra-modality attentional mechanism for aggregating the complex feature relations. In addition, we utilize semantic fusion module to capture the inter-modality relationship of the features from different modalities. The semantic fusion module is built through the adoption of two parallel Co-attention blocks to establish accurate and reliable relationships between modalities. The contributions of our work can be summarized as:

  • We propose a novel multimodal fake news detection method that considers both single-modal feature judgment and cross-modal semantic fusion. The method demonstrates excellent and balanced performance on datasets with different attributes.

  • We integrate an object detection block, Faster R-CNN, into the entities’ features extraction module and aggregate the relational features through attention mechanism.

  • We develop a semantic fusion module to capture the inter-modality relationship between news text and corresponding entities from images. The semantic fusion module is made up of two parallel Co-attention blocks to obtain stable connections between modalities.

The rest of this work is summarized as follows. Section “Related works” summaries previous works related with the fake news detection topic, specially, the frameworks using multimodal data in detail. While Section “Problem statement” provides a detailed definition of the fake news detection task. In Section “Proposed method”, details of the newly proposed multimodal framework are provided. Then, Section “Experiments and analysis” illustrates the descriptions of adopted datasets and parameter settings. In addition, extensive simulations are conducted while corresponding analysis are provided in Section “Experiments and analysis”. Finally, Section “Conclusion” gives the conclusion.

Related works

Traditional methods are single-modality recognition ones that focus only on the text and are mainly based on simple classifiers. Text-based detection methods usually use various neural networks to extract and classify text information, including Convolutional Neural Network (CNN) [16], Recurrent Neural Network (RNN) [17], Long Short-Term Memory network (LSTM) [18], and so on. Later, some studies tried to utilize visual information in the news to detect its authenticity [19, 20]. Visual information-based methods usually use image recognition or classification models to classify images. However, the use of insufficient information makes it difficult to achieve satisfactory results with the above single-modal methods.

In recent times, the news has evolved from a pure text format to a multimedia one that consists of visual information such as images [21] or videos [22]. As a result, image classification models [23] and optimization algorithms [24] have been widely used in fake news detection. This allows multimodal detection methods to show superior performance [25].

To explore the information beyond text, Jin et al. present the att-RNN model [6] to integrate features from different modalities. They introduce the attentional mechanism that combines word embedding with visual features and social association features. Later, Wang et al. build a multimodal feature encoder. In addition, to distinguish events and detect fake news, they also build an Event Adversarial Neural Networks (EANN) [9]. This model is essentially composed of two modules. For text processing, it inputs word embedding vectors into a CNN to derive text representations. In terms of image processing, it uses pre-trained VGG-19 [26] to abstract image representations. Then, the two representations are combined and fed to two identical neutral network classifiers. One classifier is used to discriminate between events, while the other is used to detect fake news. Later, Khattar et al. design a Multimodal Variational AutoEncoder (MVAE) [10] motivated by [9]. In this framework, bidirectional LSTM and VGG-19 are used for the extraction of text and image representations independently. It is possible to obtain new vectors by concatenating the two representations mentioned above. To restore the initial sample, the vectors are sent to a decoder. At the same time, the detection of fake news is a secondary task in this framework [27]. However, this detection classifier has to be learned simultaneously with another one. It is bound to increase the complexity and the instability of the model. In addition, the whole framework is sometimes hampered by the unavailability of labeled data for detection task. The methods mentioned above are all reasonable attempts to detect fake news in a multimodal way. However, the interactive features of multimodality are not well exploited.

To deal with the above mentioned problem, a number of multimodal frameworks have been presented which attempts to interactively fuse multimodal features [28]. Singhal et al. develop SpotFake [29], a multimodal framework, to address the task of detecting fake news specifically. Bidirectional Encoder Representations from Transformers (BERT) [30] and VGG-19 are applied to learn text features and image features separately. In this framework, the authors only take into account the characteristics of the text and images. The authenticity of the news item is then determined accordingly. The removal of interference from other sub-tasks ensures that the framework is well suited to the detection of fake news. However, this model only uses a connection between two pre-trained models and does not take into account intermodality feature. Zhou et al. also present Similarity-Aware FakE (SAFE) [11] news detection method using cross-modal information from a different viewpoint. SAFE adopts an description model to convert images into text descriptions. Then, a text-based classifier is trained with news texts and image descriptions. In this process, the authors argue that SAFE takes into account the relationship between the features extracted from different modalities. The inability to effectively fuse multimodal features is a shortcoming of the approaches mentioned above. To better fuse textual and visual features, Wu et al. propose a Multimodal Co-Attention Networks (MCAN) [13] inspired by real human reading habits. This model fuses different domains image features and text features several times by Co-attention layers. However, it does not take into account the role of single-modal features.

Since then, many researchers have explored multimodal fake news detection from a variety of perspectives. Considering of image tampering, Xue et al. present a Multimodal Consistency Neural Network (MCNN) [12]. It uses the Error Level Analysis algorithm [31] to detect forged images. However, it does not improve the performance of feature fusion. Wang et al. explored several semantic associations between images and text. They proposed an instance-driven multimodal graph fusion method [15] focusing on the implications of multimodal presentation. To address the intrinsic ambiguity between different modalities content which causes the multimodal detection of fake news to be inferior, Chen et al. present a Cross-modal Ambiguity-aware FakE news detection method (CAFE) [14]. It is a meaningful attempt to utilize an information-theoretic perspective. To take advantage of valid information at the shallow level, Jing et al. proposed a Multimodal Progressive Fusion Network (MPFN) [8]. It undergoes several intermediate fusions which improves the performance effectively.

Problem statement

In essence, fake news is a distortion of information by those who create it. According to prior work on media bias theory [32], distortion bias is generally modelled as a binary classification challenge. For this reason, fake news detection has typically been formulated as a binary classification problem.

A social media news post is defined as \({\mathcal {N}}\). Depending on whether \({\mathcal {N}}\) is true or false, the label y is denoted by ‘0’ or ‘1’ accordingly. Based on the evaluation of the news \({\mathcal {N}}\), the prediction label \({\hat{y}}\) is classified as ‘0’ or ‘1’ by model \({\mathcal {M}}\).

Fake News Detection: The aim of this topic is to access whether the news article \({\mathcal {N}}\) is forged, under the condition that the original post of news \({\mathcal {N}}\) and related information are provided, i.e., \({\mathcal {M}}: {\mathcal {N}} \rightarrow \{1,0\}\) such that,

$$\begin{aligned} {\mathcal {M}}({\mathcal {N}})= {\left\{ \begin{array}{ll} 1, &{} {\mathcal {N}} \text { is fake } \\ 0, &{} \text { otherwise } \end{array}\right. } \end{aligned}$$
(1)

where \({\mathcal {M}}\) represents the assessment method that the researchers are working on.

Proposed method

In this section, we will discuss the detailed structure and motivation of proposed multimodal method.

Model overview

In this manuscript, we present a new multimodal fake news detection approach (IFIS) through intra-modality feature aggregation and inter-modality semantic fusion. In essence, the proposed approach is made up of four modules, i.e., text embedding, entity feature embedding, feature aggregation and semantic fusion. In order to eliminate redundant features, IFIS extracts entity features from images. In addition, to comprehensively utilize the intra-modal features and inter-modal features, we designe the feature aggregation and semantic fusion modules. The feature aggregation module uses the attention mechanism to strengthen single modal features, while the semantic fusion module adopts the co-attention to carry out cross-modal semantic interaction. For illustration, we present an overview of IFIS in Fig. 2. The main construction and functions of the various modules are further described later. For visualization, we use different colors to represent features of different modalities. As in Fig. 2, orange and blue represent visual and textual features, respectively.

Fig. 2
figure 2

The schematic of IFIS. Here, the framework is essentially made up of four sub-processes, i.e., text embedding, entity feature embedding, feature aggregation and semantic fusion. As illustrated, the orange region and the blue one indicate visual and textual features, respectively. The orange-blue gradient region represents the fused features

Text embedding

Aiming to obtain a high-quality corpus, we are anticipated to clean the original news articles first. Due to the casual nature of human tweeting, we have to remove strange symbols and emoji signs from the text. The cleaned text is concatenated into a single paragraph as text input. Text paragraphs are split into different sentences. A sentence s is transformed into a succession of tokens \(\left\{ w_1, w_2, \ldots , w_k\right\} \), where \(w_k\) is the aggregation of position and token for the k-th token that exists in the sentence. Inspired by [30], the produced input is passed to Transformer [33]. To confine the learnt information of input, the Transformer encoder is adopted. The input sequence of tokens \(\left\{ w_1, w_2, \ldots , w_k\right\} \) is then mapped into an abstract continuous representation \(\left\{ z_1, z_2, \ldots , z_k\right\} \).

Then, we adopt BERT, an excellent language model, to process the abstract continuous representation. It views words through the shared conditioning of their immediate context, which is deeply bidirectional. The BERT module is pre-trained with Next Sentence Prediction and Masked Language Modelling, two unsupervised predictive tasks. The goal of Next Sentence Prediction is to predict whether a sentence is an adjacent sentence to the target sentence. While, the target of Masked Language Modelling is to predict the masked input tokens in a paragraph, which are masked in some percentage at random. In this manuscript, we adopt two datasets to valid the superiority of presented approach, i.e., Weibo and Twitter, which are in Chinese and English respectively; thus, two independent versions of BERT are utilized, respectively. Though the two versions of BERT are trained with datasets in different languages but they are of no structural differences.

Entity feature embedding

Inspired by [34], we utilize pre-trained Faster R-CNN [35] to segment the salient regions containing entities from the images. We abandon the classical methods of image feature extraction for the following two reasons. Firstly, the semantic relationship may not be captured by the embedding representations obtained from CNN or VGG [36], even though they can preserve the spatial information. Secondly, classical approaches divide images equally at the spatial level [37], resulting in unnecessary background fragments and broken entities. It takes extra work to filter out the necessary fragments and the broken entities degrade the performance of the model.

Fig. 3
figure 3

The specific process of the extracting entity features

As in Fig. 3, we utilize pre-trained Faster R-CNN to identify the image patches containing entity targets. Faster R-CNN is an object recognition framework which consists of two main steps. It uses bounding boxes to identify and localize areas of the image that belong to particular classes. Firstly, a Region Proposal Network tries to forecast the boundaries and scores of objects. Then, it adopts region of interest pooling to acquire feature maps of each bounding and classify fragments within proposed region. To detect fake news, we remove some tiny and dependent entities from the category of object detection, including “eyes", “eyebrow", “necklace", and so on. For each news, we select the k highest scoring entities in the image. Next, pre-trained ResNet-50 [38] is adopted to transform the entity regions to vectors \(e_t^i \in R_t^i\), where i and t represent the dimension and sequence of image vector, respectively. Then, the visual entity vectors are arranged as the entity feature embedding in order of scores.

Feature aggregation module

In this section, we will highlight how self-attention mechanism is used to model the intra-modality relationship for image vectors and text tokens. Attention module has been used extensively in Visual Question Answering (VQA) and Natural Language Processing (NLP) tasks in recent years. The attention mechanism can solve the problem of long-distance dependence of text and the average feature of the image. It learns new weight distributions in a targeted manner and act them on important features. The attention mechanism used in this module can effectively highlight the features of single modal to improve the accuracy of the method. Let us review the paradigm of attention functioning. The attention module is a function of the mapping, which is capable of capturing the global constraints of all the items in sequences. It receives a variable number of inputs and returns outputs with the same quatity. Each input consists of three representations: query, key and value, which are packed into matrices Q, K, V independently. They interact and decide where to focus the attention.

Fig. 4
figure 4

The specific structure of self-attention block

Self-attention block is incorporated into the feature aggregation module because the image vectors and text embedding are processed independently. As in Fig. 4, self-attention is a particular type of attention mechanism. Its purpose is the encoding of interactions between fragments of images or text. In self-attention, matrices Q, K, V are the same, i.e. the three input representations make no difference. (\(d\times 1\))-dimensional queries, keys, and values are computed by the input of multi-head self-attention firstly. Then, they are packed into Q, K and V, respectively. The dot product between Q and K describes the attention allocation on V. The attention mechanism is run through a number of times in parallel in the multi-head attention submodule. Every head pays unique attention to one piece of the sequence. Finally, all the outputs of attention heads are combined and linearly rescaled to obtain the required dimension of the projection.

For the i-th head, Q, K, and V are transformed to the inputs as follow:

$$\begin{aligned} Q_i=Q W_i^Q, K_i=K W_i^K, V_i=V W_i^V \end{aligned}$$
(2)

where \(W_i^Q, W_i^K, W_i^V \in {\mathbb {R}}^{1 \times d_h}\) are the projection matrices for the i-th head, \(d_h=d / m\) is the dimensionality of the output features from the heads and m is the total number of heads.

The operation of calculating the multi-head self-attention function can be represented as:

$$\begin{aligned} \begin{aligned}&MultiHead = h W^O\\&h={h}_1 \oplus {h}_2 \oplus \ldots \oplus {h}_{{m}}\\&h_i={\text {softmax}}\left( \frac{Q_i K_i^T}{\sqrt{d_h}}\right) V_i \end{aligned} \end{aligned}$$
(3)

where \(W^O\in {\mathbb {R}}^{m d_h\times 1}\) represents a learnable parameter matrix, \(\oplus \) denotes the combination of vectors.

Next, the fully connected feed-forward network is applied on each fragment independently and reshape outcomes linearly:

$$\begin{aligned} FFN(x)=\max \left( 0, {W_1} x+b_1\right) W_2+b_2 \end{aligned}$$
(4)

where \(\left( x, b_1, b_2\right) \in {R}^{d \times 1}\), \(\left( W_1, W_2\right) \in {R}^{d \times d}\), \(W_1, b_1\) and \(W_2, b_2\) are the learnable parameters trained in the first and the second fully connected layers, respectively. Finally, residual connections and layer normalisation are placed around the two sub-layers to carry the positioning information to superior layers.

For the visual features, the representations obtained from the entity patches \(\left[ e_1; \ldots ; e_k\right] \in {\mathbb {R}}^{k \times d}\) are fed into a self-attention layer to capture the interaction between visual entities. The result of the multi-head self-attention module is, \(O=\left[ o_1; \ldots ; o_k\right] \in {\mathbb {R}}^{k \times d}\), where \(O=L(Y+(MultiHead(Y))\). A set of continuous representations is then obtained by applying position-wise feed-forward and layer normalisation, \(X=\left\{ X_i\right\} _{i=1}^k\), where \(X_i=L\left( o_i+\right. FeedForward \left. \left( o_i\right) \right) \). Finally, average pooling followed by L2 normalisation compresses the resulting image vectors into a dense representation.

For the textual features, the representations obtained from text content \(Z=\left[ z_1, z_2,\ldots , z_k\right] \) are fed into a 1-dimensional convolution neural network, which is useful for capturing the hidden sequencing features. The convolutional layer is adopted to map the feature \(F=\left\{ f_i\right\} _{i=1}^{k-h+1}\) from the input sequences \(\left\{ z_{i:(i+h-1)}\right\} _{i=1}^{k-h+1}\). Each input is a set of contiguous words, which is presented as,

$$\begin{aligned} \begin{aligned}&f_i=ReLu\left( w \cdot z_{i:(i+h-1)}+b\right) \\&z_{i:(i+h-1)}={\text {concat}}\left( z_i, z_{i+1}, \ldots , z_{i+h-1}\right) \end{aligned} \end{aligned}$$
(5)

where \(w, z_{i:(i+h-1)} \in {\mathbb {R}}^{hd}\), ReLU is the function of ReLU activation, b is a bias, w is the matrix of trainable parameters. Then, a max-pooling operation is applied to the resulting feature map for dimensionality reduction, \({\hat{f}}_s=max\{f_i\}_{i=1}^{k-h+1}\). The text representations are generated by \(r=W{\hat{f}}+b\). Finally, a fully connected layer and a \({l_2}\) normalisation are adopted to operate on r to produce the feature vector of text.

Semantic fusion module

Semantic feature fusion is implemented through the semantic fusion module with the detailed architecture being provided in Fig. 5. Firstly, we will introduce Co-Attention block which is the basic unit of semantic fusion module.

Fig. 5
figure 5

The specific structure of the semantic fusion module. The orange represents features of one modality, while the blue represents features of another modality. The orange-blue gradient squares represent the fused features, which are produced by two parallel Co-Attention blocks

Co-Attention block is a variant of the multi-head self-attention block. As in Fig. 5, in a Co-attention block, queries use data from one modality, while keys and values use data from another modality. In addition, the query matrices are used as residuals to keep the original semantic feature. The remaining architectures are the same with those used in multi-headed self-attention. Attention features for one modality based on another modality are produced by the Co-Attention block. For instance, if matrix Q is taken using the textual features and matrices K and V are taken using the visual features, the attention matrix calculated by Q and K can be applied to measure the similarity between the text and image. After that, the attention matrix is used to weight V, i.e., the textual features. Just like the real habit of humans reading news, after looking at the images, we have more attention for the text sequences that are in relation to the images. Co-attention can effectively simulate the above process and learn the semantic fusion relationship between different modal features.

As in Fig. 5, a fusion module is obtained by connecting two Co-Attention blocks in parallel. The orange squares represent features extracted from one modality. The blue squares represent features extracted from another modality. As previously explained, one Co-Attention block represents the weighting of the attention matrix to one modal feature and the other Co-Attention block represents the opposite situation. Then, the results of two Co-Attention blocks are concatenated and placed in a fully connected layer to produce the final representations. The semantic fusion module models cross-modal interactions by changing the modal of input features to simulates the real reading habits of humans.

Experiments and analysis

Data descriptions

To reveal the effectiveness of proposed model, extensive experiments are carried out on two public datasets, Weibo and Twitter. Table 1 shows the detailed statistics for these two datasets.

  • Weibo [6]: This dataset was constructed by Jin et al. All posts were made between May 2012 and January 2016 in Weibo.

    All fake news is validated by the official rumour debunking system of Weibo, while all real news is checked by an official news organisation, Xinhua News Agency. This dataset contains 4,779 real posts and 4,749 fake ones. The tweets consist of articles, additional images and social context. As we are primarily interested in identifying news with text and images, we have removed tweets without relevant images.

  • Twitter [39]: This dataset was first released to support the fake content detection on Twitter in the “Verifying Multimedia Use" task. A development set and a test set are included in the raw data set, and both contain tweets about different events. To ensure the fairness of performance comparison, we retain the news containing both text and images while the others are filtered out. It is notable that the quantity of images is much smaller than that of samples in this dataset. This means that an image might be shared by several samples.

Table 1 Detailed statistics of Weibo and Twitter datasets

Evaluation metrics

To measure the performance of various models, a variety of metrics are adopted, with corresponding being provided as follows:

  1. (1)

    Precision: It represents the proportion of true positives among the positive examples identified by the model.

    It is abbreviated as Pre.;

  2. (2)

    Recall: It indicates the proportion of correctly identified positive cases out of the total number of positive cases. It is abbreviated as Rec.;

  3. (3)

    \({F_{Score}}\): Precision and Recall are sometimes contradictory, especially when used to measure unbalanced datasets. Therefore, \(F_{Score}\) is proposed as an indicator to measure the comprehensive performance of a model. It is calculated by weighting Precision and Recall, which can be represented as:

    $$\begin{aligned} {\textrm{F}}_{{\textrm{Score}}}=\left( 1+\tau ^{2}\right) \frac{Pre.\times Rec.}{\tau ^{2} \cdot Pre.+Rec.} \end{aligned}$$
    (6)

    where \(\tau \) represents a weighted parameter. When \(\tau \) is assigned to 1, \(F_{Score}\) is written as \(F_1\). In general, a larger \({F_1}\) indicates better performance;

  4. (4)

    Accuracy: It represents the ability of the classifier to determine all samples within the dataset.

    It is an ideal metric when the sample proportion of each category in the dataset is balanced. In general, a larger value of \({\textrm{Accuracy}}\) usually indicates better overall performance.

Parameter settings

During the training phases, the maximum lengths of the text equal to 200 and 30 for Weibo and Twitter, respectively. As for the size of hidden nodes and the dimensionality d, both of them are assigned to 256; the total number of heads m is fixed at 8. We fix the number of visual entities k to 5 (the reason of adopting this value will be discussed later). When training on Twitter dataset, the parameters of BERT and Faster R-CNN are frozen to avoid overfitting; however, on Weibo dataset, we will not do this. For Twitter and Weibo datasets, the adopted BERT model are BERT-base-multilingual-cased and BERT-base-Chinese, respectively. IFIS is trained for 50 epochs with an learning rate of 5e-5. The dropout rate is fix at 0.5 and the batchsize is assigned to 128. We use the Adam optimizer and categorical cross-entropy loss functions to optimize the model. The above hyperparameters are determined using grid search, where the accuracy is used as the criterion for parameter selection. Furthermore, the model is implemented on PyTorch framework and all experiments are run on an Nvidia GeForce RTX 3090Ti graphic card. Besides, hyperparameters for baselines are the same as respective original papers.

Baselines

To illustrate the superiority of proposed model, several SOTA approaches (including single-modal and multimodal ones) are selected for performance comparison. They are listed as:

Single-modal methods

Here, the considered single-modal methods contain Textual and Visual.

Textual [10]: Each word is mapped to a 32-dimensional vector, then a bidirectional LSTM is applied to extract the sequential features and produce the prediction results.

Visual [6]: The VGG-Net is utilized to extract 4096 dimensional features from images and a classifier is then trained to infer the labels.

Multimodal methods

As to multimodal methods, they mainly focus on developing approaches to efficiently extract and fuse features. The considered multimodal methods are provided as:

VQA [40]: The aim is to address the Visual Question Answering through the concatenation of textual and visual features. To make fair comparisons, the concatenated features are fed into a single-layer LSTM.

att-RNN [6]: This approach utilizes an LSTM to obtain features. Then, the attention mechanism is adopted for the fusion of textual, visual and social features. For fair comparisons, social features are removed.

EANN [9]: It develops a CNN for the generation of text representation, and utilizes a pre-trained VGG-19 to abstract the image representations. After that, the text and image representations are merged and sent to a network classifier.

MVAE [10]: It extracts text and image representations using Bi-LSTMs and VGG-19, respectively. The two representations are then interlinked and sent to the decoder to reconstitute the source samples.

SAFE [11]: It uses a descriptive model to paraphrase images into text. The textual description is then combined with the original text to train a text classifier.

SpotFake+ [7]: SpotFake applies BERT and VGG-19 to extract textual and visual features separately. These features are interlinked together for classification training. On the basis of SpotFake, SpotFake+ is presented replacing BERT with a pre-trained XLNET.

Table 2 Performance evaluation with incorporated ones

MCAN [13]: It utilises VGG-19, CNNs and BERT to extract different domains visual features and text features, respectively. Then, it fuses the features several times using Co-attention layer. For fair comparisons, frequency-domain feature is removed from the whole model.

MCNN [12]: It adopts BERT and ResNet-50 to capture textual and visual features. Meanwhile, ELA algorithm is applied to detect forged images. In addition, it assigns weights to the image and text using the attention mechanism.

DIIF [15]: It applies BERT and Mask R-CNN [41] to derive textual and image features. Then, a unifying graph is constructed to combine the textual and visual instances. It adopts attention mechanism and gating mechanism to generate the contextual representation and gather the semantic interactions.

CAFE [14]: It uses BERT and ResNet-34 to learn textual and image features from news. A ambiguity learning approach is used to extract ambiguity across modalities. Besides, a fusion module is adopted to collect the interactions of modalities.

MPFN [8]: Here, traditional CNN is used to process the sequence of news words and the image patches. Finally, it merges the resulting representations and then classifies them using a soft-max layer.

Result and discussion

In this subsection, we calculate the Precision, Recall and \(F_1\) scores of different news categories obtained for all methods on the considered datasets.

Table 2 illustrates the overall performance comparison of considered approaches and the proposed model with experiments being conducted on Weibo and Twitter. As revealed, there is no doubt that IFIS outperforms previous models with an accuracy of 0.896 and 0.838 on Weibo and Twitter, respectively. Besides, there are some interesting observations and phenomena worth exploring.

The length of text in Weibo is much longer than that in Twitter, which is the reason why the maximum text length of Weibo is limited to 200 and the corresponding value of Twitter is limited to 30. The images and news are one-to-one in Weibo, which means the visual information is weakly correlated with the authenticity of the sample. Nevertheless, the amount of images in Twitter is much smaller than the sample. The same image is shared by several samples. Even worse, a consistent label is applied to all samples that share the same image. This leads to a strong association between the visual feature and the authenticity of the sample, which is contrary to the characteristic of data in Weibo. However, for both the Weibo and Twitter datasets, the performance of Visual continues to lag behind VQA and att-RNN according to Table 2. It indicates that the utilization of multiple modalities in fake news detection tasks is superior to methods which adopt single modal. This also verifies that the performance of single-modal methods is inferior to multimodal methods in most cases. Besides, MCAN, MCNN and DIIF generally perform better than SAFE and SpotFake+, which confirms the contribution of cross-modal interaction in improving the performance of models.

Compared to att-RNN, EANN and MVAE introduce auxiliary tasks that demonstrate significant performance advantages. It verifies that the most significant feature from each modality can be captured by using coarse-granted multi-modal fusion. Following this idea, SpotFake+ adopts XLNET tokens as a replacement for word embeddings and outperforms the above methods. It is a demonstration of the tremendous benefits of pre-trained models. After that, the success of MCAN, MCNN and DIIF shows that reasonable feature fusion module can further improve the detection performance.

Overall, as in Table 2, the presented results confirm the effectiveness of our method. In combination with the above analysis, the advantages are attributed to the following factors. (1) The proposed model makes use of information about the news content as much as possible. It cleans the data reasonably and integrate textual and visual information. (2) It adopts powerful pre-trained models to extract multimodal features, including ResNet and BERT. (3) It utilizes Faster R-CNN for entity acquisition, which eliminates the interference of redundant features as much as possible during feature extraction. (4) It incorporates both intra-modality feature aggregation and inter-modality semantic fusion simultaneously. It combines the advantages of single-modal features judgment and cross-modal interactions.

Finally, our proposed model is of a boosted performance in the accuracy when addressing detecting tasks on both Weibo and Twitter datasets. For comparison, some state-of-the-art methods are incorporated. As in Table 2, the best method on one dataset usually performs mediocre on others, while the proposed method improves the accuracy by 0.6 More importantly, the proposed method has a balanced performance on different datasets, and the accuracy is optimal simultaneously. The performance improvement is attribute to better feature aggregation and feature interaction capability. Meanwhile, on the considered datasets, the corresponding performance of our proposed method also varies, which is mainly incurred by the existed differences between samples in different datasets, especially the visual information.

Ablation study

As in the following Table 3, we illustrate the performance of some variants of proposed model. Hence, the influence of considering different components will be discussed explicitly.

  • Entity: To evaluate the impact of the entity extraction, we present a variant which no longer adopts the Faster R-CNN to derive the visual entity. Here, the ResNet-50 is employed to obtain the visual features from the averagely segmented regions of images.

  • Aggregation: To evaluate the impact of the feature aggregation, it remove this module from IFIS. The features of images and text are concatenated to the outputs of semantic fusion module.

  • Fusion: In order to evaluate the impact of the semantic fusion, the corresponding module is removed from IFIS. The features from two Feature aggregation modules are concatenated together to derive the final prediction.

Table 3 Performance of different variants and proposed model
Fig. 6
figure 6

Results of ablation studies

To present the results intuitively, we further display the results in Fig. 6. As revealed, it is clear that all the components in the proposed model contribute significantly to the result of both datasets. Specifically, comparing with Entity, the proposed method adopts the entity detection, thereby incurring an improved detection performance. This suggests that the inclusion of visual units as input is helpful in improving performance. Besides, the proposed method outperforms the Aggregation which lacks the intra-modal feature interactions. It is an indication of the effectiveness of intra-modal correlations between visual entity features and textual semantic features. In addition, if the semantic fusion module is removed, i.e., Fusion, the method faces a significant performance degradation, which demonstrates the importance of exploiting the semantic interactions between different modalities.

Furthermore, when removing any of the components, the performance degradation of the model on the Twitter is greater than that on Weibo. Considering the differences between datasets, this phenomenon is attributed to the balanced distribution of data in Weibo and the robustness of fine-tuned BERT.

Parameter analysis

As afore-stated, we utilize the top k visual entities which are ranked based on the detection possibility of each image. Then, we will further study the effect of varying k. Here, we conduct experiments for scenarios when adopting different k with the results being provided in Fig. 7.

Fig. 7
figure 7

Results for scenarios with different numbers of entities

As revealed in Fig. 7, the effect of varying k on the detecting performance illustrates a similar trend on both datasets which increases first and then declines. This indicates there exists an optimal k. Overall, we can obtain a maximum detecting accuracy when k equals to 5. Hence, it is clear that the optimal parameter for k is 5. In addition, the selection of entities effectively reduces the number of features that need to be processed by the proposed framework. As a result, this method does not increase the complexity of fake news detection. The computing power and time required are similar to existing ones. Furthermore, for the Twitter dataset, the effect of varying k on the overall performance is much greater than that for the Weibo dataset, being indicated by a larger varying amount. As to the reason of this phenomenon, we suppose that it is related to the strong association between the visual feature and the authenticity of the sample in the Twitter dataset. In contrast, this association in Weibo dataset is extremely weak. Meanwhile, this phenomenon also validates the derived conclusions in above subsections.

Conclusion

Aiming to detect multimodal fake news effectively, we present an effective multimodal detection method (IFIS) through the utilization of intra-modality feature aggregation and inter-modality semantic fusion. Specifically, our proposed method extracts entity features from a image to avoid acquiring too much visual noisy features. And intra-modality features are aggregated based on entity detection methods and attention mechanisms. In addition, we utilize semantic fusion module to capture the inter-modality relationships between textual content and visual entities. The semantic fusion module is made up of two parallel Co-attention blocks which establish stable connections between modalities. The simultaneous integration of single-modal features and cross-modal semantics allows the model to determine the authenticity of the sample in different datasets.

The superiority of our proposed approach is demonstrated by experimental results on different datasets. Nonetheless, there also exists some limitations of this work that could be addressed in the future: (1) The modalities considered are not sufficient, and the features of more modalities are not used, such as video, audio, propagation structure and so on; (2) Parameter setting process can be improved, for instance, through the adoption of cooperative-competitive neural networks [42], reaction–diffusion neural networks [43] or fault tolerant iterative learning control [44].

For future work, we will delve into the role of entity features in feature representation and feature interaction. Moreover, we will also consider the leverage of prior knowledge and dissemination structures in such tasks.