1. Introduction
Steganography is a technology to embed additional data into digital media by slight alteration to achieve covert communication without drawing suspicion [
1]. Generally, the original digital media (cover) is the spatial/JPEG images which can be chosen from the standard image sets or be downloaded from the Internet [
2]. For all the steganographic algorithms, the effect of data embedding can be viewed as adding a string of independent pseudo noise to the cover and the modified image is called a stego image [
3]. Therefore, after data embedding, the steganographic changes are concealed by the cover content. In order to measure the distortion caused by embedding operation, each element (pixels for an uncompressed image or non-zero AC coefficient for a JPEG image) is assigned a distortion value computed by the predefined distortion function and total embedding distortion of all cover elements can be theoretically minimized with the aid of Syndrome-Trellis codes (STC) which nearly reaches the payload-distortion bound [
4,
5,
6]. To achieve high security, a group of novel content-adaptive embedding algorithms are developed, such as wavelet obtained weights (WOW) [
7], spatial universal wavelet relative distortion (S-UNIWARD) [
8], high-pass low-pass and low-pass (HILL) [
9,
10], minimizing the power of optimal detector (MiPOD) [
11], JPEG universal wavelet relative distortion (J-UNIWARD), uniform embedding distortion (UED), and uniform embedding revisited distortion (UERD) [
10]. The underlying designing principle of these steganographic algorithms is that the complex region used for loading message is difficult to be modelled with simple and single statistical model [
12,
13].
To detect whether the suspicious image is a cover or a stego one, the effective and typical method is steganalysis containing two separated parts denoted as feature extraction and ensemble classifier (EC) training [
14]. Generally, the feature extraction is to handle the cover/stego objects by a set of diverse linear and non-linear high-pass filters to suppress the image content and expose the minor steganographic changes. Then, the computed residual samples are represented with one-dimensional or two-dimensional statistical feature. The existed excellent features are spatial rich model (SRM) [
15], JPEG rich model (JRM) [
16], and DCT residuals (DCTR) [
17]. After training EC with extracted feature, we can obtain an optimal classifier owning outstanding detection performance.
From the aspect of content-adaptive steganography [
18], even with STC, if the cover is not secure, therefore, the corresponding stego image is easily detected by the trained classifier. Moreover, if the cover is downloaded from the Internet, the risk of repetitive compression of the chosen JPEG image is so high [
19]. Although there are some existing steganographic works focusing on the transport channel, the multiple upload and download operations may arouse the third part vigilance [
20,
21]. Therefore, to obtain directly optimal cover, the synthesis of secure cover is desirable and has practical significance.
In the past few years, the photo-realistic image synthesis has become a hot topic. However, due to the complex and high dimensionality of natural image, the generation of high-quality images become a tough task. The modern and effective approach to conquer this problem is the generative adversarial network (GAN) which aims to build a good generative model of natural images by the competition of a generator and a discriminator [
22]. Based on GAN, numerous novel schemes have been proposed. Zhang et al. proposed stacked GAN (StackGAN) including two stages of generator training to generate vivid image containing low- and high-resolution object parts. Combing the Laplacian pyramid framework, Emily Denton et al. described a Laplacian pyramid GAN (LapGAN) architecture to achieve a coarse-to-fine mode of image generation [
23]. However, when the resolution of generation is high, the architecture of StackGAN or LapGAN is deep and the training processing is rather slow. In order to deal with the mentioned problems, Karras et al. proposed PGGAN [
24]. The key idea of PGGAN is to progressively train the generator and discriminator from the low resolution (
) to the high resolution (
) with lower training time complexity. Following the progressive growing mechanism, Karras et al. proposed StyleGAN and StyleGAN2 which can control the synthesis of high-quality images by the disentanglement and style mixing [
25]. Moreover, by injecting the noise into different layers of the synthesis network, StyleGAN and StyleGAN2 achieve the stochastic realizations of generated image at various scales, especially for the high-level attributes [
26]. However, from the perspective of steganography, the generated image may be unsuited to be used as a cover to accomplish the secure communication.
Since the effect of stochastic variation adjustment relies on the injected noise at each layer of synthesis network, due to the similar adding operation, we hypothesize that this noise-adding operation can be seen as another type of data embedding (steganography). Therefore, if the injecting noise is optimized, along with the image synthesis, we could obtain the proper images which own higher security. Motivated by the abovementioned works, this article proposes a noise-optimization stacked StyleGAN2 named NOStyle for secure steganographic cover generation. The proposed scheme aims to enhance the security and, meanwhile, keep the fidelity of generated image. As shown in
Figure 1, the whole architecture of NOStyle is segmented into two stages with symmetrical mode. The structure of the first stage is the same as StyleGAN2 composed by a set of stage-I generator (SI-G) and stage-I discriminator (SI-D). The second stage also includes a pair of new stage-II generator (SII-G) and discriminator (SII-D). In
Figure 1, we show the detailed framework.
MN firstly uses a latent code z as the input and outputs an intermediate latent code w. Then, with w and the progressive growing, SI-G generates an image. Meanwhile, compared with a real image, SI-D makes judgement whether the quality of generated image is vivid. After iteration and parameter optimization, SI-G creates the high-quality benchmark image. At the second stage, based on the basic architectural of StyleGAN2, we design a new generator SII-G composed by a secure noise optimization network (SNON) and a synthesis network (SyN).
Consider the disentanglement and progressive growing, SNON aims to achieve the controlling of image details by adjusting the noise and injecting the optimized noise into finer layer ( and ) of SyN. In SII-D, we design a noise loss, including the image secure loss and fidelity loss, and compute the difference between the outputs of SI-G and SII-G. By minimizing the noise loss, SNON outputs multiple-scale optimized noise maps which are injected into the corresponding scales of SyN and we finally obtain the secure and high-quality image (cover). Therefore, after image synthesis, the proposed architecture generates the vivid image, while the security of generated image is enhanced.
Summarily, the whole training can be separated into two stages. In the first phase, with given dataset, we obtain a typical StyleGAN2 which accomplishes image generation task and outputs a benchmark high-quality image. Then, applying SNON and the noise loss, we achieve the noise optimization and image evaluation. Finally, the proposed architecture generates a vivid image. The contributions of this article are listed as follows:
We hope to make a tightly connection between the image generation and the steganography. Consider the image synthesis, we hypothesis the noise injection is seen as another type of steganography. Hence, by optimizing the injecting noise map, the security of generated image can be enhanced and guaranteed.
The proposed architecture NOStyle which is to balance between the security and quality of generated image. To achieve this goal, combing the image secure loss and fidelity loss, we design the noise loss that can evaluate the complexity and fidelity of generated image.
We try to make a conclusion between the security and fidelity of image. To give a clearer explanation, we calculate the Fréchet inception distance (FID) and give various secure testing on multiple steganographic algorithms. According to the experimental results, it is clearly that, for the style-based image synthesis, the security of the generated image is inverse proportion to the fidelity.
The rest of this article is organized as follows. In
Section 2, we show the basic notation, the basic theory of GAN, concept of secure steganography, and a typical steganographic distortion. The detailed architecture and the training processing of our proposed scheme are described in
Section 3. In
Section 4, we give extensive experiments and detailed analysis on the security and quality of generated image. Finally,
Section 5 concludes the whole paper and provides further discussion.
3. Proposed NOStyle Architecture
3.1. Basic Idea
In the proposed work, based on StyleGAN2, we design a secure cover generation architecture. Specifically, the mapping network takes a latent code z as the input to generate an intermediate latent code w. Then, using z and w, the generator outputs a high-quality image from low-resolution to high-resolution with progressive growing and stochastic variation (noise map).
Consider the distinguishing characteristics of typical architecture StyleGAN2 and the demanding of security, the main goal of our proposed architecture is to enhance the security of generated image by optimizing the stochastic variation, while the fidelity of created image is kept as vivid as possible [
27,
28,
29]. Unlike the previous works StyleGAN and StyleGAN2, the stochastic variation represented as noise map is not random and, based on the progressive growing and short connection [
30,
31,
32], we design a secure noise optimization network (SNON) which aims to optimize the noise map. After optimization, we can obtain proper noise map which will be involved in the synthesis of the high-resolution and secure image. Apart from the network structure, the convergence of SNON relies on the output of SII-D in which we combine the predefined steganographic distortion function and learned perceptual image patch similarity (LPIPS) to construct the noise loss which is employed to evaluate the differences between the two results of SI-G and SII-G. Here, we hypothesis that, for image generation, the security and fidelity are two contradictory topics. Therefore, our proposed scheme is a strategy which tries to makes a tradeoff between the security and fidelity.
3.2. Proposed Architecture Overview
According to the methodology described above, we now give a detailed description of the architecture of NOStyle which is shown in
Figure 2 and the details are described as follows.
It is obvious that our proposed architecture mainly owns five dependent parts which are MN, SNON, stage-I SyN, stage-II SyN, and SII-D. The stage-I NOStyle generator is inherited from original StyleGAN2 including MN and stage-I SyN. In the first stage, we apply the StyleGAN2 to generate a high-quality image which is used as a benchmark image and is injected into SII-G. Taking SI-G results as input, the second stage-II NOStyle optimizes the injecting noise and generates a high-quality/secure image. The architecture of the stage-II NOStyle is composed of SNON, stage-II SyN, and SII-D. The designing principle of SNON is motivated by the progressive growing and short connection by which different-scale noise map is optimized and injected into finer layers of the stage-II SyN. Employing the optimized noise map, the random noise z, and an intermediate latent code w, stage-II SyN finally outputs the high-quality and secure image. Here, the architectures of stage-I SyN and stage-II SyN are the same and the differences focus on the inputs of network.
Generally, the per-pixel added noise map is sample from the Gaussian distribution
. Suppose the added noise map is
, combining
, the synthesis networks Stage-I SyN can output a random image
. With the same latent code
, Stage-II SyN generates a secure image
. Then,
and
are entered into SII-D. Using the wavelet filter banks and LPIPS, SII-D constructs the noise loss (NL) to evaluate the complexity and fidelity of the generated image. By minimizing NL, we can adjust the injecting noise map. The details are illustrated in
Section 3.3.
3.3. Structural Design
As discussed above, the proposed architecture mainly consists of MN, SNON, SyN, and SII-D. The individual parts are described as follows.
Mapping network (MN) accepts a non-linear latent code as the input, where is the corresponding density in the training data. The original is represented as the combination of many factors of variations. According to theory of disentanglement, the optimal latent code should be the combination of linear subspaces, each of which controls one factor of variation. Then, after normalization and eight full connected layers (FC), is disentangled and we obtain a more linear intermediate latent code .
Synthesis network (SyN) takes a latent code to generate a vivid image with the progressive growing and noise. During the training of network, this architecture aims to firstly create low resolution images and then, step by step, output higher resolution images. Therefore, the different resolution features are not affected by each other. Meanwhile, without affecting the overall features, the injected noise adjusts the local changes to make the image more vivid.
Generator network contains two networks which are stage-II SyN and SNON. The first network (
Figure 3, right) is the synthesis network and another (
Figure 3, left and middle) is the noise optimization network. Consider the optimal characters of the disentangled latent code
, these two networks also use
as the input.
Our noise optimization network is mainly inspired by PGGAN and ResNet [
33,
34,
35]. Two simple designing rules are given to optimize the injecting noise: First, we introduce the progressive mechanism to generate the noise with the same size as the different resolution image. Second, the disentangled latent code
is indirectly utilized to form the secure noise by shortcut connection. The progressive model is formed by three blocks in which there are a full connected layer, two
convolution layers, and an average pooling layer with stride of 1. Each block aims to shift the low resolution to the high resolution. Therefore, after three times promoting, the latent code
is changed into a
feature map. Here, we want to adjust the injecting noise at two resolutions (
and
). Hence, we narrow the number of
feature map into resolution
with three
convolution kernels and obtain three corresponding feature maps denoted as
Rn1,
Rn2, and
Rn3, respectively.
We deduce that the disentangled latent code is useful to construct the secure noise. To fully utilize , we introduce an underlying mapping to represent as . Motivated by the previous works, we decide to adopt three convolution kernels to achieve the mapping and output three feature maps Rw1, Rw2, and Rw3 sized by , , and , respectively. Therefore, we totally obtain six feature maps which are Rn1, Rn2, Rn3, Rw1, Rw2, and Rw3.
We deem that the merged operation can enhance the effectiveness of feature maps. Therefore, we merge four different feature maps into two groups which are denoted as and . After applying the activation function (leaky ReLU) to and , two noise maps are created and injected into layers of the synthesis network. Moreover, by up-sampling, we double the size of feature map Rw3 into . Combing Rw3 and the same activation function leaky ReLU, the third group is turned into the third noise map which will be injected into layer of the synthesis network. For other layers of stage-II SyN, the injected noise maps are kept unchanged.
3.4. Loss Function
The SII-D and SII-G will be trained according to the noise loss which is the combination of the image secure loss
and fidelity loss
. Image secure loss
is used to measure the complexity of image [
36]. As discussed in
Section 2.4, with the filter banks
, three directional residuals
are obtained, where
. Suppose we directly use
as the image secure loss, the quality of generated image will be dominated by the image secure loss
. Therefore, the created image could lack fidelity. In our scheme, we use “
ln” operation to turn the larger residual into the smaller one. Combining three converted residuals,
is written as
Fidelity loss
is to make the synthetic image more vivid. Inspired by optimal characteristics of LPIPS, we adopt LPIPS matrix as our feature-level loss to evaluate the quality of generated image. LPIPS matrix is the average of normalized and extracted feature of total stacks. Assume the reference and distorted patches are
and
sized by
, given a network
owning
L layers, we compute the normalized and scaled embedding parameters
,
of layer
l. Collecting all parameters of total layers,
distance between
and
is computed as follows
where
is the scale parameter which is equivalent to the cosine distance.
is defined as
We hypothesize that the security and the fidelity are contradictory topics. It means that if
is higher, the generated image may be more secure and less vivid. However, when
is lower, the quality of the created image could be higher. Therefore, the final distortion should make a tradeoff between
and
. Following the described hypothesis, the noise loss
is defined as the sum of the secure loss and fidelity loss:
where
and
are the tunable parameters.
To give a clear explanation, the processing of the proposed scheme is described in Algorithm 1.
Algorithm 1 Secure Image (Cover) Generation |
Input: a pre-trained StyleGAN2 generator SI-G; a latent code w; SNON; stage-II SyN; a discriminator SII-D; a random noise map N ∈ ; the noise loss . Output: secure synthesis image (cover) X(S).
- (1)
Use w and SI-G to output image X(R). - (2)
Introduce w and N as the inputs of SNON and stage-II SyN to generate synthesis image X(S). - (3)
Compute the noise distortion between X(S) and X(R). - (4)
Update tunable parameters β and γ to minimize . - (5)
Use optimal parameters to output optimal X(S).
|
4. Experimental Results and Discussion
In this section, we show the extensive experimental results for the performance evaluation and image quality analysis.
4.1. Settings
4.1.1. Image Sets
In our scheme, there are totally three image sets which are used in our experiments. The first one is the LSUN containing around one million labeled images for each of ten scene categories and twenty object categories. We decide to choose the LSUN Cat as training set which is used in the two stages of NOStyle. Here, due to the high demanding of GPU and energy consumption, we adopt the pre-trained model as the choice of the first stage of our architecture. The second image set named GSI (generated secure images) contains 80,000
gray images created by StyleGAN2, NOStyle-SLA, NOStyle-SLB, and NOStyle. NOStyle-SLA and NOStyle-SLB are the monolayer version of NOStyle. The third image set includes 10,000
images which are the down-sampled version of BOSSbase ver.1.01 by the “imresize” Matlab function [
37].
4.1.2. Steganographic Methods
Totally, four steganographic methods are used as the testing algorithms, including spatial method S-UNIWARD, two JPEG methods J-UNIWARD and UED, and a deep learning steganographic method SGAN. S-UNIWARD and J-UNIWARD are based on the directional high-pass filter groups which have been discussed in
Section 2.4. For these methods, the steganographic distortions are relied on the directional residuals which are computed from the spatial/decompressed JPEG image. Based on the intra/inter-block neighborhood coefficients, another typical steganographic method UED aims to minimize the whole statistical changes of DCT coefficients by modifying the non-zero quantized DCT coefficients with equal probability. Apart from the classical methods, there exists many GAN-based and CNN-based schemes. Among these methods, SGAN utilizes the GAN-based architecture to achieve better security.
Generally, the amount of embedded data is measured by the payload which is represented as the ratio of the capacity of embedding data and the available elements (pixels or non-zero JPEG coefficients). According to the format of cover, the payload is measured as the bits per pixel (bpp) or bits per non-zero AC coefficient (bpnzAC). For example, assume the capacity of the embedding data is C and the number of available pixels is N, the relative payload is . Applying STC, the message is embedded into a cover with the minimized distortion to achieve undetectability.
4.1.3. Steganalyzers
Three novel steganalyzers DCTR, JRM, and SRMQ1 are employed to evaluate the security performance of the generated images. Depending on the mutual position of two adjacent/nonadjacent coefficients, SRMQ1 and JRM use co-occurrence to show the correlation of coefficients and statistical dependency. DCTR is the first-order statistics of quantized noise residuals which are calculated from the decompressed JPEG image using 64 kernels of the discrete cosine transform.
4.1.4. Security Evaluation
The security evaluation is carried on two databases GSI and BOSSbase ver.1.01. Meanwhile, the chosen classifier is called an ensemble classifier in which, based on subspace of original feature space, a series of sub-classifier is constructed, and the final decision is made by mixing the individual decision of each sub-classifier denoted as Fisher linear discriminator (FLD).
The whole experimental process is divided into two stages denoted as training and testing. At the training stage, using the designated steganography algorithm and cover dataset, we can construct the corresponding stego images. Then, for the cover and stego set, we randomly choose one half of these two image sets with equal number and create the training set. Finally, based on the statistical difference between the selected cover and stego images, we obtain a trained ensemble classifier which can be employed to judge whether an image is a cover or stego one.
Combing the remaining cover and stego images, we construct the testing set and the performance is evaluated on the testing set. In the testing stage, there are two kinds of testing errors. The first one is that a cover is judged as a stego and the second case is that a stego is seen as a cover. These two errors, respectively, stand for the false alarm and missed detection which are abbreviated to
PFA and
PMD. Finally, the classification error is defined by the minimal average error under equal probability of these two errors,
The security of the generated cover is evaluated by PE and higher PE means cover owns better security.
4.2. Key Elements of Proposed Architecture
4.2.1. Image Secure Loss
As discussed in
Section 3.4, the image secure loss aims to guarantee the security of generated image. According to Equation (8), we use the directional residuals to design the image secure loss. Let us suppose the directional residuals are large, if we directly use them to build the image secure loss
, the final noise loss could be dominated by
and the affection of
may be ignored. In this case, the fidelity of the generated image cannot be guaranteed. To give a clear explanation, we give a comparison on the generated normal image and abnormal image which is created without using the “
ln” operation. According to the results shown in
Figure 4, compared with normal image (left image), it is clearly that the generated image (right image) looks like a random noise. Therefore, it is necessary to use the “
ln” operation to turn the larger directional residuals to be the smaller one.
4.2.2. Fidelity Loss
As analysis in
Section 3.3, fidelity is another key part to construct the noise loss. Inspired by the optimal charateristics of LPIPS, we use LPIPS to measures the fidelity of the generated image. We find that only using the image secure loss may make the image less vivid. Even when the operation “
ln” is used, the fidelity of image cannot be guaranteed. To reveal the affections of fidelity loss, we represent a group of comparison images shown in
Figure 5. The left image is generated by StyleGAN2 and the right one is created by NOStyle without using LPIPS. Compared with StyleGAN2, the generated image is blurry. Hence, according to the comparative results of
Figure 4 and
Figure 5, the final noise loss should be the optimal combination of the image secure loss and fidelity loss.
4.2.3. Hyperparameters
Based on the conclusions on analysis above, and are both important keys to generate the secure and high-quality image. Consider the value of residuals and LPIPS, two tunable parameters were set to and . Meanwhile, we use leaky ReLU with slope and the equalized learning rate for all layers. To enhance the quality of image, we follow the some valuable conclusions and use the truncation trick to capture the area of high density. Here, the truncated parameter is set to 0.5. Meanwhile, an Adam optimizer with learning rate 0.1 is used to train our network.
4.3. Ablation Experiment
According to the discussion in
Section 3.2, by iteration, the proposed architecture aims to generate and adjust the injecting three noise map groups
,
, and
to enhance the security of generated image. Here, each size of three noise map groups is
,
, and
. To evaluate the affection of each noise map group, we individually inject
and
into synthesis network to create image and the generative methods are named NOStyle-SLA and NOStyle-SLB.
To testify the security of above two methods, we choose two image datasets BOSSbase and GSI as the cover sets. Firstly, all spatial images are compressed into JPEG version with quality factor 75 and 95. Then, we employ steganographic methods J-UNIWARD and UED to create stego images. After extracting the DCTR feature and applying the ensemble classifier, the results are given in
Table 1,
Table 2,
Table 3 and
Table 4 which are listed as follows.
We observe that the security of images generated by NOStyle-SLA and NOStyle-SLB outperform the standard image set BOSSbase and another image set created by StyleGAN2. Across six relative embedding rates, the average improvements of NOStyle-SLA and NOStyle-SLB over StyleGAN2 are about 0.44% and 1.21%. Moreover, the results show that, at most relative payloads, NOStyle-SLB is more secure than NOStyle-SLA. Therefore, we make a conclusion that, compared with , the noise map group can effectively enhance the security of generated images. Since and also show the ability of raising the security of generated image, in our final scheme, and are both employed to create the secure image (cover).
4.4. Quality of Generated Images
4.4.1. Comparison of Macroscopic Architecture
Based on LSUN CAT and the optimal parameters, we own three image synthesis methods which are NOStyle-SLA, NOStyle-SLB, and NOStyle. Combining the novel method StyleGAN2, we totally obtain four methods. Using different image generation methods, with the same non-linear
latent code
z, we can create similar image with the same scene. In
Figure 6, we give a set of comparison examples. Each comparison image includes two sub-images which are generated by StyleGAN2 (left sub-image) and NOStyle (right sub-image). The results show that the overall structure of each image is almost the same and the stochastic details of generated image are precisely represented. Therefore, we can see that the quality of the images generated by NOStyle is rather high. However, if we carefully observe the image, we find that there are some tiny differences distributed in the detailed region. The corresponding analysis of comparison results will be given in next subsection.
Except the visual characteristic, we hope to discuss the feature representation of generated image. As discussed in [
25], Fréchet inception distance (FID) is an excellent value to measure the quality of image. Lower FID score is an indicator of high-quality images, and vice versa. FID is defined as
Suppose
p(
·) and
pw(
·) represent the distribution of generated images and real images,
m and
C are the mean and variance of
p(·).
mw and
Cw are the mean and variance of
pw(
·). Here, for 80,000 generated
images, we also calculate FID to measure the quality of image. The corresponding FIDs are listed in
Table 5.
The results of
Table 5 show that FID of NOStyle is highest and, to the opposite, the corresponding value of StyleGAN2 is lowest. According to the related conclusion, lower FID means the quality of generated images is higher. Therefore, we conclude that the quality of the image generated by model NOStyle is lower than the other three generative networks. However, the gap between the four FID values is quite small and we can infer that, for the given four image generative methods, the difference in image quality is rather small. Meanwhile, compared with FIDs of NOStyle-SLA and NOStyle-SLB, we see that the FID value of NOStyle-SLB is a little higher than the FID of NOStyle-SLA. Hence, the generated image of model NOStyle-SLA is higher than the corresponding result of NOStyle-SLB. Combing the detection results and FID, for the unconditional high-quality image synthesis, we conclude that higher FID value means higher image security. Here, we suppose that there is tight connection between the FID and image security. To give more explanations, the corresponding analysis will be discussed in
Section 4.6.
4.4.2. Detail Comparison of Various Methods
According to the experimental results in [
24] and analysis above, StyleGAN2 displays excellent performance to generate high-quality image. Compared with StyleGAN2, NOStyle keeps some key architectures including MN and SyN. The big differences between these two style-based generative networks focus on SNON and stage-II discriminator. Therefore, compared with multiple images generated by different generative models, we assert that the styles corresponding to coarse and middle spatial resolutions are the same, while the details distributed into the complex regions have minor differences.
To show the local difference of various generated images, we focus on the same complex region of four images created by StyleGAN2, NOStyle-SLA, NOStyle-SLB, and NOStyle. Then, we give comparison results in
Figure 7. It is clear, for the given four methods, the chosen regions look almost the same. However, if we carefully observe four comparison results, we know that there are some tiny differences distributed in the complex region. The reason is that we just adjust the high-resolution noise maps. In fact, these spatial differences bring about changes in security and fidelity.
Figure 8 gives the examples of the generated covers, the corresponding stego images, and the modification maps. The stego images are generated by J-UNIWARD on 0.2 bpnzAC for JPEG quality factors 85. Although four stego images look like almost the same, the modification maps show that the embedding changes in DCT domain are quite different. From the view of steganography, the embedding differences cause the difference of security and we conclude that there is a strong connection between the image synthesis and security.
4.5. Security Performance
In this part, we compare the security performance of original image set BOSSbase and different image sets generated by other generative models, including StyleGAN2, NOStyle-SLA, NOStyle-SLB, and NOStyle. The experiments are carried out on the spatial and JPEG domain. In order to construct JPEG image set, the original gray images set GSI are compressed into JPGE images with quality factors 75, 85, and 95. After the compression operation, we totally obtain 40,000 spatial images and 120,000 JPEG images. Finally, the experiments are executed on the 160,000 images and the payloads for each image set are 0.05, 0.1, 0.2, 0,3, 0.4, and 0.5.
For spatial cover set, we choose three novel steganographic schemes, including S-UNIWARD, HILL, and SGAN, to generate stego images. Meanwhile, for JPEG cover set, two novel JPEG steganographic schemes J-UNIWARD and UED are used as the choices to create stego images. Later, the original image set and the corresponding stego image sets are divided into two parts with equal size. Finally, with FLD, we obtain the corresponding detection results which are shown in
Table 6,
Table 7 and
Table 8 and
Figure 9,
Figure 10,
Figure 11 and
Figure 12.
According to the above testing results, we can see that, compared with other four image sets BOSSbase, StyleGAN2, NOStyle-SLA, and NOStyle-SLB, NOStyle achieves best secure performance at almost every payload against SRMQ1, regardless of the typical spatial and GAN-based steganographic schemes. On the other hand, for two JPEG steganographic methods UED and J-UNIWARD, NOStyle is more secure than StyleGAN2 against JRM and DCTR. On average, across six payloads, the improvements of NOStyle are 1.19%, 0.94%, 1.32%, 1.02%, 1.28%, and 0.71% over StyleGAN2, respectively. The experiments indicate that, compared with the typical image generation scheme StyleGAN2, NOStyle can optimize the injected noise map and enhance the security performance of generated image. Compared with spatial and JPEG detection results, we observe that NOStyle gains bigger improvement on JPEG steganographic schemes.
4.6. Connection between Security and Fidelity
In this section, we hope to construct the connection between the security and quality. To achieve this goal, we firstly select 8000 images from 4 image sets created by StyleGAN2, NOStyle, NOStyle-SLA, and NOStyle-SLB with equal size. For the sake of convenience, we refer to four schemes as SG2, NS, NSA, and NSB. The experiments are carried out on all given images and three experimental relative payloads for each image set are 0.2, 0,3, and 0.4. Meanwhile, we use two JPEG steganographic methods (J-UNIWARD and UED) to generate stego images. Finally, the security testing is carried on the extracted novels features DCTR and JRM. For two steganographic methods, three relative payloads, and two quality factors, we can obtain many combination schemes. For example, if we use DCTR to test J-UNIWARD at embedding rate 0.2 for JPEG quality factor 75, this category is abbreviated to “D-J-75-2”.
Suppose we fix a testing combination and, for four methods, we obtain four detection errors. Here, we define a value
PSF which is computed as the division of any
PE and the maximum of detection errors of four generative models. Therefore, the ratio value
PSF is defined as follows
Additionally, we use the same operation to deal with the corresponding FID values in
Table 5. Across all the parameter combinations and FID, we totally obtain 13 ratios which are listed in
Table 9.
According to the results shown in
Table 9, in nearly all cases,
PSF of NOStyle is 1. It means that the detection error of NOStyle is the highest and the security of corresponding generated image is the highest. Moreover, we see that the tendency of
PSF of various combinations and FID is consistent. Therefore, we assert that the security of the generated image is inverse proportion to the fidelity of the created image. It implies that, under the generative mechanism, the fidelity is lower and the corresponding image security is less, and vice versa for the case of a higher fidelity.
We aim to describe the relationship between the security of the generated image and the fidelity. As previous discussions, in the classical framework of image generation, the key task is to maintain the fidelity of generated image by the disentanglement and style mixing mechanism. Generally, the stochastic controlling of image detail is achieved by the injecting of stochastic noise located at different layers of the synthesis network which is trained based on the noise loss. From the aspect of steganography, combining the security loss and fidelity loss, we redesign the network loss, retrain the synthesis network, and obtain the stochastic map. However, for image generation, the given stochastic map is not optimal and, during the processing of image synthesis, the fidelity of generated image is diminished. Indeed, according to the results of
Table 5, FID of NOstyle is a little higher than the FID of StyleGAN2. It means that the fidelity of image generated by NOstyle is a little worse than the corresponding image created by StyleGAN2. Therefore, the above analysis testifies that the security of generated image is inverse proportion to the fidelity. However, we see that the difference of fidelity between two methods is tiny. Experiments show, compared with StyleGAN2, NOstyle makes a bigger progress in image security.
Based on the experimental results and analysis, we see that, by redefining the secure noise optimization network and an optimal noise loss, we achieve the optimization of the injected noise which can be used to generate the secure and high-quality image. Finally, the proposed scheme can make a tradeoff between the security and fidelity.
4.7. Computational Complexity
For model-base image synthesis, the computational complexity is the key point to make the proposed approach applicable. We execute a set of experiments to testify the computational complexity of three methods. To train our model, we choose the subset of LSUN Cat dataset as training set to train our proposed image synthesis network NOstyle. Based on the previous discussion, our network architecture mainly owns two stages denoted as stage-I and stage-II. The stage-I NOStyle generator is inherited from the pretrained StyleGAN2 and the stage-II NOStyle is used to optimize the injecting noise and generate a high-quality/secure image. Therefore, the computational complexity of the proposed scheme mainly depends on the computational complexity of stage-II. The experiments are tested on a server with 2.2 GHz CPU, 16 GB memory, and a 2080 Ti GPU. The computational complexity represented as training time (h) is listed in
Figure 13 in which NS, NSA, and NSB, respectively, stand for three comparative methods.
According to given results in
Figure 13, the computational complexity of NS is higher than the corresponding values of NSA and NSB. Meanwhile, due to the similar mechanism of NSA and NSB, the training time of two generative methods are almost the same. However, the differences between NS and other two methods are small. On average, the difference of training time is about half an hour. As previous discussion, consider the high demanding of GPU and energy consumption, we directly use the pre-train model of StyleGAN2 and, therefore, the training time of StyleGAN2 is not represented in
Figure 13. Obviously, we firmly believe that the computational complexity of StyleGAN2 is higher than other three methods NS, NSA, and NSB. With the lower computational complexity, the practicality of the proposed approach is rather high.
4.8. Stochastic Variation
Let us consider how the style-based methods implement the stochastic variation. Given the designed network, the stochastic realizations (noise map) are achieved by adding per-pixel noise after each convolution of network. According to the comparative results in
Figure 6 and
Figure 7 and the related discussion in [
25], the noise only affects the stochastic aspects of generated image, such as hairs, fur, or freckles. Meanwhile, we can see the overall composition of different generated images remains unchanged. For our proposed architecture NOStyle, on the one hand, the injected noise is not totally random and the noise is adjusted to maintain the image fidelity and security. Therefore, the pseudo-random noise indeed affects the security of the generated image. However, our proposed architecture optimizes the given noise and makes an ideal tradeoff between image security and fidelity.