Artificial Intelligence

Researchers Identify a Resilient Trait of Deepfakes That Could Aid Long-Term Detection

Updated on December 9, 2022

Since the earliest deepfake detection solutions began to emerge in 2018, the computer vision and security research sector has been seeking to define an essential characteristic of deepfake videos – signals that could prove resistant to improvements in popular facial synthesis technologies (such as autoencoder-based deepfake packages like DeepFaceLab and FaceSwap, and the use of Generative Adversarial Networks to recreate, simulate or alter human faces).

Many of the ‘tells', such as lack of blinking, were made redundant by improvements in deepfakes, whereas the potential use of digital provenance techniques (such as the Adobe-led Content Authenticity Initiative) – including blockchain approaches and digital watermarking of potential source photos – either require sweeping and expensive changes to the existing body of available source images on the internet, or else would need a notable cooperative effort among nations and governments to create systems of invigilation and authentication.

Therefore it would be very useful if a truly fundamental and resilient trait could be discerned in image and video content that features altered, invented, or identity-swapped human faces; a characteristic that could be inferred directly from falsified videos, without large-scale verification, cryptographic asset hashing, context-checking, plausibility evaluation, artifact-centric detection routines, or other burdensome approaches to deepfake detection.

Deepfakes in the Frame

A new research collaboration between China and Australia believes that it has found this ‘holy grail', in the form of regularity disruption.

The authors have devised a method of comparing the spatial integrity and temporal continuity of real videos against those that contain deepfaked content, and have found that any kind of deepfake interference disrupts the regularity of the image, however imperceptibly.

This is partly because the deepfake process breaks the target video down into frames and applies the effect of a trained deepfake model into each (substituted) frame. Popular deepfake distributions act in the same way as animators, in this respect, giving more attention to the authenticity of each frame than to each frame's contribution to the overall spatial integrity and temporal continuity of the video.

From the paper: A) Differences between the kinds of data. Here we see that p-fake's disturbances change the spatio-temporal quality of the image in the same way as a deepfake does, without substituting identity. B) Noise analysis of the three types of data, showing how p-fake imitates deepfake disruption. C) A temporal visualization of the three types of data, with real data demonstrating greater integrity in fluctuation. D) the T-SNE visualization of extracted features for real, faked, and p-faked video. Source: https://arxiv.org/pdf/2207.10402.pdf

This is not the way that a video codec treats a series of frames when an original recording is being made or processed. In order to save on file-size or make a video suitable for streaming, a tremendous amount of information is discarded by the video codec. Even at its highest-quality settings, the codec will allocate key-frames (a variable that can be set by the user) – entire, practically uncompressed images that occur at a preset interval in the video.

The interstitial frames between key-frames are, to an extent, estimated as a variant of the frames, and will re-use as much information as possible from the adjacent key-frames, rather than being complete frames in their own right.

On the left, a complete key-frame, or ‘i-frame', is stored in the compressed video, at some expense of file-size; on the right, an interstitial ‘delta frame' reuses any applicable part of the more data-rich key-frame. Source: https://blog.video.ibm.com/streaming-video-tips/keyframes-interframe-video-compression/

In this way, the block (containing x number of frames, depending on keyframe settings) is arguably the smallest unit considered in a typical compressed video, rather than any individual frame. Even the keyframe itself, known as an i-frame, forms part of that unit.

In terms of traditional cartoon animation, a codec is performing a species of in-betweening, with the key-frames operating as tent-poles for the interstitial, derived frames, known as delta frames.

By contrast, deepfake superimposition devotes enormous attention and resources to each individual frame, without considering the frame's wider context, and without making allowance for the way that compression and block-based encoding affect the characteristics of ‘authentic' video.

A closer look at the discontinuity between the temporal quality of an authentic video (left), and the same video when it is disrupted by deepfakes (right).

Though some of the better deepfakers use extensive post-processing, in packages such as After Effects, and though the DeepFaceLab distribution has some native capacity to apply ‘blending' procedures like motion blur, such sleight-of-hand does not affect the mismatch of spatial and temporal quality between authentic and deepfaked videos.

The new paper is titled Detecting Deepfake by Creating Spatio-Temporal Regularity Disruption, and comes from researchers at Tsinghua University, the Department of Computer Vision Technology (VIS) at Baidu Inc., and the University of Melbourne

‘Fake' Fake Videos

The researchers behind the paper have incorporated the functionality of the research into a plug-and-play module named Pseudo-fake Generator (P-fake Generator), which transforms real videos into faux-deepfake videos, by perturbing them in the same way that the actual deepfake process does, without actually performing any deepfake operations.

Tests indicate that the module can be added to all existing deepfake detection systems at practically zero cost of resources, and that it notably improves their performance.

The discovery could help to address one of the other stumbling blocks in deepfake detection research: the lack of authentic and up-to-date datasets. Since deepfake generation is an elaborate and time-consuming process, the community has developed a number of deepfake datasets over the last five years, many of which are quite out-of-date.

By isolating regularity disruption as a deepfake-agnostic signal for videos altered post-facto, the new method makes it possible to generate limitless sample and dataset videos that key in on this facet of deepfakes.

Overview of the STE block, where channel-wise temporal convolution is used as a spur to generate spatio-temporally enhanced encodings, resulting in the same signature that even a very convincing deepfake will yield. By this method, ‘fake' fake videos can be generated that bear the same signature characteristics as any altered, deepfake-style video, and which do not hinge upon particular distributions, or upon volatile aspects such as feature behavior or algorithmic artifacts.

Tests

The researchers conducted experiments on six noted datasets used in deepfake detection research: FaceForensics++ (FF++); WildDeepFake; Deepfake Detection Challenge preview (DFDCP); Celeb-DF; Deepfake Detection (DFD); and Face Shifter (FSh).

For FF++, the researchers trained their model on the original dataset and tested each of the four subsets separately. Without the use of any deepfake material in training, the new method was able to surpass the state of the art results.

The method also took pole position when compared against the FF++ C23 compressed dataset, which provides examples that feature the kind of compression artifacts that are credible in real world deepfake viewing environments.

The authors comment:

‘Performances within FF++ validate the feasibility of our main idea, while generalizability remains a major problem of existing deepfake detection methods, as the performance is not guaranteed when testing on deepfakes generated by unseen techniques.

‘Consider further the reality of the arms race between detectors and forgers, generalizability is an important criterion to measure the effectiveness of a detection method in the real world.'

Though the researchers conducted a number of sub-tests (see paper for details) around ‘robustness', and varying the types of videos input (i.e. real, false, p-faked, etc.), the most interesting results are from the test for cross-dataset performance.

For this, the authors trained their model on the aforementioned ‘real world' c23 version of FF++, and tested this against four datasets, obtaining, the authors state, superior performance across all of them.

Results from the cross-dataset challenge. The paper notes that SBI uses a similar approach to the authors' own, while, the researchers claim, p-fake shows better performance for spatio-temporal regularity disruption.

The paper states:

‘On the most challenging Deepwild, our method surpasses the SOTA method by about 10 percentage points in terms of AUC%. We think this is due to the large diversity of deepfakes in Deepwild, which makes other methods fail to generalize well from seen deepfakes.'

Metrics used for the tests were Accuracy Score (ACC), Area Under the Receiver Operating Characteristic Curve (AUC), and Equal Error Rate (EER).

Counter-Attacks?

Though the media characterizes the tension between deepfake developers and deepfake detection researchers in terms of a technological war, it's arguable that the former are simply trying to make more convincing output, and that increased deepfake detection difficulty is a circumstantial by-product of these efforts.

Whether developers will try to address this newly-revealed shortcoming depends, perhaps, on whether or not they feel that regularity disruption can be perceived in a deepfake video, by the naked eye, as a token of inauthenticity, and that therefore this metric is worth addressing from a purely qualitative point of view.

Though five years have passed since the first deepfakes went online, deepfaking is still a relatively nascent technology, and the community is arguably more obsessed with detail and resolution than correct context, or matching the signatures of compressed video, both of which require a certain ‘degradation' of output – the very thing that the entire deepfake community is currently struggling against.

If the general consensus there turns out to be that regularity disruption is a nascent signature that does not affect quality, there may be no effort to compensate for it – even if it can be ‘cancelled out' by some post-processing or in-architecture procedures, which is far from clear.

First published 22nd July 2022.