Dealing with Deepfakes
Written by Marco Fontani   

“SEEING IS BELIEVING.” Or, rather, that’s what we used to say. Since the beginning of time, seeing a fact or a piece of news depicted in an image was far more compelling than reading it, let alone hearing about it from someone else. This power of visual content probably stemmed from its immediacy: looking at a picture takes less effort and training than reading text, or even listening to words. Then, the advent of photography brought an additional flavor of undisputable objectivity. Thanks to photography, pictures could be used as a reliable recording of events.

This article appeared in the January-February 2021 issue of Evidence Technology Magazine.
You can view that full issue here.

Looking closer, however, it turns out that photographs have been faked since shortly after their invention. One of the most famous examples of historical hoaxes, dating back to the late 1860s, is Abraham Lincoln’s head spliced over John Calhoun’s body, and cleverly so (Figure 1). (Note: Click here to read the full hoax description on hoaxes.org.)


Figure 1.

Politics was indeed an important driver for image manipulation throughout the years, as witnessed by many fake pictures created to serve leaders of democracies and tyrannies. We have photos of the Italian dictator Benito Mussolini proudly sitting on a horse that was held by an ostler (the latter promptly erased), photos of Joseph Stalin where some subjects were removed after they fell in disgrace, and so on. All these pictures were “fake”, in the sense that they were not an accurate representation of what they purported to show.

Of course, creating hoaxes with good, old-fashioned analog pictures was not something everyone could do. It took proper tools, training, and lots of time. Then, digital photography arrived, which was soon followed by digital-image manipulation software and, a few years later, digital-image sharing platforms. With advanced image editing solutions available at affordable prices—or even for free—there was a boom in the possibilities of creating fake pictures. Of course, you still needed suitable training and time to obtain professional results, but this was nothing compared to working with film.

In the last couple of years, we have witnessed yet another revolution in the manipulation of images: “deepfakes”. A deepfake is a fake image or video generated with the aid of a deep artificial neural network. It may involve changing a person’s face with someone else’s face (so-called “face-swaps”), changing what a subject is saying (“lip-sync” fakes), or even changing the words and movements of someone’s head so that they are like a puppet, or guided actor (“re-enactment”). But how is this achieved? What are these “deep artificial neural networks”? How can we fight deepfakes? In this article, we’ll try to address these questions and bring some order to all of this.

Artificial Neural Networks
An artificial neural network (ANN) is a machine-learning algorithm, and it’s not new at all. In fact, psychologist Frank Rosenblatt proposed the first ANN as a way to model the human brain back in 1958. Like the human brain, an ANN comprises many elementary units (neurons). Each neuron is connected to other neurons through input connections and output connections, and each connection is assigned a weight. The weighted contributions coming from input neurons are summed together, and a single output value is computed using an “activation function”. The obtained output is then sent to other neurons through output connections. Neurons are distributed in layers: we have an input layer, an output layer, and an arbitrary number of “hidden” layers in between.

Like the human brain, ANN must be trained with data. Lots of data, possibly. The idea is that you need a labeled training dataset: you feed one dataset element to the neural network, wait for the output to be produced, then you measure how much of the output is wrong, and you “backpropagate” the corrections to connection weights from the output layer back to the input. Thus, training an ANN basically means updating its connection weights until the produced output matches the expected one as closely as possible.

As simple as they are, ANNs are extremely powerful. Technically speaking, they are “universal function approximators”, which means they can be used to compute virtually anything, provided a sufficiently complex network of neurons is allowed. And actually, ANNs have been used in many applications: playing video games, recognizing handwritten characters, spam filtering, cancer diagnosis, financial forecasting, image classification, and more.

Using ANNs for Deepfakes
Now, different tasks call for different network architectures. In general, the word “deep” in deepfakes suggests that the neural networks employed have a lot of hidden layers in order to be able to carry out complex processing tasks. As far as deepfakes are concerned, there are two neural network schemes that proved fundamental: auto-encoders and generative adversarial networks (GANs).

An auto-encoder is a particular ANN that has the same amount of input and output neurons, but at least one hidden layer with a smaller number of neurons—like in Figure 2, below.


Figure 2.

The network is simply asked to recreate the input data in the output layer. But since there is a hidden layer with fewer neurons (the “bottleneck”), the network cannot simply copy elements from input to output neurons. Instead, the network must compress the information in the bottleneck, then un-compress it, and then map it to the output. In other words, the left part of the network works as an encoder, the bottleneck layer provides the compressed data, and the right part of the network works as a decoder.


How is this related to deepfakes? Well, let’s imagine we have a picture of Tom Cruise’s face and we want to swap it with Jim Carrey’s face. First, we gather many images of both actors’ faces. Then, we train two autoencoders that share the same weights in the encoding stage (that is, from the input to the bottleneck), but have dedicated decoders—one per actor. In other words, at compression time, the network learns how to preserve “common traits” shared by both faces, but at decoding time, only the peculiar traits of each actor are reenforced. Now that we have these networks, the trick is that we’ll use the “wrong” decoder network: we compress Tom Cruise’s face with the shared encoder, but then we purportedly use Jim Carrey’s network to carry out the decoding. The result will be a “Jim Carrey-fied” Tom Cruise face. Want an example? This YouTube video that added Robert Downey Jr. and Tom Holland to the Back to the Future cast is quite impressive:

Of course, the full face-swap pipeline is larger than just these auto-encoders. It is necessary to first isolate faces from other actors in each frame, then they must be warped and aligned to a “standard position”. In this standard position, the swapping happens. Then they must be warped back to their original position and “blended” into the original actor’s head. (Note: Face swaps normally only change the region going from the mouth to eyebrows, and they only marginally affect hairs or jaws.)

Now, let’s talk about generative adversarial networks. GANs are the predominant deep-learning technology employed when new content needs to be generated. For example, the popular website thispersondoesnotexist.com generates faces of people that are “hallucinated” by a neural network. These people do not exist! How do they achieve this? In the definition of GAN, “generative” indicates that the goal is to generate new content, rather than classify, predict, or compress. “Adversarial” means that a GAN is actually made of two neural networks, a Generator and a Discriminator, playing one against the other. The Discriminator is trained to distinguish real content—in our case, real faces—from synthetically generated faces. The Generator’s goal instead is to fool the Discriminator by producing a sufficiently realistic face starting from random pixels. Typically, at the beginning of training, the Generator produces almost random pixels and the Discriminator easily wins (i.e., it correctly detects that the generated content is not a real face). However, at every iteration the output of the Discriminator is given to the Generator as feedback, so that the Generator can improve again and again (Figure 3).


Figure 3.

If the two networks are properly designed, and enough training material and processing power is provided, the Generator will eventually produce extremely realistic faces, such as the one below taken from the mentioned website (Figure 4).


Figure 4.

Fighting Deepfakes
We have mentioned two deep-learning architectures used for deepfake creation, and we have seen how realistic the generated content can be. Let us now move on the other side of the battlefield and see what we can do to detect deepfakes.

We can virtually identify three possible “macro-approaches” to deepfake content detection. One possible route is to treat deepfakes as classical images to be analyzed. In the end, regardless of how well the face has been generated and inserted, most deepfake images of video frames are nothing but splicing of a fake face into a real picture. Therefore, classical image and video forensic techniques—based on compression artifacts analysis, noise consistency, and correlation analysis—have a chance to be successful in detecting a deepfake (Verdoliva 2020). For example, Amped Authenticate’s ADJPEG filter successfully detects many images generated with a popular face-swapping app, as shown in Figure 5 below.


Figure 5.

The ADJPEG filter works by finding double-compression artifacts in the original part of the image. This means that even if a more complex splicing system is used to substitute the face, it will make little difference since it will affect the manipulated part and not the original part, which is left untouched.

The second possible approach is to use deep learning on the detection side. Researchers have been publishing tons of papers where more and more complex neural networks are employed for deepfake content detection (Verdoliva 2020). The main issue of deep-learning-based detection techniques is that they heavily depend on the training dataset. In other words, as long as you take a large dataset, split it in two, then use one part to train the network and the other to test it, things work nicely. But if you use the so-trained network on images from a different dataset, performance drops dramatically, and this severely limits the applicability of these data-driven approaches. Another relevant problem is lack of explainability: it is normally very hard to explain “how” the network reached the final classification, which is of course problematic in a forensic scenario. Finally, if a valid detection network X is made publicly available (as it should be for repeatability purposes, required in forensics), there is a risk that attackers will build a GAN using X as the discriminator, which means they could generate anti-forensic fakes targeted to fool X.

The third macro-approach consists of visual consistency and behavioral analysis methods. Contrary to the first two approaches, these make explicit use of the fact that deepfakes basically involve the faces of people, and there are many subtle things that could go wrong in the generation process. We may find obvious clues of manipulation, such as inconsistencies in eye color or earrings, as in Figure 6 below (although, admittedly, you may well find real people with different colored eyes, or wearing just one earring).


Figure 6.

These kinds of blatantly “strange” defects are becoming less common as neural networks improve. However, there are other kinds of anomalies that are harder for a network to avoid. For example, researchers found that in deepfake videos, the tampered face has a much lower eye-blinking rate than real faces (Li 2018). That’s probably because most neural networks have limited time awareness, and they can hardly figure out the right moment when eyes should blink. Of course, it would be possible for the attacker to work around this issue by just “copying” the eye blinking time from the original face into the tampered face, so even this anomaly probably may be removed soon.

Very recently, it has been shown that when the deepfaked subject speaks, there are inconsistencies between the phonemes (elementary units of sound, those that you normally find written below words in dictionaries to explain how to pronounce them) and the vysemes (the elementary movement of the mouth) (Agarwal 2020). Of course, designing an automated analysis for this kind of anomaly is not trivial, and manually carrying out the analysis is time consuming.

Finally, it is worth mentioning an analysis method designed to protect world leaders or very popular individuals, for which hundreds of hours of video are normally available (Agarwal 2019). The method creates a personalized “profile” of the peculiar behavioral characteristics of the original subject using training videos (e.g., the way they move the head and eyebrows when speaking, or the way eye wrinkles vary over time). Now, if the attacker used an actor’s face to “re-enact” the world leader’s face, the faked face will follow the actor’s behavioral characteristics, not those of the original subject. Therefore, extracting these characteristics from the questioned video and comparing them to the individual’s profile may allow spotting possible inconsistencies.

Deepfakes: Good or Evil?
As it often happens with technology in general, there are both constructive and malicious uses for deepfakes. On the constructive side, think about the movie production industry: they can finally fix the annoying out-of-sync mouth effect in dubbed movies at little cost. Even more, they could easily animate an avatar of the main character at little expense, compared to having video-editing specialists working hours for every second of the movie (this is bad news for such professionals, though).

On the evil side, there are sadly several ways to weaponize the deepfake technology. Misinformation is one of them. It is getting easier to create a video of a politician saying something they would never say. Forging fake evidence is another possible misuse: someone could create an alibi by swapping their face into someone else’s face so as to pretend they were in a place at a certain time. Sadly enough, however, the main misuse of deepfakes is currently related to non-consensual pornography. Women found their face realistically spliced over an actress’s face in sexually explicit videos, with an obvious negative impact on their reputation. Recently, even bots (automatic chat responders) have been created that will “undress” any woman. The attacker sends in a picture of the dressed victim, and the bot creates a picture where the victim is naked.

Such a variety of potential misuses certainly calls for the development of reliable deepfake detection technologies, but also suggests that technology cannot be the sole answer. People need to be educated about the existence of deepfakes, and to be made aware that seeing is no longer believing—well, in the digital world, at least. In a world where everyone can post news on the internet, the ability to scrutinize an information source to decide about its reliability is becoming increasingly important. All in all, there’s no surprise that a complex threat such as deepfakes requires a combination of education, intelligence, and technology.


About the Author

Marco Fontani graduated in Computer Engineering (summa cum laude) in 2010 at the University of Florence (Italy) and earned his Ph.D. in Information Engineering in 2014 at the University of Siena under the supervision of Prof. Mauro Barni. He works as an R&D Engineer at Amped Software, where he coordinates research activities. He participated in several research projects, funded by the European Union and by the European Office of Aerospace Research and Development. He is the author or co-author of several journal papers and conference proceedings, and he is a member of the Institute of Electrical and Electronics Engineers (IEEE) Information Forensics and Security Technical Committee. He delivered training to law enforcement agencies and he has provided expert witness testimony on several forensic cases involving digital images and videos.


References

Agarwal, S., H. Farid, Y. Gu, M. He, K. Nagano, and H. Li. 2019. Protecting world leaders against deep fakes. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Long Beach, California. 2019:38-45.

Agarwal, S., H. Farid, O. Fried, and M. Agrawala. 2020. Detecting deep-fake videos from phoneme-viseme mismatches. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle, Washington. 2020:2814-2822.

Li, Y., M. Chang, and S. Lyu. 2018. In ictu oculi: Exposing AI created fake videos by detecting eye blinking. IEEE International Workshop on Information Forensics and Security (WIFS). Hong Kong, Hong Kong. 2018:1-7.

Verdoliva, L. 2020. Media forensics and deepfakes: An overview. arXiv preprint arXiv:2001.06564.

 
Next >






Product News

Six interchangeable LED lamps

highlight the features of the OPTIMAX Multi-Lite Forensic Inspection Kit from Spectronics Corporation. This portable kit is designed for crime-scene investigation, gathering evidence, and work in the forensic laboratory. The LEDs provide six single-wavelength light sources, each useful for specific applications, from bodily fluids to fingerprints. The wavelengths are: UV-A (365 nm), blue (450 nm), green (525 nm), amber (590 nm), red (630 nm), and white light (400-700 nm). The cordless flashlight weighs only 15 oz. To learn more, go to: www.spectroline.com

Read more...