NIST Corner: Your Voice Is Evidence
Written by Reva Schwartz   

A call comes into the emergency call center and, in the background of the call, the events of a crime being committed can be heard. While the dispatcher determines the level of help needed, a recording is automatically captured. If needed, a speaker recognition expert can compare the speech in the background of the E911 call to a police interview with a person of interest. The expert would seek to confirm that the person heard during the criminal act captured on the E911 call was in fact the same individual in the interview.

That is a very simple and limited explanation of what takes place within the speaker recognition field. While this profession is not one of the more widely known forensic disciplines, there is an extensive research community focused on supporting practitioners by improving tools, technology, and capabilities.

In this field, the evidence is typically several audio recordings or speech samples, such as E911 calls, undercover recordings, and audio from CCTV video. These samples are also referred to as questioned recordings since the identity of the person is uncertain. The other piece of evidence is the known recording that contains speech from someone whose identity has been established. Forensic speaker comparison evaluates the probability of two hypotheses based on the evidence:

1) that the speakers in the known and questioned recordings are the same; or

2) that the speakers in the known and questioned recordings are different.

Recently, there have been a number of high-profile investigations and court cases where forensic speaker comparison played a major role, including the Robert Zimmerman trial and the “Jihadi John” ISIS videos. In the former, the audio from an E911 call consisted mostly of screams. In the ISIS videos, accent and dialect became the focus of the investigation. These cases exemplify the high level of evidence variability commonly encountered by practitioners in this field.

In forensic speaker comparison, a variety of procedures and tools are used to arrive at the truth, although there is currently no standard method for doing so. Any particular technique by itself is not as important as demonstrating that the overall method is suitable for its intended purpose. This is also known as the method’s validity.

Unlike other forms of pattern evidence (such as ballistics, latent prints, or footwear), speech is a behavioral phenomenon. This leads to significant case-to-case variation. Different recording processes, variable levels of noise, and the multitude of audio coding and compression schemes all add to this variability, providing the research community in this field with a long list of challenges.

To reduce error, the research community tries to model as many sources of variability as possible. The largest contributors to variability tend to fall into two categories: extrinsic and intrinsic.

Extrinsic factors are associated with the recording itself (such as noise, coding, or compression) while intrinsic factors are associated with the speaker or speakers heard within the recordings (such as speaker stress, language/dialect characteristics, and whispers or shouting).

Since 1996, National Institute of Standards and Technology (NIST) has carried out more than a dozen Speaker Recognition Evaluations (SRE). The objectives of these evaluations have been to drive forward tools and technology, measure the state-of-the-art, and find the most promising algorithmic approaches in forensic speaker comparison tasks. The basic task within these evaluations is to determine if the same person is speaking in two different audio samples. To date, the research community has focused almost exclusively on addressing extrinsic variability by studying issues such as length of the recording, transmission channel (telephone, microphone), audio coding and compression, and different types of noise. Concentrating on these issues has paid off well, with significant performance boosts for the underlying technology used by practitioners.

Intrinsic variability describes how a speaker communicates. Examples of this include:

  • Differences in language, accent, and dialect between the evidence and known recordings;
  • The type of speech heard in the recordings, such as a conversation, a formal speech or monologue, a voice mail message, a police interview, or courtroom testimony;
  • The relationship between the parties in the conversation;
  • The emotional state or condition of the speaker, such as physical, emotional, or cognitive stress, being under the influence of drugs or alcohol, or speaking while engaging in physical activity;
  • The speaker or speakers’ awareness that they are being recorded.

These issues and their interplay within and across samples are commonly referred to as mismatch, and they continue to vex even the best-performing forensic speaker comparison systems. For example, the questioned material in a case may consist of speech samples from a background talker in an E911 call during a struggle, while the known material may consist of digital video recordings from a formal police interview with a subject of interest in a noisy room. The mismatch in the above scenario entails numerous extrinsic and intrinsic factors.

Mismatch is a significant cause of error for existing algorithms in attempting to identify speakers in all environments. Conventional wisdom suggests that having practitioners in the mix will improve overall system performance for these conditions. However, that premise has not been extensively tested or proven. There is also little insight into how to best combine human expertise and automated technology in forensic science in general.

A typical speech waveform, the type of recorded digital data that might be used in analyzing vocal evidence. Photo: NIST

One point on which the forensic speaker comparison community does agree is that more data is needed for research. Data associated with forensic casework is of varying quality because practitioners have no control over how evidence audio is collected. So, practitioners have to be flexible and adapt their tools to handle these varying conditions. Practitioners also need to have extensive knowledge of intrinsic variability and how it plays out under a variety of circumstances so they can select the proper data for testing and analysis within a case.

With mismatched conditions, it may never be possible to model all the variables, but the goal should be to reduce the influence of as many compromising factors as possible. For example, if the case includes a telephone call and an interview, then the examiner needs sample recordings from those channel types (the phone system and the interview room) to begin to train a system. For full system development, data is needed to test methodology, to perform system validation, and for calibration.

Over the past two decades, dataset development for forensic speaker comparison system testing and evaluation has advanced significantly, with much of it driven by the NIST SRE. These datasets have been designed to primarily study extrinsic variation. Answering the “mismatch question” critical to forensic casework requires data that better reflect intrinsic variability conditions. Until we have collected those datasets, researchers and practitioners will not have access to data to sufficiently address many of their research and case requirements.

Another significant source of error in all forensic science is bias. Speaker comparison practitioners are supposed to focus on every factor in the audio except the story it tells, the aspect referred to as “content.” This is because the content can lead to bias and cause the practitioner to subconsciously lean one way or another when drawing conclusions. Reducing the influence of bias on the practitioner is another important area of study worthy of pursuit.

This is an exciting time to be working in the field of forensic speaker comparison. The research opportunities are numerous, the technology continues to improve at a rapid pace, and the interest in this field has gained momentum. With more and more high-profile cases expecting high-quality, accurate results from examiners, speaker comparison will soon stand alongside the more commonly mentioned forensic science disciplines.

About the Author

This e-mail address is being protected from spam bots, you need JavaScript enabled to view it is currently a Forensic Science Research Project Manager at the National Institute of Standards and Technology (NIST), Special Programs Office. She was previously a forensic examiner specializing in speaker recognition at the United States Secret Service.


< Prev   Next >

Recovering Latent Fingerprints from Cadavers

IN A HOMICIDE CASE, the recovery of latent impressions from a body is just one more step that should be taken in the process of completing a thorough search. This article is directed at crime-scene technicians and the supervisors who support and direct evidence-recovery operations both in the field and in the controlled settings of the medical examiner’s office or the morgue under the coroner’s direction.