The Collection of Large-Scale Structured Data Systems
Written by Joseph Sremack   

Structured data is a critical component of most fraud investigations, but the collection and verification of this data is not always properly addressed by forensics publications. Structured data is a type of data that has a fixed format and composition that facilitates processing, storage, and retrieval of information. Common forms of structured data include database entries, XML files, and even “tweets” from Twitter.

See this article in its original format in the Digital Edition!

The most relevant form of structured data for fraud investigations is transactional data that applies to everything from customer and employee data to financial transactions and health-care claims. Most professionals easily recognize the importance of structured data—but the process of verifying and validating a structured data collection to ensure that all records were properly acquired, however, is not as readily understood.

This article defines structured data, how the collection process differs from that of unstructured data, and provides a verifiable methodology for collecting structured data.

What is Structured Data?

Structured data is a phrase used in the field of data analysis; it has a shared, assumed meaning among practitioners. Data is said to be “structured” when it has a fixed composition and adheres to rules about what types of values it can contain. Most typically, structured data consists of records, each composed of a series of data elements. The rows and columns of structured data are bounded by certain properties, such as the data type, the length of the value, acceptable values, and excluded values. The most common forms of structured data are:

  • Database records (relational and hierarchical)
  • Spreadsheets (depending on format)
  • Markup files (e.g., HTML and XML)

The focus of this article is structured data that resides in large-scale data repositories; excluded from this article is structured data in smaller file formats, such as markup files and Microsoft Access databases. The main types of large-scale structured data systems include databases, data warehouses, and data marts. These systems can serve a large host of functions, such as running an enterprise resource planning (ERP) application, housing network and website traffic logs, and managing financial transactions. These systems typically contain critical information for investigations.

The table below contains a sample of structured data that highlights the way structured data is composed. The data for each customer is stored as its own record, and each of the individual data points are contained in their own field. The discrete, atomic nature of the data is what makes it structured (see Table 1).

Table 1—Here is a sample of structured data. Note that data for each customer is stored as its own record—unlike unstructured data that does not have the same rules and composition restrictions.

Unstructured data does not have the same rules and composition restrictions that structured data has. This lack of rigidity and structure allows data to be stored and presented in varied ways. Some of the most common forms of unstructured data include:

  • E-mail
  • Presentation documents
  • Textual documents (e.g., MS Word documents and PDF files)
  • Graphics files
  • Executable binary files (e.g., Windows .exe files)

Structured Data Collection Versus Unstructured Data Collection

The high-level processes of a digital investigation for both unstructured and structured data follow the same basic steps. An investigation begins with using information management resources to identify key sources of digital information. Those data sources are then preserved and collected before they are processed and loaded into a common review platform from which the data is analyzed. The relevant information from that data is then produced and presented in the form of findings. Figure 1 shows a version of the Electronic Discovery Reference Model, the standard process flow for digital investigations.

Figure 1—The Electronic Discovery Reference Model demonstrates the non-linear nature of the e-discovery process. (Image courtesy EDRM.)

In order to understand how a structured data collection is unique, an understanding of unstructured collections is required. Most unstructured files are relatively small and can easily be copied from the source system. The collection of unstructured data involves a secure copying process and a means of verification. The secure copying process is performed by one of two means: a copy of every bit on the computer or a copy of all “logical files,” which can either be done by directly connecting to the source computer or via network connection. The data is typically verified by hashing algorithms (e.g., MD5 and SHA-1) to ensure the contents of the data were not altered during copying or later analysis. The hashing algorithms are fast and reliable methods that create a unique checksum for the data that acts as a unique fingerprint. The hashing algorithms are later recomputed against the collected data to ensure that the original data matches the data being analyzed, thereby ensuring that the data was unaltered.

Structured data has unique limitations and requirements for data collection that require a different approach. The copying of structured data often involves large-scale data systems that house production data and cannot be taken offline for collection. For example, a data warehouse’s contents are constantly being altered while online, so the data-warehouse application locks the data to prevent copying. If the application were to allow for copying, the copied data could be corrupted and make the collected data unusable. The voluminous nature of the data also makes a bit-by-bit copy of the data unfeasible; copying one petabyte of data cannot be performed in a reasonable amount of time. Also, the volume of data in large-scale structured data systems is too large for a hashing algorithm to be applied. Current estimates state that MD5 typically requires between 4 to 20 seconds to calculate the checksum value per gigabyte of data, depending on the implementation of the algorithm and computing resources. Given that the volume of many large-scale systems that house structured data can exceed several terabytes and even petabytes, employing hashing algorithms to verify the collection is not always viable.

Collection and Verification Methodology

Several processes exist for collecting and verifying data from large-scale structured data systems, the selection of which depends on the purpose of the collection and limitations brought on by the system in question. Because a bit-by-bit copy of a database is rarely a viable option, the main approaches are either to collect backups or extract the relevant data by queries or generated reports. (Note: Queries and reports refer to any type of customized programmatic means to extract data from the structured data system.)

Verifying a backup can be performed either by a hashing algorithm or by verifying the contents of the backup once restored. Verifying the data from queries and reports are done by verifying the contents by control totals. Each approach has its own advantages and disadvantages, and the question as to what approach should be used depends heavily on the purpose of the investigation, the amount of time available, and the system and amount of data involved.

The overall process for structured data collection and verification involves collecting the data and retaining the logs for the steps taken, computing control totals or checksums, and then validating the process and verifying the control totals or checksums. Figure 2 outlines these steps.

Figure 2—Structured data collection and verification process

The volume and complexity of data, as well as the information that needs to collected, are the most important factors to consider when determining the appropriate collection method. The IT resources for performing the collection and the amount of time available should also factor into the decision. Collecting backup files of the system is best when a complete set of the system’s contents are required and the backup with the time period in question is available. Extracting the complete contents of the system via queries is best when the full contents of the system are required and a backup is not available or cannot be created. Extracting only the relevant data via queries is ideal when the full contents of the system are not required and specific filters can be applied during the collection to limit the amount of data. Sampling the data can be either a first step in understanding what data exists in the system or when extrapolation of the sample data can be performed. Additionally, if the system can produce reports and the other methods are not feasible, then collecting reports is an alternative collection method. Table 2 summarizes the advantages and disadvantages of each approach.

Table 2—Collection methods for large-scale structured data systems

Verifying the data collection is a crucial process that is a requirement for any investigation. If the findings of the data collection are to be part of a legal process, the court admissibility of the data hinges on whether the data was properly collected. The complete findings for an analysis can be ruled inadmissible if the collection is shown to have been flawed. Likewise, the only way to have any confidence whether the data was properly collected for non-legal purposes is to verify the data. Any doubt that exists regarding the data collection can nullify any findings later found in analysis.

The methods for verifying structured data collection involve collecting the verification checksums in conjunction with the source data. Structured data verification requires that checksums of the source data match the collected data, much like with unstructured data. The difference is that a hashing algorithm typically cannot be employed. Instead, the checksums consist of verifying record counts, aggregate control totals from numeric fields, verifying the collected totals with those of generated reports from the system, and verifying standard language textual values. These checksums must be collected in concert with the data, especially in the case of a live system where the contents are continuously updated.

Verifying a backup can be done in one of two ways. First, if the backup is a manageable size, then this is the one case when an MD5 or SHA-1 hash can be used. The backups are single, static files, so the verification can be performed without the risk of the file’s contents changing, thereby preventing the hashing algorithm from ever confirming that the file was copied correctly. Figure 3 demonstrates the high-level process for this type of verification.

Figure 3—Verifying a structured data backup via MD5

The second method of verifying a structured data collection is by collecting and comparing “control totals” that are calculated values from the data. Control totals can either be gathered from reports generated by the system or from the system during the collection process. These totals are later compared to the collected source data to verify if the collection was performed correctly. The standard control totals include the following:

  • Total number of records
  • Summations of numeric values
  • Total number of populated fields
  • Distribution of standard language values

No precise rules exist for how many or what type of control total should be taken; however, the standard practice is to, at a minimum, compare the total number of records and the numeric summation or standard language value distribution for one field for each set of data collected (e.g., a database table).

In addition to the backup and query-generated verification methods, validating the collection by means of collecting and reviewing the backup logs or queries used to generate the preserved data is critical. Validating these files can help confirm that the correct sources of data were collected and the appropriate data filters were applied. These files validate that the overall process was correct, which works in concert with the verification methods of control totals and checksums.


Structured data collections are unique from unstructured data, but the same principles apply. The keys to a successful collection are to identify the best process based on the needs and limitations of the structured data system, verify the data by means of checksums or control totals, and then validate the process by way of the queries used for extraction or backup logs. While this article presented only the basic collection and verification techniques, applying these principles will help to ensure a proper data collection and identify any issues in the early stages of an investigation.

About the Author

This e-mail address is being protected from spam bots, you need JavaScript enabled to view it is a Director with FTI Consulting’s Forensic and Litigation Consulting Practice.

< Prev   Next >

Item of Interest

The language barrier between English-speaking investigators and Spanish-speaking witnesses is a growing problem. (Updated 28 February 2011)