Case Law, Hashing Algorithms, and Child Pornography Cases
“Hashing is a powerful and pervasive technique used in nearly every examination of seized digital media. The concept behind hashing is quite elegant: take a large amount of data, such as a file or all the bits on a hard drive, and use a complex mathematical algorithm to generate a relatively compact numerical identifier (the hash value) unique to that data. Examiners use hash values throughout the forensics process, from acquiring the data, through analysis, and even into legal proceedings. Hash algorithms are used to confirm that when a copy of data is made, the original is unaltered and the copy is identical, bit-for-bit. That is, hashing is employed to confirm that data analysis does not alter the evidence itself. Examiners also use hash values to weed out files that are of no interest in the investigation, such as operating system files, and to identify files of particular interest.” Richard P. Salgado, “Fourth Amendment Search and the Power of the Hash,” 119 Harvard Law Review Forum 38 (2006)
The software or employees and volunteers identify potential CSAM visually or by checking the hash values of uploaded or shared CSAM against a database of known child porn. A “hash value” is an alphanumeric string that serves to identify an individual digital file as a kind of “digital fingerprint.” Although it may be possible for two digital files to have hash values that “collide,” or overlap, it is unlikely that the values of two dissimilar images will do so. United States v. Cartier, 543 F.3d 442, 446 (8th Cir.2008).
Hash values are relatively short – usually 256 or less bit length. Common hashing algorithms include MD5 or SHA-1.
The North Carolina Court of Appeals has recognized the reliability of hashing, even using outdated hashing algorithms such as SHA-1 which was developed in the 1990s, broken in 2005, and deprecated by Microsoft in 2020. State v. Gerard, 790 S.E.2d 592 (N.C. Ct. App. 2016).