Sanitization is the process of removing sensitive information, from a document or other message (or
sometimes encrypting it), so that the document may be distributed to a broader audience. When the intent is secrecy protection, such as in dealing with classified information, sanitization attempts to reduce the document's classification level, possibly yielding an unclassified document. When the intent is privacy protection, it is often called data anonymization. Originally, the term sanitization was applied to printed documents; it has since been extended to apply to computer files and the problem of data remanence.Redaction in its sanitization sense (as distinguished
from its other editing sense) is the blacking out or deletion of text in a
document or the result of doing so. It is intended to allow the selective
disclosure of information in a document while keeping other parts of the
document secret. Typically the result is a document that is suitable for publication or for dissemination to others rather than
the intended audience of the original document.
Secure document redaction techniques
Redacting confidential
material from a paper document before its public release involves overwriting
portions of text with a wide black pen, followed by photocopying the result—the obscured text may be
recoverable from the original. Alternatively opaque "cover-up tape"
or "redaction tape", opaque, removable adhesive tape in various widths, may be applied before
photocopying.
This is a simple process
with only minor security risks. For example, if the black pen or tape is not
wide enough, careful examination of the resulting photocopy may still reveal
partial information about the text, such as the difference between short and
tall letters. The exact length of the removed text also remains recognizable,
which may help in guessing plausible wordings for shorter redacted sections.
Where computer-generated proportional fonts were used, even more information
can leak out of the redacted section in the form of the exact position of
nearby visible characters.
Secure redacting is more
complicated with computer files. Word processing formats may save a revision
history of the edited text that still contains the redacted text. In some file
formats, unused portions of memory are saved that may still contain fragments
of previous versions of the text. Where text is redacted, in Portable Document
(PDF) or word processor formats, by overlaying graphical elements (usually
black rectangles) over text, the original text remains in the file and can be
uncovered by simply deleting the overlaying graphics. Effective redaction of electronic
documents requires the removal of all relevant text and image data from the
document file. This process, internally complex, can be carried out very easily
by a user with the aid of "redaction" functions in software for
editing PDF or other files.
Data remanence
Data remanence is the residual
representation of digital data that remains even after attempts
have been made to remove or erase the data. This residue may result from data
being left intact by a nominal file deletion operation, by reformatting of
storage media that does not remove data previously written to the media, or
through physical properties of the storage media that
allow previously written data to be recovered. Data remanence may make
inadvertent disclosure of sensitive information possible
should the storage media be released into an uncontrolled environment (e.g.,
thrown in the bin (trash) or lost).
Various
techniques have been developed to counter data remanence. These techniques are
classified as clearing, purging/sanitizing,
or destruction. Specific methods include overwriting, degaussing, encryption, and media destruction.
Effective
application of countermeasures can be complicated by several factors, including
media that are inaccessible, media that cannot effectively be erased, advanced
storage systems that maintain histories of data throughout the data's life
cycle, and persistence of data in memory that is typically considered volatile.
Causes
Many operating systems, file managers, and other software provide a facility
where a file is not immediately deleted when the user requests that action.
Instead, the file is moved to a holding area,
making it easy for the user to undo a mistake. Similarly, many software products
automatically create backup copies of files that are being edited, to allow the
user to restore the original version, or to recover from a possible crash (autosave feature).
Even
when an explicitly deleted file retention facility is not provided or when the
user does not use it, operating systems do not actually remove the contents of
a file when it is deleted unless they are aware that explicit erasure commands
are required, like on a solid-state drive. (In such cases, the operating
system will issue the Serial ATA TRIM command or the SCSI UNMAP
command to let the drive know to no longer maintain the deleted data.) Instead,
they simply remove the file's entry from the file system directory,
because this requires less work and is, therefore, faster, and the contents of
the file—the actual data—remain on the storage medium.
The data will remain there until the operating system reuses the space for new data.
In some systems, enough filesystem metadata are
also left behind to enable easy undeletion by commonly available utility software. Even when undelete has become
impossible, the data, until it has been overwritten, can be read by software
that reads disk sectors directly. Computer forensics often
employs such software.
Likewise, reformatting, repartitioning, or reimaging a system is unlikely to write to every
area of the disk, though all will cause the disk to appear empty or, in the
case of reimaging, empty except for the files present in the image, to most
software.
Finally,
even when the storage media is overwritten, physical properties of the media
may permit recovery of the previous contents. In most cases however, this
recovery is not possible by just reading from the storage device in the usual
way, but requires using laboratory techniques such as disassembling the device
and directly accessing/reading from its components.