3 Ways AI Protects Personally Identifiable Information (PII)

By Maxime Vermeir

Australian business must treat the protection of personally identifiable information (PII) as a priority – lest they risk up to $A50 million in penalties. PII is stored by businesses in nearly every sector. PII includes any sensitive information that can be used to identify an individual, such as their name, address, phone number or financial details like credit card and tax file numbers. If there is a data breach and this information is compromised, it could lead to disastrous privacy violations or even identity theft.

Whether it’s on a registration form for a gym membership or an application for a loan, your PII is likely stored somewhere in the recesses of a businesses’ files, whether those are digital or in a stainless-steel filing cabinet. Record keeping guidance by AUSTRACK requires Australian organisations to store this information for long periods of time, typically anywhere from 7 to 45 years, which leaves this data vulnerable to leaks and breaches.

Furthermore, the Freedom of Information Act of 1982 introduces greater complexity in the context of government-held documents. If someone requests access to a government document, the PII of external parties must be redacted. For example, if you request a grant application that you submitted, the names of other individuals involved in this correspondence must be redacted.

As you might imagine, this introduces a high degree of variance and special consideration for what information is included and redacted. Many of these records are also low-quality scans, near-illegible handwriting and inconsistent document formats that pose major obstacles in redacting PII data.

Nonetheless, businesses have a social and financial responsibility to prevent leakage of PII.

Luckily, recent advancements in artificial intelligence (AI) and optical character recognition (OCR) have paved three new paths to autonomously identify and redact sensitive information from documents.

Document-specific redaction based on field coordinates

AI can autonomously identify information in a pre-determined location on a document.

For example, when redacting the CVV of a credit card, the AI model can extract that data from the coordinates where the CVV field is always located. This is most effective for specific document formats where the locations of fields are in the same place every time.

While this positional approach is highly effective for proprietary formats from specified vendors, it must be adjusted in the event of any variation. This makes it less flexible than other AI-powered redaction methods, however its straightforward nature makes it consistent, quickly verifiable, and easily deployable.

Keyword-based redaction

By contrast, keyword-based redaction allows AI to be more flexible in its approach and less rigidly dependent on coordinates and consistent formats. Instead, it scans for specific keywords across the document such as “Card Number” or “Mailing Address.” Once these keywords are detected with OCR, AI can redact the appropriate information, offering broader coverage across a variety of documents.

Oftentimes this technique is rooted in the Luhn algorithm, a simple formula used to identify and validate different identification numbers based on the assortment of digits contained within. This method may require some fine-tuning depending on the data being redacted, but is ultimately more document-agnostic than redaction based on field coordinates.

Machine learning for field identification and handwriting

AI’s capabilities for handwriting recognition and field identification have been significantly enhanced by machine learning. Complex neural networks can identify image fragments of handwriting – useful for signatures and other handwritten entries – before determining their contents with OCR and redacting them based on their proximity to redaction criteria.

This system also allows for continuous training and improvement through user feedback, enabling greater efficiency over time.

Of course, many forms of PII are typically in a printed or typed format, but many organisations store decades-old documents that are predominately handwritten, not to mention handwritten signatures are extremely common.

While these distinct approaches to AI-powered PII redaction have their own limitations, applying them to their most effective use cases can drastically reduce the burden of manually identifying and extracting this data at an organisational scale.

Looming over Australian businesses are reforms to the Privacy Act with a tier-based penalty system. It’s expected to be introduced as early as this month, causing innovation leaders to explore every possible avenue to meet regulatory demands for data responsibility and transparency into PII use.

This is not exclusive to any specific industry. Banks and lenders store PII in statements and pay slips; healthcare organisations store high volumes of health and transaction records; educational institutions hold onto student records or even disciplinary notices; the list goes on, extending to education, government services, insurance, and beyond.

With consideration to regulatory pressures as well as ethical obligation, leveraging AI for scalable PII redaction is a crucial step for businesses to protect themselves from both financial losses and reputational damage, as well as their customers from attacks on their data.

Maxime Vermeir, Senior Director of AI Strategy at ABBYY, is one of the keynote speakers at the ABBYY AI Summit being held in Sydney on September 5, 2024. For full details of the Summit Program and registration Visit HERE.

Business Solution

Document & Records Management

Enterprise Applications

Enterprise Content Management

Information Analytics

Scanning & Capture