I have often spoken about the coming data privacy inflection point for information management in various blogs, articles, webinars, and podcasts. This inflection point involves the need to manage ALL data within an organization to ensure that new data privacy rights can be addressed.
Now, most organizations currently only capture and centrally manage between 5% and 10% of all data generated and received within the company. This means that between 90% and 95% of all corporate data is not tracked, indexed, or viewable by a central IT authority or Records Management personnel. In reality, this unmanaged data is controlled by individual employees and stored locally on employee devices without any visibility by records management or IT.
However, for companies to comply with the new data privacy laws (and the associated data subject rights), they will need access to all corporate data, including that data held by individual employees. Some have pushed back and questioned why not let the employees continue to manage their non-records – we’ll ask them for specific PII when we need to.
Two points here; first, most employees don’t manage their local data, i.e., keep it in a logical file structure in one specific location, attach keywords, and assign retention/disposition policies. This is usually due to a lack of training and time to spend performing records management duties.
And second, the number of end-user data subject access requests (DSAR) for information on their PII will skyrocket in the next couple of years, potentially reaching thousands of requests per year. This fact will make it nearly impossible to rely on employees to search for specific PII on a case-by-case basis.
Because of this current employee information management challenge, companies cannot perform a complete search for, report on, or delete PII when requested due to the lack of data visibility on employee devices.
With this new reality, information governance professionals are facing a formidable task – gaining access to and managing 5 to 10 times more data. However, in this article, I will discuss the potential of managing a great deal more information (after the access challenge is solved) using AI to categorize data, namely a look at ChatGPT for document auto-categorization.
ChatGPT and Information management
ChatGPT, a large language model developed by OpenAI, has received a great deal of press in the last several months. Because of the notoriety, I, and I’m sure many others, started to think about how AI could address the needs of this new information management reality.
I started this exercise by asking the ChatGPT AI the following question:
Can ChatGPT auto-categorize documents based on content and context?
The AI produced the following answer:
Yes, ChatGPT can be trained to auto-categorize documents based on their content and context. This is a common application of natural language processing (NLP) and machine learning, where the model is trained on a large dataset of categorized documents to learn patterns in the language and context that are associated with each category.
The process of auto-categorization typically involves several steps. First, the text of each document is preprocessed to extract relevant features, such as word frequencies, sentence structure, and topic keywords. Then, these features are fed into a machine learning algorithm, such as a decision tree or a neural network, that learns to classify the documents into predefined categories based on their features.
To train such a model, a large dataset of categorized documents is required. This dataset is used to train the machine learning algorithm by presenting it with examples of documents and their corresponding categories. The algorithm learns to recognize patterns in the language and context of the documents that are associated with each category, and it uses these patterns to predict the category of new, unseen documents.
Overall, auto-categorization is a powerful tool for organizing and analyzing large volumes of text data, and it has a wide range of applications in fields such as content management, marketing, and customer service.
This answer was eye-opening because it reminded me of my time at Recommind many years ago, an early player and pioneer in machine learning-based predictive coding for reviewing large evidence data sets for eDiscovery.
Early predictive coding involved providing the application with both correct and incorrect examples of case-responsive content. This process was called supervised machine learning and involved many training cycles to ensure the accuracy rate was in the 90% to 95% range so that Judges would readily accept the results.
But the payoff for eDiscovery was that responsive content could be very quickly and accurately found in gigantic data sets, dramatically reducing the cost of the eDiscovery review process. This cost savings was realized because attorneys no longer had to rely on groups of paralegals or expensive contract attorneys to read every page of content and determine relevance to the case.
With the eDiscovery/predictive coding example in mind, what if AI could accurately auto-categorize, tag, apply retention/disposition policies, and file vast amounts of data moving around an enterprise?
This would remove the requirement for employees to spend their time trying to manage their local content (which most don’t do anyway) while also making all the local data manageable and visible to central authorities.
For those unfamiliar with the concept of auto-categorization, it refers to the automatic classification, tagging, indexing, and management of documents based on their content and context.
In the case of the context of content in documents, let me offer an example some of you will be familiar with from years ago. How would an application that includes auto-categorization based on keywords (a standard capability right now) file a document that features the word Penguin? How would the application recognize that the document was referring to the black and white flightless bird, the publishing house, or the comic book Batman villain? By understanding the context of the additional content in the document, the AI would understand which Penguin the document was referring to and be able to categorize, tag and file it accurately.
Circling back (sorry for the reference) to the data privacy law inflection point, this capability can be particularly useful for all businesses that deal with, collect, use, and sell PII.
Traditionally (some) employees have manually performed data categorization and management, which can be time-consuming and less-than-accurate. And as I mentioned above, most employees don’t have the time or training to categorize and manage all the documents they encounter daily.
However, with the advent of new AI capabilities, such as ChatGPT and now the more advanced ChatGPT 4, consistent and accurate auto-categorization can now be done automatically across massive data sets – even potentially in real time.
One of the primary benefits of using ChatGPT for document auto-categorization is that it is incredibly consistent and accurate. ChatGPT is a machine learning model that has already been trained on vast amounts of data, and it can use this data to predict the correct category for each document.
Because ChatGPT has been trained on a very large dataset, it can recognize patterns and make highly accurate categorization predictions. This means businesses will be able to rely on ChatGPT to correctly and consistently categorize their documents without needing manual/employee intervention.
Another benefit of using ChatGPT for document auto-categorization is that it is incredibly fast. Speed is of the essence when dealing with huge volumes of live documents. This means that businesses can process their documents much more rapidly, improving efficiency and consistency and relieving employees of these non-productive requirements.
Additionally, because ChatGPT can quickly categorize documents, it can be utilized in real-time (live) data flows, which will be particularly useful for businesses that now must “read” and categorize much larger data set flows (live data) due to the data privacy laws.
Using ChatGPT for records auto-categorization will also lead to cost savings for businesses. Traditionally, document categorization has been done manually by employees, which can be inaccurate, time-consuming, and labor-intensive.
However, by using ChatGPT, organizations can free up employees to work on other tasks raising productivity. Additionally, because ChatGPT can categorize documents quickly and accurately, businesses can avoid costly errors arising from inaccurate manual document categorization.
Finally, ChatGPT is a machine-learning model that can learn and improve over time. As businesses continue to use ChatGPT for document categorization, the model will become more accurate and efficient, leading to even greater benefits in the long run. As ChatGPT continues to evolve, it will likely become even more sophisticated, which means that businesses can look forward to even more significant benefits in the future.
What this means for users and vendors
ChatGPT is quickly being built into many platforms, including Microsoft’s Bing search engine and the Azure Cloud infrastructure.
What does this mean for information/records management applications in the Azure Cloud? Soon vendors with native Azure applications will be able to design ChatGPT capabilities into their information management applications to provide highly accurate auto-categorization, tagging, litigation hold placement, field-level encryption (of PII), and retention/disposition policy placement.
However, this is only half of the solution I referenced concerning the information management inflection point challenge. The other important requirement all companies will face is gaining access to and managing all corporate data, including that data controlled by individual employees.
The bottom line for information management application vendors is that using ChatGPT for records auto-categorization and related capabilities is a no-brainer because it will offer a wide range of benefits for businesses. From improved accuracy to faster processing times, greater employee productivity, and, most importantly, compliance with the new data privacy laws.
Those information management vendors that ignore or are slow to include these new capabilities will lose.