Can ChatGPT Solve Information Management’s Biggest Challenge?

I have often spoken about the coming data privacy inflection point for information management in various blogs, articles, webinars, and podcasts. This inflection point involves the need to manage ALL data within an organization to ensure that new data privacy rights can be addressed.

Now, most organizations currently only capture and centrally manage between 5% and 10% of all data generated and received within the company. This means that between 90% and 95% of all corporate data is not tracked, indexed, or viewable by a central IT authority or Records Management personnel. In reality, this unmanaged data is controlled by individual employees and stored locally on employee devices without any visibility by records management or IT.

However, for companies to comply with the new data privacy laws (and the associated data subject rights), they will need access to all corporate data, including that data held by individual employees. Some have pushed back and questioned why not let the employees continue to manage their non-records – we’ll ask them for specific PII when we need to.

Two points here; first, most employees don’t manage their local data, i.e., keep it in a logical file structure in one specific location, attach keywords, and assign retention/disposition policies. This is usually due to a lack of training and time to spend performing records management duties.

And second, the number of end-user data subject access requests (DSAR) for information on their PII will skyrocket in the next couple of years, potentially reaching thousands of requests per year. This fact will make it nearly impossible to rely on employees to search for specific PII on a case-by-case basis.

Because of this current employee information management challenge, companies cannot perform a complete search for, report on, or delete PII when requested due to the lack of data visibility on employee devices.

With this new reality, information governance professionals are facing a formidable task – gaining access to and managing 5 to 10 times more data. However, in this article, I will discuss the potential of managing a great deal more information (after the access challenge is solved) using AI to categorize data, namely a look at ChatGPT for document auto-categorization.

ChatGPT and Information management

ChatGPT, a large language model developed by OpenAI, has received a great deal of press in the last several months. Because of the notoriety, I, and I’m sure many others, started to think about how AI could address the needs of this new information management reality.

I started this exercise by asking the ChatGPT AI the following question:

Can ChatGPT auto-categorize documents based on content and context?

The AI produced the following answer:

Yes, ChatGPT can be trained to auto-categorize documents based on their content and context. This is a common application of natural language processing (NLP) and machine learning, where the model is trained on a large dataset of categorized documents to learn patterns in the language and context that are associated with each category.

The process of auto-categorization typically involves several steps. First, the text of each document is preprocessed to extract relevant features, such as word frequencies, sentence structure, and topic keywords. Then, these features are fed into a machine learning algorithm, such as a decision tree or a neural network, that learns to classify the documents into predefined categories based on their features.

To train such a model, a large dataset of categorized documents is required. This dataset is used to train the machine learning algorithm by presenting it with examples of documents and their corresponding categories. The algorithm learns to recognize patterns in the language and context of the documents that are associated with each category, and it uses these patterns to predict the category of new, unseen documents.

Overall, auto-categorization is a powerful tool for organizing and analyzing large volumes of text data, and it has a wide range of applications in fields such as content management, marketing, and customer service.

This answer was eye-opening because it reminded me of my time at Recommind many years ago, an early player and pioneer in machine learning-based predictive coding for reviewing large evidence data sets for eDiscovery.

Early predictive coding involved providing the application with both correct and incorrect examples of case-responsive content. This process was called supervised machine learning and involved many training cycles to ensure the accuracy rate was in the 90% to 95% range so that Judges would readily accept the results.

But the payoff for eDiscovery was that responsive content could be very quickly and accurately found in gigantic data sets, dramatically reducing the cost of the eDiscovery review process. This cost savings was realized because attorneys no longer had to rely on groups of paralegals or expensive contract attorneys to read every page of content and determine relevance to the case.

With the eDiscovery/predictive coding example in mind, what if AI could accurately auto-categorize, tag, apply retention/disposition policies, and file vast amounts of data moving around an enterprise?

This would remove the requirement for employees to spend their time trying to manage their local content (which most don’t do anyway) while also making all the local data manageable and visible to central authorities.

For those unfamiliar with the concept of auto-categorization, it refers to the automatic classification, tagging, indexing, and management of documents based on their content and context.

In the case of the context of content in documents, let me offer an example some of you will be familiar with from years ago. How would an application that includes auto-categorization based on keywords (a standard capability right now) file a document that features the word Penguin? How would the application recognize that the document was referring to the black and white flightless bird, the publishing house, or the comic book Batman villain? By understanding the context of the additional content in the document, the AI would understand which Penguin the document was referring to and be able to categorize, tag and file it accurately.

Circling back (sorry for the reference) to the data privacy law inflection point, this capability can be particularly useful for all businesses that deal with, collect, use, and sell PII.

Traditionally (some) employees have manually performed data categorization and management, which can be time-consuming and less-than-accurate. And as I mentioned above, most employees don’t have the time or training to categorize and manage all the documents they encounter daily.

However, with the advent of new AI capabilities, such as ChatGPT and now the more advanced ChatGPT 4, consistent and accurate auto-categorization can now be done automatically across massive data sets – even potentially in real time.

One of the primary benefits of using ChatGPT for document auto-categorization is that it is incredibly consistent and accurate. ChatGPT is a machine learning model that has already been trained on vast amounts of data, and it can use this data to predict the correct category for each document.

Because ChatGPT has been trained on a very large dataset, it can recognize patterns and make highly accurate categorization predictions. This means businesses will be able to rely on ChatGPT to correctly and consistently categorize their documents without needing manual/employee intervention.

Another benefit of using ChatGPT for document auto-categorization is that it is incredibly fast. Speed is of the essence when dealing with huge volumes of live documents. This means that businesses can process their documents much more rapidly, improving efficiency and consistency and relieving employees of these non-productive requirements.

Additionally, because ChatGPT can quickly categorize documents, it can be utilized in real-time (live) data flows, which will be particularly useful for businesses that now must “read” and categorize much larger data set flows (live data) due to the data privacy laws.

Using ChatGPT for records auto-categorization will also lead to cost savings for businesses. Traditionally, document categorization has been done manually by employees, which can be inaccurate, time-consuming, and labor-intensive.

However, by using ChatGPT, organizations can free up employees to work on other tasks raising productivity. Additionally, because ChatGPT can categorize documents quickly and accurately, businesses can avoid costly errors arising from inaccurate manual document categorization.

Finally, ChatGPT is a machine-learning model that can learn and improve over time. As businesses continue to use ChatGPT for document categorization, the model will become more accurate and efficient, leading to even greater benefits in the long run. As ChatGPT continues to evolve, it will likely become even more sophisticated, which means that businesses can look forward to even more significant benefits in the future.

What this means for users and vendors

ChatGPT is quickly being built into many platforms, including Microsoft’s Bing search engine and the Azure Cloud infrastructure.

What does this mean for information/records management applications in the Azure Cloud? Soon vendors with native Azure applications will be able to design ChatGPT capabilities into their information management applications to provide highly accurate auto-categorization, tagging, litigation hold placement, field-level encryption (of PII), and retention/disposition policy placement.    

However, this is only half of the solution I referenced concerning the information management inflection point challenge. The other important requirement all companies will face is gaining access to and managing all corporate data, including that data controlled by individual employees.

The bottom line for information management application vendors is that using ChatGPT for records auto-categorization and related capabilities is a no-brainer because it will offer a wide range of benefits for businesses. From improved accuracy to faster processing times, greater employee productivity, and, most importantly, compliance with the new data privacy laws.

Those information management vendors that ignore or are slow to include these new capabilities will lose.


“Move to Manage” versus “Manage in Place”

Traditional approaches to information management are generally speaking no longer suitable to meet today’s information management needs. The legacy “move-to-manage” premise is expensive, fraught with difficulties and contradictory to modern data repositories that (a) are either cloud-based, (b) have built-in governance tools, or (c) contain data that best resides in the native repository.

In reality, traditional records management and ECM systems only manage a small percentage of an organization’s total information. A successful implementation is often considered 5% of the information that exists. What about all the information not deemed a “record”?

Traditional archiving systems tend to capture everything and for the most part cause organizations to keep their archived information for much longer periods of time, or forever. Corporate data volumes and the data landscape have changed dramatically since archiving systems became widely adopted. Some organizations are discovering the high cost of getting their data out while others are experiencing end-user productivity issues, incompatible stuns or shortcuts and a lack of support for the modern interfaces through which users expect to access their information.

The unstructured data problem, along with the emerging reality of the cloud, have brought us to an inflection point; either continue to use decade-old, higher-cost and complex approaches to manage huge quantities of information, or proactively govern this information where it naturally resides  to more effectively identify, organize and advance the best possible outcomes for security, compliance, litigation response and innovation.

Today’s enterprise-ready hardware and storage solutions as well as scalable business productivity applications featuring built-in governance tools are both affordable and easily accessible. For forward-thinking organizations, there is no question that in-place information management is the most viable and cost-effective methodology for information management in the 21st century.

An Acaevo white paper on the subject can be downloaded here

Successful Predictive Coding Adoption is Dependent on Effective Information Governance

Predictive coding has been receiving a great deal of press lately (for good reason), especially with the ongoing case; Da Silva Moore v. Publicis Groupe, No. 11 Civ. 1279 (ALC) (AJP), 2012 U.S. Dist. LEXIS 23350 (S.D.N.Y. Feb. 24, 2012). On May 21, the plaintiffs filed Rule 72(a) objections to Magistrate Judge Peck’s May 7, 2012 discovery rulings related to the relevance of certain documents that comprise the seed set of the parties’ ESI protocol.

This Rule 72(a) objection highlights an important point in the adoption of predictive coding technologies; the technology is only as good as the people AND processes supporting it.

To review, predictive coding is a process where a computer (with the requisite software), does the vast majority of the work of deciding whether data is relevant, responsive or privileged to a given case.

Beyond simply searching for keyword matching (byte for byte), predictive coding adopts a computer self-learning approach. To accomplish this, attorneys and other legal professionals provide example responsive documents/data in a statistically sufficient quantity which in turn “trains”the computer as to what relevant documents/content should be flagged and set aside for discovery. This is done in an iterative process where legally trained professionals fine-tune the seed set over a period of time to a point where the seed set represents a statistically relevant sample which includes examples of all possible relevant content as well as formats. This capability can also be used to find and secure privileged documents. Instead of legally trained people reading every document to determine if a document is relevant to a case, the computer can perform a first pass of this task in a fraction of the time with much more repeatable results. This technology is exciting in that it can dramatically reduce the cost of the discovery/review process by as much as 80% according to the RAND Institute of Civil Justice.

By now you may be asking yourself what this has to do with Information Governance?…

For predictive coding to become fully adopted across the legal spectrum, all sides have to agree 1. the technology works as advertised, and 2. the legal professionals are providing the system with the proper seed sets for it to learn from. To accomplish the second point above, the seed set must include content from all possible sources of information. If the seed set trainers don’t have access to all potentially responsive content to draw from, then the seed set is in question.

Knowing where all the information resides and having the ability to retrieve it quickly is imperative to an effective discovery process. Records/Information Management professionals should view this new technology as an opportunity to become an even more essential partner to the legal department and entire organization by not just focusing on “records” but on information across the entire enterprise. With full fledged information management programs in place, the legal department will be able to fully embrace this technology to drastically reduce their cost of discovery.

Automatic Deletion…A Good Idea?

In my last blog, I discussed the concept of Defensible Disposal; getting rid of data which has no value to lower the cost and risk of eDiscovery as well as overall storage costs (IBM has been a leader in Defensive Disposal for several years). Custodians keep data because they might need to reuse some of the content later or they might have to produce it later for CYA reasons. I have been guilty of over the years and because of that I have a huge amount of old data on external disks that I will probably never, ever look at again. For example, I have over 500 GB of saved data, spreadsheets, presentations, PDFs, .wav files, MP3s, Word docs, URLs etc. that I have saved for whatever reason over the years. Have I ever really, reused any of the data…maybe a couple of times, but in reality they just site there. This brings up the subject of the Data Lifecycle. Fred Moore, Founder of Horison Information Strategies wrote about this concept years ago, referring to the Lifecycle of Data and the probability that the saved data will ever be re-used or even looked at again. Fred created a graphic showing this lifecycle of data.

Figure 1: The Lifecycle of data – Horison Information Systems

The above chart shows that as data ages, the probability of reuse goes down…very quickly as the amount of saved data rises. Once data has aged 90 days, its probability of reuse approaches 1% and after 1 year is well under 1%.

You’re probably asking yourself, so what!…storage is cheap, what’s the big deal? Storage is cheap. I have 500 GB of storage available to me on my new company supplied laptop. I have share drives available to me. And I have 1 TB of storage in my home office. I can buy 1TB of external disk for approximately $100, so why not keep everything forever?

For organizations, it’s a question of storage but more importantly, it’s a question of legal risk and the cost of eDiscovery. Any existing data could be a subject of litigation and therefore reviewable. You may recall in my last blog, I mentioned a recent report from the RAND Institute for Civil Justice which discussed the costs of eDiscovery including the estimate that the cost of reviewing records/files is approximately 73% of every eDiscovery dollar spent. By saving everything because you might someday need to reuse or reference it drive the cost of eDiscovery way up.

The key question to ask is; how do you get employees to delete stuff instead of keeping everything? In most organizations the culture has always been one of “save whatever you want until your hard disk and share drive is full”. This culture is extremely difficult to change…quickly. One way is to force new behavior with technology. I know of a couple of companies which only allow files to be saved to a specific folder on the users desktop. For higher level laptop users, as the user syncs to the organization’s infrastructure, all files saved to the specific folder are copied to a users sharedrive where an information management application applies retention policies to the data on the sharedrive as well as the laptop’s data folder.

In my opinion this extreme process would not work in most organizations due to culture expectations. So again we’re left with the question of how do you get employees to delete stuff?

Organizational cultures about data handling and retention have to be changed over time. This includes specific guidance during new employee orientation, employee training, and slow technology changes. An example could be reducing the amount of storage available to an employee on the share or home drive.

Another example could be some process changes to an employee’s workstation of laptop. Force the default storage target to be the “My Documents” folder. Phase 1 could be you have to save all files to the “My Documents” folder but can then be moved anywhere after that.

Phase 2 could include a 90 day time limit on the “My Documents” folder so that anything older than 90 days is automatically deleted (with litigation hold safeguards in place). This would cause files not deemed to be important enough to moved to be of little value and “disposable”. The 3rd Phase could include the inability to move files out of the “My Documents” folder (but with the ability for users to create subfolders with no time limit) thereby ensuring a single place of discoverable data.

Again, this strategy needs to be a slow progression to minimalize the perceived changes to the user population.

The point is it’s a end user problem, not necessarily an IT problem. End users have to be trained, gently pushed, and eventually forced to get rid of useless data…

Defensible Disposal and Predictive Coding Reduces (?) eDiscovery by 65%

Following Judge Peck’s decision on predictive coding in February of 2012, yet another Judge has gone in the same direction. In Global Aerospace Inc., et al, v. Landow Aviation, L.P. dba Dulles Jet Center, et al (April 23, 2012), Judge Chamblin, a state judge in the 20th Judicial Circuit of Virginia’s Loudoun Circuit Court, wrote:

“Having heard argument with regard to the Motion of Landow Aviation Limited Partnership, Landow Aviation I, Inc., and Landow & Company Builders, Inc., pursuant to Virginia Rules of Supreme Court 4:1 (b) and (c) and 4:15, it is hereby ordered Defendants shall be allowed to proceed with the use of predictive coding for the purposes of the processing and production of electronically stored information.”

This decision was despite plaintiff’s objections the technology is not as effective as purely human review.

This decision comes on top of a new RAND Institute for Civil Justice report which highlights a couple of important points. First, the report estimated that $0.73 of every dollar spent on eDiscovery can be attributed to the “Review” task.RAND also called out a study showing an 80% time savings in Attorney review hours when predictive coding was utilized.

This suggests that the use of predictive coding could, optimistically, reduce an organization’s eDiscovery costs by 58.4%.

The barriers to the adoption of predictive coding technology are (still):

  • Outside counsel may be slow to adopt this due to the possibility of loosing a large revenue stream
  • Outside and Internal counsel will be hesitant to rely on new technology without a track record of success
  • Additional guidance from Judges

These barriers will be overcome relatively quickly.

Let’s take this cost saving projection further. In my last blog I talked about “Defensible Disposal” or in other words, getting rid of old data not needed by the business. It is estimated the cost of review can be reduced by 50% by simply utilizing an effective Information Governance program. Utilizing the Defensible Disposal strategy brings the $0.73 of every eDiscovery review dollar down to $0.365.

Now, if predictive coding can reduce the remaining 50% of the cost of eDiscovery review by 80% as was suggested in the RAND report, between the two strategies, a total eDiscovery savings of approximately 65.7% could be achieved. To review, lets look at the math.

Starting with $0.73 of every eDiscovery dollar is attributed to the review process

Calculating a 50% saving due to Defensible Disposal brings the cost of review down to $0.365.

Calculating the additional 80% review savings using predictive coding we get:

$0.365 * 0.2 (1-.8) = $0.073 (total cost of review after savings from both strategies)

To finish the calculations we need to add back in the cost not related to review (processing and collection) which is $0.27

Total cost of eDiscovery = $0.073 + $0.27 = $0.343 or a savings of: $1.0 – $0.343 = 0.657 or 65.7%.

 As with any estimates…your mileage may vary, but this exercise points out the potential cost savings utilizing just two strategies, Defensible Disposal and Predictive Coding.

Information Management Cost Reduction Strategies for Litigation

In these still questionable economic times, most legal departments are still looking for ways to reduce, or at least stop the growth, of their legal budgets. One of the most obvious targets for cost reduction in any legal department is the cost of responding to eDiscovery including the cost of finding all potentially responsive ESI, culling it down and then having in-house or external attorneys review it for relevance and privilege. Per a CGOC survey, the average GC spends approximately $3 million per discovery to gather and prepare information for opposing counsel in litigation.

Most organizations are looking for ways to reduce these growing costs of eDiscovery. The top four cost reduction strategies legal departments are considering are:

  • Bring more evidence analysis and do more ESI processing internally
  • Keep more of the review of ESI in house rather that utilize outside law firms
  • Look at off-shore review
  • Pressure external law firms for lower rates

I don’t believe these strategies address the real problem, the huge and growing amount of ESI.

Several eDiscovery experts have told me that the average eDiscovery matter can include between 2 and 3 GB of potentially responsive ESI per employee. Now, to put that in context, 1 GB of data can contain between 10,000 and 75,000 pages of content. Multiply that by 3 and you are potentially looking at between 30,000 and 225,000 pages of content that should be reviewed for relevancy and privilege per employee. Now consider that litigation and eDiscovery usually includes more than one employee…ranging from two to hundreds.

It seems to me the most straight forward and common sense way to reduce eDiscovery costs is to better manage the information that could be pulled into an eDiscovery matter, proactively.

To illustrate this proactive information management strategy for eDiscovery, we can look at the overused but still appropriate DuPont case study from several years ago.

DuPont re-looked at nine cases. They determined that they had reviewed a total of 75,450,000 pages of content in those nine cases. A total of 11,040,000 turned out to be responsive to the cases. DuPont also looked at the status of these 75 million pages of content to determine their status in their records management process. They found that approximately 50% of those 75 million pages of content were beyond their documented retention period and should have been destroyed and never reviewed for any of the 9 cases. They also calculated they spent $11, 961,000 reviewing this content. In other words, they spent &11.9 million reviewing documents that should not have existed if their records retention schedule and policy had been followed.

An information management program, besides capturing and making ESI available for use, includes the defensible deletion of ESI that has reached the end of its retention period and therefore is valueless to the organization.

Corporate counsel should be the biggest proponents of information governance in their organizations simply due to the fact that it affects their budgets directly.

The ROI of Information Management

Information, data, electronically stored information (ESI), records, documents, hard copy files, email, stuff—no matter what you call it; it’s all intellectual property that your organization pays individuals to produce, interpret, use and export to others. After people, it’s a company’s most valuable asset, and it has many CIOs, GCs and others responsible asking: What’s in that information; who controls it; and where is it stored?

In simplest terms, I believe that businesses exist to generate and use information to produce revenue and profit.  If you’re willing to go along with me and think of information in this way as a commodity, we must also ask: How much does it cost to generate all that information? And, what’s the return on investment (ROI) for all that information?

The vast majority of information in an organization is not managed, not indexed, not backed up and, as you probably know or could guess, is rarely–if ever–accessed. Consider for a minute all the data in your company that is not centrally managed and  not easily available. This data includes backup tapes, share drives, employee hard disks, external disks, USB drives, CDs, DVDs, email attachments  sent outside the organization and hardcopy documents hidden away in filing cabinets.

Here’s the bottom line: If your company can’t find information or  doesn’t know what it contains, it is of little value. In fact, it’s valueless.

Now consider the amount of money the average company spends on an annual basis for the production, use and storage of information. These expenditures span:

  • Employee salaries. Most employees are in one way or another hired to produce, digest and act on information.
  • Employee training and day-to-day help-desk support.
  • Computers for each employee
  • Software
  • Email boxes
  • Share drives, storage
  • Backup systems
  • IT employees for data infrastructure support

In one way or another, companies exist to create and utilize information. So… do you know where all your information is and what’s in it? What’s your organization’s true ROI on the production and consumption of your information in your entire organization? How much higher could it be if you had complete control if it?

As an example, I have approximately 14.5 GB of Word documents, PDFs, PowerPoint files, spreadsheets, and other types of files in different formats that I’ve either created or received from others. Until recently, I had 3.65 GB of emails in my email box both on the Exchange server and mirrored locally on my hard disk. Now that I have a 480 MB mailbox limit imposed on me, 3.45 GB of those emails are now on my local hard disk only.

How much real, valuable information is contained in the collective 18 GB on my laptop? The average number of pages of information contained in 1 GB is conservatively 10,000. So 18 GB of files equals approximately 180,000 pages of information for a single employee that is not easily accessible or searchable by my organization. Now also consider the millions of pages of hardcopy records existing in file cabinets, microfiche and long term storage all around the company.

The main question is this: What could my organization do with quick and intelligent access to all of its employees’ information?

The more efficient your organization is in managing and using information, the higher the revenue and hopefully profit per employee will be.

Organizations need to be able to “walk the fence” between not impeding the free flow of information generation and sharing, and having a way for the organization as a whole to  find and use that information. Intelligent access to all information generated by an organization is key to effective information management.

Organizations spend huge sums of money to generate information…why not get your money’s worth? This future capability is the essence of true information management and much higher ROIs for your organization.