Can ChatGPT Solve Information Management’s Biggest Challenge?


I have often spoken about the coming data privacy inflection point for information management in various blogs, articles, webinars, and podcasts. This inflection point involves the need to manage ALL data within an organization to ensure that new data privacy rights can be addressed.

Now, most organizations currently only capture and centrally manage between 5% and 10% of all data generated and received within the company. This means that between 90% and 95% of all corporate data is not tracked, indexed, or viewable by a central IT authority or Records Management personnel. In reality, this unmanaged data is controlled by individual employees and stored locally on employee devices without any visibility by records management or IT.

However, for companies to comply with the new data privacy laws (and the associated data subject rights), they will need access to all corporate data, including that data held by individual employees. Some have pushed back and questioned why not let the employees continue to manage their non-records – we’ll ask them for specific PII when we need to.

Two points here; first, most employees don’t manage their local data, i.e., keep it in a logical file structure in one specific location, attach keywords, and assign retention/disposition policies. This is usually due to a lack of training and time to spend performing records management duties.

And second, the number of end-user data subject access requests (DSAR) for information on their PII will skyrocket in the next couple of years, potentially reaching thousands of requests per year. This fact will make it nearly impossible to rely on employees to search for specific PII on a case-by-case basis.

Because of this current employee information management challenge, companies cannot perform a complete search for, report on, or delete PII when requested due to the lack of data visibility on employee devices.

With this new reality, information governance professionals are facing a formidable task – gaining access to and managing 5 to 10 times more data. However, in this article, I will discuss the potential of managing a great deal more information (after the access challenge is solved) using AI to categorize data, namely a look at ChatGPT for document auto-categorization.

ChatGPT and Information management

ChatGPT, a large language model developed by OpenAI, has received a great deal of press in the last several months. Because of the notoriety, I, and I’m sure many others, started to think about how AI could address the needs of this new information management reality.

I started this exercise by asking the ChatGPT AI the following question:

Can ChatGPT auto-categorize documents based on content and context?

The AI produced the following answer:

Yes, ChatGPT can be trained to auto-categorize documents based on their content and context. This is a common application of natural language processing (NLP) and machine learning, where the model is trained on a large dataset of categorized documents to learn patterns in the language and context that are associated with each category.

The process of auto-categorization typically involves several steps. First, the text of each document is preprocessed to extract relevant features, such as word frequencies, sentence structure, and topic keywords. Then, these features are fed into a machine learning algorithm, such as a decision tree or a neural network, that learns to classify the documents into predefined categories based on their features.

To train such a model, a large dataset of categorized documents is required. This dataset is used to train the machine learning algorithm by presenting it with examples of documents and their corresponding categories. The algorithm learns to recognize patterns in the language and context of the documents that are associated with each category, and it uses these patterns to predict the category of new, unseen documents.

Overall, auto-categorization is a powerful tool for organizing and analyzing large volumes of text data, and it has a wide range of applications in fields such as content management, marketing, and customer service.

This answer was eye-opening because it reminded me of my time at Recommind many years ago, an early player and pioneer in machine learning-based predictive coding for reviewing large evidence data sets for eDiscovery.

Early predictive coding involved providing the application with both correct and incorrect examples of case-responsive content. This process was called supervised machine learning and involved many training cycles to ensure the accuracy rate was in the 90% to 95% range so that Judges would readily accept the results.

But the payoff for eDiscovery was that responsive content could be very quickly and accurately found in gigantic data sets, dramatically reducing the cost of the eDiscovery review process. This cost savings was realized because attorneys no longer had to rely on groups of paralegals or expensive contract attorneys to read every page of content and determine relevance to the case.

With the eDiscovery/predictive coding example in mind, what if AI could accurately auto-categorize, tag, apply retention/disposition policies, and file vast amounts of data moving around an enterprise?

This would remove the requirement for employees to spend their time trying to manage their local content (which most don’t do anyway) while also making all the local data manageable and visible to central authorities.

For those unfamiliar with the concept of auto-categorization, it refers to the automatic classification, tagging, indexing, and management of documents based on their content and context.

In the case of the context of content in documents, let me offer an example some of you will be familiar with from years ago. How would an application that includes auto-categorization based on keywords (a standard capability right now) file a document that features the word Penguin? How would the application recognize that the document was referring to the black and white flightless bird, the publishing house, or the comic book Batman villain? By understanding the context of the additional content in the document, the AI would understand which Penguin the document was referring to and be able to categorize, tag and file it accurately.

Circling back (sorry for the reference) to the data privacy law inflection point, this capability can be particularly useful for all businesses that deal with, collect, use, and sell PII.

Traditionally (some) employees have manually performed data categorization and management, which can be time-consuming and less-than-accurate. And as I mentioned above, most employees don’t have the time or training to categorize and manage all the documents they encounter daily.

However, with the advent of new AI capabilities, such as ChatGPT and now the more advanced ChatGPT 4, consistent and accurate auto-categorization can now be done automatically across massive data sets – even potentially in real time.

One of the primary benefits of using ChatGPT for document auto-categorization is that it is incredibly consistent and accurate. ChatGPT is a machine learning model that has already been trained on vast amounts of data, and it can use this data to predict the correct category for each document.

Because ChatGPT has been trained on a very large dataset, it can recognize patterns and make highly accurate categorization predictions. This means businesses will be able to rely on ChatGPT to correctly and consistently categorize their documents without needing manual/employee intervention.

Another benefit of using ChatGPT for document auto-categorization is that it is incredibly fast. Speed is of the essence when dealing with huge volumes of live documents. This means that businesses can process their documents much more rapidly, improving efficiency and consistency and relieving employees of these non-productive requirements.

Additionally, because ChatGPT can quickly categorize documents, it can be utilized in real-time (live) data flows, which will be particularly useful for businesses that now must “read” and categorize much larger data set flows (live data) due to the data privacy laws.

Using ChatGPT for records auto-categorization will also lead to cost savings for businesses. Traditionally, document categorization has been done manually by employees, which can be inaccurate, time-consuming, and labor-intensive.

However, by using ChatGPT, organizations can free up employees to work on other tasks raising productivity. Additionally, because ChatGPT can categorize documents quickly and accurately, businesses can avoid costly errors arising from inaccurate manual document categorization.

Finally, ChatGPT is a machine-learning model that can learn and improve over time. As businesses continue to use ChatGPT for document categorization, the model will become more accurate and efficient, leading to even greater benefits in the long run. As ChatGPT continues to evolve, it will likely become even more sophisticated, which means that businesses can look forward to even more significant benefits in the future.

What this means for users and vendors

ChatGPT is quickly being built into many platforms, including Microsoft’s Bing search engine and the Azure Cloud infrastructure.

What does this mean for information/records management applications in the Azure Cloud? Soon vendors with native Azure applications will be able to design ChatGPT capabilities into their information management applications to provide highly accurate auto-categorization, tagging, litigation hold placement, field-level encryption (of PII), and retention/disposition policy placement.    

However, this is only half of the solution I referenced concerning the information management inflection point challenge. The other important requirement all companies will face is gaining access to and managing all corporate data, including that data controlled by individual employees.

The bottom line for information management application vendors is that using ChatGPT for records auto-categorization and related capabilities is a no-brainer because it will offer a wide range of benefits for businesses. From improved accuracy to faster processing times, greater employee productivity, and, most importantly, compliance with the new data privacy laws.

Those information management vendors that ignore or are slow to include these new capabilities will lose.

Advertisement

The Lifecycle of Information – Updated


Organizations habitually over-retain information, especially unstructured electronic information, for all kinds of reasons. Many organizations simply have not addressed what to do with it so many of them fall back on relying on individual employees to decide what should be kept and for how long and what should be disposed of. On the opposite end of the spectrum a minority of organizations have tried centralized enterprise content management systems and have found them to be difficult to use so employees find ways around them and end up keeping huge amounts of data locally on their workstations, on removable media, in cloud accounts or on rogue SharePoint sites and are used as “data dumps” with or no records management or IT supervision. Much of this information is transitory, expired, or of questionable business value. Because of this lack of management, information continues to accumulate. This information build-up raises the cost of storage as well as the risk associated with eDiscovery. In reality, as information ages, it probability of re-use and therefore its value, shrinks quickly. Fred Moore, Founder of Horison Information Strategies, wrote about this concept years ago as the Lifecycle of Data. Figure 1 below shows that as data ages, the probability of reuse goes down…very quickly as the amount of saved data rises. Once data has aged 10 to 15 days, its probability of ever being looked at again approaches 1% and as it continues to age approaches but never quite reaches zero (figure 1 – blue shading).

Lifecycle of Data 1

Figure 1: The Lifecycle of Information

Contrast that with the possibility that a large part of any organizational data store has little of no business, legal or regulatory value. In fact the Compliance, Governance and Oversight Counsel (CGOC) conducted a survey in 2012 that showed that on the average, 1% of organizational data is subject to litigation hold, 5% is subject to regulatory retention and 25% had some business value (figure 2 – green shading). This means that approximately 69% of an organizations data store has no business value and could be disposed of without legal, regulatory or business consequences. The average employee creates, sends, receives and stores conservatively 20 MB of data per day. This means that at the end of 15 business days, they have accumulated 220 MB of new data, at the end of 90 days, 1.26 GB of data and at the end of three years, 15.12 GB of data (if they don’t delete anything). So how much of this accumulated data needs to be retained? Again referring to figure 2 below, the red shaded area represents the information that probably has no legal, regulatory or business value according to the 2012 CGOC survey. At the end of three years, the amount of retained data from a single employee that could be disposed of without adverse effects to the organization is 10.43 GB. Now multiply that by the total number of employees and you are looking at some very large data stores.

Lifecycle of Data 2

Figure 2: The Lifecycle of information Value

The above Lifecycle of Information Value graphic above shows us that employees really don’t need all of the data they squirrel away (because its probability of re-use drops to 1% at around 15 days) and based on the CGOC survey, approximately 69% of organizational data is not required for legal, regulatory retention or has business value. The difficult piece of this whole process is how can an organization efficiently determine what data is not needed and dispose of it using automation (because employees probably won’t)… As unstructured data volumes continue to grow, automatic categorization of data is quickly becoming the only realistic way to get ahead of the data flood. Without accurate automated categorization, the ability to find the data you need, quickly will never be realized. Even better, if data categorization can be based on the value of the content, not just a simple rule or keyword match, highly accurate categorization and therefore information governance is achievable.

Are Law Firms the Weakest Link in the Information Security Chain?


Many law firms are unwittingly setting themselves up to be a prime target for cyber criminals. But it is not the firm’s data that hackers might be looking for – it is the huge volume of client data that law firms handle on a daily basis that make them so appealing for cyber criminals to target.

eDiscovery continues to generate huge, and ever-growing data sets of ESI for law firms to manage. Those data sets are often passed to the client’s law firm for processing, review and production. The end result is law firms are sitting on huge amounts of sensitive client data and if the firm is not diligent about managing it, securing it, and disposing of it at the conclusion of the case.  And absent serious reforms in the Rules of Civil Procedure, these data volumes will only continue to grow.

A 2014 ABA Legal Technology Survey Report found that 14% of law firms experienced a security breach in 2013 which included a lost or stolen computer or smartphone, a cyber-attack, a physical break in of website exploit event. That same survey reported that 45% of respondents had experienced a virus-based technology infection and boutique firms of 2 to 9 attorneys were the most likely to have experienced an infection. Law firms of 10 to 49 attorneys were the most likely to suffer security breaches.

A growing number of clients are demanding their law firms take data security more seriously and are laying down the law – “give us what we want or we will find another law firm that will…” Generally speaking, law firms have never been accused of being technology “early adopters” and while they still don’t need to be, they do need to take client (and firm) data security and management seriously and adopt technology and processes that will both satisfy their client’s rising expectations as well as their cyber insurance providers best practices.

At the end of the day, law firms should ask themselves a basic question: is my law firm prepared and equipped to protect our client’s data and if not, what’s the best strategy for my law firm going forward?

For more detail on this topic, download the Paragon white paper on this subject.

InfoGov: Productivity Gains Equal Revenue Gains


A great deal has been written on lost productivity and the benefits of information governance. The theory being that an information governance program will raise employee productivity thereby saving the organization money. This theory is pretty well accepted based on the common sense realization and market data that information workers spend many hours per week looking for information to do their jobs. One data point comes from a 2013 Wortzmans e-Discovery Feed blog titled “The Business Case for Information Governance – Reduce Lost Productivity! that states employees spend up to nine hours per week (or 1 week per month or 12 weeks per year) looking for information. The first question to consider is how much of that time searching for information could be saved with an effective information governance program?

InfoGov Productivity Savings

Three months out of every year spent looking for information seems a little high… so what would a more conservative number be for time spent searching for information? In my travels through the archiving, records management, eDiscovery, and information governance industries, I have spoken to many research analysts and many, many more customers and have generally seen numbers in the 2 to 4 hours per week range thrown around. Assuming the four hours per week estimate, the average employee spends 208 hours per year (26 working days or 5.2 weeks) looking for information. Let’s further assume that an effective information governance program that would capture, index, store, and manage (including disposal), of all ESI per centralized policies would save 50% of the time employees spend looking for information (not an unrealistic estimate in my humble opinion), or 104 hours per year (13 days or 2.6 weeks). To bring this number home, let’s dollarize employee time.

Table 1 lays out the assumptions we will use for the productivity calculations including the average annual and hourly salary per employee.

Blog 08082014 t1

 

 

 

 

 

 

Table 2 below shows the calculations based on the assumptions in table 1 for weekly and annual time periods.

Blog 08082014 t2

 

 

 

 

 

Assuming a work force of 1000 employees at this company, the total annual cost of search is $7.5 million. Assuming a 50% increase in search productivity gives us an estimated $3.75 million saving from recovered employee productivity. In most cases, a $3.75 million annual savings would more than pay for an effective information governance program for a company of 1000 employees. But that potential savings is only a third of the recoverable dollars.

Another productivity cost factor is the amount of time spent recreating data that couldn’t be found (but existed) during search. Additional variables to be used for calculations include:

Blog 08082014 t3

 

 

 

Most employees will agree that a certain percentage of their search time is spent looking for information they don’t find…until well after their need has passed. This number is very hard to estimate but based on my own experience, I use a percentage of 40%. The other important variable is the amount of time (as a percentage) spent actually recreating the data you couldn’t find. In other words, the percentage of time (200%) of hours spent searching for information but not finding it (table 3).

Blog 08082014 t4

 

 

 

 

Table 4 above lays out the calculations showing the total hours wasted recreating data that should have been found of 166,400 across the entire company or $6 million. The assumption is that this wasted time spent recreating data not found would be reduced to zero with an effective information governance program.

So far the estimated saving based on recovered productivity (if they adopted an information governance program) for this company of 1000 employees is $3.75 million plus $6 million or $9.75 million (table 5).

Blog 08082014 t5

 

 

 

The last (and most controversial) calculation is based on the revenue opportunity cost or in other words; what additional revenue could be generated with a productivity recovery increase in employee hours? For these calculations we need an additional number; the annual revenue for the company. Divide this by the number of employees and you will get the average revenue per employee and the average revenue per employee per hour (table 6).

Blog 08082014 t6

 

 

 

 

How Does Productivity Affect Revenue

The last variable that needs an explanation is the “discount factor for revenue recovery” (table 6). This discount factor is based on the assumption that every recovered hour will not equal an additional (one for one) average revenue per employee per hour. Common sense tells us this will not happen but common sense also tells us that employees that are more productive generate more revenue. So in this example, I will use revenue recovery discount factor of 60% or 40% of the above $101.92 per hour number. This is met to impose a degree of believability to the calculation.

To calculate the total (discounted) recoverable revenue from improved information search we use the following formula: Estimated recoverable productivity hours for wasted search time * (the average revenue per hour per employee – (1 – the revenue recovery discount factor)) or 104,000*($101.92*(1-60%)) which equals $4,239,872 or $4.24 million.

Calculating the (discounted) recovered revenue from productivity gains from recreating data not found we will use the following formula: Estimated total hours spent recreating data not found * (1 minus the revenue recovery discount factor * the average revenue per employee per hour or (166,400*(1-60%)*$101.92) equals $6,784,000.

So to wrap up this painful experiment in math, the potential dollar savings and increased revenue from the adoption of an information governance program is:

Blog 08082014 t7

 

 

The point of this discussion was to explore the potential of using the concept of recovered revenue from increases in productivity from the more effective management of information – information governance. You may (probably) disagree with the numbers used, but I think the point of calculating an InfoGov ROI using recovered revenue due to productivity gains… is realistic.

 

Dark (Data) Clouds on the Horizon


Dark Cloud

 

There have been many definitions of “Dark Data” over the last couple of years including: unstructured, unclassified, untagged, unmanaged and unknown electronic data that is resident within an organization’s enterprise. Most of these definitions center on unstructured data residing in an enterprise. But with the advent of BYOD and employees use of personal clouds, this definition should be expanded to include any corporate owned data, no matter where it resides.

Dark data, especially dark data stored outside of the company’s infrastructure (and awareness that it even exists) is an obvious liability for eDiscovery response, regulatory compliance, and corporate IP security.

Is BYOC a good idea?

Much has been written on the dangers of “Bring Your Own Device” (BYOD) but little has been written on the dangers of “Bring Your Own Cloud” (BYOC) otherwise known as personal clouds. Employees now have access to free cloud storage from many vendors that give them access to their content no matter where they are. These same personal clouds also provide automatic syncing of desktop folders and the ability to share specific documents or even entire folders. These personal clouds offer a fantastic use model for individuals to upload their personal content for backup, sharing and remote availability. In the absence of any real guidance from employers, employees have also begun to use these personal clouds for both personal and work purposes.

The problem arises when corporate-owned data is moved up to personal clouds without the organization’s approval or awareness. Besides the obvious problem of potential theft of corporate IP, effective eDiscovery and regulatory compliance become impossible. Corporate data residing in personal clouds become “Dark Clouds” to the organization; corporate data residing in repositories outside the organizations infrastructure, management or knowledge.

Dark Clouds and eDiscovery

Organizations have been trying to figure out what to do with huge amounts of dark data within their infrastructure, particularly when anticipating or responding to litigation. Almost everything is potentially discoverable in litigation if it pertains to the case, and searching for and reviewing GBs or TBs of dark data residing in the enterprise can push the cost of eDiscovery up substantially. But imagine the GBs of corporate dark data residing in employee personal clouds that the organization has zero awareness of… Is the organization still responsible to search for it, secure it and produce it? Depending on who you ask, the answer is Yes, No, and “it depends”.

In reality, the correct answer is “it depends”. It will depend on what the organization did to try and stop employee dark clouds from existing. Was a policy prohibiting employee use of personal clouds with corporate data in place; were employees alerted to the policy; did the organization try to audit and enforce the policy; did the organization utilize technology to stop access to personal clouds from within the enterprise, and did the organization use technology to stop the movement of corporate data to personal clouds (content control)?

If the organization can show intent and actions to ensure dark clouds were not available to employees, then the expectation of dark cloud eDiscovery search may not exist. But if dark cloud due diligence was not done and/or documented, all bets are off.

Regulatory Compliance and Dark Clouds

Employee personal clouds can also end up becoming the repository of sensitive data subject to regulatory security and privacy requirements. Personally identifiable information (PII) and personal health information (PHI) under the control of an organization are subject to numerous security and privacy regulations and requirements that if not followed, can trigger costly penalties. But inadvertent exposure can occur as employees move daily work product up to their personal clouds to continue work at home or while traveling. A problem is many employees are not trained on recognizing and handling sensitive information; what is it, what constitutes sensitive information, how should it be secured, and the liabilities to the organization if sensitive information is leaked. The lack of understanding around the lack of security of personal clouds and the devices used to access them are a related problem. Take, for example, a situation where an employee accesses their personal cloud while in a coffee shop on an unsecured Wi-Fi connection. A hacker can simply gain access to your laptop via the unsecured Wi-Fi connection, access your personal cloud folder, and browse your personal cloud through your connection (a password would not be required because most users opt to auto-sign in to their cloud accounts as they connect on-line).

As with the previous eDiscovery discussion, if the organization had taken the required steps to ensure sensitive data could not be leaked (even inadvertently by the employee), they leave themselves open for regulatory fines and more.

Reducing the Risk of Dark Clouds

The only way to stop the risk associated with dark clouds is to stop corporate data from leaving the security of the enterprise in the first place. This outcome is almost impossible to guarantee without adopting draconian measures that most business cultures would rebel against but there are several measures that an organization can employ to at least reduce the risk:

  • First, create a use policy to address what is acceptable and not acceptable behavior when using organization equipment, infrastructure and data.
  • Document all policies and update them regularly.
  • Train employees on all policies – on a regular basis.
  • Regularly audit employee adherence to all policies, and document the audits.
  • Enforce all breaches of the policy.
  • Employee systematic security measures across the enterprise:
    • Don’t allow employee personal devices access to the infrastructure – BYOD
    • Stop employee access to personal clouds – in many cases this can be done systematically via cutting specific port access
    • Employ systematic enterprise access controls
    • Employ enterprise content controls – these are software applications that control access to individual content based on the actual content and the user’s security profile.

Employee dark clouds are a huge liability for organizations and will become more so as attorney’s become more educated on how employees create, use, store and share information. Now days, discoverable and sensitive data can migrate out of the control of an enterprise in the blink of an eye. The question is what the organization does to prevent it…

Law Firms, HIPAA and the “Minimum Necessary Standard” Rule


TMI blogThe HIPAA Omnibus Rule became effective on March 26, 2013. Covered entities and Business Associates had until September 23, 2013 to become compliant with the entirety of the law including the security rule, the privacy rule and the breach notification rule. Law firms that do business with a HIPAA regulated organization and receive protected health information (PHI) are considered a Business Associate (BA) and subject to all regulations including the security, privacy and breach notification rules. These rules are very prescriptive in nature and can impose additional procedures and additional cost to a law firm.

Under the HIPAA, there is a specific rule covering the use of PHI by both covered entities and Business Associates called the “Minimum Necessary Stand” rule or 45 CFR 164.502(b), 164.514(d). The HIPAA Privacy rule and minimum necessary standard are enforced by the U.S. Department of Health and Human Services Office for Civil Rights (OCR). Under this rule, law firms must develop policies and procedures which limit PHI uses, disclosures and requests to those necessary to carry out the organization’s work including:

  • Identification of persons or classes of persons in the workforce who need access to PHI to carry out their duties;
  • For each of those, specification of the category or categories of PHI to which access is needed and any conditions appropriate to such access; and
  • Reasonable efforts to limit access accordingly.

The minimum necessary standard is based on the theory that PHI should not be used or disclosed when it’s not necessary to satisfy a particular job. The minimum necessary standard generally requires law firms to take reasonable steps to limit the use or disclosure of, PHI to the minimum necessary to represent the healthcare client. The Privacy Rule’s requirements for minimum necessary are designed to be flexible enough to accommodate the various circumstances of any covered entity.

The first thing firms should understand is that, as Business Associates subject to HIPAA through their access and use of client data, firms are subject to the Minimum Necessary Standard, which requires that when a HIPAA-covered entity or a business associate (law firm) of a covered entity uses or discloses PHI or when it requests PHI from another covered entity or business associate, the covered entity or business associate must make “reasonable efforts to limit protected health information to the minimum necessary to accomplish the intended purpose of the use, disclosure, or request.”

Law firm information governance professionals need to be aware of this rule and build it into their healthcare client related onboarding processes.

You Don’t Know What You Don’t Know


Blog_06272014_graphicThe Akron Legal News this week published an interesting editorial on information governance. The story by Richard Weiner discussed how law firms are dealing with the transition from rooms filled with hard copy records to electronically stored information (ESI) which includes firm business records as well as huge amounts of client eDiscovery content. The story pointed out that ESI flows into the law firm so quickly and in such huge quantities no one can track it much less know what it contains.  Law firms are now facing an inflection point, change the way all information is managed or suffer client dissatisfaction and client loss.

The story pointed out that “in order to function as a business, somebody is going to have to, at least, track all of your data before it gets even more out of control – Enter information governance.”

There are many definitions of information governance (IG) floating around but the story presented one specifically targeted at law firms: IG is “the rules and framework for managing all of a law firm’s electronic data and documents, including material produced in discovery, as well as legal files and correspondence.” Richard went on to point out that there are four main tasks to accomplish through the IG process. They are:

  • Map where the data is stored;
  • Determine how the data is being managed;
  • Determine data preservation methodology;
  • Create forensically sound data collection methods.

I would add several more to this list:

  • Create a process to account for and classify inbound client data such as eDiscovery and regulatory collections.
  • Determine those areas where client information governance practices differ from firm information governance practices.
  • Reconcile those differences with client(s).

As law firms’ transition to mostly ESI for both firm business and client data, law firms will need to adopt IG practices and process to account for and manage to these different requirements. Many believe this transition will eventually lead to the incorporation of machine learning techniques into IG to enable law firm IG processes to have a much more granular understanding of what the actual meaning of the data, not just that it’s a firm business record or part of a client eDiscovery response. This will in turn enable more granular data categorization capability of all firm information.

Iron Mountain has hosted the annual Law Firm Information Governance Symposium which has directly addressed many of these topics around law firm IG. The symposium has produced ”A Proposed Law Firm Information Governance Framework” a detailed description of the processes to look at as law firms look at adopting an information governance program.

Dark Data Archiving…Say What?


Dark door 2

In a recent blog titled “Bring your dark data out of the shadows”, I described what dark data was and why its important to manage it. To review, the reasons to manage were:

  1. It consumes costly storage space
  2. It consumes IT resources
  3. It masks security risks
  4. And it drives up eDiscovery costs

For the clean-up of dark data (remediation) it has been suggested by many, including myself, that the remediation process should include determining what you really have, determine what can be immediately disposed of (obvious stuff like duplicates and any expired content etc.), categorize the rest, and move the remaining categorized content into information governance systems.

But many “conservative” minded people (like many General Counsel) hesitate at the actual deletion of data, even after they have spent the resources and dollars to identify potentially disposable content. The reasoning usually centers on the fear of destroying information that could be potentially relevant in litigation. A prime example is seen in the Arthur Andersen case where a Partner famously sent an email message to employees working on the Enron account, reminding them to “comply with the firm’s documentation and retention policy”, or in other words – get rid of stuff. Many GCs don’t want to be put in the position of rightfully disposing of information per policy and having to explain later in court why potentially relevant information was disposed of…

For those that don’t want to take the final step of disposing of data, the question becomes “so what do we do with it?” This reminds me of a customer I was dealing with years ago. The GC for this 11,000 person company, a very distinguished looking man, was asked during a meeting that included the company’s senior staff, what the company’s information retention policy was. He quickly responded that he had decided that all information (electronic and hardcopy) from their North American operations would be kept for 34 years. Quickly calculating the company’s storage requirements over 34 years with 11,000 employees, I asked him if he had any idea what his storage requirements would be at the end of 34 years. He replied no and asked what the storage requirements would be. I replied it would be in the petabytes range and asked him if he understood what the cost of storing that amount of data would be and how difficult it would be to find anything in it.

He smiled and replied “I’m retiring in two years, I don’t care”

The moral of that actual example is that if you have decided to keep large amounts of electronic data for long periods of time, you have to consider the cost of storage as well as how you will search it for specific content when you actually have to.

In the example above, the GC was planning on storing it on spinning disk which is costly. Others I have spoken to have decided that most cost effective way to store large amounts of data for long periods of time is to keep backup tapes. Its true that backup tapes are relatively cheap (compared to spinning disk) but are difficult to get anything off of, they have a relatively high failure rate (again compared to spinning disk)  and have to be rewritten every so many years because backup tapes slowly lose their data over time.

A potential solution is moving your dark data to long term hosted archives. These hosted solutions can securely hold your electronically stored information (ESI) at extremely low costs per gigabyte. When needed, you can access your archive remotely and search and move/copy data back to your site.

An important factor to look for (for eDiscovery) is that data moved, stored, indexed and recovered from the hosted archive cannot alter the metadata in anyway. This is especially important when responding to a discovery request.

For those of you considering starting a dark data remediation project, consider long term hosted archives as a staging target for that data your GC just won’t allow to be disposed of.

Bring your dark data out of the shadows


NosferatuShadowDark data, otherwise known as unstructured, unmanaged, and uncategorized information is a major problem for many organizations. Many organizations don’t have the will or systems in place to automatically index and categorize their rapidly growing unstructured dark data, especially in file shares, and instead rely on employees to manually manage their own information. This reliance on employees is a no-win situation because employees have neither the incentive nor the time to actively manage their information.

Organizations find themselves trying to figure out what to do with huge amounts of dark data, particularly when they’re purchasing TBs of new storage annually because they’ve run out.

Issues with dark data:

  • Consumes costly storage space and resources – Most medium to large organizations provide terabytes of file share storage space for employees and departments to utilize. Employees drag and drop all kinds of work related files (and personal files like personal photos, MP3 music files, and personal communications) as well as PSTs and work station backup files. The vast majority of these files are unmanaged and are never looked at again by the employee or anyone else.
  • Consumes IT resources – Personnel are required to perform nightly backups, DR planning, and IT personnel to find or restore files employees could not find.
  • Masks security risks – File shares act as “catch-alls” for employees. Sensitive company information regularly finds its way to these repositories. These file shares are almost never secure so sensitive information like personally identifiable information (PII), protected health information (PHI, and intellectual property can be inadvertently leaked.
  • Raises eDiscovery costs – Almost everything is discoverable in litigation if it pertains to the case. The fact that tens or hundreds of terabytes of unindexed content is being stored on file shares means that those terabytes of files may have to be reviewed to determine if they are relevant in a given legal case. That can add hundreds of thousands or millions of dollars of additional cost to a single eDiscovery request.

To bring this dark data under control, IT must take positive steps to address the problem and do something about it. The first step is to look to your file shares.

Discoverable versus Admissible; aren’t they the same?


This question comes up a lot, especially from non-attorneys. The thought is that if something is discoverable, then it must be admissible; the assumption being that a Judge will not allow something to be discovered if it can’t be used in court. The other thought is that everything is discoverable if it pertains to the case and therefor everything is admissible.

Let’s first address what’s discoverable. For good cause, the court may order discovery of any matter (content) that’s not privileged relevant to the subject matter involved in the action. In layman’s terms, if it is potentially relevant to the case, you may have to produce it in discovery or in other words, anything and everything is potentially discoverable.  All discovery is subject to the limitations imposed by FRCP Rule 26(b)(2)(C).

With that in mind, let’s look at the subject of admissibility.

In Lorraine v. Markel Am. Ins. Co., 241 F.R.D. 534, 538 (D. Md. 2007), the court started with the premise that the admissibility of ESI is determined by a collection of evidence rules “that present themselves like a series of hurdles to be cleared by the proponent of the evidence”.  “Failure to clear any of these evidentiary hurdles means that the evidence will not be admissible”. Whenever ESI is offered as evidence, five evidentiary rules need to be considered. They are:

  • is relevant to the case
  • is authentic
  • is not hearsay pursuant to Federal Rule of Evidence 801
  • is an original or duplicate under the original writing rule
  • has probative value that is substantially outweighed by the danger of unfair prejudice or one of the other factors identified by Federal Rule of Evidence 403, such that it should be excluded despite its relevance.

Hearsay is defined as a statement made out of court that is offered in court as evidence to prove the truth of the matter asserted. Hearsay comes in many forms including written or oral statements or even gestures.

It is the Judge’s job to determine if evidence is hearsay or credible. There are three evidentiary rules that help the Judge make this determination:

  1. Before being allowed to testify, a witness generally must swear or affirm that his or her testimony will be truthful.
  2. The witness must be personally present at the trial or proceeding in order to allow the judge or jury to observe the testimony firsthand.
  3. The witness is subject to cross-examination at the option of any party who did not call the witness to testify.

The Federal Rules of Evidence Hearsay Rule prohibits most statements made outside of court from being used as evidence in court. Looking at the three evidentiary rules mentioned above – usually a statement made outside of the courtroom is not made under oath, the person making the statement outside of court is not present to be observed by the Judge, and the opposing party is not able to cross examine the statement maker. This is not to say all statements made outside of court are inadmissible. The Federal Rule of Evidence 801 does provide for several exclusions to the Hearsay rule.

All content is discoverable if it potentially is relevant to the case and not deemed privileged, but discovered content may be ruled inadmissible if it is deemed privileged (doctor/patient communications), unreliable or hearsay. You may be wondering how an electronic document can be considered hearsay? The hearsay rule refers to “statements” which can either be written or oral. So, as with paper documents, in order to determine whether the content of electronic documents are hearsay or fact, the author of the document must testify under oath and submit to cross-examination in order to determine whether the content is fact and can stand as evidence.

This legal argument between fact and hearsay does not relieve the discoveree from finding, collecting and producing all content in that could be relevant to the case.