Dark Data Archiving…Say What?


Dark door 2

In a recent blog titled “Bring your dark data out of the shadows”, I described what dark data was and why its important to manage it. To review, the reasons to manage were:

  1. It consumes costly storage space
  2. It consumes IT resources
  3. It masks security risks
  4. And it drives up eDiscovery costs

For the clean-up of dark data (remediation) it has been suggested by many, including myself, that the remediation process should include determining what you really have, determine what can be immediately disposed of (obvious stuff like duplicates and any expired content etc.), categorize the rest, and move the remaining categorized content into information governance systems.

But many “conservative” minded people (like many General Counsel) hesitate at the actual deletion of data, even after they have spent the resources and dollars to identify potentially disposable content. The reasoning usually centers on the fear of destroying information that could be potentially relevant in litigation. A prime example is seen in the Arthur Andersen case where a Partner famously sent an email message to employees working on the Enron account, reminding them to “comply with the firm’s documentation and retention policy”, or in other words – get rid of stuff. Many GCs don’t want to be put in the position of rightfully disposing of information per policy and having to explain later in court why potentially relevant information was disposed of…

For those that don’t want to take the final step of disposing of data, the question becomes “so what do we do with it?” This reminds me of a customer I was dealing with years ago. The GC for this 11,000 person company, a very distinguished looking man, was asked during a meeting that included the company’s senior staff, what the company’s information retention policy was. He quickly responded that he had decided that all information (electronic and hardcopy) from their North American operations would be kept for 34 years. Quickly calculating the company’s storage requirements over 34 years with 11,000 employees, I asked him if he had any idea what his storage requirements would be at the end of 34 years. He replied no and asked what the storage requirements would be. I replied it would be in the petabytes range and asked him if he understood what the cost of storing that amount of data would be and how difficult it would be to find anything in it.

He smiled and replied “I’m retiring in two years, I don’t care”

The moral of that actual example is that if you have decided to keep large amounts of electronic data for long periods of time, you have to consider the cost of storage as well as how you will search it for specific content when you actually have to.

In the example above, the GC was planning on storing it on spinning disk which is costly. Others I have spoken to have decided that most cost effective way to store large amounts of data for long periods of time is to keep backup tapes. Its true that backup tapes are relatively cheap (compared to spinning disk) but are difficult to get anything off of, they have a relatively high failure rate (again compared to spinning disk)  and have to be rewritten every so many years because backup tapes slowly lose their data over time.

A potential solution is moving your dark data to long term hosted archives. These hosted solutions can securely hold your electronically stored information (ESI) at extremely low costs per gigabyte. When needed, you can access your archive remotely and search and move/copy data back to your site.

An important factor to look for (for eDiscovery) is that data moved, stored, indexed and recovered from the hosted archive cannot alter the metadata in anyway. This is especially important when responding to a discovery request.

For those of you considering starting a dark data remediation project, consider long term hosted archives as a staging target for that data your GC just won’t allow to be disposed of.

Cloudy, with a chance of eDiscovery


eDiscovery101

In the last year there has numerous articles, blogs, presentations and panels discussing the legal perils of “Bring Your Own Device” or BYOD policies. BYOD refers to the policy of permitting employees to bring personally owned mobile devices (laptops, tablets, and smart phones) to their workplace, and to use those devices to access privileged company information and applications. The problem with BYOD is company access to company data housed on the device. For example, how would you search for potentially relevant content on a smartphone if the employee wasn’t immediately available or refused to give the company access to it?

Many organizations have banned BYOD as a security risk as well as a liability when involved with litigation.

BYOC Equals Underground Archiving?

Organizations are now dealing with another problem, one with even greater liabilities. “Bring your own cloud” or BYOC refers to the availability and use by individuals…

View original post 595 more words

Tolson’s Three Laws of Machine Learning


eDiscovery101

TerminatorMuch has been written in the last several years about Predictive Coding (as well as Technology Assisted Review, Computer Aided Review, and Craig Ball’s hilarious Super Human Information Technology ). This automation technology, now heavily used for eDiscovery, relies heavily on “machine learning”,  a discipline of artificial intelligence (AI) that automates computer processes that learn from data, identify patterns and predict future results with varying degrees of human involvement. This interative machine training/learning approach has catapulted computer automation to unheard-of and scary levels of potential. The question I get a lot (I think only half joking) is “when will they learn enough to determine we and the attorneys they work with are no longer necessary?

Is it time to build in some safeguards to machine learning? Thinking back to the days I read a great deal of Isaac Asimov (last week), I thought about Asimov’s The Three Laws…

View original post 225 more words

Bring your dark data out of the shadows


NosferatuShadowDark data, otherwise known as unstructured, unmanaged, and uncategorized information is a major problem for many organizations. Many organizations don’t have the will or systems in place to automatically index and categorize their rapidly growing unstructured dark data, especially in file shares, and instead rely on employees to manually manage their own information. This reliance on employees is a no-win situation because employees have neither the incentive nor the time to actively manage their information.

Organizations find themselves trying to figure out what to do with huge amounts of dark data, particularly when they’re purchasing TBs of new storage annually because they’ve run out.

Issues with dark data:

  • Consumes costly storage space and resources – Most medium to large organizations provide terabytes of file share storage space for employees and departments to utilize. Employees drag and drop all kinds of work related files (and personal files like personal photos, MP3 music files, and personal communications) as well as PSTs and work station backup files. The vast majority of these files are unmanaged and are never looked at again by the employee or anyone else.
  • Consumes IT resources – Personnel are required to perform nightly backups, DR planning, and IT personnel to find or restore files employees could not find.
  • Masks security risks – File shares act as “catch-alls” for employees. Sensitive company information regularly finds its way to these repositories. These file shares are almost never secure so sensitive information like personally identifiable information (PII), protected health information (PHI, and intellectual property can be inadvertently leaked.
  • Raises eDiscovery costs – Almost everything is discoverable in litigation if it pertains to the case. The fact that tens or hundreds of terabytes of unindexed content is being stored on file shares means that those terabytes of files may have to be reviewed to determine if they are relevant in a given legal case. That can add hundreds of thousands or millions of dollars of additional cost to a single eDiscovery request.

To bring this dark data under control, IT must take positive steps to address the problem and do something about it. The first step is to look to your file shares.

Infobesity in the Healthcare Industry: A Well-Balanced Diet of Predictive Governance is needed


Fat TwitterWith the rapid advances in healthcare technology, the movement to electronic health records, and the relentless accumulation of regulatory requirements, the shift from records management to information governance is increasingly becoming a necessary reality.

In a 2012 CGOC (Compliance, Governance and Oversight Counsel) Summit survey, it was found that on the average 1% of an organization’s data is subject to legal hold, 5% falls under regulatory retention requirements and 25% has business value. This means that 69% of an organization’s ESI is not needed and could be disposed of without impact to the organization. I would argue that for the healthcare industry, especially for covered entities with medical record stewardship, those retention percentages are somewhat higher, especially the regulatory retention requirements.

According to an April 9, 2013 article on ZDNet.com, by 2015, 80% of new healthcare information will be composed of unstructured information; information that’s much harder to classify and manage because it doesn’t conform to the “rows & columns” format used in the past. Examples of unstructured information include clinical notes, emails & attachments, scanned lab reports, office work documents, radiology images, SMS, and instant messages. Despite a push for more organization and process in managing unstructured data, healthcare organizations continue to binge on unstructured data with little regard to the overall health of their enterprises.

So how does this info-gluttony, (the unrestricted saving of unstructured data because data storage is cheap and saving everything is just easier), affect the health of the organization? Obviously you’ll look terrible in horizontal stripes, but also finding specific information quickly (or at all) is impossible, you’ll spend more on storage, data breaches will could occur more often, litigation/eDiscovery expenses will rise, and you won’t want to go to your 15th high school reunion…

To combat this unstructured info-gain, we need an intelligent information governance solution – STAT!  And that solution must include a defensible process to systematically dispose of information that’s no longer subject to regulatory requirements, litigation hold requirements or because it no longer has business value.

To enable this information governance/defensible disposal Infobesity cure, healthcare information governance solutions must be able to extract meaning from all of this unstructured content, or in other words understand and differentiate content conceptually. The automated classification/categorization of unstructured content based on content meaning cannot accurately or consistently differentiate the meaning in electronic content by simply relying on simple rules or keywords. To accurately automate the categorization and management of unstructured content, a machine learning capability to “train by example” is a precondition. This ability to systematically derive meaning from unstructured content as well as machine learning to accurately automate information governance is something we call “Predictive Governance”.

A side benefit of Predictive Governance is (you’ll actually look taller) previously lost organizational knowledge and business intelligence can be automatically compiled and made available throughout the organization.

The ROI of Conceptual Search


After people, information is a company’s most valuable asset. But many are asking; “what’s in that information?”, “who controls it?”, “can others access it?”, and “is it a risk to keep?”, “for how long?”. The vast majority of information in any organization is not managed, not indexed, and is rarely–if ever–accessed.

Companies exist to create and utilize information. Do you know where all your organization’s information is, what’s in it, and most importantly, can those that need it find and access it? If your employees can’t find when they need it, then the return on investment (ROI) for that information is zero. How much higher could the ROI be if your employees could actually find and share data effortlessly?

Enterprise search – The mindless regurgitation of keyword matches

Enterprise search is the organized query/retrieval of information from across an organization’s enterprise data systems. Data sources include e-mail servers, application databases, content management systems, file systems, intranet sites and many others. Legacy enterprise search systems provide users the ability to query organizational data repositories utilizing keyword-based inquiries that returns huge results sets that then have to be manually filtered by the user until they find what they were looking for (if they actually find it).

A sizeable drawback to a keyword-based search is that it will return all keyword matches even though they may be conceptually different – false positives.

What is conceptual search?

A conceptual search is used to search electronically stored information for information that is conceptually a match or similar to the information represented in a search query as opposed to a keyword search where only documents with exact keyword matches are returned. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query regardless of shared terms or language.

Cost savings – Concept versus keyword search

Employees are rarely capable of constructing keyword and Boolean searches that return the data they are looking for immediately. Because of this fact, time is wasted in actually finding what they were looking for. IDC has estimated that using a higher quality enterprise search capability can save up to 53.4% of time spent searching for data. Many have argued conceptual search can save even more time because conceptual search more closely models how humans think and therefor will return more meaningful results quicker.

Alan Greenspan, a past Chairman of the Federal Reserve, once stated “You’re entitled to your own opinions, but not to your own facts”. Return on Investment calculations are only as good as the reliability of the variables used to calculate it. To calculate ROI, the benefit (return) of an investment is divided by the cost of the investment – the result is expressed as a percentage.

Enterprise Search ROI calculations require the following data points:

•           The total cost of the current enterprise search process used

•           The total cost of the new enterprise search process after the investment is in place

•           The total cost of the new enterprise search investment

The actual ROI formula looks like this:


Return on investment is an often asked for but little understood financial measure. Many equate cost savings to ROI but cost savings is only a part of the equation. ROI also includes looking at the cost of the solution that produced the savings. ROI lets you compare returns from various investment opportunities to make the best investment decision for your available dollars.

Organizations run on information. If information is easier to find and use, the organization profits from it.

Ask the Magic 8-Ball; “Is Predictive Defensible Disposal Possible?”


The Good Ole Days of Paper Shredding

In my early career, shred days – the scheduled annual activity where the company ordered all employees to wander through all their paper records to determine what should be disposed of, were common place. At the government contractor I worked for, we actually wheeled our boxes out to the parking lot to a very large truck that had huge industrial shredders in the back. Once the boxes of documents were shredded, we were told to walk them over to a second truck, a burn truck, where we, as the records custodian, would actually verify that all of our records were destroyed. These shred days were a way to actually collect, verify and yes physically shred all the paper records that had gone beyond their retention period over the preceding year.

The Magic 8-Ball says Shred Days aren’t Defensible

Nowadays, this type of activity carries some negative connotations with it and is much more risky. Take for example the recent case of Rambus vs SK Hynix. In this case U.S District Judge Ronald Whyte in San Jose reversed his own prior ruling from a 2009 case where he had originally issued a judgment against SK Hynix, awarding Rambus Inc. $397 million in a patent infringement case. In his reversal this year, Judge Whyte ruled that Rambus Inc. had spoliated documents in bad faith when it hosted company-wide “shred days” in 1998, 1999, and 2000. Judge Whyte found that Rambus could have reasonably foreseen litigation against Hynix as early as 1998, and that therefore Rambus engaged in willful spoliation during the three “shred days” (a finding of spoliation can be based on inadvertent destruction of evidence as well). Because of this recent spoliation ruling, the Judge reduced the prior Rambus award from $397 million to $215 million, a cost to Rambus of $182 million.

Another well know example of sudden retention/disposition policy activity that caused unintended consequences is the Arthur Andersen/Enron example. During the Enron case, Enron’s accounting firm sent out the following email to some of its employees:

 

 

This email was a key reason why Arthur Andersen ceased to exist shortly after the case concluded. Arthur Andersen was charged with and found guilty of obstruction of justice for shredding the thousands of documents and deleting emails and company files that tied the firm to its audit of Enron. Less than 1 year after that email was sent, Arthur Andersen surrendered its CPA license on August 31, 2002, and 85,000 employees lost their jobs.

Learning from the Past – Defensible Disposal

These cases highlight the need for a true information governance process including a truly defensible disposal capability. In these instances, an information governance process would have been capturing, indexing, applying retention policies, protecting content on litigation hold and disposing of content beyond the retention schedule and not on legal hold… automatically, based on documented and approved legally defensible policies. A documented and approved process which is consistently followed and has proper safeguards goes a long way with the courts to show good faith intent to manage content and protect that content subject to anticipated litigation.

To successfully automate the disposal of unneeded information in a consistently defensible manner, auto-categorization applications must have the ability to conceptually understand the meaning in unstructured content so that only content meeting your retention policies, regardless of language, is classified as subject to retention.

Taking Defensible Disposal to the Next Level – Predictive Disposition

A defensible disposal solution which incorporates the ability to conceptually understand content meaning, and which incorporates an iterative training process including “train by example,” in a human supervised workflow provides accurate predictive retention and disposition automation.

Moving away from manual, employee-based information governance to automated information retention and disposition with truly accurate (95 to 99%) and consistent meaning-based predictive information governance will provide the defensibility that organizations require today to keep their information repositories up to date.

Predicting the Future of Information Governance


Information Anarchy

Information growth is out of control. The compound average growth rate for digital information is estimated to be 61.7%. According to a 2011 IDC study, 90% of all data created in the next decade will be of the unstructured variety. These facts are making it almost impossible for organizations to actually capture, manage, store, share and dispose of this data in any meaningful way that will benefit the organization.

Successful organizations run on and are dependent on information. But information is valuable to an organization only if you know where it is, what’s in it, and what is shareable or in other words… managed. In the past, organizations have relied on end-users to decide what should be kept, where and for how long. In fact 75% of data today is generated and controlled by individuals. In most cases this practice is ineffective and causes what many refer to as “covert orunderground archiving”, the act of individuals keeping everything in their own unmanaged local archives. These underground archives effectively lock most of the organization’s information away, hidden from everyone else in the organization.

This growing mass of information has brought us to an inflection point; get control of your information to enable innovation, profit and growth, or continue down your current path of information anarchy and choke on your competitor’s dust.

 

img-pred-IG

 

Choosing the Right Path

How does an organization ensure this infection point is navigated correctly? Information Governance. You must get control of all your information by employing the proven processes and technologies to allow you to create, store, find, share and dispose of information in an automated and intelligent manner.

An effective information governance process optimizes overall information value by ensuring the right information is retained and quickly available for business, regulatory, and legal requirements.  This process reduces regulatory and legal risk,  insures needed data can be found quickly and is secured for litigation,  reduces overall eDiscovery costs, and provides structure to unstructured information so that employees can be more productive.

Predicting the Future of Information Governance

Predictive Governance is the bridge across the inflection point. It combines machine-learning technology with human expertise and direction to automate your information governance tasks. Using this proven human-machine iterative training capability,Predictive Governance is able to accurately automate the concept-based categorization, data enrichment and management of all your enterprise data to reduce costs, reduce risks, enable information sharing and mitigate the strain of information overload.

Automating information governance so that all enterprise data is captured, granularity evaluated for legal requirements, regulatory compliance, or business value and stored or disposed of in a defensible manner is the only way for organizations to move to the next level of information governance.

Finding the Cure for the Healthcare Unstructured Data Problem


Healthcare information/ and records continue to grow with the introduction of new devices and expanding regulatory requirements such as The Affordable Care Act, The Health Insurance Portability and Accountability Act (HIPAA), and the Health Information Technology for Economic and Clinical Health Act (HITECH). In the past, healthcare records were made up of mostly paper forms or structured billing data; relatively easy to categorize, store, and manage.  That trend has been changing as new technologies enable faster and more convenient ways to share and consume medical data.

According to an April 9, 2013 article on ZDNet.com, by 2015, 80% of new healthcare information will be composed of unstructured information; information that’s much harder to classify and manage because it doesn’t conform to the “rows & columns” format used in the past. Examples of unstructured information include clinical notes, emails & attachments, scanned lab reports, office work documents, radiology images, SMS, and instant messages.

Who or what is going to actually manage this growing mountain of unstructured information?

To insure regulatory compliance and the confidentiality and security of this unstructured information, the healthcare industry will have to 1) hire a lot more professionals to manually categorize and mange it or 2) acquire technology to do it automatically.

Looking at the first solution; the cost to have people manually categorize and manage unstructured information would be prohibitively expensive not to mention slow. It also exposes private patient data to even more individuals.  That leaves the second solution; information governance technology. Because of the nature of unstructured information, a technology solution would have to:

  1. Recognize and work with hundreds of data formats
  2. Communicate with the most popular healthcare applications and data repositories
  3. Draw conceptual understanding from “free-form” content so that categorization can be accomplished at an extremely high accuracy rate
  4. Enable proper access security levels based on content
  5. Accurately retain information based on regulatory requirements
  6. Securely and permanently dispose of information when required

An exciting emerging information governance technology that can actually address the above requirements uses the same next generation technology the legal industry has adopted…proactive information governance technology based on conceptual understanding of content,  machine learning and iterative “train by example” capabilities

 

The lifecycle of information


Organizations habitually over-retain information, especially unstructured electronic information, for all kinds of reasons. Many organizations simply have not addressed what to do with it so many of them fall back on relying on individual employees to decide what should be kept and for how long and what should be disposed of. On the opposite end of the spectrum a minority of organizations have tried centralized enterprise content management systems and have found them to be difficult to use so employees find ways around them and end up keeping huge amounts of data locally on their workstations, on removable media, in cloud accounts or on rogue SharePoint sites and are used as “data dumps” with or no records management or IT supervision. Much of this information is transitory, expired, or of questionable business value. Because of this lack of management, information continues to accumulate. This information build-up raises the cost of storage as well as the risk associated with eDiscovery.

In reality, as information ages, it probability of re-use and therefore its value, shrinks quickly. Fred Moore, Founder of Horison Information Strategies, wrote about this concept years ago.

The figure 1 below shows that as data ages, the probability of reuse goes down…very quickly as the amount of saved data rises. Once data has aged 10 to 15 days, its probability of ever being looked at again approaches 1% and as it continues to age approaches but never quite reaches zero (figure 1 – red shading).

Contrast that with the possibility that a large part of any organizational data store has little of no business, legal or regulatory value. In fact the Compliance, Governance and Oversight Counsel (CGOC) conducted a survey in 2012 that showed that on the average, 1% of organizational data is subject to litigation hold, 5% is subject to regulatory retention and 25% had some business value (figure 1 – green shading). This means that approximately 69% of an organizations data store has no business value and could be disposed of without legal, regulatory or business consequences.

The average employee creates, sends, receives and stores conservatively 20 MB of data per day. This means that at the end of 15 business days, they have accumulated 220 MB of new data, at the end of 90 days, 1.26 GB of data and at the end of three years, 15.12 GB of data. So how much of this accumulated data needs to be retained? Again referring to figure 1 below, the blue shaded area represents the information that probably has no legal, regulatory or business value according to the 2012 CGOC survey. At the end of three years, the amount of retained data from a single employee that could be disposed of without adverse effects to the organization is 10.43 GB. Now multiply that by the total number of employees and you are looking at some very large data stores.

Figure 1: The Lifecycle of data

The above lifecycle of data shows us that employees really don’t need all of the data they squirrel away (because its probability of re-use drops to 1% at around 15 days) and based on the CGOC survey, approximately 69% of organizational data is not required for legal, regulatory retention or has business value. The difficult piece of this whole process is how can an organization efficiently determine what data is not needed and dispose of it automatically…

As unstructured data volumes continue to grow, automatic categorization of data is quickly becoming the only way to get ahead of the data flood. Without accurate automated categorization, the ability to find the data you need, quickly, will never be realized. Even better, if data categorization can be based on the meaning of the content, not just a simple rule or keyword match, highly accurate categorization and therefore information governance is achievable.