Organizations habitually over-retain information, especially unstructured electronic information, for all kinds of reasons. Many organizations simply have not addressed what to do with it so many of them fall back on relying on individual employees to decide what should be kept and for how long and what should be disposed of. On the opposite end of the spectrum a minority of organizations have tried centralized enterprise content management systems and have found them to be difficult to use so employees find ways around them and end up keeping huge amounts of data locally on their workstations, on removable media, in cloud accounts or on rogue SharePoint sites and are used as “data dumps” with or no records management or IT supervision. Much of this information is transitory, expired, or of questionable business value. Because of this lack of management, information continues to accumulate. This information build-up raises the cost of storage as well as the risk associated with eDiscovery.
In reality, as information ages, it probability of re-use and therefore its value, shrinks quickly. Fred Moore, Founder of Horison Information Strategies, wrote about this concept years ago.
The figure 1 below shows that as data ages, the probability of reuse goes down…very quickly as the amount of saved data rises. Once data has aged 10 to 15 days, its probability of ever being looked at again approaches 1% and as it continues to age approaches but never quite reaches zero (figure 1 – red shading).
Contrast that with the possibility that a large part of any organizational data store has little of no business, legal or regulatory value. In fact the Compliance, Governance and Oversight Counsel (CGOC) conducted a survey in 2012 that showed that on the average, 1% of organizational data is subject to litigation hold, 5% is subject to regulatory retention and 25% had some business value (figure 1 – green shading). This means that approximately 69% of an organizations data store has no business value and could be disposed of without legal, regulatory or business consequences.
The average employee creates, sends, receives and stores conservatively 20 MB of data per day. This means that at the end of 15 business days, they have accumulated 220 MB of new data, at the end of 90 days, 1.26 GB of data and at the end of three years, 15.12 GB of data. So how much of this accumulated data needs to be retained? Again referring to figure 1 below, the blue shaded area represents the information that probably has no legal, regulatory or business value according to the 2012 CGOC survey. At the end of three years, the amount of retained data from a single employee that could be disposed of without adverse effects to the organization is 10.43 GB. Now multiply that by the total number of employees and you are looking at some very large data stores.
Figure 1: The Lifecycle of data
The above lifecycle of data shows us that employees really don’t need all of the data they squirrel away (because its probability of re-use drops to 1% at around 15 days) and based on the CGOC survey, approximately 69% of organizational data is not required for legal, regulatory retention or has business value. The difficult piece of this whole process is how can an organization efficiently determine what data is not needed and dispose of it automatically…
As unstructured data volumes continue to grow, automatic categorization of data is quickly becoming the only way to get ahead of the data flood. Without accurate automated categorization, the ability to find the data you need, quickly, will never be realized. Even better, if data categorization can be based on the meaning of the content, not just a simple rule or keyword match, highly accurate categorization and therefore information governance is achievable.