Next Generation Technologies Reduce FOIA Bottlenecks

Federal agencies are under more scrutiny to resolve issues with responding to Freedom of Information Act (FOIA) requests.

The Freedom of Information Act provides for the full disclosure of agency records and information to the public unless that information is exempted under clearly delineated statutory language. In conjunction with FOIA, the Privacy Act serves to safeguard public interest in informational privacy by delineating the duties and responsibilities of federal agencies that collect, store, and disseminate personal information about individuals. The procedures established ensure that the Department of Homeland Security fully satisfies its responsibility to the public to disclose departmental information while simultaneously safeguarding individual privacy.

In February of this year, the House Oversight and Government Reform Committee opened a congressional review of executive branch compliance with the Freedom of Information Act.

The committee sent a six page letter to the Director of Information Policy at the Department of Justice (DOJ), Melanie Ann Pustay. In the letter, the committee questions why, based on a December 2012 survey, 62 of 99 government agencies have not updated their FOIA regulations and processes which was required by Attorney General Eric Holder in a 2009 memorandum. In fact the Attorney General’s own agency have not updated their regulations and processes since 2003.

The committee also pointed out that there are 83,000 FOIA request still outstanding as of the writing of the letter.

In fairness to the federal agencies, responding to a FOIA request can be time-consuming and expensive if technology and processes are not keeping up with increasing demands. Electronic content can be anywhere including email systems, SharePoint servers, file systems, and individual workstations. Because content is spread around and not usually centrally indexed, enterprise wide searches for content do not turn up all potentially responsive content. This means a much more manual, time consuming process to find relevant content is used.

There must be a better way…

New technology can address the collection problem of searching for relevant content across the many storage locations where electronically stored information (ESI) can reside. For example, an enterprise-wide search capability with “connectors” into every data repository, email, SharePoint, file systems, ECM systems, records management systems allows all content to be centrally indexed so that an enterprise wide keyword search will find all instances of content with those keywords present. A more powerful capability to look for is the ability to search on concepts, a far more accurate way to search for specific content. Searching for conceptually comparable content can speed up the collection process and drastically reduce the number of false positives in the results set while finding many more of the keyword deficient but conceptually responsive records. In conjunction with concept search, automated classification/categorization of data can reduce search time and raise accuracy.

The largest cost in responding to a FOIA request is in the review of all potentially relevant ESI found during collection. Another technology that can drastically reduce the problem of having to review thousands, hundreds of thousands or millions of documents for relevancy and privacy currently used by attorneys for eDiscovery is Predictive Coding.

Predictive Coding is the process of applying machine learning and iterative supervised learning technology to automate document coding and prioritize review. This functionality dramatically expedites the actual review process while dramatically improving accuracy and reducing the risk of missing key documents. According to a RAND Institute for Civil Justice report published in 2012, document review cost savings of 80% can be expected using Predictive Coding technology.

With the increasing number of FOIA requests swamping agencies, agencies are hard pressed to catch up to their backlogs. The next generation technologies mentioned above can help agencies reduce their FOIA related costs while decreasing their response time.


Conceptual Search verses Predictive Coding

In my last blog entry titled Successful Predictive Coding Adoption is Dependent on Effective Information Governance”, a question was posted which I thought deserved a wider sharing with the group; “What is the difference between predictive coding and conceptual search?” Being an individual not directly associated with either technology but with some interesting background, I believe I can attempt to explain the differences, at least as it pertains to discovery processes.

Conceptual search technologies allow a user to search on concepts…(pretty valuable insight, right?) instead of searching on a keyword such as “dog”. In the case of a keyword search on “dog”, the user would generate a results set of every document/file/record with the three letters D-O-G present in that specific sequence. The results could include returns on “dogs”, the 4- legged animals, references to “frankfurters”, references to movies (Dog Day Afternoon) etc. in no particular priority.

True conceptual search capability understands (based on search criteria) that the user was looking for information on the 4-legged animals so would return references to not just “dogs” but would also include references to “Golden Retrievers”, “Animal Shelters”, “Pet Adoption” etc.. Some conceptual search solutions will also cluster concepts to give the user the ability to quickly fine-tune their search; for example create a cluster of all dog (animal) references, a cluster for all food related references and so on. Many eDiscovery analytic solutions include this clustering capability.

Predictive coding is a process which includes both automation and human interaction to best produce a results set of potentially responsive documents that trained human reviewers can check.

Predictive coding takes the conceptual search and clustering idea much further than just understanding concepts. A predictive coding solution is “trained” in a very specific manner for each case. For example, the legal team with additional subject matter expertise, manually choose document/records/files that they deem as responsive examples for the particular case and input them to the predictive coding system as examples of content/format which should be found and coded as responsive to the case. Most predictive coding processes include several iterative cycles to fine-tune the example training examples. An iterative cycle would include legal professionals sampling/reviewing those records coded as responsive by the solution and determining if they are truly responsive in the opinion of the human reviewer. If the reviewers find examples of documents that are not deemed responsive, then those documents would then in turn be used to train the solution to disregard or not code as responsive specific content based on the iterative examples. This iterative cycle could be processed several times until the human professionals agree the system has reached the desired level of capability. By the way, this iterative process can and is also used to sample results sets of documents deemed non-responsive to determine if the solution is not finding potentially responsive content. This process is called “Elusion”. Elusion is the process to count the proportion of misses that a system yielded. The proportion of misses, is the proportion of responsive documents that were not marked responsive by the solution. Elusion is the proportion of missed documents that are in fact responsive. This elusion process can also be used in the iterative cycle to further train the system.

The obvious benefit of a predictive coding solution in the eDiscovery process is to dramatically reduce the time spent on legal professionals reading each and every document to determine its responsiveness. A 2012 RAND Institute for Civil Justice report estimated a savings of 80% for the eDiscovery review process (73% of the total cost of eDiscovery) when using a predictive coding solution.

So, to answer the question, conceptual search is an automated information retrieval method which is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

Predictive coding is a process (which can include conceptual search) which uses machine learning technologies to categorize (or code) an entire corpus of documents as responsive, non-responsive, or privileged based on human chosen examples used to train the system in an iterative process. These technologies typically rank the documents from most to least likely to be responsive to a specific information request. This ranking can then be used to “cut” or partition the documents into one or more categories, such as potentially responsive or not, in need of further review or not, etc1.

1 Partial definition from the eDiscovery Daily Blog:

Successful Predictive Coding Adoption is Dependent on Effective Information Governance

Predictive coding has been receiving a great deal of press lately (for good reason), especially with the ongoing case; Da Silva Moore v. Publicis Groupe, No. 11 Civ. 1279 (ALC) (AJP), 2012 U.S. Dist. LEXIS 23350 (S.D.N.Y. Feb. 24, 2012). On May 21, the plaintiffs filed Rule 72(a) objections to Magistrate Judge Peck’s May 7, 2012 discovery rulings related to the relevance of certain documents that comprise the seed set of the parties’ ESI protocol.

This Rule 72(a) objection highlights an important point in the adoption of predictive coding technologies; the technology is only as good as the people AND processes supporting it.

To review, predictive coding is a process where a computer (with the requisite software), does the vast majority of the work of deciding whether data is relevant, responsive or privileged to a given case.

Beyond simply searching for keyword matching (byte for byte), predictive coding adopts a computer self-learning approach. To accomplish this, attorneys and other legal professionals provide example responsive documents/data in a statistically sufficient quantity which in turn “trains”the computer as to what relevant documents/content should be flagged and set aside for discovery. This is done in an iterative process where legally trained professionals fine-tune the seed set over a period of time to a point where the seed set represents a statistically relevant sample which includes examples of all possible relevant content as well as formats. This capability can also be used to find and secure privileged documents. Instead of legally trained people reading every document to determine if a document is relevant to a case, the computer can perform a first pass of this task in a fraction of the time with much more repeatable results. This technology is exciting in that it can dramatically reduce the cost of the discovery/review process by as much as 80% according to the RAND Institute of Civil Justice.

By now you may be asking yourself what this has to do with Information Governance?…

For predictive coding to become fully adopted across the legal spectrum, all sides have to agree 1. the technology works as advertised, and 2. the legal professionals are providing the system with the proper seed sets for it to learn from. To accomplish the second point above, the seed set must include content from all possible sources of information. If the seed set trainers don’t have access to all potentially responsive content to draw from, then the seed set is in question.

Knowing where all the information resides and having the ability to retrieve it quickly is imperative to an effective discovery process. Records/Information Management professionals should view this new technology as an opportunity to become an even more essential partner to the legal department and entire organization by not just focusing on “records” but on information across the entire enterprise. With full fledged information management programs in place, the legal department will be able to fully embrace this technology to drastically reduce their cost of discovery.

Defensible Disposal and Predictive Coding Reduces (?) eDiscovery by 65%

Following Judge Peck’s decision on predictive coding in February of 2012, yet another Judge has gone in the same direction. In Global Aerospace Inc., et al, v. Landow Aviation, L.P. dba Dulles Jet Center, et al (April 23, 2012), Judge Chamblin, a state judge in the 20th Judicial Circuit of Virginia’s Loudoun Circuit Court, wrote:

“Having heard argument with regard to the Motion of Landow Aviation Limited Partnership, Landow Aviation I, Inc., and Landow & Company Builders, Inc., pursuant to Virginia Rules of Supreme Court 4:1 (b) and (c) and 4:15, it is hereby ordered Defendants shall be allowed to proceed with the use of predictive coding for the purposes of the processing and production of electronically stored information.”

This decision was despite plaintiff’s objections the technology is not as effective as purely human review.

This decision comes on top of a new RAND Institute for Civil Justice report which highlights a couple of important points. First, the report estimated that $0.73 of every dollar spent on eDiscovery can be attributed to the “Review” task.RAND also called out a study showing an 80% time savings in Attorney review hours when predictive coding was utilized.

This suggests that the use of predictive coding could, optimistically, reduce an organization’s eDiscovery costs by 58.4%.

The barriers to the adoption of predictive coding technology are (still):

  • Outside counsel may be slow to adopt this due to the possibility of loosing a large revenue stream
  • Outside and Internal counsel will be hesitant to rely on new technology without a track record of success
  • Additional guidance from Judges

These barriers will be overcome relatively quickly.

Let’s take this cost saving projection further. In my last blog I talked about “Defensible Disposal” or in other words, getting rid of old data not needed by the business. It is estimated the cost of review can be reduced by 50% by simply utilizing an effective Information Governance program. Utilizing the Defensible Disposal strategy brings the $0.73 of every eDiscovery review dollar down to $0.365.

Now, if predictive coding can reduce the remaining 50% of the cost of eDiscovery review by 80% as was suggested in the RAND report, between the two strategies, a total eDiscovery savings of approximately 65.7% could be achieved. To review, lets look at the math.

Starting with $0.73 of every eDiscovery dollar is attributed to the review process

Calculating a 50% saving due to Defensible Disposal brings the cost of review down to $0.365.

Calculating the additional 80% review savings using predictive coding we get:

$0.365 * 0.2 (1-.8) = $0.073 (total cost of review after savings from both strategies)

To finish the calculations we need to add back in the cost not related to review (processing and collection) which is $0.27

Total cost of eDiscovery = $0.073 + $0.27 = $0.343 or a savings of: $1.0 – $0.343 = 0.657 or 65.7%.

 As with any estimates…your mileage may vary, but this exercise points out the potential cost savings utilizing just two strategies, Defensible Disposal and Predictive Coding.

Information Governance and Predictive Coding

Predictive coding, also known as computer assisted coding and technology assisted review, all refer to the act of using computers and software applications which use machine learning algorithms to enable a computer to learn from records presented it (usually from human attorneys) as to what types of content are potentially relevant to a given legal matter. After a sufficient number of examples are provided by the attorneys, the technology is given access to the entire potential corpus (records/data) to sort through and find records that, based on its “learning”, are potentially relevant to the case.

This automation can dramatically reduce costs due to the fact that computers, instead of attorneys conduct the first pass culling of potentially millions of records.

Predictive coding has several very predictable dependencies that need to be addressed to be accepted as a useful and dependable tool in the eDiscovery process. First, which documents/records are used and who chooses them to “train the system”? This training selection will almost always be conducted by attorneys involved with the case.

The second dependency revolves around the number of documents used for the training. How many training documents are needed to provide the needed sample size to enable a dependable process?

And most importantly, do the parties have access to all potentially relevant documents in the case to draw the training documents from? Remember, potentially relevant documents can be stored anywhere. For predictive coding, or any other eDiscovery process to be legally defensible, all existing case related documents need to be available. This requirement highlights the need for effective information management by all in a given organization.

As the courts adopt, or at least experiment with predictive coding, as Judge Peck did in Monique Da Silva Moore, et al., v. Publicis Groupe & MSL Group, Civ. No. 11-1279 (ALC)(AJP) (S.D.N.Y. February 24, 2012, an effective information management program will become key to he courts adopting this new technology.