The United States has some of the largest and oldest industrial infrastructure in the world, including its natural gas, oil, electrical and communications pipelines and wires more than 85% of which is controlled by private industry.
Placed into service at the beginning of the modern industrial era, much of this infrastructure is still in use. Its operation comes under multiple regulatory purviews, including the Department of Transportation (DOT) – safe operation of pipelines, Homeland Security – secure operation of critical infrastructure, the Federal Energy Regulatory Commission (FERC) – critical infrastructure protection (CIP), as well as state regulations governing intrastate operation.
A substantial portion of the infrastructure was built before 1970, and was placed into service when the pipes and lines were going through unpopulated areas; today those areas are cities, towns and suburbs, making the impact of an accident more likely to cause destruction of property and loss of life.
In order to validate the safety or security of any physical asset, it is necessary to determine if the asset is operating within design specifications. Depending on the type of asset, this generally means locating the “as-built” records, which confirm the materials used and the results of testing performed before turnover by the constructor to an operator. In addition, repairs made over the course of several decades constitute additional sets of records, which need to be found and analyzed.
The weakest link in the data chain determines how strong the entire chain is – one weak valve or section of line sets the upper limit for operational capacities and pressures for the entire system. Knowing which pieces of the system are weak lessens the potential for mishaps or catastrophe caused by operating above safe design limits. Having a good handle on the as-built and repair records improves the ability to protect the systems from malicious attack, as preventative measures can be designed and implemented based on good knowledge of the design.
The failure modes of some types of infrastructure have a domino effect, whereby a surge, or tidal wave, of energy is created. This happened during the Northeast Blackout of 2003 when untrimmed tree branches brushed against a power line, combining with a software bug in an alarm system to trigger a cascading failure that automatically took 265 power plants offline.
In San Bruno, CA a high-pressure natural gas pipeline exploded and destroyed many homes in the neighborhood near San Francisco, causing several deaths. The subsequent investigation found poor welds to be the primary cause of the failure, and final fine amounts are still being discussed. The San Bruno event led to new DOT regulations, requiring pipeline operators to verify test and materials records associated with every mile of pipe.
The big data challenge associated with these initiatives lies in vast amounts of un-indexed or poorly indexed documents dating back decades. Physical records were created at the time of construction and during repairs, and have been collected and preserved in one of many ways:
• Stored in boxes with other records that include some short description of the contents stored in a database – each box holds an average of 2,500 pages of content.
• Scanned to microfilm, with or without a film index.
• Scanned to an image (usually TIFF or PDF). The common practice is to combine hundreds of documents and records into a single PDF, which is hundreds of pages long, with no index.
• Repair records are commonly separated from as-built records and are among the most difficult to find.
Unfortunately, locating the critical records to validate testing performed years ago and unearthing materials design data requires a tremendous amount of research. The data is generally of poor quality due to its age and has undergone a scanning process yielding poor resolution. Since the assets may have been bought and sold many times over several decades, little historical context is available, because the companies and people who created the data are long gone, and the data has changed hands multiple times.
Assigning a team to manually review the data is expensive and time consuming. This approach is error prone, because the work is mind-numbingly repetitive and tedious; it is also a poor use of precious engineering resources. To alleviate the task of finding the needles in the haystacks, the recommended approach is to leverage technology to uncover the data on your behalf. This method includes:
• Analyzing box and file descriptions, mining key phrases from the database, which are of higher interest and fingerprinting boxes with descriptions that match a controlled vocabulary of phrases.
• Choosing the probable relevant boxes to sample, based on the analysis, then scanning each box to a giant PDF.
• Splitting giant PDFs up into their original documents, and auto-classifying each document as belonging to a standard document type.
• Being able to work with very poor quality data, and using predictive analytics to determine if the fragment of a work or phrase in a document title matches a target.
• Using machine learning to teach the system how to recognize unique patterns of data belonging to a particular category.
• Using auto-extraction techniques to populate critical data points into a database, such as materials data.
The benefits of using a technology-assisted process include:
• Reducing the cost of locating critical records by at least 30%, due to reducing the amount of labor required.
• Curbing the risk of not finding critical records by using more powerful methods.
• Creating a robust database containing high-quality records and attributes which can be examined relative to the objectives.
• Affirmatively demonstrating to government regulators the preemptive steps initiated to fully comply with prescribed policies and procedures.
• Freeing up engineering resources to work on higher level tasks.
Author: Brent G. Stanley is the chief operating officer of Haystac LLC and a recognized industry expert in the field of information governance and the use of artificial intelligence technology to streamline and automate the process of interpreting large volumes of unstructured data. Stanley holds degrees in math, physics and mechanical engineering from Case Western Reserve University.