Unstructured textual data such as e-commerce, social media, real-time web ad tailoring, and counter-terrorism analysis is growing exponentially. Huge, distributed systems that use hundreds or thousands of servers running management layers like Hadoop and Spark are addressing the problem with brute force server farms. But how much more conventional CPU-only resource can we throw at big data problems as we head past petabytes into exabyte and zettabyte territory? Specialized pattern analysis accelerators, with vastly lower power and cost per text analysis, can be a vital part of the solution.
The Era of Big Data
The creation of “Big Data” (BD) technologies was sparked by the exponential growth in web data and key massively distributed computing software inventions at Google. Big data, in its essence, refers to data that is too large, too rapid in its creation or too diverse in its structure to feasibly store and analyze with traditional relational databases. These BD technologies started at Google but have been rapidly evolved and extended for finance, e-commerce, sensor data, surveillance and many other domains. The Big Data hardware/software/services market is projected to exceed $47B by 2017, a 31% compound annual growth rate over the period 2012 - 2017, with over $7B in computing hardware alone.
The “three V’s” of Big Data are volume, variety and velocity. All three require ever more powerful and more massively distributed compute power, based on BD technologies such as Hadoop, MapReduce, Hive, Pig, Spark, MongoDB and others.
The need for speed in analyzing many petabytes of unstructured, often text-based data is only getting more challenging—exabytes, and even many zettabytes of data are here now. An estimate for 2020 is approximately 40 zettabytes of data on the web, which equals a whopping 43 trillion gigabytes. Worldwide social media users total into the billions, and each one can accumulate thousands to tens of thousands of connections to content. Tens of billions of content pieces are added to Facebook every month, which can be correlated to vast amounts of shopping or other metadata all over the web.
Of the three “V” aspects, velocity is the more recent and rapidly growing imperative for BD. Velocity is usually defined as the rate of change in the base data, not to mention the rate of change in questions asked.
A vital aspect of velocity that can motivate adding hardware acceleration, such as Automata Processors, can be the need for extreme low latency for answers. Examples of “Real-Time Big Data Analytics” (RTBDA) applications include individualized ad-serving for web visitors, analytics of Twitter sentiment trends or breaking news to drive high-frequency stock trading, or even detecting credit card fraud while the identity thief is still at the checkout. This trend may motivate more hardware acceleration at the leaf level of BD’s distributed architecture with AP technology, rather than relying only on wider and wider distributions of data across servers. The value of acceleration depends on the specific data analysis algorithm and whether accelerator hardware is an order or orders of magnitude quicker than using traditional CPUs only.
There is a quite a difference between traditional relational database analytics and RTBDA. Previously, you had to know most of the kinds of questions you planned to ask before you stored your data. In the worst case, you might have to change relational schema, rebuild indexes, and collect new data, which could take months in some cases. BD technologies provide the scale and flexibility to store data before you know how you are going to analyze it. They work for complex, non-traditional data—the kinds of unstructured or semi-structured data generated by finance, customer relations management (CRM) systems, social media, mobile communications, telecommunications, customer service records, sensors, and web activity. No longer does data have to be imported, hashed, B-tree’d and stored neatly in tables. In today’s world of BD analytics, anything goes. Store everything, think of new mash-ups from diverse sources, and analyze when you need to.
So, in today’s typical BD architecture, perhaps the most readily apparent role for the Automata Processor is at the post job distribution, leaf-level of processing on the servers for certain classes of algorithms that are far faster when run on PCIe AP accelerator cards rather than using only traditional CPUs. For example, a common BD task is to scan terabytes of text to compute the frequencies of target keywords. With MapReduce, terabytes or petabytes of text can be split into thousands of subsets.
Automata Processor cards in each server could more quickly compute keyword frequency tables (one for each subset). Then BD technologies aggregate them to produce a final table. The output of pattern matching engines in the Automata Processor can be connected to on-chip counters. Smaller server farms with AP accelerator cards could equal the speed of much larger server farms.
Other Potential AP Applications
Pre-processing cleanup: Big Data at ingest time is often very messy and can benefit from cleanup prior to analysis, such as text deletion, trimming, padding, or substitutions. Automata Processors are well-suited to this kind of text processing.
Non-leaf-level BD tasks: As BD technologies evolve, we may discover other valuable roles for APs above the leaf “server-CPU-accelerator” level that exploit its immensely parallel symbol processing acceleration, perhaps in the management layer. One possible example is encoding, compressing, binning and accessing very large “bitmap indexes” in Hive using the immensely parallel Automata Processor.
Video surveillance metadata: Automata Processors could be ideal for analyzing the voluminous streams of metadata generated by surveillance cameras 24/7/365. Since this produces massive amounts of text/symbol analysis, it is well aligned with AP's strengths.
Intelligence and counter-terrorism: BD has already been extensively used for intelligence analysis, with collection of vast quantities of unstructured textual data. Automata Processors are designed precisely for massive acceleration of text processing.
Apache Mahout: Mahout is a Hadoop/MapReduce-compatible analysis software package. It supports three use cases that could be accelerated by the Automata Processor. “Recommendation mining” takes users’ behavior and from that tries to find items users might like. “Clustering” analyzes text documents and groups them into groups of topically related documents. “Classification” learns from existing well-categorized documents what documents of a specific category look like, and then assigns unlabeled documents to the predicted category.
Text mining: Text mining is the process of deriving meaningful information from text. It can be done within Big Data infrastructure, or outside of it.
Typical methods include detecting text patterns and trends through statistical pattern analysis, as well as structuring the data by adding metadata. Other tasks include: clustering, categorization, concept identification and tagging, sentiment analysis, document summarization, finding relations between words or concepts, and word frequency statistics. Named entity recognition uses dictionaries, glossaries, rules and natural language techniques to identify entities: people, organizations, stock ticker symbols, place names, acronyms, phone numbers, email addresses, etc.
Sentiment analysis, also used in stock trading, involves discerning subjective content and attitudinal information: opinion, mood, and emotion, while tying the sentiment to an entity, such as a stock or a person’s name. For national security uses, text mining can be used to monitor and analyze intelligence intercepts, blogs, news, emails, Twitter feeds, search logs etc. Text mining is also heavily used in the scientific and medical literature. It can augment market research and customer relations management (CRM) systems, as well as intelligence analysis. Also see the page on Financial Services. The Automata Processor is relevant to virtually any text mining application, due to the immense power of its regular expression capability.
A Functional Example
The Levenshtein distance is the measurement of the difference between two sequences allowing for substitutions, deletions, or insertions. This principle has broad applicability to many Big Data application domains such as bioinformatics and network security. This example is scanning for any sequence within a distance of three symbols or less for a sequence L symbols long.
If a solution exists, it is always found in L − k ≤ t ≤ L + k where t is the number of clock cycles, L is the target string length, and k is the maximum distance. This implementation requires (2k (L − 1) + 1) State Transition Elements in the Automata Processor.
Note that this approach trades an O(k × L) number of elements for an arithmetic execution time of O(k + L).