“Left of boom” is a colorful phrase used by intelligence and law enforcement agencies that refers to the precious interval of time one has to detect and then prevent an impending attack or crime by an adversary. Operations and other activities aimed at preventing the event can only commence subsequent to detection and so many technical pursuits in these domains are focused on developing new technologies and techniques that will push detection as far left of “boom” as possible.
A technologist, of course, would speculate on whether or not a faster, bigger, more intelligent hardware/software apparatus might help. Something capable of efficiently sifting through fused data from multiple sources: surveillance, financial transactions, human interactions on individuals, looking for signatures of covert operations. These signals could then cause appropriate reactions in downstream analysts: Further investigation, that “something’s hinky” feeling, and maybe sounding the alarm.
Suppose we have a detective, looking for an impending illegal drug deal involving two known dealers who are planning to meet in a particular city they each have to travel to. He/she would run a query like the one below on the aforementioned dataset:
“Show me any relationships between known dealers who have bought plane, train, or bus tickets to the same destination in cash where they arrive within 1 week of each other.”
A couple of things to note here. First, we do not know what the exact relationship between the dealers is a priori, how strong or weak, and there may not be any direct connections linking the two people. There may be “non-obvious” intermediate relationships we need to find in-between them. So there is a “fuzzy” aspect to the query. We don’t know precisely what we are looking for. If you just said “Hey that sounds like a graph”, you’re right on and we’ll discuss that in a second. Related to this, while this seems like a useful query to run, we will have to probably iterate on it after we see some results returned. Therefore we have a system requirement that the queries must execute and return data on human interaction-able time-scales, i.e. seconds and minutes, not hours.
“Left of boom” problems abound in other fields as well and could benefit from similar capability: In the current “post-omic” era in biology and medicine researchers sift through the resultant vast amounts of genomic and proteomic data generated by high-throughput, democratized sequencing technologies, attempting to discover novel gene and protein interactions that might lead to new therapeutic agents and risk prediction far to the left of the “boom” of disease onset. Corporate entities, wanting to remain competitive, perceive the need to integrate and correlate their enterprise customer transaction data with real-time trends propagating within on-line social networks and do quantitative behavioral science on patterns locked away in data being continually generated by the ubiquitous use of increasingly sensor-rich mobile devices.
As you may have suspected, all of this turns out to be very hard, or maybe even “non-trivial” as we physicists like to say in our nerd-macho vernacular. There are many technical reasons “anomalous” or “fuzzy” or “best-effort” pattern matching is difficult and there is not a general solution or even a consensus definition of what the problem is exactly. This posting is envisioned as the first in a series exploring the technical issues surrounding how to do “non-obvious” relationship analytics on large, messy data in a way that produces mission-critical, actionable signals.
Here are some technical reasons off the top of my head why non-obvious relationship finding is difficult for current technologies:
- Here we are not just constructing queries articulating known correlations between instances of structured data in a relational table and imposing simple Boolean filters on known factors (e.g. “Greater Than” some value). Often times you do not know the context for (or maybe even the content of) the data and the queries are looking for things that are indicated as “interesting” or “anomalous” by statistics, rather than domain knowledge.
- The underlying structure of the data itself is typically what’s important here, i.e. the relationships (edges) between people, places, and things in the data (nodes), i.e. often times, the best way to represent data in these problems is as a graph. Not only is it the most natural structure in many cases (e.g. a social or protein interaction network), but if you are able to reasonably transform your data to such a representation, a whole host of well-studied applied mathematics and linear algebra becomes available for your analysis needs
- OK, if we limit ourselves to problems that we can represent as a graph, we run into these issues:
- Database-oriented hardware/software shortcomings: What technologies exist where you can actually load large real-world data graphs with potentially billions of nodes and run queries over the data that return in seconds and minutes, rather than hours? If one chooses to parallelize using commodity hardware and some distributed computing scheme (a la MapReduce), one is left with having to partition the graph over a “shared-nothing” architecture which makes practically all graph queries low performance. If one optimizes the query execution strategy or the index structure to be fast on one type of query, other types will suffer. How does one go about this in a sufficiently general way, i.e. not optimized for a particular application?
- Algorithm-oriented hardware/software shortcomings: How does one actually go about executing a query to get exact and near matches we want to know about, rank-ordered by how precisely well the instance patterns match the query with current technologies? As a particular example of this type of activity, let’s consider best-effort pattern matching on large attributed graphs, as shown in (2). The problem is querying a graph of N nodes with a query pattern consisting of nq nodes. This problem is stated to have polynomial time complexity, O(N^nq). The really great algorithmic work in (2) is able to reduce this to linear in the size of the graph, but would this still be the case on very large graphs? At some size the hardware/software platform has to fall over.
A lot of progress has been made with Semantic Web technologies such as the equitable Resource Description Framework (RDF) data model where everything goes in as a (subject, predicate, object) triple. Ontology statements are additional triples that go in like any other instance data statement, but can provide flexible, evolvable, and swappable structure over the data with an ease that is hard to come by with rigid schemas. The W3C recommendation query language SPARQL has a lot of great functionality and realizes some of the great benefits of a graph database. You can perform efficient queries that amount to simple pattern matchings in SPARQL. These same queries may need to be expressed as horrible recursive non-performant self-joins in SQL if the same data resided in an RDBMS.
However, at the end of the day, SPARQL is still what it was intended to be: a simple-to-use declarative query language, and not a hardcore quantitative analysis workhorse.
As you can see, there is a lot to explore, and we will. I will finish this post by showing one version of the query I talked about earlier to show some of the issues I will be discussing in future posts.
Here is what the query graph pattern might look like:
Here is what near matches in the instance data might look like. Note that suspected dealers are not actually associated directly (at least in terms of the underlying data used to build the graph) but are third-degree connections and that while the destinations are different, they are geographically very close. How do we treat this?
Finally, there is the temporal nature of the underlying data itself, i.e. the graph, if ever, will be fully satisfied over time. (We will get into the thorny temporal data issue in a later post) At a particular point in time (say Jan31) John and Alice have not yet been observed together and Will has not purchased his plane ticket yet. However, partial, maybe even disjoint, instantiations may exist. How do we track these instantiations which join up to satisfy our pattern of interest over time?
Can we account for all the above cases by doing one or more SPARQL queries? Anyone have any ideas?
(1) Anyanwu, K., Maduko, A., Sheth, A., “SPARQ2L: Towards Support for Subgraph Extraction Queries in RDF Databases”, WWW2007 Track: Semantic Web (2007).
(2) Tong, H., Gallagher, B., Faloutsos, C., Eliassi-Rad, T., “Fast Best-Effort Pattern Matching in Large Attributed Graphs”, ACM KDD07 (2007).