<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>YarcData</title>
	<atom:link href="http://yarcdata.com/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://yarcdata.com/blog</link>
	<description>A Cray Company</description>
	<lastBuildDate>Wed, 22 May 2013 18:52:38 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>“Fast Onboarding” Pt. 1 – a classic business challenge</title>
		<link>http://yarcdata.com/blog/?p=470</link>
		<comments>http://yarcdata.com/blog/?p=470#comments</comments>
		<pubDate>Wed, 22 May 2013 18:52:38 +0000</pubDate>
		<dc:creator>dean</dc:creator>
				<category><![CDATA[Big Data]]></category>

		<guid isPermaLink="false">http://yarcdata.com/blog/?p=470</guid>
		<description><![CDATA[In this two-part blog series, I’m going to first discuss various frustrations of data management professionals and then the solution they need to extract value from the massive volumes available to them. I want to start by telling three different &#8230; <a href="http://yarcdata.com/blog/?p=470">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><img style="border-width: 1px; border-color: #000000; margin-left: 14px; margin-bottom: 14px;" src="http://www.yarcdata.com/images/bloggers/Dean.jpg" alt="Dean" width="100" align="right" border="1" />In this two-part blog series, I’m going to first discuss various frustrations of data management professionals and then the solution they need to extract value from the massive volumes available to them.</p>
<p>I want to start by telling three different stories of information challenges, spanning decades of information processing history. What do they have in common?</p>
<ul>
<li>A credit card company creates a system to detect patterns of card usage that indicate the possibility of fraud. But the fraudsters catch on, and change their tactics. So the bank catches on, and changes its fraud check policies. How can the fraud detection system keep up?</li>
<li>A contractor provides information systems to a government office. New legislation is enacted regarding what information may and may not be kept for public records, making the policies of these systems obsolete. How can the system keep up?</li>
<li>A detailed analysis of a market requires sophisticated statistics to react quickly enough to exploit market opportunities. But this analysis raises more detailed questions. Whoever can find these answers quickly enough can gain a further market edge. How can the analysis systems keep up?</li>
</ul>
<p>The oldest story comes from twenty years ago, the most recent from this very year. All three of them point to a problem using information systems in a dynamic business environment: Keeping up with the rapid pace of business change.</p>
<p>This is an old problem, but an ongoing one, and I want to give it a name: I call it fast onboarding. Defined rather broadly, fast onboarding is the ability of a system to bring new rules, patterns, datasets, or other information resources to bear quickly on business problems.</p>
<p><strong>What are the business situations that particularly demand fast onboarding capabilities?</strong> We&#8217;ve seen a few in the stories above. These situations typically involve highly competitive situations, with stakeholders who are responding to some external pressure, outside the view of the system.</p>
<p>We see it in security situations of all sorts, where adverse parties (fraudsters, spies, political enemies) are working against your business goals. We also see it in less adversarial situations, as in the example of the fast-breaking legislation above, where competing political pressures make drastic changes to the information context.</p>
<p><strong>What does this mean for the underlying technology?</strong> First, let’s think about how onboarding is done using conventional software methods today. For diverse datasets, we have data warehousing approaches that let us combine them into a single resource that can drive the new application.</p>
<p>We can define patterns in the data with the help of powerful query languages that are of business interest – possible cases of fraud, regulatory violations, or new competitive opportunities. These technologies and methods allow us to develop software that onboards new datasets and new patterns as the application landscape evolves.</p>
<p><strong>How well do these approaches work?</strong> In an informal poll among data warehouse professionals, I asked them how long, on the average, it took to design a warehouse, perform the ETL, design the queries on the new system, and provide business value. The answer was uniformly given “about six months.”</p>
<p>Suppose then that once the analysis delivers value, the business line has a follow-on question that recognizes a new pattern or integrates a new dataset. How long does it take to build the follow-on system? The surprising answer was again, “about six months.” In all the times I have told this story, the only objection I have ever received was that the estimate of “about six months” was a bit optimistic.</p>
<p><strong>What are the take-aways?</strong> The real lesson is that there is no cumulative value in the system design; onboarding a new dataset or pattern is just as expensive as starting from scratch. We normally think of good technology development as a predictable process, but when the system has to both satisfy the requirements known at design time and adapt to a dynamic business, even a technical success in this sense does not translate into business success.</p>
<p>A fast onboarding system isn&#8217;t measured just by how well it answers some business question, but by how quickly it can answer new questions in new information contexts. In my next blog post, I’ll discuss how best to build a system for fast onboarding so that you can learn how to tackle your massive data problems.</p>
<p><em>Dean Allemang, co-author of the bestselling book, </em>Semantic Web for the Working Ontologist<em>, is a consultant, thought-leader, and entrepreneur focusing on industrial applications of distributed data technology. He served nearly a decade as Chief Scientist at TopQuadrant, the world&#8217;s leading provider of Semantic Web development tools, producing enterprise solutions for a variety of industries. As part of his drive to see the Semantic Web become an industrial success, he is particularly interested in innovations that move forward the state of the art in distributed data technology.  Dean&#8217;s current work is concentrated on the life sciences and finance industries, where he currently sees the most promising industrial interest in this technology.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://yarcdata.com/blog/?feed=rss2&#038;p=470</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Writing Iterative Algorithms in SPARQL: Pt. 2</title>
		<link>http://yarcdata.com/blog/?p=458</link>
		<comments>http://yarcdata.com/blog/?p=458#comments</comments>
		<pubDate>Tue, 30 Apr 2013 19:58:35 +0000</pubDate>
		<dc:creator>Steve Reinhardt</dc:creator>
				<category><![CDATA[RDF/SPARQL]]></category>

		<guid isPermaLink="false">http://yarcdata.com/blog/?p=458</guid>
		<description><![CDATA[In the first post of this series we looked at what iterative algorithms are. In this post we’ll look at the actual SPARQL queries that implement the peer-pressure clustering algorithm. The algorithm starts with a set of vertices and edges &#8230; <a href="http://yarcdata.com/blog/?p=458">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><img style="border-width: 1px; border-color: #000000; margin-left: 14px; margin-bottom: 14px;" src="http://www.yarcdata.com/images/bloggers/steve.jpg" alt="Steve" width="100" align="right" border="1" /><a href="http://yarcdata.com/blog/?p=318">In the first post of this series</a> we looked at what iterative algorithms are. In this post we’ll look at the actual SPARQL queries that implement the peer-pressure clustering algorithm.</p>
<p>The algorithm starts with a set of vertices and edges between pairs of them. The code, simply stated, is:<br />
<br clear="all"></p>
<pre>assign each vertex to an initial cluster
do {
	assign each vertex to the most popular cluster of its neighbors
} while (enough vertices changed clusters in this iteration)</pre>
<p>Each statement results in a query. We chose to put each vertex in an initial cluster named the same as the vertex. The SPARQL query for this looks like:</p>
<pre>[ 1] DROP GRAPH &lt;urn:ga/g/xjz0&gt;
[ 2] CREATE GRAPH &lt;urn:ga/g/xjz0&gt;
[ 3] INSERT {
[ 4]       GRAPH &lt;urn:ga/g/xjz0&gt; {?s &lt;urn:ga/p/inCluster&gt; ?s }
[ 5] }
[ 6] WHERE {
[ 7]    SELECT DISTINCT ?s WHERE {
[ 8]      ?s &lt;urn:ga/p/hasLink&gt; ?o .
[ 9]    }
[10]}</pre>
<p>The cluster assignments are extra data that we’re adding to the database.  To avoid cluttering up the default graph and make it easy to find them, we put them in named graphs chosen for this execution of the algorithm (“xjz” in the examples), so their names do not collide with other executions of the algorithm.  In lines 1 and 2, we delete via the DROP construct any graph of that name and CREATE a new empty graph.  The SELECT clause on lines 7-9 finds all vertices in the default graph that are the subject of a hasLink predicate and, for each unique such vertex, then on lines 4-6 INSERTs into the named graph a new triple of the same subject, the inCluster predicate, and the subject name as the cluster name.</p>
<p>We have used URIs starting with the urn: prefix so we can freely make up a namespace with no concern of collisions with other usage of the same strings elsewhere.  These URIs will not be dereferenceable, which in this situation is not a problem.  In &lt;urn:ga/p/hasLink&gt; and &lt;urn:ga/g/xjz<strong>i</strong>&gt;  the “ga” stands  for graph analytics, the “p” for predicate, and the “g” for graph name.</p>
<p>Typically the second query and the third query will execute repeatedly.  The second query calculates new cluster assignments based on current cluster assignments.</p>
<pre>[11] DROP GRAPH &lt;urn:ga/g/xjzi+1&gt;
[12] CREATE GRAPH &lt;urn:ga/g/xjzi+1&gt;
[13] INSERT
[14] {
[15]   GRAPH &lt;urn:ga/g/xjzi+1&gt;  { ?s &lt;urn:ga/p/inCluster&gt; ?clus3 }
[16] }
[17] WHERE {
[18]   { SELECT ?s (SAMPLE(?clus) AS ?clus3)
[19]   {
[20]     { SELECT ?s (MAX(?clusCt) AS ?maxClusCt)
[21]       {
[22]         SELECT ?s ?clus (COUNT(?clus) AS ?clusCt)
[23]         WHERE
[24]         {
[25]           ?s &lt;urn:ga/p/hasLink&gt; ?o .
[26]           GRAPH &lt;urn:ga/g/xjzi&gt; { ?o &lt;urn:ga/p/inCluster&gt; ?clus }
[27]         } GROUP BY ?s ?clus
[28]       } GROUP BY ?s
[29]     }
[30]     { SELECT ?s ?clus (COUNT(?clus) AS ?clusCt)
[31]       WHERE
[32]       {
[33]         ?s &lt;urn:ga/p/hasLink&gt; ?o .
[34]         GRAPH &lt;urn:ga/g/xjzi&gt; { ?o &lt;urn:ga/p/inCluster&gt; ?clus }
[35]       } GROUP BY ?s ?clus
[36]     } FILTER (?clusCt = ?maxClusCt)
[37]     } GROUP BY ?s
[38]   }
[39] }</pre>
<p>Lines 22-27, for each vertex, returns the vertex, the cluster assignments of its neighbors, and the count of vertices in that cluster.  Lines 20-29 calculate the maximum cardinality of the clusters of the neighbors of each vertex.  Lines 30-36 calculate the cluster assignment of that maximum-cardinality cluster.  (SPARQL lacks a construct that returns the maximum value of one intermediate result and the corresponding element of another intermediate result.)  Lines 18-38 join the maximum cardinality with the cluster name and also, in the case of a tie in maximum cardinality, break any tie by SAMPLEing a cluster assignment for each cluster.  Lines 14-16 INSERT the new cluster-assignment triples into the named graph.</p>
<p>Because this query reads and writes to graphs whose names change with the iteration count, it is simple to create the queries in a scripting language (JavaScript in this case) to make these modifications to the query for each iteration.  We have chosen a naming scheme of &lt;graphName&gt;i, where i varies across a range, to simplify debugging by examination of intermediate cluster assignments.  An alternative (that is more memory conserving) would use just two named graphs (e.g., current and new) and MOVE new to current at the end of each iteration.  The algorithm does clean up all but the named graph with the final cluster assignments, so the extra memory use is only transient.</p>
<pre>[40] SELECT (COUNT(?oNew) as ?vccCt)
[41] WHERE {
[42]   GRAPH &lt;urn:ga/g/xjzi&gt;   {?s &lt;urn:ga/p/inCluster&gt; ?oOld}
[43]   GRAPH &lt;urn:ga/g/xjzi+1&gt; {?s &lt;urn:ga/p/inCluster&gt; ?oNew}
[44]   FILTER (?oOld != ?oNew)
[45] }</pre>
<p>The second query executed in each iteration (lines 40-45) counts the number of vertices that changed cluster assignment in the just-completed iteration.  The JavaScript loop that makes the queries then decides whether the algorithm has converged, either by the absolute number or percentage of vertices that changed or by a maximum iteration count.</p>
<p>I have omitted the text of a trivial query done at initialization time to count the number of vertices to be clustered.</p>
<p>Careful readers may note that the inner SELECTs at lines 22-27 and 30-35 are identical.  While developing a complex nested query like this, needing to keep the same code in two spots identical is cumbersome.  SPARQL 1.1 contains no good mechanism to define this code once and reuse it, like a function in a procedural language.  The SQL WITH clause defines by name such a code block that can be executed wherever its results are needed.</p>
<p>The second and third queries above both have minor changes from one instance to the next (<em>e.g.</em>, substituting “xjz2”, “xjz3”, … into the graph name).  While these are not hard to cope within JavaScript code that creates the queries, it does mean that the query is literally different each time that it is executed. Hence, the SPARQL endpoint will have to reinterpret the query each time, which could at some point become time-consuming.  SQL’s placeholder capability enables the passing of a value (of a given type) at execution time that is inserted at the placeholder’s position in the query, avoiding reinterpretation.</p>
<p>In this post we have looked at the SPARQL queries that implement peer-pressure clustering as an iterative algorithm.  Next time we’ll look at the JavaScript code that creates the queries and calls the SPARQL endpoint.</p>
]]></content:encoded>
			<wfw:commentRss>http://yarcdata.com/blog/?feed=rss2&#038;p=458</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Harnessing the Power of Data Discovery: Pt. 2</title>
		<link>http://yarcdata.com/blog/?p=439</link>
		<comments>http://yarcdata.com/blog/?p=439#comments</comments>
		<pubDate>Thu, 25 Apr 2013 17:55:53 +0000</pubDate>
		<dc:creator>adnan</dc:creator>
				<category><![CDATA[Big Data]]></category>

		<guid isPermaLink="false">http://yarcdata.com/blog/?p=439</guid>
		<description><![CDATA[In my last blog entry, I talked about why traditional relational database systems aren’t as flexible for data discovery and had introduced the concept of a graph as a better alternative for analytics. Before diving right into why graph databases &#8230; <a href="http://yarcdata.com/blog/?p=439">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><img style="border-width: 1px;border-color: #000000;margin-left: 14px;margin-bottom: 14px" src="http://www.yarcdata.com/images/bloggers/adnan.jpg" alt="Adnan" width="100" align="right" border="1" />In my last blog entry, I talked about why traditional relational database systems aren’t as flexible for data discovery and had introduced the concept of a graph as a better alternative for analytics. Before diving right into why graph databases are better suited for big data discovery, I want to take a moment to show why tabular (or even columnar) systems don’t lend themselves to relationship analytics.</p>
<p>Now let’s look at a simple transactional database. For the purpose of this simplified example, let’s assume an online company that sells apparel is looking for better targeted ads to improve sales. The database stores the details of people, their addresses, what they purchased, as well as other pertinent information about the users and their transactions. The schema is pretty straightforward in this illustration in fig 1a.</p>
<p><strong>Fig 1a</strong><br />
<img class="alignnone size-full wp-image-442" title="img1" src="http://yarcdata.com/blog/wp-content/uploads/2013/04/img11.png" alt="" width="413" height="91" /></p>
<p>One could do several types of analytics on this data, like cluster analysis of a geographic region or frequency of purchase. With enough transactions from an individual, you could also do some basic consumer profiling. If you know what you’re looking for, the compact tabular representation is efficient storage wise. However, this efficiency comes at a price &#8211; inflexibility. Let’s say your marketing manager thinks that adding social networking data might be useful. You could do it but now you’d have to change your schema to add this extra data, this new “relationship,” as shown below.</p>
<p><strong>Fig 1b</strong><br />
<img class="alignnone size-full wp-image-441" title="img2" src="http://yarcdata.com/blog/wp-content/uploads/2013/04/img2.png" alt="" width="465" height="134" /></p>
<p>The marketing manager is hypothesis-testing: she might use this data for a while and then realize that adding geographic information would be useful too since a marathon is going to take place in this town. This way, she can figure out which folks and their friends who live in proximity to each other and have purchased running shoes in the past. Adding this new information to the database would require yet another change to the schema. Anytime you decide to explore any additional “relationships,” you need to modify the schema. Changing a schema on a production system is a big deal; it can take several weeks and not something that database admins take lightly. There are better ways of analyzing data without modifying your database, but the point I’m trying to illustrate here is that tables are by design compact; while great for certain things, discovering patterns is not one of those things…</p>
<p>Hence the need and emergence of an alternative data representation format: graphs. Basically, a graph is made up of nodes that are connected to other nodes via an edge. In a graph database, the edge represents the relationship between the data entities represented by the nodes. For simplicity, let’s start with just two nodes, with one node representing an individual’s name and another for his address. This is represented in text as what is known as a triple. Much like how a sentence in English is constructed with a subject, an object, and predicate relating the two, a graph is also constructed using sentences (if you will) known as triples, as shown in Fig 2a. Visually, it’s more convenient to display this as shown in Fig 2b.</p>
<p><strong>Name, lives at, Address</strong></p>
<p><strong>Fig 2a</strong><br />
<img src="http://yarcdata.com/blog/wp-content/uploads/2013/04/img31.png" alt="" title="img3" width="179" height="59" class="alignnone size-full wp-image-455" /></p>
<p>Once you’ve grasped the simplicity of the model, it’s easy to get expressive with your data. Let’s look how we might represent the above tabular data in a graph database.</p>
<p>Adnan, lives at, 123 Main St<br />
Adnan, bought, Running Shoes<br />
Adnan, Paid, $120<br />
Adnan, gender, male<br />
And so on…</p>
<p>And graphically:</p>
<p><strong>Fig 2b</strong><br />
<img src="http://yarcdata.com/blog/wp-content/uploads/2013/04/Fig2b.png" alt="" title="Fig2b" width="588" height="166" class="alignnone size-full wp-image-453" /></p>
<p>What does putting data in this format afford us? Now let’s go back to our example of adding social media data as we did in the previous case. Since everything is already expressed as a relationship, adding new relationships is trivial. So adding friend information is as simple as adding that relationship between the nodes:</p>
<p>Adnan, friends with, John</p>
<p>John, friends with, Emma</p>
<p>And again, graphically, this would be represented as in Fig 3, with the new relationships in place:</p>
<p><strong>Fig 3</strong><br />
<img src="http://yarcdata.com/blog/wp-content/uploads/2013/04/Fig3.png" alt="" title="Fig3" width="588" height="199" class="alignnone size-full wp-image-456" /></p>
<p>The solid red lines in Fig 3 are the explicit relationships that were specified. And you can just as easily add more relationships as you need. Just as tabular representations excel at being spatially efficient, graph databases are efficient when it comes to adding and exploring new relationships.</p>
<p>I skipped an important step so that I could show you how a graph database works. The nodes don’t have any meaning as such, but what if you could ascribe meaning to each node that distinguishes the nodes of the same class? All <strong>names</strong> would belong to a class of type NAME, addresses could belong to a class of type LOCATION, and so on. Essentially, this is defining the ontology of the data, or in other words, the data dictionary.</p>
<p>The beauty of ontologies is that they allow you to assign properties to certain node classes and any node belonging to that class automatically inherits that property whether it is explicitly specified or not. For example, you could define an ontology where classes of type NAME who buy running shoes are also of class RUNNERS. Class of Type RUNNERS buy running apparel, so now you can <strong>infer</strong> that both Adnan and Emma might be interested in running apparel but not John. Inferencing is one more tool that can help enrich your data by introducing additional relationships. In Fig 3 above, even though I specified just two friendship relationships, I added twice as many new edges (shown dotted). From a marketer’s perspective, if Adnan is John’s friend<ins cite="mailto:Alyssa%20Jarrett" datetime="2013-04-12T13:35">,</ins> then the reverse is also true. So with friendships at least, you can infer that the reverse relationship is also true but that need not always be the case.</p>
<p>Ontologies are a complete topic unto themselves, but once properly defined, they are a very powerful concept and give graph analytics more flexibility, especially when it comes to constructing complex queries. Ontologies can be hierarchical (e.g. the LOCATION type can be expanded to include city, state and country<ins cite="mailto:Alyssa%20Jarrett" datetime="2013-04-12T13:36">,</ins> or to include latitude and longitude). Different ontologies may be needed based on how much you want to drill down into your data. And depending on what data you represent, you may want a different ontology. However, several generic ontologies specific to fields like biology, financial services, etc. have been produced that can be easily modified.</p>
<p>I’d like to add one last item before I wind up this post. Just as many of you may be familiar with the Structured Query Language or SQL for querying relational databases, certain graph databases employ a similar query language named SPARQL<sup>1</sup>. For those of you familiar with SQL, it is simple and elegant. However, writing a very complex query on a very large dataset can be trying since you have to be familiar with the data layout and other aspects of the database if the performance is to be acceptable. Since data layout is not an issue with graph databases, SPARQL doesn’t suffer from many of the shortcomings of SQL, and offers more expressivity and ease when it comes to writing complex queries.</p>
<p>For the interested reader, a thorough discussion on both Ontologies and SPARQL is available in Dean Allemang’s excellent text <em>The Semantic Web for the Working Ontologist</em><sup>2</sup>. Dean also happens to be a guest blogger for us, by the way, and we’re really proud to have him sharing his thoughts with our readers.</p>
<p>This has been a more technical post than I’d normally prefer but understanding some of the fundamental advantages of a graph databases are crucial to why graph analytics is a powerful and different platform. Rather than just taking my word for it, I’ve tried to underscore why this is so.</p>
<p>To reiterate, here are the three takeaways from this post:</p>
<ul>
<li>Relational databases are optimized for a compact representation, and do that very efficiently.</li>
<li>Graph databases on the other hand, are very flexible when it comes to adding new relationships</li>
<li>Graph database are further enhanced by facilities like ontologies, inferencing, and more expressive query languages like SPARQL.</li>
</ul>
<p>This post focused on why a graph database makes it easy to add new relationships. Now if you’re thinking ahead, you’re probably wondering about the possibilities once your data is organized as a graph. It’s one thing to explicitly add new relationships you knew existed, but what about finding new relationships in your data that you weren’t aware of? Now, that’s what we call <em>Discovery</em> and more about that in my next blog. Until then, keep the comments and feedback coming.</p>
<div>
<hr align="left" size="1" width="33%" />
<div>
<p><sup>1</sup> <a href="http://www.w3.org/TR/rdf-sparql-query/">http://www.w3.org/TR/rdf-sparql-query/</a><br />
<sup>2</sup> <a href="http://www.amazon.com/Semantic-Web-Working-Ontologist-Second/dp/0123859654">http://www.amazon.com/Semantic-Web-Working-Ontologist-Second/dp/0123859654</a></p>
</div>
</div>
<div>
<div>
<div>
<p>&nbsp;</p>
</div>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://yarcdata.com/blog/?feed=rss2&#038;p=439</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
