In part one of this series we looked at how we could use recursive CTE’s toto find overlapping groups in two columns in a table, in order to put them into a new super group of associated groups.

Since I wrote that post, SQL Server 2019 CTP 3.1 has been released, and with it comes some enhancements to the graph processing functionality, namely the new SHORTEST_PATH() function.

To use this code yourself you will need to be using SQL Server 2019 CTP 3.1 or later. The simplest way to play with the CTP releases is to use Docker. Here is a good guide to getting setup by DBA From The Cold. The images are available on the Docker Hub, just make sure you pull this image.

Demo Data

Once we are setup, we can load some sample data into a table like in the last post. The difference here is we will use the AS NODE syntax to make the table act as the nodes in a graph, and a extra columns $node_id is automatically added.

And here is what the data looks like, coloured to show the chain of links between the rows.

IdSupplier NameTax NumberBank Sort CodeBank Account NumberRequired Output
1AdventureWorks Ltd.1234567802-11-33123456781
2AdventureWorks1234567802-55-44151617181
3ADVENTURE WORKS2334455602-55-44151617181
4DVENTURE WORKS LTD.2334455602-77-66998877661
5AW Bike Co2334455602-88-00119911991
6Big Bike Discounts5555666602-88-00119911991
7Contoso Corp9000900002-99-02123412342

This produces the same simple disconnected graphs as last time. We have one graph with 6 nodes, and another graph with a single node.

The Nodes are numbered using the Id’s of the rows

Now we need to query the table, and use the groups of tax numbers, and the groups of bank details to build a table of the graph edges.

Associating the Groups

Now that we have the data, we can use the new SHORTEST_PATH function to traverse the graph. The documentation examples use the function to find the shortest path from one node to another. In our case, we want to find all the paths between all connected nodes. This would be a bad idea in a huge graph, such as all the users on LinkedIn, but as this scenario deals with the idea of lots of smaller, disconnected graphs which we want to use to groups nodes, the approach should be fine. Here is the ode to navigate the graph.

Note: If we wanted to limit the number of nodes traversed, we can as described in the docs.

The output from this looks like this (abbreviated for readability)

IdGraphPathLevelCount
11:11
11:21
22:11
~~~~~~~~~~~~~~~~
22:3:52
33:2:12
33:5:62
~~~~~~~~~~~~~~~~
66:5:3:23
11:2:3:5:64
66:5:3:2:14

The Id column represents the start node for the graph iteration. The GraphPath column uses the STRING_AGG function to build a colon delimited list of all of the nodes visited as the graph was traversed.

We can take this output, and use the STRING_SPLIT function to unpivot this path data. If we group on the node id returned from the string split, then MIN(Id) will give us the lowest Id of all the nodes in the graph. As nodes are only a member of a single graph, this produces the associative grouping.

The output of this gives the same answers as the recursive CTE method.

IdSupplierNameTax NumberBank Sort CodeBank Account NumberGraph GroupRequired Output
1AdventureWorks Ltd.1234567802-11-331234567811
2AdventureWorks1234567802-55-441516171811
3ADVENTURE WORKS2334455602-55-441516171811
4ADVENTURE WORKS LTD.2334455602-77-669988776611
5AW Bike Co2334455602-88-001199119911
6Big Bike Discounts (AW)5555666602-88-001199119911
7Contoso Corp9000100002-99-021234123422

Performance

So how does this perform compared to the recursive CTE approach? Well for this simple set of data the performance was pretty much identical at around 16ms. But changing the sample data to add a few more nodes in the existing groups increases the edge count quite a bit.

Which gives us a graph that looks like this.

Comparing the performance when processing these two graphs made much more difference. Running in a small Docker container, the CTE approach took almost 13 seconds, where the graph approach took only 42ms.

Next Steps…

In the last post of this series we will look at performance. In particular we will try and gauge how the approaches scale as we increase the number of rows (and therefore disconnected graphs) in the table. We will also consider how performance changes as we make our graphs:

  • Larger (more hops)
  • Cyclic (more nodes connected in circles, blue and green in the above graph)
  • Complex (more interconnections between nodes, yellow in the above graph)

For anyone wanting to learn more about graph databases in general, VisualGo has some great tutorials.

In the meantime the demo code for this post is available on this Github Gist, and requires SQL Server 2019 CTP 3.1 or higher to run.


1 Comment

Associative Grouping using tSQL (Part 1) – SQL Smarts · July 5, 2019 at 4:20 pm

[…] part 2 we will also look at the new graph DB features added to SQL 2019 CTP3.1, and see how they can help […]

Leave a Reply

Your email address will not be published. Required fields are marked *