This is part of series of posts about associative grouping:
In part one of this series we looked at how we could use recursive CTE’s to find overlapping groups in two columns in a table, in order to put them into a new super group of associated groups.
Since I wrote that post, SQL Server 2019 CTP 3.1 has been released, and with it comes some enhancements to the graph processing functionality, namely the new
To use this code yourself you will need to be using SQL Server 2019 CTP 3.1 or later. The simplest way to play with the CTP releases is to use Docker. Here is a good guide to getting setup by DBA From The Cold. The images are available on the Docker Hub, just make sure you pull this image.
Once we are setup, we can load some sample data into a table like in the last post. The difference here is we will use the
AS NODE syntax to make the table act as the nodes in a graph, and a extra columns
$node_id is automatically added.
-- Nodes DROP TABLE IF EXISTS SuppliersGraph; CREATE TABLE SuppliersGraph ( Id INT IDENTITY(1,1) NOT NULL ,SupplierName VARCHAR(30) NOT NULL ,TaxNumber INT NOT NULL ,BankSortCode CHAR(8) NOT NULL ,BankAccountNumber INT NOT NULL )AS NODE; INSERT INTO SuppliersGraph (SupplierName, TaxNumber, BankSortCode, BankAccountNumber) VALUES ('AdventureWorks Ltd.', 12345678, '02-11-33', 12345678) ,('AdventureWorks', 12345678, '02-55-44', 15161718) ,('ADVENTURE WORKS', 23344556, '02-55-44', 15161718) ,('ADVENTURE WORKS LTD.', 23344556, '02-77-66', 99887766) ,('AW Bike Co', 23344556, '02-88-00', 11991199) ,('Big Bike Discounts (AW)', 55556666, '02-88-00', 11991199) ,('Contoso Corp', 90001000, '02-99-02', 12341234);
And here is what the data looks like, coloured to show the chain of links between the rows.
|Id||Supplier Name||Tax Number||Bank Sort Code||Bank Account Number||Required Output|
|4||DVENTURE WORKS LTD.||23344556||02-77-66||99887766||1|
|5||AW Bike Co||23344556||02-88-00||11991199||1|
|6||Big Bike Discounts||55556666||02-88-00||11991199||1|
This produces the same simple disconnected graphs as last time. We have one graph with 6 nodes, and another graph with a single node.
Now we need to query the table, and use the groups of tax numbers, and the groups of bank details to build a table of the graph edges.
-- Edges DROP TABLE IF EXISTS SupplierEdges; CREATE TABLE SupplierEdges ( LinkType VARCHAR(100) NOT NULL )AS EDGE; WITH CTE_Edges AS (-- Add the tax number links SELECT s1.Id AS FromId ,s2.Id AS ToId ,'TaxNumber' AS LinkType FROM SupplierNodes AS s1 INNER JOIN SupplierNodes AS s2 ON s1.TaxNumber = s2.TaxNumber UNION -- And the bank account links SELECT s1.Id AS FromId ,s2.Id AS ToId ,'BankSortCode, BankAccountNumber' AS LinkType FROM SupplierNodes AS s1 INNER JOIN SupplierNodes AS s2 ON s1.BankSortCode = s2.BankSortCode AND s1.BankAccountNumber = s2.BankAccountNumber ) INSERT INTO SupplierEdges ($from_id, $to_id, LinkType) SELECT (SELECT $node_id FROM SupplierNodes WHERE Id = e.FromId) AS FromId ,(SELECT $node_id FROM SupplierNodes WHERE Id = e.ToId) AS ToId ,STRING_AGG(e.LinkType, ' / ') FROM CTE_Edges AS e GROUP BY e.FromId ,e.ToId; SELECT * FROM SupplierEdges;
Associating the Groups
Now that we have the data, we can use the new SHORTEST_PATH function to traverse the graph. The documentation examples use the function to find the shortest path from one node to another. In our case, we want to find all the paths between all connected nodes. This would be a bad idea in a huge graph, such as all the users on LinkedIn, but as this scenario deals with the idea of lots of smaller, disconnected graphs which we want to use to groups nodes, the approach should be fine. Here is the ode to navigate the graph.
SELECT s1.Id ,CAST(s1.Id AS VARCHAR(10)) + ':' + STRING_AGG(s2.Id, ':') WITHIN GROUP (GRAPH PATH) AS GraphPath ,COUNT(s2.Id) WITHIN GROUP (GRAPH PATH) AS LevelCount FROM SupplierNodes AS s1 ,SupplierEdges FOR PATH AS lnk ,SupplierNodes FOR PATH AS s2 -- The SHORTEST_PATH() function is brand new in SQL Server 2019 CTP3.1 WHERE MATCH (SHORTEST_PATH ( s1(-(lnk)->s2)+) )
Note: If we wanted to limit the number of nodes traversed, we can as described in the docs.
The output from this looks like this (abbreviated for readability)
Id column represents the start node for the graph iteration. The
GraphPath column uses the
STRING_AGG function to build a colon delimited list of all of the nodes visited as the graph was traversed.
We can take this output, and use the
STRING_SPLIT function to unpivot this path data. If we group on the node id returned from the string split, then
MIN(Id) will give us the lowest Id of all the nodes in the graph. As nodes are only a member of a single graph, this produces the associative grouping.
WITH CTE_Links AS (SELECT s1.Id ,CAST(s1.Id AS VARCHAR(10)) + ':' + STRING_AGG(s2.Id, ':') WITHIN GROUP (GRAPH PATH) AS GraphPath ,COUNT(s2.Id) WITHIN GROUP (GRAPH PATH) AS LevelCount FROM SupplierNodes AS s1 ,SupplierEdges FOR PATH AS lnk ,SupplierNodes FOR PATH AS s2 -- The SHORTEST_PATH() function is brand new in SQL Server 2019 CTP3.1 WHERE MATCH (SHORTEST_PATH ( s1(-(lnk)->s2)+) ) ) ,CTE_Groups AS (SELECT lnk.value AS Id ,MIN(lnks.Id) AS GroupId FROM CTE_LINKS AS lnks CROSS APPLY STRING_SPLIT(lnks.GraphPath, ':') AS lnk GROUP BY lnk.value ) SELECT s.* ,DENSE_RANK() OVER (ORDER BY g.GroupId) AS GraphId FROM CTE_Groups AS g INNER JOIN SupplierNodes AS s ON s.Id = g.Id ORDER BY s.Id;
The output of this gives the same answers as the recursive CTE method.
|Id||SupplierName||Tax Number||Bank Sort Code||Bank Account Number||Graph Group||Required Output|
|4||ADVENTURE WORKS LTD.||23344556||02-77-66||99887766||1||1|
|5||AW Bike Co||23344556||02-88-00||11991199||1||1|
|6||Big Bike Discounts (AW)||55556666||02-88-00||11991199||1||1|
So how does this perform compared to the recursive CTE approach? Well for this simple set of data the performance was pretty much identical at around 16ms. But changing the sample data to add a few more nodes in the existing groups increases the edge count quite a bit.
INSERT INTO SupplierNodes (SupplierName, TaxNumber, BankSortCode, BankAccountNumber) VALUES ('AdventureWorks Ltd.', 12345678, '02-11-33', 12345678) ,('AdventureWorks Limited', 12345678, '02-54-44', 32165487) ,('AdventureWorks', 12345678, '02-55-44', 15161718) ,('AW Limited', 98768765, '02-55-44', 15161718) ,('ADVENTURE WORKS', 98765432, '02-55-44', 15161718) ,('AW Bike Co', 98765432, '02-66-00', 11991199) ,('AW Bike Co', 12345678, '02-66-00', 11991199) ,('Big Bike Discounts (AW)', 55556666, '02-66-00', 11991199) ,('Contoso Corp', 90001000, '02-99-02', 12341234);
Which gives us a graph that looks like this.
Comparing the performance when processing these two graphs made much more difference. Running in a small Docker container, the CTE approach took almost 13 seconds, where the graph approach took only 42ms.
As I get time in the future I’d like to look at the performance of these solutions, particularly to try and gauge how the approaches scale as we increase the number of rows (and therefore disconnected graphs) in the table. We will also consider how performance changes as we make our graphs:
- Larger (more hops)
- Cyclic (more nodes connected in circles, blue and green in the above graph)
- Complex (more interconnections between nodes, yellow in the above graph)
It will also be interesting to look at what options there are for for people using Azure with the Data Lakehouse architecture (Databricks & synapse). Recursive CTE’s and tSQL graph processing are not supported in Synapse SQL Pools, but spark has support for graph processing using the GraphFrames library.
For anyone wanting to learn more about graph databases in general, VisualGo has some great tutorials.
In the meantime the demo code for this post is available on this Github Gist, and requires SQL Server 2019 CTP 3.1 or higher to run.