This is part of series of posts about associative grouping:
Recently I was asked by a friend to have a look at an interesting query problem he had been looking at. He was trying to find overlapping groups in two columns, in order to put them into a new super group of associated groups. The simplest way to describe the problem is with an demo, so off we go…
In our scenario, we have a that company has grown through acquisitions, with each one bringing a new set of suppliers. Some of those supplier lists have overlapped, hence the need to cleanse that data as it is consolidated into a single system. However, as is often the case in this situation, the duplicate suppliers do not match on the natural business keys. Lets dive straight into an example…
Here’s the SQL to create the sample table.
CREATE TABLE Suppliers ( Id INT IDENTITY(1,1) NOT NULL ,SupplierName VARCHAR(30) NOT NULL ,TaxNumber INT NOT NULL ,BankSortCode CHAR(8) NOT NULL ,BankAccountNumber INT NOT NULL ); INSERT INTO Suppliers (SupplierName, TaxNumber, BankSortCode, BankAccountNumber) VALUES ('AdventureWorks Ltd.', 12345678, '02-11-33', 12345678) ,('AdventureWorks', 12345678, '02-55-44', 15161718) ,('ADVENTURE WORKS', 23344556, '02-55-66', 15161718) ,('ADVENTURE WORKS LTD.', 23344556, '02-77-44', 99887766) ,('AW Bike Co', 23344556, '02-88-00', 11991199) ,('Big Bike Discounts', 55556666, '02-88-00', 11991199) ,('Contoso Corp', 90009000, '02-99-02', 12341234);
And here’s what the data looks like. The matching tax numbers and bank account details are highlighted, and the final column is added to show the required result.
|Id||Supplier Name||Tax Number||Bank Sort Code||Bank Account Number||Required Output|
|4||DVENTURE WORKS LTD.||23344556||02-77-66||99887766||1|
|5||AW Bike Co||23344556||02-88-00||11991199||1|
|6||Big Bike Discounts||55556666||02-88-00||11991199||1|
We have a table containing supplier names, tax numbers, and bank details for invoicing. It is desired to group the data such that if any records have the same tax number or bank details they are added to a group. In the example above, rows 1 to 5 need to be in the same group, even though row 1 has completely different tax and bank details to rows 4 and 5! Row 6 is for a completely different supplier, and as such should be in it’s own group.
Preparing the Data
We can write a query to add two new columns to that data to hold group numbers, one for the tax number, and another for the combination of bank sort code and account number using the
DENSE_RANK() window function. This will simplify the logic of our algorithm as we move forward.
SELECT Id ,SupplierName ,TaxNumber ,BankSortCode ,BankAccountNumber ,DENSE_RANK() OVER (ORDER BY TaxNumber) AS TaxGroup ,10 + DENSE_RANK() OVER (ORDER BY BankSortCode, BankAccountNumber) AS BankGroup FROM Suppliers ORDER BY Id;
|Note: Adding 10 to the dense rank is just to give different numbers for the two groups in the examples for clarity.|
The result set looks like this.
|Id||Supplier Name||Tax Number||Bank Sort Code||Bank Account Number||Tax Group||Bank Group|
|4||ADVENTURE WORKS LTD.||23344556||02-77-66||99887766||2||13|
|5||AW Bike Co||23344556||02-99-00||11991199||2||14|
|6||Big Bike Discounts||55556666||02-99-00||11991199||3||14|
Adding the new group columns means we have a single column that represents the multi-column groups of the raw data, and we have a nice small data type to work with instead of strings.
The colour coding of those columns shows how the goops on the two columns act like links of a chain, linking the rows together. This is what a graph database would store – the individual records are nodes, linked together by the group numbers or edges.
Now we can define the problem more clearly. We are looking at a number of disconnected graphs, and we want to assign a unique group number to each of those individual graphs. The associations between the node of the graph are made by the nodes having either the same tax or bank account details. Here is the same data drawn as a graph.
Note: Unfortunately the new graph features added to SQL Server 2017 are quite limited, and are unable to help us solve this particular problem.
Instead we will solve the problem using the tools available in tSQL. We start by using the last query to build a query to to define how a bank account group links two different tax groups together.
WITH CTE_Groups AS (SELECT s.Id ,DENSE_RANK() OVER (ORDER BY s.TaxNumber) AS TaxNumberGroup ,10 + DENSE_RANK() OVER (ORDER BY s.BankSortCode, s.BankAccountNumber) AS BankAccountGroup FROM Suppliers AS s ) SELECT g1.TaxNumberGroup ,g1.BankAccountGroup ,g2.TaxNumberGroup AS TaxNumberGroupLink ,( CASE WHEN g1.TaxNumberGroup > g2.TaxNumberGroup THEN 1 ELSE 0 END ) AS IsDupeEdge FROM CTE_Groups AS g1 LEFT JOIN CTE_Groups AS g2 ON g1.BankAccountGroup = g2.BankAccountGroup AND g1.TaxNumberGroup <> g2.TaxNumberGroup GROUP BY g1.TaxNumberGroup ,g1.BankAccountGroup ,g2.TaxNumberGroup;
And here are the results, colour coded to match the previous tables.
This has abstracted out the links between tax groups and bank account groups from the actual entities that have those attributes. Where we have a pair of rows that represent directed edges in a graph, we have marked a single one as being a duplicate. This is important as we recursively walk the graph, as we don’t want to walk the same path in different directions. We must however include both directions, or we may not be able to walk all the possible routes through the graph.
Associating the Groups
We can build a recursive CTE (common table expression) to walk through this structure, and assign unique group numbers to the distinct graphs.
WITH CTE_Groups AS (SELECT s.Id ,s.SupplierName ,s.TaxNumber ,s.BankSortCode ,s.BankAccountNumber ,DENSE_RANK() OVER (ORDER BY s.TaxNumber) AS TaxNumberGroup ,10 + DENSE_RANK() OVER (ORDER BY s.BankSortCode, s.BankAccountNumber) AS BankAccountGroup FROM Suppliers AS s ) ,CTE_Links AS (SELECT g1.TaxNumberGroup ,g1.BankAccountGroup ,g2.TaxNumberGroup AS TaxNumberGroupLink ,( CASE WHEN g1.TaxNumberGroup > g2.TaxNumberGroup THEN 1 ELSE 0 END ) AS IsDupeEdge ,CAST( ':' + CAST(g1.TaxNumberGroup AS VARCHAR(10)) + ':' + CAST(g1.BankAccountGroup AS VARCHAR(10)) + ':' + CAST(g2.TaxNumberGroup AS VARCHAR(10)) + ':' AS VARCHAR(MAX) ) AS FullPath ,CAST( CAST(g1.BankAccountGroup AS VARCHAR(10)) + ':' + CAST(g2.TaxNumberGroup AS VARCHAR(10)) + ':' AS VARCHAR(MAX) ) AS SubPath FROM CTE_Groups AS g1 LEFT JOIN CTE_Groups AS g2 ON g1.BankAccountGroup = g2.BankAccountGroup AND g1.TaxNumberGroup <> g2.TaxNumberGroup GROUP BY g1.TaxNumberGroup ,g1.BankAccountGroup ,g2.TaxNumberGroup ) ,RCTE_PathWalker AS (SELECT l.TaxNumberGroup ,l.BankAccountGroup ,l.TaxNumberGroupLink ,l.TaxNumberGroup AS GraphGroup -- Reuse as unique per disconnected graph ,1 AS IterationCount ,l.FullPath FROM CTE_Links as l WHERE l.IsDupeEdge = 0 UNION ALL SELECT c.TaxNumberGroup ,c.BankAccountGroup ,c.TaxNumberGroupLink ,p.GraphGroup ,p.IterationCount + 1 ,p.FullPath + c.SubPath FROM RCTE_PathWalker AS p INNER JOIN CTE_Links AS c ON c.TaxNumberGroup = p.TaxNumberGroupLink -- Guard against circular paths through the graph AND p.FullPath NOT LIKE '%' + c.FullPath + '%' -- Another gaurd against circular paths through the graph WHERE p.IterationCount < 100 ) SELECT s.Id ,s.SupplierName ,s.TaxNumber ,s.BankSortCode ,s.BankAccountNumber ,g.GraphGroup FROM CTE_Groups AS s LEFT JOIN (SELECT pw.TaxNumberGroup ,DENSE_RANK() OVER (ORDER BY MIN(pw.GraphGroup)) AS GraphGroup FROM RCTE_PathWalker AS pw GROUP BY pw.TaxNumberGroup )AS g ON g.TaxNumberGroup = s.TaxNumberGroup;
The first two CTE’s are the bits we already covered. In the recursive CTE we use the
IsDupeEdge to filter filter out the rows that would make us search the same path from both directions. We then recursively join to the next link in the chain, checking the path of the child is not already in the full path of the parent to prevent following circular chains.
The output of the recursive part looks like this.
|Tax Number Group||Bank Account Group||Tax Number Group Link||Graph Group||Iteration Count||Full Path|
The final part then simply groups by the tax group number we calculated at the beginning, using a
DENSE_RANK() to normalise the final graph group numbers. This is joined to the result of the first CTE, which gives us the original table, with the GraphGroup, which associates the groups of tax numbers and bank accounts.
|Id||SupplierName||Tax Number||Bank Sort Code||Bank Account Number||Graph Group||Required Output|
|4||ADVENTURE WORKS LTD.||23344556||02-77-66||99887766||1||1|
|5||AW Bike Co||23344556||02-88-00||11991199||1||1|
|6||Big Bike Discounts (AW)||55556666||02-88-00||11991199||1||1|
The final output, with the required output column from the introduction added.
In part 2 we will also look at the new graph DB features added to SQL 2019 CTP3.1, and see how they can help solve the problem, and how the performance compares to the recursive CTE approach.
We have seen how we can associate two different groupings applied to a set of data. However, using a recursive method can sometimes perform quite slow when using larger data sets. In part 3 we will have a look at how the performance scales with more data, and different complexity graphs.
For anyone wanting to learn more about graph databases in general, VisualGo has some great tutorials.