This post describes a simple T-SQL solution for comparing multiple databases each-to-each, combinatorial style and visualizing the results in MS Excel. It comes very useful in projects where the codebase has to be cleaned up or unified for refactoring or a major upgrade.
- Identify groups of similar DBs
- Calculate their similarity score
- Display the results as a diagram
A database schema is comprised of "objects" like user tables, views, stored procedures or functions. They are all listed in
sys.objects table and their source code or definitions are stored in
sys.syscomments. We can use data from those two tables to compare one DB to another.
We'll need 2 tables: one to hold the list of objects and another to hold the numbers on how those objects are distributed across all the DBs.
First, create a table that holds all object names from all databases.
drop table if exists tDbObjectList GO create table tDbObjectList ( DbName nvarchar(255), ObjName nvarchar(255), TypeDesc nvarchar(255), ObjHash bigint )
Populating the list of objects from all DBs can be done with a very handy, but unofficial SP called
sp_MSforeachdb. It is not the most elegant and reliable method, but it works OK most of the time.
exec master.sys.sp_MSforeachdb 'USE [?]; insert into master.dbo.tDbObjectList select DB_NAME(), obj.name, obj.type_desc, sum(ABS(cast(checksum(com.text) as bigint))) from sys.objects obj left outer JOIN sys.syscomments com on obj.object_id = com.id where type_desc in (''VIEW'', ''SQL_TABLE_VALUED_FUNCTION'', ''SQL_STORED_PROCEDURE'', ''USER_TABLE'', ''SQL_SCALAR_FUNCTION'') and DB_NAME() not in (''model'', ''tempdb'', ''msdb'') group by obj.name, obj.type_desc'
The above code retrieves a list of objects of certain types from
sys.objects table and their source T-SQL code from
sys.syscomments table. Some system DBs are excluded.
master DB is included because it is quite common to create objects in it from SSMS by mistake.
We can use two different metrics to calculate the similarity score: object names match and T-SQL code match. The metrics are permanently stored in
drop table if exists tDbSimilarityScore GO create table tDbSimilarityScore ( ID int identity, -- row id in the table DB1 nvarchar(255), -- DB names for cross-comparing DB2 nvarchar(255), -- them in pairs Cnt1 int, -- total count of objects in DB1 Cnt2 int, -- total count of objects in DB2 Same int, -- how many object names they share NotIn2 int, -- how many object names are in DB1 and not in DB2 NotIn1 int, -- how many object names are in DB2 and not in DB1 HashSame int, -- how many shared objects with the same name have the same T-SQL code HashDif int, -- how many shared objects with the same name have different T-SQL code Score decimal(7,5), -- overall similarity score ScoreSqared decimal(7,5), -- the score from above squared ScoreHash decimal(7,5), -- T-SQL code similarity score ScoreHashSquared decimal(7,5) -- T-SQL score squared )
tDbSimilarityScore table is populated with N x N combinations of all DB names from
insert into tDbSimilarityScore (db1, db2) select distinct ol1.DbName, ol2.DbName from tDbObjectList ol1 join tDbObjectList ol2 on (ol1.DbName != ol2.DbName)
There will be some duplicate rows in an N x N matrix because db1/db2 pairs are symmetrical - comparing DB A with DB B is equivalent to comparing DB B with DB A. The following line of code deletes the duplication and reduces the number of records from N^2 to N!:
delete from tDbSimilarityScore from tDbSimilarityScore tss where exists (select 1 from tDbSimilarityScore tss2 where tss2.db1 = tss.db2 and tss2.db2 = tss.db1 and tss2.id < tss.id)
It is an optional step. Do not remove the duplication yet - we will need the full N x N matrix for a pivot table later.
The rest of the fields in
tDbSimilarityScore are populated with a sequence of UPDATE statements. It is a simpler and faster way than a single INSERT statement. The number of records involved is small enough not to worry about the performance.
- Update the total number of objects for DB1 and DB2:
update tDbSimilarityScore set Cnt1 = ta.Cnt from tDbSimilarityScore tss left outer join (select DbName, count(*) as Cnt from master.dbo.tDbObjectList group by DbName) as ta on (tss.DB1 = ta.DbName) update tDbSimilarityScore set Cnt2 = ta.Cnt from tDbSimilarityScore tss left outer join (select DbName, count(*) as Cnt from master.dbo.tDbObjectList group by DbName) as ta on (tss.DB2 = ta.DbName)
- Update the numbers for similarity:
update tDbSimilarityScore set Same = isnull(ts.Same,0) from tDbSimilarityScore tss left outer join (select ol1.DbName as DB1, ol2.DbName as DB2, count(*) as Same from tDbObjectList ol1 join tDbObjectList ol2 on ol1.DbName != ol2.DbName and ol1.ObjName = ol2.ObjName group by ol1.DbName, ol2.DbName) as ts on (tss.DB1 = ts.DB1 and tss.DB2 = ts.DB2) update tDbSimilarityScore set HashSame = isnull(ts.same,0) from tDbSimilarityScore tss left outer join (select ol1.DbName as DB1, ol2.DbName as DB2, count(*) as Same from tDbObjectList ol1 join tDbObjectList ol2 on ol1.DbName != ol2.DbName and ol1.ObjName = ol2.ObjName and ol1.ObjHash = ol2.ObjHash group by ol1.DbName, ol2.DbName) as ts on (tss.DB1 = ts.DB1 and tss.DB2 = ts.DB2)
- Update the numbers for dissimilarity
update tDbSimilarityScore set notin2 = cnt1-same, notin1 = cnt2 - same, hashdif = same - hashsame
- Calculate the overall score
update tDbSimilarityScore set Score = convert(decimal(13,3), (Same / ((convert(float, Cnt1) + convert(float, Cnt2))/2)) + (isnull(convert(float, HashSame)/ nullif(Same,0),0))) / 2.000 update tDbSimilarityScore set ScoreSqared =power( convert(decimal(13,3), (Same / ((convert(float, Cnt1) + convert(float, Cnt2))/2)) + (isnull(convert(float, Hashsame)/ nullif(Same,0),0))) / 2.000, 2) update tDbSimilarityScore set ScoreHash = convert(decimal(13,3), (isnull(convert(float, HashSame)/ nullif(Same,0),0))) update tDbSimilarityScore set ScoreHashSquared =power(convert(decimal(13,3), isnull(convert(float, HashSame)/ nullif(Same,0),0)), 2)
In this example we compared the main reporting database with other reporting and shared DBs.
select * from tDbSimilarityScore where db1 = 'reporting' order by db1, score desc
The correlation between all REPORTING_... DBs was quite high, but so was their code divergence. For example REPORTING and REPORTING_ROx have 19 objects with the same names, but different TSQL code. The other REPORTING_... DBs share approximately 1/2 of their object names with REPORTING, but 20% of them have differences in the source code. Some of the differences are there by design, but after a closer inspection we concluded that most of them can be merged onto a single code base.
The similarity and divergence between DBs can be visualised in an MS Excel matrix. This example uses a pivot table with conditional formatting. It takes only a couple of minutes to make.
- Copy results of
select * from tDbSimilarityScore order by db1, score descinto a blank Excel spreadsheet.
- Choose your comparison metric, e.g. Score, and delete the others. You should have 3 columns left: DB1, DB2, Score
- Create a pivot table in a new worksheet (click top menu Insert, then Pivot table on the far left)
- Drag column names (DB1, DB2, Score) inside the pivot table field panel to arrange them into a matrix.
- You should now see a matrix view similar to the one above, but with no colouring.
- Select all the cells with Score values inside the matrix and add colour, conditional on the cell's value (click on top menu Home, Conditional Formatting, Colour Scales)
In the following example we compared shared databases using different metrics:
- Score: how similar they are in their intent
- ScoreSquared: same as above, but more spread out
- ScoreHash: code divergence between similar DBs
- ScoreHashSquared: same as above, but more spread out
These simple colour graphs are only really useful for a quick visual overview. They helped us grasp the extent of the problem and highlighted areas of interest for further investigation. The rest of the work was done by trawling through the source code, not without help of more scripts. For example, much of the code divergence was due to differences in white space, casing and line endings. So something as simple as normalising all T-SQL code with
LOWER(REPLACE(REPLACE(REPLACE(com.text,' ',''),CHAR(10),''),CHAR(13),'')) before calculating the similarity score may save you hours down the track.
This post is based on my recent experience migrating a real estate management system with hundreds of MS SQL databases from on-prem to Azure SQL. Read my other articles for more learnings from that project.