Mar 6, 2008

New INTERSECT and EXCEPT operators in Sql Server 2005 T-SQL

INTERSECT

Let say you have a products table. I’m going to use the products table from the AdventureWorks database. Now let’s say you imported a list of products into the database into a table called productduplicates. Just for a reason, let’s say you manufacturer sent you a full list of products, and you need to see which products you already have in your database.

Normally, if you didn’t have matching primary keys on both tables, like a productId, you would do an inner join on some columns that uniquely identify the rows, as such:



INNER JOINselect a.[name], a.productnumber
from production.product a
inner join production.productduplicates b
on a.[name] = b.[name]
and a.productnumber = b.productnumber


Execution plan for inner join

Hash Match(Inner Join, HASH:([b].[Name], [b].[ProductNumber])=([a].[Name], [a].[ProductNumber]), RESIDUAL:([AdventureWorks].[Production].[Product].[Name] as [a].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] as [b].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber] as [a].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber] as [b].[ProductNumber]))
--Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates] AS [b]))
--Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID] AS [a]))

Now you can replace this with a simple INTERSECT statement like the following:

INTERSECTselect [name], productnumber from production.product
intersect
select [name], productnumber from production.productduplicates


Execution plan for INTERSECT

Hash Match(Right Semi Join, HASH:([AdventureWorks].[Production].[productduplicates].[Name], [AdventureWorks].[Production].[productduplicates].[ProductNumber])=([AdventureWorks].[Production].[Product].[Name], [AdventureWorks].[Production].[Product].[ProductNumber]), RESIDUAL:([AdventureWorks].[Production].[Product].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]))
--Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates]))
--Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]))

Just like if you were using a UNION for two select statements, the same number of columns must exist in both select statements, and they must be of the same datatype. In both above examples, you get a list of products that exist in both tables, so you know which products from the manufacturer you already have listed.

I haven’t seen any performance difference in the two, but I’ve only played with this on small tables. The execution plan does show that while you are doing a hash match inner join on the first query, the second query performs a hash match right semi join. Both hash matches perform 50% of the work in the query. The other 50% is split evenly between the table scan and index scan in both queries. So what does the fact that a right semi match is performed when using INTERSECT? I tells me its doing the same thing as if we were to use the EXISTS, like so:


EXISTS

select a.[name], a.productnumber
from production.product a
where exists(select b.[name], b.productnumber from production.productduplicates b
where a.[name] = b.[name]
and a.productnumber = b.productnumber)

EXCEPT

Same scenario, if you wanted to know which products you have that do not exist in the duplicates table, you would normally do a NOT EXISTS like so:


NOT EXISTS

select a.[name], a.productnumber
from production.product a
where not exists(select b.[name], b.productnumber from production.productduplicates b where a.[name] = b.[name]
and a.productnumber = b.productnumber)


Execution plan for NOT EXISTS

Hash Match(Right Anti Semi Join, HASH:([b].[Name], [b].[ProductNumber])=([a].[Name], [a].[ProductNumber]), RESIDUAL:([AdventureWorks].[Production].[Product].[Name] as [a].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] as [b].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber] as [a].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber] as [b].[ProductNumber]))
--Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates] AS [b]))
--Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID] AS [a]))


And you get back the list of products that you have that aren’t in the duplicates list. You can flip this around to figure out which products are in the duplicates that you don’t have in your products.

Let’s simply the code a little bit by using the EXCEPT operator. This looks identical to the INTERSECTS query:
EXCEPT

select [name], productnumber from production.product
except
select [name], productnumber from production.productduplicates

Execution plan for except

Hash Match(Right Anti Semi Join, HASH:([AdventureWorks].[Production].[productduplicates].[Name], [AdventureWorks].[Production].[productduplicates].[ProductNumber])=([AdventureWorks].[Production].[Product].[Name], [AdventureWorks].[Production].[Product].[ProductNumber]), RESIDUAL:([AdventureWorks].[Production].[Product].[Name]=[AdventureWorks].[Production].[productduplicates].[Name] AND [AdventureWorks].[Production].[Product].[ProductNumber]=[AdventureWorks].[Production].[productduplicates].[ProductNumber]))
--Table Scan(OBJECT:([AdventureWorks].[Production].[productduplicates]))
--Clustered Index Scan(OBJECT:([AdventureWorks].[Production].[Product].[PK_Product_ProductID]))


Again, can’t find a performance difference. In all queries shows, the reads and duration all average out to be virtually identical. You’d have to test this on some large tables (high number of rows) to figure out if there is a performance difference at all.

Looking at the execution plans for everything, it doesn’t seem that there is a any gain from using INTERSECT and EXCEPT over JOINS or EXISTS/NOT EXISTS except for one big thing: Much simpler code!