DEV Community

Samuel Nitsche
Samuel Nitsche

Posted on • Originally published at cleandatabase.wordpress.com on

More Pitfalls of the MINUS comparison

When I did my presentation at APEX Connect, Erik van Roon explained another pitfall of my MINUS approach to compare the content of two views to me – something some people already tried to show me and failed (well, I failed getting the point).

Let’s start with the same situation as in the last example, a small list of Star Wars characters and the movie episodes in which they appear:

ID Character Episodes
1 Darth Vader 3, 4, 5, 6
2 Luke Skywalker 4, 5, 6, 7, 8
3 Rey 7, 8

Now let’s assume we are forced to do overwork by the sithlord in charge, it’s 3 am, we are terribly tired and create a new view but miss the group by statement:

create or replace view all_movie_characters_3
  as
  select
    sw_char.id,
    sw_char.name
  from
    star_wars_characters sw_char
    inner join appearance_in_episode ep
      on sw_char.id = ep.character_fk;
Enter fullscreen mode Exit fullscreen mode

Or course this will not return the expected result:

ID Character
1 Darth Vader
1 Darth Vader
1 Darth Vader
1 Darth Vader
2 Luke Skywalker
2 Luke Skywalker
2 Luke Skywalker
2 Luke Skywalker
3 Rey
3 Rey

But if we apply our MINUS comparison, we will get no results, indicating that both of the views are equal:

(
  select * from all_movie_characters
  minus
  select * from all_movie_characters_3
)
union all
(
  select * from all_movie_characters_3
  minus
  select * from all_movie_characters
);
-- No results
Enter fullscreen mode Exit fullscreen mode

The problem is, that MINUS removes all appearances of equal rows. Therefore, it’s only useful when we can be sure that each row is exactly identifiable by its contents.

You could put it like this: MINUS comparison works if select * from ... of a query is equal to select distinct * from ... – which is not the case in our example:

select 'distinct' type, count(*)
from (
       select distinct *
         from all_movie_characters_3
     )
union all
select 'all', count(*) from all_movie_characters_3;
Enter fullscreen mode Exit fullscreen mode
TYPE COUNT(*)
distinct 3
all 10

The solution Erik suggested, was to add the count over all columns to the comparison:

with old as (
  select id, name, count(*) number_of_equals
    from all_movie_characters
    group by id, name
  ),
  new as (
    select id, name, count(*) number_of_equals
    from all_movie_characters_3
    group by id, name
  )
(
  select * from old
  minus
  select * from new
)
union all
(
  select * from new
  minus
  select * from old
);
Enter fullscreen mode Exit fullscreen mode

The disadvantage here is, that we can’t use * notation anymore, something which I always found to be very convenient for quick comparisons, especially when comparing views with many columns. If you want to be sure, though, there’s no way to avoid investing the additional time.

Alternative approaches

There’s a pretty famous approach to comparing the contents of two queries by Tom Kyte:

select id, name, sum(old_cnt), sum(new_cnt)
from (
  select id, name, 1 old_cnt, 0 new_cnt
    from all_movie_characters source
  union all
  select id, name, 0 old_cnt, 1 new_cnt
    from all_movie_characters_3 target
)
group by id, name
having sum(old_cnt) != sum(new_cnt);
Enter fullscreen mode Exit fullscreen mode
ID NAME SUM(OLD_CNT) SUM(NEW_CNT)
2 Luke Skywalker 1 4
3 Rey 1 2
1 Darth Vader 1 4

Besides being very elegant and pretty performant, one really cool thing about this approach is that you can easily see which rows are different and where the difference lies.

The downside for me is that it’s quite some amount of typing and that what’s happening is not so obvious for the untrained reader (like me). Both of these can be worked around, though.

Stewart Ashton wrote a lot about that topic and even came up with a helper package that makes it much more convenient to compare two query results. You should check out his list of blog posts.

The complete example is available on LiveSQL and github.

So, can the simple MINUS comparison still be useful?

I think it can, for quick comparisons of familiar data. However, it’s good to know the limitations and the alternatives. As a general rule I’d say: Better go with a solution that requires more effort but contains less pitfalls. But there are always exceptions to the rules.

Top comments (0)