DEV Community

Jimmy Yeung
Jimmy Yeung

Posted on

Check if a file is a subset of another file using bash script

Scenario

I need to check if a file is a subset of another file into the CI pipeline. Thus bash script is chosen since it's performant and we don't need to install extra dependencies into the CI pipeline.

  1. diff
    The first command comes to my mind is diff, which is a really powerful command telling the difference between two files.

    However it's too powerful. diff "predicts" which line needs to be changed in order to make the two files identical; which is unnecessary for my use case.

    E.g. (Example from GeeksToGeeks)

    $ cat a.txt
    Gujarat
    Uttar Pradesh
    Kolkata
    Bihar
    Jammu and Kashmir
    
    $ cat b.txt
    Tamil Nadu
    Gujarat
    Andhra Pradesh
    Bihar
    Uttar pradesh
    
    $ diff a.txt b.txt
    0a1
    > Tamil Nadu
    2,3c3
    < Uttar Pradesh
     Andhra Pradesh
    5c5
     Uttar pradesh
    
  2. comm
    Without further digging into diff, I found another command comm which is simple and just fit in my use case.

    comm returns 3 columns:

    • first column contains names only present in the 1st file
    • second column contains names only present in 2nd file
    • the third column contains names common to both the files

    E.g. (Example from GeeksToGeeks)

    // displaying contents of file1 //
    $cat file1.txt
    Apaar 
    Ayush Rajput
    Deepak
    Hemant
    
    // displaying contents of file2 //
    $cat file2.txt
    Apaar
    Hemant
    Lucky
    Pranjal Thakral
    
    $comm file1.txt file2.txt
                    Apaar
    Ayush Rajput
    Deepak
                    Hemant
            Lucky
            Pranjal Thakral
    

    And to check if one file is a subset of another file, we just need the 1st column. We could just do -23 to neglect the 2nd and 3rd column. I.e.

    comm -23 file1.txt file2.txt
    

Conclusion

At last, I just end up with this simple bash script to check the subset condition:

#!/bin/bash    
SUBSET="<subset_file_path>"
SUPERSET="<superset_file_path>"
CHECK=$(comm -23 <(sort $SUBSET | uniq ) <(sort $SUPERSET | uniq ) | head -1)

if [[ ! -z $CHECK ]]; then
  echo "Detected extra line in $SUBSET and not in $SUPERSET."
  echo $CHECK
  exit 1
fi
Enter fullscreen mode Exit fullscreen mode

Added the extra sort and uniq commands there just to make sure we're comparing two sorted and deduplicated files.

Discussion (0)