Scenario
I need to check if a file is a subset of another file into the CI pipeline. Thus bash script is chosen since it's performant and we don't need to install extra dependencies into the CI pipeline.
-
diff
The first command comes to my mind isdiff
, which is a really powerful command telling the difference between two files.However it's too powerful.
diff
"predicts" which line needs to be changed in order to make the two files identical; which is unnecessary for my use case.E.g. (Example from GeeksToGeeks)
$ cat a.txt Gujarat Uttar Pradesh Kolkata Bihar Jammu and Kashmir $ cat b.txt Tamil Nadu Gujarat Andhra Pradesh Bihar Uttar pradesh $ diff a.txt b.txt 0a1 > Tamil Nadu 2,3c3 < Uttar Pradesh Andhra Pradesh 5c5 Uttar pradesh
-
comm
Without further digging intodiff
, I found another commandcomm
which is simple and just fit in my use case.comm
returns 3 columns:- first column contains names only present in the 1st file
- second column contains names only present in 2nd file
- the third column contains names common to both the files
E.g. (Example from GeeksToGeeks)
// displaying contents of file1 // $cat file1.txt Apaar Ayush Rajput Deepak Hemant // displaying contents of file2 // $cat file2.txt Apaar Hemant Lucky Pranjal Thakral $comm file1.txt file2.txt Apaar Ayush Rajput Deepak Hemant Lucky Pranjal Thakral
And to check if one file is a subset of another file, we just need the 1st column. We could just do
-23
to neglect the 2nd and 3rd column. I.e.
comm -23 file1.txt file2.txt
Conclusion
At last, I just end up with this simple bash script to check the subset condition:
#!/bin/bash
SUBSET="<subset_file_path>"
SUPERSET="<superset_file_path>"
CHECK=$(comm -23 <(sort $SUBSET | uniq ) <(sort $SUPERSET | uniq ) | head -1)
if [[ ! -z $CHECK ]]; then
echo "Detected extra line in $SUBSET and not in $SUPERSET."
echo $CHECK
exit 1
fi
Added the extra sort
and uniq
commands there just to make sure we're comparing two sorted and deduplicated files.
Top comments (0)