grep has lots of options. This one's really a combination of three to do something nifty.
On the previous post, I was converting a bunch of
iso-8859-1 files to
utf-8. Setting aside for the moment the fact that I was using
iconv for this (read the fine man page!), you might perhaps be wondering, how did I know which files I wanted to convert?
First off, this is not magic. It presupposes that you know the files are
iso-8859-1 and that your locale is set to
utf-8. The latter is easy enough - check your LANG environment variable and set it to something suitable if it doesn't already end in '
.UTF-8' or '
.utf8' (detailed discussion of this not for this article).
The former is your problem. I can't really help you with this - data is a bunch of bytes plus an encoding you may or may not know :D So. Assuming you have a reasonable level of certainty these files are encoded to
iso-8859-1, run this:
grep -axv '.*' mysuspectfile
It will return any lines with
iso-8859-1 characters that are not legal
utf-8, which is to say pretty much anything with its high bit set.
Sidebar: Before anyone jumps on me, yes there are obscure combinations of high-bit-set
iso-8859-1characters that are legal
utf-8, but they are sufficiuently unlikely in any normal text written by people not on weird psychoactive substances for this test to be pretty reliable. Reference here for more details.
Why does this work?
-asays 'treat this file as printable text'
-vsays 'invert this match'
-xsays 'match the whole line against the pattern
.* (because of
-a and your locale) means 'any legal
utf-8 character'. The
-x requires every character in the line to match that pattern, and the
-v will then spit out lines for which that is not the case.
To generate just a list of offending files from a directory, then, we can use
-r to recurse down the directory, and
-l to just report matching filenames.
grep -r -l -axv '.*' mydirectory
Bingo. All ready to throw at
xargs -P :D