Using Regex to search for a list of words

Jeremy · December 2, 2019, 8:06pm

My wife’s job has a list of terms that can get an .srt file flagged if it contains any word from the list. The list is expansive and not for the faint of heart. I thought, “I bet regular expressions can solve this problem!” I don’t even know where to start. Should I go back and listen to that episode? Will it contain my answers.

So here’s a family friendly example
List of banned words
Apple
Banana
Orange

Text to search:
This apple is the most disgusting fruit I’ve ever tried to eat. I would rather have an orange or a banana.

stu_w · December 2, 2019, 9:09pm

I’m missing some context here so not quite sure what you’re looking at, however…

Are you looking for something to check a file and warn if there’s bad words or looking for something to do real time substitution?
If you’re just looking for something to check an .srt for ‘naughty words’ before it gets deployed somewhere then you could use grep in the terminal.

Grep can get its list of words to search for, from a file; so for example, put your list of banned words into a file called “badwords.txt” (each word to search for on its own line, no blank lines in the file) then run:

grep -i -f badwords.txt mySample.srt

If there’s no matches on bad words then it should return nothing. If there are any instances of naughty words then it should return that whole line.

I would think you’ll need to use -i so that if you have apple in the badwords.txt file then it’ll find instances of apple, Apple & APPLE in the srt.

tjluoma · December 2, 2019, 9:18pm

grep to the rescue!

Let’s reformat your test sentence like this:

This apple is the most disgusting
fruit I've ever tried to eat.
I would rather have an orange
or a banana.

Save that to test.txt

Put all of the bad words in a file named badwords.txt (one word per line). Then run this command:

grep -i -f badwords.txt test.txt

which will output this:

This apple is the most disgusting
I would rather have an orange
or a banana.

If you want to get really fancy, you could use cat -n which will prefix line numbers, that way you could easily tell where the offending words are:

cat -n test.txt | grep -i -f badwords.txt

1 This apple is the most disgusting
3 I would rather have an orange
4 or a banana.

Be sure to use -i with grep so that it will ignore case.

If you just want to see which files contain the “bad words”…

Then you can use:

grep -l -i -f badwords.txt test.txt

(Note that -l is a lowercase -L)

That way you could put all of your .srt files in a directory and do:

grep -l -i -f badwords.txt *.srt

and you would get a list of all of the files which matched any of the words in badwords.txt

If you wanted to put all of the files that do have “bad words” into a separate folder, you could do this:

mkdir -p HasBadWords

grep -l -i -f badwords.txt *.srt | while read line
do

mv -vn '$line' HasBadWords/

done

Then all of the offending .srt files would be in the folder ‘HasBadWords’ and the ones left would be “clean”.

BEWARE!

Make sure that badwords.txt does not have any BLANK lines in it, or else it grep will consider any file with a blank line to have a “bad word”.

Scope

Note that searching for apple will also match crabapple and pineapple and Applebees. I presume that is the desired result in this situation.

Jeremy · December 3, 2019, 12:56am

Thanks @stu_w and @tjluoma I think this will do the trick. I think the way I understand it is her work system will flag the file as something that needs attention, but no explanation except that it contains a word from the list. So then a human goes in to check exactly what flagged it (maybe it was ‘assessment’ and not really ‘ass’).

You have both started a rabbit hole, see you in a couple of months lol.

ACautionaryTale · December 3, 2019, 6:39am

No conversation like this is complete without the quote (attributed to Jamie Zawinski), “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”

Jeremy · December 4, 2019, 12:34am

@tjluoma In your example above you show the words are bolded, is that what is supposed to happen or is it just for sake of demonstration?

Jeremy · December 4, 2019, 1:03am

nvm, figured it out, I was able to copy and paste someone else’s code to show the highlights.

tjluoma · December 4, 2019, 1:15am

well in that case I’ll let others know that if they want grep to highlight the matched word(s) then I recommend you add this to your shell init file (such as ~/.bashrc or ~/.zshrc)

GREP_OPTIONS='--color=auto’

RosemaryOrchard · December 4, 2019, 2:41pm

Discourse has automatic syntax highlighting, so this is supposed to happen. The content is still plaintext, it just shows the formatting to help convey meaning.