How can I automate this regex task?

zkarj · November 27, 2021, 1:28am

I have developed a series of regex replacements that, when used in BBEdit, successfully perform the processing I want on the contents of a web page from a particular site. But there are many such web pages I wish to process, and I wish to repeat it on all of them repeatedly over time (as they change).

So my question is… how to I automate this task on macOS (Monterey)? However I want to add some restrictions:

I don’t want to have to install some command line tool (e.g. Node).
Ideally it needs to be standalone, but integrating multiple parts of macOS is fine if it’s automated.

I first tried using a simple shell script with curl and then sed however I discovered sed does not support ‘non-greedy’ matches which are required for the task.

I remembered it was possible to run Javascript as a scripting language on macOS, and have done it inside the likes of Hazel, but cannot fathom how to make it work in a shell script.

Automator does not have the required actions.

I thought Shortcuts might do it but I am getting very confusing results from the very first regex.

My ideal automation will be providing the URL as input and having the resulting processed text output to a file.

Any guidance or suggestions welcomed.

JohnAtl · November 27, 2021, 1:57am

You could use Perl.

perl -pe 's|(http://.*?/).*|\1|'

or a short Ruby script.

drdrang · November 27, 2021, 2:27am

Second vote for Perl. BBEdit uses Perl-compatible regular expressions (PCRE), so whatever regexes you developed there should work unchanged in Perl.

zkarj · November 27, 2021, 5:47am

Oh, cool. Thanks! I will give that a go.

rlivingston · November 27, 2021, 6:12am

Look at Keyboard Maestro

MartinPacker · November 27, 2021, 8:24am

Does the BBEdit Text Factory function help here?

zkarj · November 27, 2021, 9:11pm

OK, I got a very simple curl to perl (rhyme!) command running and then tried plugging in my first actual regex but it’s not performing as I expect.

I double checked my regex in both BBEdit and Patterns in PCRE mode and both correctly match and remove the entire first chunk down to the second <tr>.

When I run this command it does no such thing, leaving the original file completely intact. Is there some kind of mode flag I need along with the -pe given it spans lines? (I have multiline turned off in Patterns).

curl --silent "http://flydw.org.uk/DWZKZ.htm" | perl -pe 's|<html>[\S\s]+<table.+>\s+<tr>[\S\s]+?</tr>\n||' > outreg.txt

Here’s the working BBEdit version.

RosemaryOrchard · November 27, 2021, 9:24pm

This would be my suggestion. Page 145 of the BBEdit Manual (titled Text Factories) is really helpful.

zkarj · November 28, 2021, 2:56am

Text Factories are cool! But I’d still like to be able to script it all easily. I figure I’m missing something in the perl -pe scenario.

I did discover that the final \n was being consumed (by simply echoing the string) so escaped it to \\n but it still does not do the job and I am scratching my head why.

I tried a much simpler example (using a downloaded file this time for brevity):

perl -pe 's|<td>|<bozo>|' < DWZKZ.htm

That does exactly what I would expect, so there is something about my more complex pattern that is tripping it up.

zkarj · November 28, 2021, 5:20am

It’s something to do with the [\S\s] set.

The first line contains only <html>.

This replaces the first line with nothing: 's|<html>||'
So does this: 's|<html>[\S\s]+||'
But this does nothing: 's|<html>[\S\s]+<||'

So it seems the [\S\s] set is not matching newline characters? Which is the whole point of it.

JohnAtl · November 28, 2021, 1:02pm

This should do it:

curl --silent "http://flydw.org.uk/DWZKZ.htm" | perl -p -0777 -e 's@<html>[\s\S]+?</tr>@@pm'

The problem is that the input record separator ($/) is newline by default, which doesn’t allow multiline mode to work as we intuitively think it would. The -0 makes perl set a new record separator, and 777 is a magic number that means read the whole file (called slurp mode in the first ref below).

ref: perl command line arguments

This is a very helpful regex tester, and it will even write some skeletal code for you. Very nice when practicing the Dark Arts of regular expressions.
ref: regex101 interactive regex tester

drdrang · November 28, 2021, 2:58pm

@JohnAtl is correct again. I regret not mentioning that Perl reads and processes input a line at a time by default. This is in keeping with the standard Unix practice seen also in tools like sed and awk, but may seem weird to people not familiar with that tradition.

I will respectfully point out that the p and m modifiers at the end of @JohnAtl’s regex don’t do anything. They don’t do anything bad, but they don’t do anything useful, either. Not for this regex, anyway.

The p has been ignored for the last several versions of Perl, and m changes behavior only if the regex contains ^ or $ anchors. See the perlre man page. In fairness, BBEdit’s grep searching acts as if m were present, so that provides some justification for adding it when moving a regex from BBEdit to Perl. But it doesn’t do anything in this case.

JohnAtl · November 28, 2021, 4:33pm

Thanks for the info!
regex101 added those, the regex worked, and I already had 30-45 min invested in this problem, so went with it

zkarj · November 29, 2021, 3:36am

Thank you @drdrang and @JohnAtl so much not just for solving my problem but explaining it in clear terms. I do appreciate that it takes time. I have run the first regex successfully now and will move on to the others and building it all into my script which is already fetching all of the files. And I will need the m option as some of them do use ^.

Thanks also to @RosemaryOrchard for the Text Factory pointer — while still trying to figure out the command line aspect, I have been able to fine tune my regex game to cater for some oddities in the source pages and the ability to quickly process the lot (albeit with a bunch of clicks) has been a boon when it has taken many runs to get perfect.

Last night I successfully got the Text Factory to produce a clean set of CSV files based on the web site data, which is the first step to… well, I’m not sure of the end game yet. I’m hopeful I can come up with something which will allow the curator of this information to modernise from this…

<meta name="GENERATOR" content="Microsoft FrontPage Express 2.0">

zkarj · November 29, 2021, 11:47pm

Last night I got the full script going so I can type in one command and get a single CSV file at the end, containing the transformed contents of 51 pages.

There were a few more gotchas along the way, such as realising that using the “Replace All” button in BBEdit means I had to add the g flag to the Perl regex, and some reworking of line start/end matching (helped by stripping all leading and trailing spaces).

I’m kinda surprised I didn’t dream in regex last night, as the last hour before bed was spent getting it all working…