I have developed a series of regex replacements that, when used in BBEdit, successfully perform the processing I want on the contents of a web page from a particular site. But there are many such web pages I wish to process, and I wish to repeat it on all of them repeatedly over time (as they change).
So my question is… how to I automate this task on macOS (Monterey)? However I want to add some restrictions:
I don’t want to have to install some command line tool (e.g. Node).
Ideally it needs to be standalone, but integrating multiple parts of macOS is fine if it’s automated.
I first tried using a simple shell script with curl and then sed however I discovered sed does not support ‘non-greedy’ matches which are required for the task.
Automator does not have the required actions.
I thought Shortcuts might do it but I am getting very confusing results from the very first regex.
My ideal automation will be providing the URL as input and having the resulting processed text output to a file.
OK, I got a very simple curl to perl (rhyme!) command running and then tried plugging in my first actual regex but it’s not performing as I expect.
I double checked my regex in both BBEdit and Patterns in PCRE mode and both correctly match and remove the entire first chunk down to the second <tr>.
When I run this command it does no such thing, leaving the original file completely intact. Is there some kind of mode flag I need along with the -pe given it spans lines? (I have multiline turned off in Patterns).
The problem is that the input record separator ($/) is newline by default, which doesn’t allow multiline mode to work as we intuitively think it would. The -0 makes perl set a new record separator, and 777 is a magic number that means read the whole file (called slurp mode in the first ref below).
@JohnAtl is correct again. I regret not mentioning that Perl reads and processes input a line at a time by default. This is in keeping with the standard Unix practice seen also in tools like sed and awk, but may seem weird to people not familiar with that tradition.
I will respectfully point out that the p and m modifiers at the end of @JohnAtl’s regex don’t do anything. They don’t do anything bad, but they don’t do anything useful, either. Not for this regex, anyway.
The p has been ignored for the last several versions of Perl, and m changes behavior only if the regex contains ^ or $ anchors. See the perlre man page. In fairness, BBEdit’s grep searching acts as if m were present, so that provides some justification for adding it when moving a regex from BBEdit to Perl. But it doesn’t do anything in this case.
Thank you @drdrang and @JohnAtlso much not just for solving my problem but explaining it in clear terms. I do appreciate that it takes time. I have run the first regex successfully now and will move on to the others and building it all into my script which is already fetching all of the files. And I will need the m option as some of them do use ^.
Thanks also to @RosemaryOrchard for the Text Factory pointer — while still trying to figure out the command line aspect, I have been able to fine tune my regex game to cater for some oddities in the source pages and the ability to quickly process the lot (albeit with a bunch of clicks) has been a boon when it has taken many runs to get perfect.
Last night I successfully got the Text Factory to produce a clean set of CSV files based on the web site data, which is the first step to… well, I’m not sure of the end game yet. I’m hopeful I can come up with something which will allow the curator of this information to modernise from this…
Last night I got the full script going so I can type in one command and get a single CSV file at the end, containing the transformed contents of 51 pages.
There were a few more gotchas along the way, such as realising that using the “Replace All” button in BBEdit means I had to add the g flag to the Perl regex, and some reworking of line start/end matching (helped by stripping all leading and trailing spaces).
I’m kinda surprised I didn’t dream in regex last night, as the last hour before bed was spent getting it all working…