Automatically Download Podcasts for Archival - not listening

Drewster · September 1, 2020, 8:39am

Back when I had a NAS, it had a Download Manager and I’m pretty certain I could set it to track and download podcast enclosures from the RSS feed.

Now I’m NAS-less, I’m looking for some way to automate the downloading of podcasts, not for listening, but for archival.

Is there an app that can do this? DevonThink can get the RSS feed but it doesn’t download the enclosure.

Or am I going to have to hack together a Hazel script that watches the Apple Podcasts folder, then copies/archives from there? Hmm, maybe that is the best solution…

WayneG · September 1, 2020, 1:34pm

It might take more than Hazel. “This is not designed to be user accessible, and the podcast files do not display the original file names.”

Good luck.

vco1 · September 1, 2020, 2:01pm

These type of tasks are pretty easily done through the cli. Through Homebrew you could try castget.
If you’re more adventurous you could try Podget, which you have to install yourself. On their Github they even mention that it should run on a Mac:

As of version 0.8.5, we added a few options to be sure dependencies are installed when running on Mac OS X.

I did use Podget myself for some time, but on a Linux machine though.

webwalrus · September 1, 2020, 2:29pm

Without digging too much into it, I did something like this with DEVONthink to download the articles associated with RSS feeds (instead of just the feed itself).

Might be able to be adapted for a podcast enclosure workflow.

Here’s the thread:

My completed script was down around post 9. If you’re a little bit of a programmer, and you find futzing around to be fun, you might give it a shot.

Otherwise I heavily second @vco1’s suggestion to do something with the command line. And make sure to put a record in your Mac’s crontab so that your command-line script auto-fires.

If you’re not familiar with the Mac’s crontab, here’s a line that fires up a script on my computer to move some files into my Dropbox so JustCast sees them as new podcasts:

15 * * * * /Users/robertwall/Documents/Scripts/justCast.php &> /dev/null

That runs the script every hour, at the 15-minute mark.

Best of luck!

vco1 · September 1, 2020, 7:37pm

Not that it’s important, but I think the preferred way of accomplishing this nowadays is through launchd. Although crontab is much easier to grasp and leads to the same result in this case.

Drewster · September 2, 2020, 1:15am

Ah, yes. I remember discussions about the deprecation of crontab in favour of launchd.

Thanks for the suggestions @vco1 and @webwalrus. This has given me some more yak shaving to do!

webwalrus · September 2, 2020, 1:27am

I’m a Linux admin, so I just use what I know.

tjluoma · September 2, 2020, 1:41am

Hi, and welcome to “Building a Shell Script”

I’m your host, TJ Luoma.

On today’s episode we’re going to archive all of the mp3s from an RSS feed.

I’m going to use MPU as an example, because we’re on the MPU forum.

The first thing we should do is set a variable with the URL of the podcast feed. This should be the actual RSS feed, not an iTunes or SoundCloud URL.

FEED='https://www.relay.fm/mpu/feed'

Technically we don’t have to do this, but it makes it easier to change the feed if you want to do it to a different URL.

The second thing we need to decide is where are we going to store the files that we download? I’m going to keep them in ~/Music/Podcasts/ShowNameHere but you can change that if you want.

It’s a good idea to make this a variable so we can:

Easily refer to it by a short name
Change it in one location and know that we changed all instances of it.

	# this is where we define what the folder name will be
DIR="$HOME/Music/Podcasts/MacPowerUsers"

	# if the folder does not already exist, let's create it
[[ ! -d "$DIR" ]] && mkdir -p "$DIR"

	# now we're going to 'change directories' into that directory
	# so everything else we do will happen in there:

cd "$DIR"

Now we need to fetch the feed URL and see what URLs it offers as ‘enclosures’ (which is what podcast RSS feeds use to define where episodes are stored).

curl -sfLS "$FEED" \
| fgrep '<enclosure ' \
| tr '"|\047' '\012' \
| egrep '^http.*\.mp3$'

Fetch the URL with curl
use fgrep filter the results to only lines with <enclosure
use tr to replace any single or double quotes with a newline (Note: that \047 stands for “single quote” and \012 stands for “newline”)
use egrep to filter the results down to just lines which start with ‘http’ and end with ‘.mp3’ (obviously this will fail if you find a podcast that doesn’t use mp3s)

Now we want to process all of the results of that command, line by line. To do that, we use while read line.

In its simplest form, it would look something like this:

curl -sfLS "$FEED" \
| fgrep '<enclosure ' \
| tr '"|\047' '\012' \
| egrep '^http.*\.mp3$' \
| while read line
do

	echo "$line"

done

In this case, the variable $line would be the URL to the MP3. However, I prefer to use variables that are more descriptive so I remember what they stand for, so I would assign the variable URL to be equal to $line

curl -sfLS "$FEED" \
| fgrep '<enclosure ' \
| tr '"|\047' '\012' \
| egrep '^http.*\.mp3$' \
| while read line
do

	URL="$line"

	echo "$URL"

done

Exact same output, but in 6 months it will be easier to remember what is happening here.

Now, curl can use the filename from the URL. So, if we have this URL:

https://www.podtrac.com/pts/redirect.mp3/traffic.libsyn.com/relaympu/mpu-321.mp3

and ask curl to download it, we will get a file name ‘mpu-321.mp3’.

Good podcasts, like the ones at Relay, will use good filenames. But not all podcasts will. Just something to be aware of. And even good podcasts can occasionally have weird filenames, such as the filename for episode 514:

https://www.podtrac.com/pts/redirect.mp3/traffic.libsyn.com/relaympu/mpu_5fourteen.mp3

Why is it called ‘mpu_5fourteen.mp3’ instead of ‘mpu_514.mp3’? I don’t know. I assume this happens if they don’t want someone trying to guess the filename, or if they have to re-release an episode for some reason and need to give it a new name so podcast apps will realize it is a different file.

So, we can download the URL using curl with this command:

curl --remote-name --silent --location --fail --show-error \
"https://www.podtrac.com/pts/redirect.mp3/traffic.libsyn.com/relaympu/mpu-321.mp3"

or, to use our variable:

curl --remote-name --silent --location --fail --show-error "$URL"

That will download the file, and the --remote-name tells curl to base the local filename off the name of the file on the server, in this case ‘mpu-321.mp3’.

Now, if we were just going to run this script once that would be fine, but if we are doing to run this script repeatedly, we need to be smarter. We don’t need or want to download files more than once. That would be wasteful of time and resources.

There are several ways of dealing with this, but to me, the best one is to check to see if we already have a file with the same filename, and only download the URL if we do not have a file that matches the end of that URL.

How do we check for that? Well, zsh has a nifty feature where you can give it a variable but tell it that you only want the tail of the variable. Instead of using $URL we just use $URL:t and that will give us only the tail end of the URL, which is the part after the last /.

Once we have that, all we need to do is check to see if we already have a file with that name. If yes, we don’t download it again. If no, we do.

Also, now that we have this $FILENAME variable, we can tell curl to use that, so instead of using --remote-name we use --output "$FILENAME". It means that same thing to curl but it will be more clear to anyone reading this in the future, including us.

FILENAME="$URL:t"

if [[ -e "$FILENAME" ]]
then

	echo "We already have '$FILENAME'."

else

	curl --output "$FILENAME" --silent --location --fail --show-error "$URL"

fi

This is important because you don’t want to try to re-download 550+ episodes of MPU every time you run the script.

However, there is one more thing we need to do. We need to consider the possibility that curl will try to download the file, but will fail, or will only download part of the file. if that happens, we don’t want to keep the partial file, because then the next time we run the script, we will skip downloading it.

So what we need to do is check to see if curl reported any problems, and if it did, then we want to remove the file (if it exists) so we can try again next time.

curl --output "$FILENAME" --silent --location --fail --show-error "$URL"

EXIT="$?"

if [[ "$EXIT" == "0" ]]
then

	echo "Successfully downloaded '$URL' to '$PWD/$FILENAME'."

else

	echo "$0 failed to download '$URL' to '$PWD/$FILENAME' (\$EXIT = $EXIT)" \
	| tee -a "$HOME/Desktop/$0:t:r.errors.txt"

	mv -vn "$FILENAME" "$HOME/.Trash/$FILENAME.$$.$RANDOM.mp3"

fi

With this new bit of code, we will check curl’s ‘exit code’ to see if it reports an error (generally speaking, anything that isn’t a zero means there was an error).

We will also log the error to a file on the Desktop, where we hope that whoever is running this script will notice it, in case they want to check it out for themselves.

Even if they do not, the script will try to download that file again the next time it is run.

I have assembled the file version of the script into a GitHub gist:

gist.github.com

https://gist.github.com/tjluoma/b4401b65df2505ff70778599fd6b7d27

podcastdownloader.sh

#!/usr/bin/env zsh -f
# Purpose: Download enclosures from a podcast feed
#
# From:	Timothy J. Luoma
# Mail:	luomat at gmail dot com
# Date:	2020-09-01

	# this is a variable we can use to refer to the name of this script
	# for example, if the filename is
	# 		/usr/local/bin/podcastdownloader.sh

This file has been truncated. show original

How to install it

Download the file from Github and save it somewhere like /usr/local/bin/podcastdownloader.sh
Make sure it is ‘executable’ (which just means ‘it can run’)

chmod 755 /usr/local/bin/podcastdownloader.sh
There is no step three.

How to run it

You could run this via cron of course but launchd is the preferred way, and better way, IMO.

Assuming that the file is at /usr/local/bin/podcastdownloader.sh, this launchd plist will work:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>Label</key>
	<string>com.tjluoma.podcastdownloader</string>
	<key>Program</key>
	<string>/usr/local/bin/podcastdownloader.sh</string>
	<key>RunAtLoad</key>
	<true/>
	<key>StandardErrorPath</key>
	<string>/tmp/com.tjluoma.podcastdownloader.txt</string>
	<key>StandardOutPath</key>
	<string>/tmp/com.tjluoma.podcastdownloader.txt</string>
	<key>StartInterval</key>
	<integer>604800</integer>
</dict>
</plist>

That will run the script every 7 days (604800 seconds) and every time you reboot your Mac or log out/log in.

You can adjust the frequency of that to your liking, of course, but remember, most podcast RSS feeds include several episodes, so even if you did not run the script for a month, all 4 episodes would still be visible in the feed.

So don’t do anything silly like run this every day. That’s wasteful.

However, say you have a podcast that you know comes out every week on the same day, and you want to download it on the day it comes out.

For example, MPU is usually posted by about 6:30 p.m. in my time zone. If I wanted to make sure that I had the latest MPU, I could tell launchd to run “Every Sunday night at 7:00 p.m.”

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>Label</key>
	<string>com.tjluoma.podcastdownloader</string>
	<key>Program</key>
	<string>/usr/local/bin/podcastdownloader.sh</string>
	<key>RunAtLoad</key>
	<false/>
	<key>StandardErrorPath</key>
	<string>/tmp/com.tjluoma.podcastdownloader.txt</string>
	<key>StandardOutPath</key>
	<string>/tmp/com.tjluoma.podcastdownloader.txt</string>
	<key>StartCalendarInterval</key>
	<dict>
		<key>Hour</key>
		<integer>19</integer>
		<key>Minute</key>
		<integer>0</integer>
		<key>Weekday</key>
		<integer>7</integer>
	</dict>
</dict>
</plist>

Now the script will only run at 7:00 p.m. (aka 1900 hours) on Sunday (aka “Weekday number 7” in launchd parlance).

The advantage here is that if you reboot or re-login during the week, this script will not run, which it might not need to do. It will only run at that time. Again, even if it missed an episode (like, for example, if the WWDC episode came out on Monday), it will catch it the following week.

How to use the `launchd` plist

Copy and paste the above into a plain text document and save it as something like

“$HOME/Library/LaunchAgents/com.tjluoma.podcastdownloader.plist”

Then, in Terminal, tell launchd to load it:

launchctl load "$HOME/Library/LaunchAgents/com.tjluoma.podcastdownloader.plist"

Note: to disable it, run this command:

launchctl unload "$HOME/Library/LaunchAgents/com.tjluoma.podcastdownloader.plist"

and then delete the file “$HOME/Library/LaunchAgents/com.tjluoma.podcastdownloader.plist”

Drewster · September 2, 2020, 3:26am

I have no words. This is incredible. I’m humbled by both your genius and your generosity.

tjluoma · September 2, 2020, 6:09am

Well, you’re welcome.

I wouldn’t call it genius, just the accumulation of little bits of knowledge for a lot of years

As far as generosity, well, I had written something like this a few years ago. I couldn’t find it, but it was pretty easy to replicate it.

I also very much enjoy sharing this sort of thing with folks.

So, when you have a chance to check it out, let me know if you have questions.

vco1 · September 2, 2020, 6:55am

It’s definitely a nice script that @tjluoma posted. One question though… Why use this - still fairly limited - script, instead of the more ‘mature’ solutions that are available? I mean, Podget has been around for a long time, and offers way more flexibility and probably checks and balances.
Just asking.

rms · September 2, 2020, 8:49am

I do this. Downcast to download the target podcasts to the local drive, then have a Chronosync task (could use a scheduled “cp” or “rsynch” command) to move the files to my NAS.

–rms

tjluoma · September 2, 2020, 3:40pm

Personally I usually find that services out there tend to not do what I want them to do the way that I want them to do it. Or they go away because it’s no longer a priority. Or they become a subscription service

If I can do something myself the way that I want it done, I prefer that. Doesn’t happen often, but I take what I can get.

That being said, I think it’s always good to know what the options are and choose among them whichever works best for you.

Drewster · September 3, 2020, 12:05pm

An update on my progress.

Firstly, I tried castget, installed via Homebrew. I got as far as making the castget file, but was struck down by an error. I am thinking it has something to do with the paths relating to Homebrew, but I’m not sure.

The error is: castget.c:32:10: fatal error: 'libxml/parser.h' file not found

Rather than futz too much with that, I moved on to @tjluoma’s script, which I got working easily thanks to the excellent documentation.

This raised another question: is this script capable of handling multiple RSS feeds within the one FEED variable?

vco1 · September 3, 2020, 12:14pm

This might be as easy as

xcode-select --install

As per the homebrew installation requirements.

Drewster · September 3, 2020, 12:27pm

I’ve already got the command line tools installed. Weird. I also tried softwareupdate --all --install --force and no updates were available.

I decided to re-download the complete XCode application.

Now I can make castget successfully.

tjluoma · September 4, 2020, 12:39am

Not as written but it would be fairly easy to add.