Hazel Monitored Folder to Combine PDFs

zherbert · September 12, 2018, 3:55am

I want: anytime a new pdf file is placed in a folder, that new pdf is combined with the existing PDF file that is already in the folder and then deleted so that there is always only one PDF file in this folder.

I have created the automator workflow that combines PDFs and then saves in the same folder from this how to: http://www.documentsnap.com/how-to-combine-pdf-files-in-mac-osx-using-automator-to-make-a-service/ and then created a hazel rule that runs this workflow when it sees a PDF file in the monitored folder.

However, when hazel comes across the folder and sees a PDF file, I can’t figure out how to get Hazel and/or Automator to select all the PDFs in the folder, combine them, then discard the individual PDFs. Any thoughts?

nlippman · September 13, 2018, 1:14am

Since no one else has yet responded, here are a few thoughts. It haven’t tested this out.

If I understand you correctly, you want to have one pdf in a folder that always remains there. Any other pdf added to the folder is automatically appended that pdf. You have an Automator workflow that can combine PDFs. I will call the pdf that you add to the “master” below.

What I would do:

Create the Automator workflow so that it accepts one pdf file and appends that file to the master.

In Hazel, create a rule for the folder that matches all of the following criteria:

file type is pdf
file name is NOT the name of the master pdf

The Hazel action runs the workflow to do the append.

Now, in terms of dealing with the file after it has been appended:

One option is to move it to the trash. I tend to like to ensure that the previous steps were successful before trashing a file. I do not know if Automator workflows can return a status that Hazel can use to decide whether to keep on processing action rules or stop. The Hazel documentation will tell you this. If so, that’s your answer. If not, if the Automator action returns a status code that indicates success or failure, wrap the call to the Automator action in a bash script or AppleScript which can return status to Hazel.

If you are not sure how to do this I can help with sample code when I am at my computer to test this out.

Another option is to have the Hazel rule tag the files after the add is done, say with a flag called “added”.

Then, add another rule that matches on all files with the added tag that are older than some number of days that will ensure you have likely looked at the master pdf and detected any missing appends, and then moves those files to the trash.

Hope this helps.

zherbert · September 14, 2018, 11:24pm

Hey thanks! I’m going to sit down this weekend and work on the automator part of it. I like the idea of tagging with Hazel and then later trashing after a day or so. I’ll let you know if I get stuck.

evanfuchs · September 15, 2018, 3:41pm

Great answer!

Hazel can do lots of cool tricks by adding/removing tags. I have Hazel add a “Pending” tag to every document that is added to a particular folder. Then if there is a matching rule for the document, it will remove the Pending tag when it runs. If there isn’t a matching rule, that means I have to rename it manually and remove the Pending tag myself. Either way, when the Pending tag is removed, it triggers another rule to file the document.

I also like to use: If Date added is not in the last xxx minutes/hours/days as a way of auto-deleting files.

zherbert · September 17, 2018, 11:56am

Okay I’m stuck. It’s the Automator workflow that I can’t seem to figure out. My Hazel rules work awesome. They go into each folder titled “14 - WNB” and add a green tag to all the PDF files in that folder only. Here is an example of the folder structure that I’m working with:

(New users can only post one image, so let me know if you need to see this)

My Hazel Rules look like this:

(New users can only post one image, so let me know if you need to see this)

Each PDF file is tagged green and I can see the automator workflow running, but can’t figure out how to get it to combine the PDFs that are in the folder, let alone append the master. Here is my workflow:

13%20AM

Any suggestions for the workflow?

nlippman · September 18, 2018, 1:11am

@evanfuchs

I see what it is you are trying to do. Although the full detail of your workflow are not present in the image, it looks like you have basically taken a workflow for PDF combining that is fairly widely available, based on a workflow presented on the MacAutomator website. No harm there, of course - that’s why the website exists in the first place. However, I don’t think this is the best way to approach your task.

The workflow you have used is based on the idea that ALL of the pdfs to be combined are supplied as input to a single execution of the workflow. However, that is not the way Hazel works. Rather, Hazel works with a single file at a time. Further, Automator’s CombinePDFs element does not allow you to specify a file that was NOT passed in to the workflow; in this case, your base PDF that should be accumulating all of the others. Since you need two files at a minimum (the base PDF and one to append), I do not think this can be done in Automator…except as I will show you below!

I think what you really want to be doing is to establish the base PDF to which all others are added, and then use the Hazel action with a script to add files to it, one at a time. This will be a bit of a kludge, but I think this approach will work.

Assumptions: 1) the PDF that everything is appended to is called “base.pdf” and is located in a known place - which can be, but does not have to be, the same folder that Hazel is scanning for PDFs to add.
2) When you are done, you want the result to be that a new pdf is appended to the end of base.pdf.

Here then is what I would do. Note that this could break over system version updates, but has remained stable for quite some time so far.

Have Hazel scan the desired folder for all files of type PDF. You may modify by a) excluding files named base.pdf if that file will reside in the same folder; b) exclude files tagged “added” if you are going to use that scheme as per my earlier post to mark files already added, c) any other criteria you need to select which pdf’s are to be added.
The Hazel action for each file is to run a script. I would do this script as a shell script (eg in bash), but you can code in python or whatever if you prefer.
This script has the name of base.pdf hard coded into it:
BASEPDF="/Users/me/MyFolder/base.pdf"
You will also want to know the name of the folder the file is in, and create some variables that will be useful:
BASEDIR=${dirname $BASEPDF} # pull out the folder name
OUTPUTFILE="$BASEDIR/output.pdf" # filename for the output
Do the append. For this, we are going to use the following python script:
/System/Library/Automator/Combine\ PDF\ Pages.action/Contents/Resources/join.py
Note that this is actually the guts of the CombinePDF automatic action which actually does all of the work. You can look at the source code to learn how to use it.

So, remembering that in a bash script, Hazel gives you the filename as “$1”
/System/Library/Automator/Combine\ PDF\ Pages.action/Contents/Resources/join.py --output=$OUTPUTFILE $BASEPDF “$1”

At this point, the file named in the variable OUTPUTFILE has the append of base.pdf and the new file to be added. You now need to move OUTPUTFILE to BASEPDF. You accomplish this via:
mv $OUTPUTFILE $BASEPDF
Since step 5 will overwrite BASEPDF, I would consider first making a copy of BASEPDF before doing the move. You can do this via the cp command, and you could add code to save various versions, etc, because if the above actions fail and you do the move (mv), you will lose BASEPDF. There are a variety of ways to do this; one is to just copy BASEPDF to BASEPDF with the current date/time appended. You will generate a backup each time Hazel adds a file, but you can delete these copies once you are sure base.pdf is intact, via Hazel rule or other approach.

I hope this is helpful. I have obviously made some assumptions, eg that you have some familiarity with shells scripting. If you don’t the code here is pretty easy to write, and I can finish off the script for you if you need that help.

Note that again you can easily trash your base.pdf if you aren’t careful and I strongly recommend that you have the script make a copy with a date/time stamp before you let Hazel run this script so that you don’t lose the base.pdf due to an execution error.

nlippman · September 18, 2018, 12:06pm

It occurred to me that there is one potential additional pitfall to what I have outlined.

Specifically: my workflow assumes Hazel is single threaded, eg that it executes the rules for a given matched file to completion before starting on the next matched file.

If that is not the case, eg if Hazel tries to run the rules in parallel, then this workflow will fail badly.

I have done a quick Google and have not found the answer to this question. You can try experimentation (create a folder with a few pdfs in it, create a Hazel rule that matches all the pdfs and runs a script that just executes a wait for perhaps 30 seconds and then writes a value to a file, run the rules, and see if the output looks correct, and check the Hazel log for that run.

I might just do this if I have a few minutes free today.

nlippman · September 18, 2018, 1:34pm

So my quick check suggests that the files will be processed in sequence and so my suggested workflow should work.

I would again caution to make sure you have copies of all the files in a safe place before running the Hazel rules to guard against data loss until this is tested.

zherbert · September 19, 2018, 12:24am

Haha okay man, I need to hire you at this point! So I’m back on the “echo “Hello World”” level of command line scripting right now. Can you help me out re the script? Here is what my folder structure looks like:
18%20PM

I have 90 client folders, all with the “14 - WNB” folder in them. No prob with having one .pdf file named base.pdf in each folder to start. Here is what my Hazel rule looks like:
35%20PM
I already have a hazel rule that is going into folders so this one works in each “14 - WNB” folder

zherbert · October 5, 2018, 3:38am

This is the script that worked for me:

## Config
## Feel free to change these if needed

# Dir name to search for base & added PDF files:
WORK_DIR_NAME="14 - WNB"

# Base PDF name:
BASE_PDF_NAME="Base.pdf"

# Base PDF template location (set full path here):
BASE_PDF_TEMPLATE_PATH=/Users/hlgdesktop/Documents/Base.pdf

### Main script

HAZELFILE="$1"
BASEPATH=$(dirname "$HAZELFILE")
DIRNAME=$(basename "$BASEPATH")

# Check if current PDF name is the same as base PDF (combined PDF moved back to work folder)
if [ "$(basename "$HAZELFILE")" = "$BASE_PDF_NAME" ]; then

	# Just exiting script without any work
	exit 0

fi


# Check if current PDF file is in correct folder
if [ "$DIRNAME" = "$WORK_DIR_NAME" ]; then

	# Check if base file is not present
	if [ ! -f "$BASEPATH/$BASE_PDF_NAME" ]; then

		# Copy template to base file or exit with error if copy failed for some reason
		cp "$BASE_PDF_TEMPLATE_PATH" "$BASEPATH/$BASE_PDF_NAME" || exit 1

	fi
	
	# Setting lock to be sure no other script instance tamper with same PDF
	while ! mkdir -p /tmp/lock_hazelscript 2>/dev/null; do
		LOCK_PID=$(cat /tmp/lock_hazelscript/pid)
		[ -f /tmp/lock_hazelscript/pid ] && ! kill -0 $LOCK_PID 2>/dev/null && rm -rf "/tmp/lock_hazelscript"
	done

	echo $$ > /tmp/lock_hazelscript/pid
	
	trap "rm -rf /tmp/lock_hazelscript" QUIT INT TERM EXIT

	"/System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py" --output=/tmp/output.pdf "$BASEPATH/$BASE_PDF_NAME" "$HAZELFILE"

	# Moving merged file back to base PDF (overwriting it!!!)
	mv -f /tmp/output.pdf "$BASEPATH/$BASE_PDF_NAME"

	# Deleting PDF file which have triggered Hazel
	rm -f "$HAZELFILE"
		
	# Releasing previous lock
	rm -rf /tmp/lock_hazelscript
		
fi