I’ve been experimenting with this over the last few days, for interview transcription. I’m pleased with the 1x results, but it’s taken about an hour for every few minutes of dialogue on an M1 Max. I haven’t evaluated the quality of the smaller models yet. It also doesn’t do speaker disambiguation yet, you have to split the tracks yourself re-assemble later (whisper outputs timestamps, so you just have to merge and order them with a script.)