One of the things I’ve come to learn in my social media life is that written words are powerful.
As sage and wise as I may appear with this salty beard and Santa Claus-like stature, I don’t mean appearances of wisdom (or lack their of) are the the source of power from my words. Rather, I am referring to the power of written (specifically those digitally written) words and their profound impact on the way the Internet’s search engines route people to our humble little blog and podcast. Beyond the hashtags we use in Twitter to drive awareness and connection to a topic, we struggled with with how we would drive awareness in the community for our podcast and the topics we cover. That is when SEO (search engine optimization) traditions started to bubble up to the front of our minds as we sought to grow our listener base and connect interested parties with the interviews and topics we were covering. So we needed a transcript.
A transcript is basically a written account of all the words spoken and this is something pretty much every profession has needed from one time or another…think:
- Court reporters capturing a case
- Medical transcribers converting doctor’s speech
- Reporters writing their stories
And the list goes on…
No surprise, there are a lot of people and companies who have recognized this need in the market and have built very profitable businesses to deliver this valued service either through a human-powered service or through their own software either delivered as a service or on-premises (note the “s” in premises their folks…premise is an idea, not a place). One analyst firm puts the transcription market at more than $18B by 2023…
But rather than leverage one of these services who produce a fine product (at a premium, I will say), we sought out a way to use modern technology to reach our transcript objective. If Amazon Echo and Google Home can come into our homes and take our words as commands in the digital universe, why can’t we use that sort of capability to make a transcript out of our multi-track recording rather than paying one of these human-powered services or expensive SaaS platforms for the same thing? Well, after a short search we found that we could.

Before we get to the search, a little background on the multi-track part of this little story is critical to unpack. Any conversation inherently involves two or more parties (aside from those crazy chats I have with myself from time to time…don’t judge me) and an audio capture of a conversation is best done with each party having their own individual track. Seasoned Podcasters know this fact and we learned it rather quickly ourselves. This is pretty easy to achieve when all the parties are in the same room just by having each person have their own microphone and connecting each of those mics to a mixer, usb interface or digital recorder for each track to be recorded separately. Doing this remotely where all parties are not in the same place can be a bit more challenging. Sure, VOIP services like Skype or WebEx make it easy to connect with one another, however the audio quality gained from their recording leaves much to be desired. That is not surprising really though, those tools were designed to optimize conversation, not recorded high quality audio and they have some weird licensing fine print that could land someone in trouble.
There are a number of ways around this. You can:
- Have each person locally record their audio feed while simply using a VOIP to interact, called a “multi-enter.”
- Hack a VOIP with a digital recorded to grab multi-track.
- Suck up dealing with crap audio from a VOIP, much to your listeners dismay.
- Use a purpose built, record and forward platform for remote recording like Zencastr, Ringr, or Cast
We’ve generally decided that the last option was best for us. Sure, the simplicity of it is brilliant for our guests to only need a laptop, microphone and quiet space versus some software to download or requirement for them to figure out local recording. However, these services are not without issues…try mixing audio recorded at different stamping rates and experience the excruciating pain known as audio drift. We tried Zencastr and had massive issues with this drift. We are using Cast now and are quite happy!
Now that we’ve captured a high-quality version of the conversation and have every participant on an individual track (digital file as mp3, was, etc.), we need to get that transcript and make the audio even better by removing background noise so listeners get the best possible experience. This is where we begin and end our search.

Our search was short mostly because we started to look at the tools we were already using for our podcast production and quickly realized that one of them had made the decision for us, that is Auphonic. Easily one of our most valued tools, Auphonic and their web-based platform has native API capabilities to interact with Google Cloud Platform’s Speech Recognition API (more on this shortly). In short, Auphonic is an online platform for audio production that, in the words of the German geniuses (or their marketing team) put so eloquently:
Auphonic develops next generation audio algorithms using a combination of music information retrieval, machine learning, signal processing and big data to create automatic audio post production software for broadcasters, podcasts, radio shows, movies, audio books, lecture recordings, screencasts and more.Users just upload their recorded audio and Auphonic will do the rest: neither complicated parameter settings nor audio expert knowledge is necessary and our algorithms keep learning and adapting to new data every day!
This thing really is pretty amazing and I could not recommend this tool enough to anyone who makes audio productions. These folks make it super simple to upload multi-track audio , set metadata you want included in the production and get amazingly well crafted (by their proprietary ML powered audio enhancement tools) and near perfect productions out the back.

But the power is in the API call. An API call out to Google Cloud Platform (GCP) and their Speech Recognition API. The first step for Auphonic users is to create a GCP instance and enable the Speech Recognition API. Side note, it is pretty dope that GCP gives you $300 in credit for setting up your first environment, which for this function will get us rough and tough about 200-300 episodes worth of transcripts for free (your mileage may vary). Once complete, you register that GCP service with Auphonic giving them credentials for the API and voila! You now have an automated transcription service.

Now, when we create a new multi-track production in Auphonic, we simply check the box for the Speech Recognition Service powered by GCP and then Skynet….I mean the machines get to work. The default output of this transcription service is robust in that I generally know how to make use of 3 of the 4 outputs…
- MP3 Mixed – this is the mixed and cleaned audio portion that Auphonic‘ engine builds, but you can choose to add other formats as well with one click
- VTT for subtitle (no idea what to do with this…another thing to learn later)
- HTML of Transcript – copy and paste this into our show notes blog for each episode…the key SEO step
- JSON of Transcript – we are publishing these to our GitHub (https://github.com/BigDataBeard) site under open-source license for researchers looking for conversation text for training models…maybe training a model to be a better host of the Big Data Beard than me. @Kyleprins has also threatened to give each of us a list of our most over-used words…for human learning.
Now we have it all for an awesome podcast:
- Crisp Audio – leveled, clean, and perfected by ML powered enhancements for your listening pleasure
- Cheap Transcription – audio turned text for SEO enhancement done by one of the outputs of Google Brain, Speech Recognition that is only getting better with time and more inputs…that is the virtuous cycle that is reinforcement learning in ML
This example is exactly what the promise of AI and ML is all about: the use of intelligent algorithms to improve our quality of life.
- The Big Data Beard teams’ lives have been enhanced by access to ML powered tools that lower the cost and pain of achieving our desired outcomes.
- More listeners will find our podcast (assuming their behaviors told search engines they might be interested) will find our podcast for their listenting pleasure…well, at least they’ll find it.
The pleasure part is up for them (you) to decide.
Your bearded friend,
Cory
Be the first to write a comment.