How to Clean Zoom Transcript Data in a Pinch

Mighty Minh
2 min readMay 10, 2021

--

Technology is great. During the pandemic, I am fortunate to get dissertation interviews done on Zoom with its auto-closed-captioning feature.

BUT…it’s not perfect.

If you are like me and forgot to format the zoom closed captioning before hitting the record + enable Auto Transcription… then your transcript spits out messy texts! To fix this, you could go line by line and clean it up…

OR…

Here’s a program I wrote with my husband’s help. The python code removes all the timestamps and automatically adds new lines to the transcripts. The program then spits out a cleaner .txt file so it’s easier on the eyes when you start your analysis.

Not sure what all of this means? Video coming soon!

Here’s the python code for you to try out in the meantime!

import os# Function to cleanup one file:
def clean_data_file(file, newfile):
f=open(file, "r")
lines=f.readlines()
f.close()
#make array
newlines=[]
# loop through each line:
for l in lines:
#split makes each word in a sentence, a string
l=l.split()
#this removes the very first word, which is the timestamp.. which is = [0]
l=l[1:]
#this frankensteins the new words together into a new sentence line
l=" ".join(l)
#add new line
l=l+"\n\n"
newlines.append(l)
# Save the file:f=open(newfile, "w")
f.writelines(newlines)
f.close()
#########
# Use the function on a bunch of files:
#wherever the location of your files are. make sure it's the right pathway
raw_folder = "/Users/minhtuyenmai/Desktop/interview_data/raw/"
clean_folder = "/Users/minhtuyenmai/Desktop/interview_data/clean/"
interview_file_list=os.listdir(raw_folder)for fname in interview_file_list:if fname[-4:]==".txt":
print("Cleaning "+fname+"... ",end="")
else:
print("Skipping "+fname+"...")
continue

person_name = fname[:-4]

newfile= clean_folder+person_name+'_clean.txt'
clean_data_file(raw_folder+fname,newfile)
print("[Done]")

--

--

Mighty Minh
Mighty Minh

Written by Mighty Minh

I have nunchuk skills, computer hacking skills. Minh of many talents.

Responses (3)