Poor Man’s Podcast Summarizer
I am a huge fan of Colossus’ podcasts on investing. Unfortunately there’s a lot of them and most of them are extremely long. To prevent myself from FOMO I wrote up this quick script to summarize podcasts that I am interested in. OpenAI has a limit on API requests for free users so I implemented a quick workaround to throttle API requests.
Setup
1. Download Repo
git clone https://github.com/CtfChan/poor_man_podcast cd poor_man_podcast
2. Download Transcript
Select a podcast from https://www.joincolossus.com/episodes and download a transcript to the
/data
directoryIn
summarize.py
rename the pdf to your pdf.3. Setup API Key
Create
.env
file in directory and OPENAI_API_KEY = "sk-MY_API_KEY"
4. Run the script!
python -m venv env source env/bin/activate pip install -r requirements.txt python summarize.py
Understanding the Script
We’re basically using LangChain’s map reduce function but we have to separate out the map and reduce calls due to the bandwidth limitations on OpenAI’s free tier.
Load pdf and split it into chunks of 5000 characters.
loader = PyPDFLoader('data/harvesting_alpha.pdf') documents = loader.load() text_splitter = CharacterTextSplitter(chunk_size=5000, chunk_overlap=500) documents = text_splitter.split_documents(documents)
Define a summary chain with LangChain
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106") chain = load_summarize_chain(llm, chain_type="stuff")
Generate a summary for each chunk. To get around the 3 requests per minute limit for the free tier just sleep for 70s.
summaries = [] for i, doc in enumerate(documents): if (i != 0 and i % 3 == 0): time.sleep(70) summary = chain.run([doc]) summaries.append(summary) print(summary)
Summarize the summarized chunks.
summary_docs = [Document(page_content=t) for t in summaries] final_summary = chain.run(summary_docs)