Poor Man’s Podcast Summarizer

I am a huge fan of Colossus’ podcasts on investing. Unfortunately there’s a lot of them and most of them are extremely long. To prevent myself from FOMO I wrote up this quick script to summarize podcasts that I am interested in. OpenAI has a limit on API requests for free users so I implemented a quick workaround to throttle API requests.

Github: https://github.com/CtfChan/poor_man_podcast

Setup

1. Download Repo


git clone https://github.com/CtfChan/poor_man_podcast
cd poor_man_podcast

2. Download Transcript

Select a podcast from https://www.joincolossus.com/episodes and download a transcript to the /data directory

In summarize.py rename the pdf to your pdf.

3. Setup API Key

Create .env file in directory and


OPENAI_API_KEY = "sk-MY_API_KEY"

4. Run the script!


python -m venv env
source env/bin/activate
pip install -r requirements.txt
python summarize.py

Understanding the Script

We’re basically using LangChain’s map reduce function but we have to separate out the map and reduce calls due to the bandwidth limitations on OpenAI’s free tier.

Load pdf and split it into chunks of 5000 characters.


loader = PyPDFLoader('data/harvesting_alpha.pdf')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=5000, chunk_overlap=500)
documents = text_splitter.split_documents(documents)

Define a summary chain with LangChain


llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
chain = load_summarize_chain(llm, chain_type="stuff")

Generate a summary for each chunk. To get around the 3 requests per minute limit for the free tier just sleep for 70s.


summaries = []
for i, doc in enumerate(documents):
    if (i != 0 and i % 3 == 0):
        time.sleep(70)
    summary = chain.run([doc])
    summaries.append(summary)
    print(summary)

Summarize the summarized chunks.


summary_docs = [Document(page_content=t) for t in summaries]
final_summary = chain.run(summary_docs)