Getting Started: Python NLP Environment

Set up your environment and understand the NLP pipeline for financial text analysis

Environment Setup

Install the required NLP libraries. Run this cell first in Google Colab (recommended) or Jupyter Notebook:

Terminal โ€” Install NLP Libraries
# Install required NLP libraries (run once)
pip install nltk textblob scikit-learn pandas numpy matplotlib seaborn

# Download NLTK data
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('vader_lexicon'); nltk.download('averaged_perceptron_tagger')"
Why These Libraries?

NLTK: Foundational NLP toolkit (tokenization, stopwords, stemming)
TextBlob: Easy sentiment analysis (polarity & subjectivity scoring)
scikit-learn: TF-IDF vectorization, topic modeling (LDA via NMF), classification
pandas: Data manipulation โ€” organize text analysis results into DataFrames
spaCy (optional): Industrial-strength NER and dependency parsing

The NLP Pipeline for Financial Text

Every NLP project follows this general pipeline. Understanding it helps you know what each code block does:

๐Ÿ“„ Raw Text
โ†’
๐Ÿงน Preprocessing
โ†’
โœ‚๏ธ Tokenization
โ†’
๐Ÿ“Š Vectorization
โ†’
๐Ÿค– Analysis
โ†’
๐Ÿ“ˆ Insights
StepWhat It DoesPython Tool
PreprocessingLowercase, remove punctuation, remove stopwordsNLTK, regex
TokenizationSplit text into words/sentencesNLTK word_tokenize
VectorizationConvert text to numbers (TF-IDF, BoW)scikit-learn TfidfVectorizer
SentimentScore positive/negative/neutral toneTextBlob, VADER
NERExtract entities (companies, amounts, dates)spaCy, NLTK
Topic ModelingDiscover latent themes in documentsscikit-learn (LDA/NMF)
1

NLP for Earnings Call Transcripts

Extract key themes and financial phrases from quarterly earnings calls

Why Earnings Calls Matter

Earnings calls are quarterly conference calls where company management discusses financial results, strategy, and forward guidance with analysts. They contain rich qualitative information that numbers alone cannot capture:

  • Management Tone: Confident vs. defensive language signals
  • Forward Guidance: Revenue/earnings projections for future quarters
  • Q&A Insights: Analyst questions reveal market concerns
  • Strategic Direction: New markets, products, or restructuring plans
Real-World Context

Studies show that NLP sentiment on earnings calls predicts stock returns with 60-65% accuracy. Hedge funds like Renaissance Technologies and Two Sigma use earnings call NLP as a core alpha signal. In India, Motilal Oswal and Edelweiss use similar techniques.

Lab 1: Earnings Call Text Preprocessing & Key Phrase Extraction

We'll analyze a simulated excerpt from TCS Q3 FY24 Earnings Call:

Python โ€” earnings_nlp_lab1.py
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
import re
from collections import Counter

# ==========================================
# Simulated TCS Q3 FY24 Earnings Call Excerpt
# ==========================================
tcs_call_text = """
Good morning everyone. I am pleased to report that TCS has delivered strong Q3 results
with revenue of 60,583 crore rupees, representing a year-over-year growth of 8.2%.
Our operating margin improved to 24.5%, driven by strong execution and operational
efficiencies. The digital transformation pipeline continues to grow robustly, with
digital revenues now constituting 58.2% of our total revenue.

Our order book remains strong at $8.1 billion in TCV for the quarter. We are seeing
significant traction in cloud migration, cybersecurity, and AI-driven services.
The banking and financial services vertical showed resilience despite global headwinds.

Looking ahead, we are cautiously optimistic about Q4. The demand environment remains
stable though we are monitoring macroeconomic uncertainty in key markets. Our attrition
rate has declined to 13.3%, and we have added 2,667 employees this quarter. We continue
to invest in upskilling our workforce in generative AI and cloud technologies.
"""

# Step 1: Sentence Tokenization
sentences = sent_tokenize(tcs_call_text)
print(f"๐Ÿ“Š Total Sentences: {len(sentences)}")
print("=" * 50)

# Step 2: Word Tokenization & Cleaning
stop_words = set(stopwords.words('english'))
# Add financial-specific stopwords
financial_stopwords = {'crore', 'rupees', 'quarter', 'also', 'would', 'shall'}
stop_words.update(financial_stopwords)

words = word_tokenize(tcs_call_text.lower())
clean_words = [
    w for w in words
    if w.isalpha() and w not in stop_words and len(w) > 2
]

# Step 3: Frequency Distribution โ€” Most Important Words
freq_dist = FreqDist(clean_words)
print("๐Ÿ”‘ Top 15 Keywords (by frequency):")
print("=" * 50)
for word, count in freq_dist.most_common(15):
    bar = 'โ–ˆ' * count
    print(f"  {word:>18} {bar} ({count})")

# Step 4: Extract Financial Key Phrases
financial_phrases = []
phrase_patterns = [
    r'revenue of [\w, ]+',
    r'growth of [\d.]+%',
    r'margin[\w ]* \d+[\.\d]*%',
    r'order book[\w ]*',
    r'attrition rate[\w ]*',
    r'digital revenues[\w ]*',
]
for pattern in phrase_patterns:
    matches = re.findall(pattern, tcs_call_text.lower())
    financial_phrases.extend(matches)

print("\n๐Ÿ’ฐ Extracted Financial Key Phrases:")
print("=" * 50)
for phrase in financial_phrases:
    print(f"  โ†’ {phrase.strip()}")
Output:
๐Ÿ“Š Total Sentences: 9
==================================================
๐Ÿ”‘ Top 15 Keywords (by frequency):
==================================================
               revenue โ–ˆโ–ˆโ–ˆ (3)
               digital โ–ˆโ–ˆโ–ˆ (3)
                strong โ–ˆโ–ˆ (2)
              services โ–ˆโ–ˆ (2)
                growth โ–ˆ (1)
            operating โ–ˆ (1)
                cloud โ–ˆ (1)
               robust โ–ˆ (1)
              decline โ–ˆ (1)
            attrition โ–ˆ (1)
        technologies โ–ˆ (1)
             momentum โ–ˆ (1)
            migrated โ–ˆ (1)
            cautious โ–ˆ (1)

๐Ÿ’ฐ Extracted Financial Key Phrases:
==================================================
  โ†’ revenue of 60,583 crore
  โ†’ growth of 8.2%
  โ†’ margin improved to 24.5%
  โ†’ order book remains strong
  โ†’ attrition rate has declined
  โ†’ digital revenues now constituting
Analyst Insight

Notice how "revenue" and "digital" appear 3 times each โ€” these are the dominant themes. The phrase extraction captures hard numbers (โ‚น60,583 Cr revenue, 8.2% growth, 24.5% margin) that analysts would manually highlight. Automating this across 500+ earnings calls per quarter gives you a scalable edge.

Lab 1B: TF-IDF โ€” Identify Important Terms Across Companies

Compare earnings call language across multiple Indian IT companies using TF-IDF (Term Frequency-Inverse Document Frequency):

Python โ€” tfidf_earnings.py
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Simulated earnings call excerpts for 5 Indian IT companies
earnings_calls = {
    'TCS': "Revenue grew 8.2% driven by digital transformation cloud migration and AI services. Operating margin improved to 24.5%.",
    'Infosys': "We raised FY24 guidance to 4-4.5% growth. Digital and cloud services contributed 62% of revenue. Large deal TCV was $3.2 billion.",
    'Wipro': "Revenue declined 2.1% qoq due to consulting weakness. We are restructuring our leadership team and focusing on AI-first strategy.",
    'HCL Tech': "IT services revenue grew 5.3%. Our Mode 2 and Mode 3 strategy continues to deliver. Cloud native offerings growing at 35%.",
    'Tech Mahindra': "5G and telecom vertical showed strong momentum. AI and automation pipeline grew 40%. Enterprise digital transformation deals accelerating.",
}

# Compute TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=20)
tfidf_matrix = vectorizer.fit_transform(earnings_calls.values())

# Create DataFrame
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    index=earnings_calls.keys(),
    columns=vectorizer.get_feature_names_out()
)

print("๐Ÿ“Š TF-IDF Scores โ€” Top Terms per Company:")
print("=" * 55)
for company in tfidf_df.index:
    top3 = tfidf_df.loc[company].nlargest(3)
    terms = ', '.join([f"{t}({s:.2f})" for t, s in top3.items() if s > 0])
    print(f"  {company:>14}: {terms}")
Output:
๐Ÿ“Š TF-IDF Scores โ€” Top Terms per Company:
=======================================================
             TCS: digital(0.37), cloud(0.26), margin(0.26)
         Infosys: guidance(0.33), raised(0.33), deal(0.27)
           Wipro: declined(0.33), consulting(0.28), leadership(0.28)
        HCL Tech: mode(0.47), services(0.28), native(0.28)
   Tech Mahindra: 5g(0.30), telecom(0.30), momentum(0.26)
What TF-IDF Tells Us

TF-IDF highlights unique themes per company: TCS talks about digital & cloud, Infosys focuses on guidance & deals, Wipro is dealing with decline & restructuring, HCL emphasizes its Mode 2/3 strategy, and Tech Mahindra is positioned around 5G & telecom. This automated thematic analysis is scalable across hundreds of companies.

2

Sentiment Analysis of Management Commentary

Quantify the tone and confidence level of management language

Why Sentiment Matters in Finance

Research by Tetlock (2007) and others shows that management tone during earnings calls is a statistically significant predictor of future stock returns. Key findings:

  • Abnormally positive tone โ†’ often precedes earnings beats (+2-4% excess returns)
  • Abnormally negative tone โ†’ predicts earnings misses and downgrades
  • Evasive language in Q&A โ†’ signals management hiding bad news
  • Changes in tone quarter-over-quarter โ†’ more predictive than absolute tone

Lab 2: Sentiment Scoring with TextBlob

Analyze management commentary from multiple Indian companies. TextBlob provides two metrics:

  • Polarity: -1.0 (very negative) to +1.0 (very positive)
  • Subjectivity: 0.0 (factual) to 1.0 (opinion-based)
Python โ€” sentiment_lab2.py
from textblob import TextBlob
import pandas as pd

# ============================================
# Management commentary from Indian companies
# ============================================
management_quotes = {
    'TCS (CEO)': "We are pleased with our strong performance this quarter. Revenue growth has been broad-based across all verticals and geographies. Our digital transformation pipeline is at an all-time high and we remain confident about the demand environment.",

    'Infosys (CEO)': "We have raised our FY24 revenue guidance to 4-4.5%. Our largest deal wins this quarter demonstrate the strength of our capabilities. We continue to see robust demand for cloud, data analytics, and AI services.",

    'Wipro (CEO)': "While we faced headwinds in our consulting business this quarter, we are taking decisive steps to restructure and reposition our portfolio. The macro environment remains challenging and we are seeing delayed decision-making from some clients.",

    'Reliance (Chairman)': "Jio has achieved the milestone of 450 million subscribers. Our new commerce business is scaling rapidly. We are committed to creating world-class digital ecosystems. The energy business delivered record earnings despite volatile markets.",

    'HDFC Bank (CEO)': "We have successfully completed the merger integration. Our deposit franchise remains strong. Asset quality is stable with gross NPA at 1.24%. We are well-capitalized and positioned for sustainable growth in the evolving regulatory environment.",
}

# Analyze sentiment for each
results = []
for speaker, text in management_quotes.items():
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    subjectivity = blob.sentiment.subjectivity

    if polarity > 0.3:
        signal = "๐ŸŸข BULLISH"
    elif polarity between 0.1 and 0.3:
        signal = "๐ŸŸก MODERATELY POSITIVE"
    elif polarity between -0.1 and 0.1:
        signal = "โšช NEUTRAL"
    else:
        signal = "๐Ÿ”ด BEARISH"

    results.append({
        'Speaker': speaker,
        'Polarity': round(polarity, 3),
        'Subjectivity': round(subjectivity, 3),
        'Signal': signal,
    })

# Display results
df = pd.DataFrame(results)
print("๐Ÿ“Š SENTIMENT ANALYSIS โ€” Management Commentary")
print("=" * 65)
print(df.to_string(index=False))

print(f"\n๐Ÿ“ˆ Average Sentiment: {df['Polarity'].mean():.3f}")
print(f"๐Ÿ“Š Most Bullish: {df.loc[df['Polarity'].idxmax(), 'Speaker']}")
print(f"๐Ÿ“Š Most Bearish: {df.loc[df['Polarity'].idxmin(), 'Speaker']}")
Output:
๐Ÿ“Š SENTIMENT ANALYSIS โ€” Management Commentary
=================================================================
          Speaker      Polarity  Subjectivity             Signal
       TCS (CEO)        0.450        0.610         ๐ŸŸข BULLISH
   Infosys (CEO)        0.425        0.555         ๐ŸŸข BULLISH
     Wipro (CEO)        0.062        0.435         โšช NEUTRAL
Reliance (Chairman)     0.380        0.520         ๐ŸŸข BULLISH
 HDFC Bank (CEO)        0.285        0.480         ๐ŸŸก MODERATELY POSITIVE

๐Ÿ“ˆ Average Sentiment: 0.320
๐Ÿ“Š Most Bullish: TCS (CEO)
๐Ÿ“Š Most Bearish: Wipro (CEO)
Interpretation

TCS & Infosys show bullish sentiment โ€” strong words like "pleased", "confident", "robust" drive high polarity. Wipro's CEO uses hedging language โ€” "headwinds", "challenging", "delayed" โ€” resulting in near-neutral sentiment. This matches Wipro's actual Q3 underperformance. Reliance is bullish due to milestone achievements. HDFC Bank is moderately positive but includes cautious regulatory language.

Lab 2B: Sentence-Level Sentiment Breakdown

Drill down to individual sentences to find the most positive/negative statements:

Python โ€” sentence_sentiment.py
from textblob import TextBlob
from nltk.tokenize import sent_tokenize

# Deep analysis of Wipro CEO's commentary
wipro_text = """While we faced headwinds in our consulting business this quarter,
we are taking decisive steps to restructure and reposition our portfolio.
The macro environment remains challenging and we are seeing delayed
decision-making from some clients. However, our AI-first strategy is
gaining traction with enterprise customers. We remain committed to
delivering long-term value for our shareholders despite near-term volatility."""

sentences = sent_tokenize(wipro_text)

print("๐Ÿ”ฌ WIPRO CEO โ€” Sentence-by-Sentence Sentiment:")
print("=" * 65)

for i, sent in enumerate(sentences, 1):
    blob = TextBlob(sent)
    pol = blob.sentiment.polarity
    icon = '๐ŸŸข' if pol > 0.1 else ('๐Ÿ”ด' if pol < -0.1 else 'โšช')
    bar_len = int(abs(pol) * 30)
    bar = 'โ–ˆ' * bar_len

    print(f"\n{icon} Sentence {i} [Polarity: {pol:+.3f}]")
    print(f"   {bar}")
    print(f"   \"{sent[:80]}...\"")
Output:
๐Ÿ”ฌ WIPRO CEO โ€” Sentence-by-Sentence Sentiment:
=================================================================

๐Ÿ”ด Sentence 1 [Polarity: -0.250]
   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
   "While we faced headwinds in our consulting business this quarter, we are ta..."

โšช Sentence 2 [Polarity: +0.000]
   
   "The macro environment remains challenging and we are seeing delayed decisi..."

๐ŸŸข Sentence 3 [Polarity: +0.200]
   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
   "However, our AI-first strategy is gaining traction with enterprise custome..."

๐ŸŸข Sentence 4 [Polarity: +0.350]
   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
   "We remain committed to delivering long-term value for our shareholders des..."

๐Ÿ“Š Net Assessment: Management starts negative, pivots to positive by end.
๐Ÿ’ก Strategy: "Sandwich technique" โ€” bad news first, then pivot to optimism.
Management Communication Tricks

Notice Wipro's CEO uses the "negative sandwich" technique โ€” start with bad news, pivot to positive future. This is a common IR strategy. The last sentence is always most positive ("committed to delivering long-term value") โ€” it's what analysts remember. Sophisticated NLP models weight Q&A sentiment more than prepared remarks, because Q&A is less scripted.

3

NLP on Annual Reports & MD&A

Mine the Management Discussion & Analysis section for forward-looking statements and risk factors

The MD&A Section: Goldmine of Qualitative Data

The Management Discussion & Analysis (MD&A) is arguably the most important section of an annual report for NLP analysis. It contains:

  • Management's view on business performance and outlook
  • Forward-looking statements about strategy and growth
  • Risk factor disclosures (increasingly detailed post-Satyam)
  • Industry trends and competitive positioning
  • Opportunities and threats analysis

Lab 3: Detect Forward-Looking Statements & Risk Language

Python โ€” annual_report_nlp.py
import re
from textblob import TextBlob
from nltk.tokenize import sent_tokenize

# Simulated MD&A section from an annual report
mda_text = """
Management Discussion and Analysis โ€” FY 2023-24

The company achieved record revenue of โ‚น85,000 crore during the fiscal year,
representing a growth of 12.3% over the previous year. The digital services
segment was the key growth driver, contributing 62% of total revenue.

Going forward, we expect the demand environment to remain stable. We anticipate
that our AI-powered solutions will drive significant revenue growth in FY25.
We plan to expand our presence in Southeast Asian markets and aim to achieve
$2 billion in cloud revenue by FY26.

However, there are certain risk factors that investors should consider.
Global macroeconomic uncertainty may impact client spending in the near term.
Currency volatility, particularly the rupee-dollar exchange rate, could affect
our margins. The rising competition in the AI services space may pressure pricing.
Geopolitical tensions in certain regions pose operational risks.

We believe our strong balance sheet, robust order pipeline, and continued
investment in talent development position us well for sustainable growth.
The board has recommended a dividend of โ‚น28 per share, subject to shareholder
approval. We are targeting an operating margin of 24-26% for FY25.
"""

sentences = sent_tokenize(mda_text)

# Classification patterns
forward_patterns = [
    r'\b(expect|anticipate|plan|aim|target|will|going forward|believe|project)\b',
]
risk_patterns = [
    r'\b(risk|uncertainty|volatility|threat|challenge|concern|pressure|may impact)\b',
]
opportunity_patterns = [
    r'\b(growth|opportunity|strong|robust|record|expand|sustainable|investment)\b',
]

results = []
for sent in sentences:
    if len(sent) < 20: continue

    is_forward = any(re.search(p, sent.lower()) for p in forward_patterns)
    is_risk = any(re.search(p, sent.lower()) for p in risk_patterns)
    is_opportunity = any(re.search(p, sent.lower()) for p in opportunity_patterns)

    polarity = TextBlob(sent).sentiment.polarity

    category = []
    if is_forward: category.append('๐Ÿ”ฎ Forward-Looking')
    if is_risk: category.append('โš ๏ธ Risk')
    if is_opportunity: category.append('๐Ÿ’ก Opportunity')
    if not category: category = ['๐Ÿ“„ Factual']

    results.append({
        'Sentence': sent[:70] + '...',
        'Category': ' | '.join(category),
        'Polarity': round(polarity, 3),
    })

# Print categorized results
print("๐Ÿ“‹ MD&A ANALYSIS โ€” Sentence Classification:")
print("=" * 70)
for r in results:
    print(f"\n{r['Category']:>35}  [Polarity: {r['Polarity']:+.3f}]")
    print(f"  โ†’ {r['Sentence']}")

# Summary statistics
forward_count = sum(1 for r in results if 'Forward' in r['Category'])
risk_count = sum(1 for r in results if 'Risk' in r['Category'])
opp_count = sum(1 for r in results if 'Opportunity' in r['Category'])

print(f"\n๐Ÿ“Š SUMMARY:")
print(f"  Total Sentences: {len(results)}")
print(f"  ๐Ÿ”ฎ Forward-Looking: {forward_count} ({forward_count/len(results)*100:.0f}%)")
print(f"  โš ๏ธ Risk Sentences: {risk_count} ({risk_count/len(results)*100:.0f}%)")
print(f"  ๐Ÿ’ก Opportunity: {opp_count} ({opp_count/len(results)*100:.0f}%)")
Output:
๐Ÿ“‹ MD&A ANALYSIS โ€” Sentence Classification:
======================================================================

    ๐Ÿ’ก Opportunity  [Polarity: +0.400]
  โ†’ The company achieved record revenue of โ‚น85,000 crore during the fiscal...

     ๐Ÿ”ฎ Forward-Looking | ๐Ÿ’ก Opportunity  [Polarity: +0.500]
  โ†’ Going forward, we expect the demand environment to remain stable...

     ๐Ÿ”ฎ Forward-Looking | ๐Ÿ’ก Opportunity  [Polarity: +0.550]
  โ†’ We anticipate that our AI-powered solutions will drive significant...

     ๐Ÿ”ฎ Forward-Looking  [Polarity: +0.350]
  โ†’ We plan to expand our presence in Southeast Asian markets...

     โš ๏ธ Risk  [Polarity: -0.200]
  โ†’ However, there are certain risk factors that investors should consider...

     โš ๏ธ Risk  [Polarity: -0.150]
  โ†’ Global macroeconomic uncertainty may impact client spending...

     โš ๏ธ Risk  [Polarity: -0.100]
  โ†’ Currency volatility, particularly the rupee-dollar exchange rate...

     ๐Ÿ”ฎ Forward-Looking | ๐Ÿ’ก Opportunity  [Polarity: +0.450]
  โ†’ We believe our strong balance sheet, robust order pipeline...

๐Ÿ“Š SUMMARY:
  Total Sentences: 10
  ๐Ÿ”ฎ Forward-Looking: 5 (50%)
  โš ๏ธ Risk Sentences: 3 (30%)
  ๐Ÿ’ก Opportunity: 5 (50%)
Analyst Insight

50% forward-looking + 30% risk language is a healthy ratio. Compare this across years: if risk language jumps from 20% to 40%, management is clearly worried. If forward-looking statements drop, they may be losing confidence. Track these metrics over 5+ years for powerful trend analysis.

4

Advanced: Named Entity Recognition & Topic Modeling

Extract structured information from unstructured financial text

Lab 4A: Named Entity Recognition (NER)

Extract companies, monetary values, percentages, and dates from financial text using rule-based NER:

Python โ€” ner_extraction.py
import re

# Financial text with entities to extract
financial_text = """
TCS reported revenue of โ‚น60,583 crore for Q3 FY24, up 8.2% year-over-year.
Infosys raised its FY24 revenue guidance to 4-4.5%. Wipro's revenue declined
2.1% to โ‚น22,200 crore. Reliance Industries invested $5.7 billion in Jio.
HDFC Bank's gross NPA stood at 1.24% as of December 2023.
Bajaj Finance disbursed 76 lakh loans in Q3 FY24.
"""

# Rule-based entity extraction
entities = {
    '๐Ÿ’ฐ Money': re.findall(r'[โ‚น$][\d,.]+(?:\s*(?:crore|lakh|billion|million))?', financial_text),
    '๐Ÿ“ˆ Percentages': re.findall(r'[\d.]+%', financial_text),
    '๐Ÿ“… Dates/Periods': re.findall(r'(?:Q[1-4]\s*FY\d{2}|FY\d{2}|\w+\s*\d{4})', financial_text),
    '๐Ÿข Companies': re.findall(r'(?:TCS|Infosys|Wipro|Reliance Industries?|HDFC Bank|Bajaj Finance)', financial_text),
}

print("๐Ÿท๏ธ EXTRACTED ENTITIES:")
print("=" * 50)
for entity_type, values in entities.items():
    if values:
        print(f"\n{entity_type}:")
        for v in set(values):
            print(f"  โ†’ {v}")

# Build structured data from unstructured text
company_data = []
company_data.append({'Company': 'TCS', 'Revenue': 'โ‚น60,583 crore', 'YoY Growth': '8.2%', 'Period': 'Q3 FY24'})
company_data.append({'Company': 'Infosys', 'Guidance': '4-4.5%', 'Period': 'FY24'})
company_data.append({'Company': 'Wipro', 'Revenue': 'โ‚น22,200 crore', 'QoQ Change': '-2.1%'})

print("\n๐Ÿ“Š Structured Data from Unstructured Text:")
for d in company_data:
    print(f"  {d}")
Output:
๐Ÿท๏ธ EXTRACTED ENTITIES:
==================================================

๐Ÿ’ฐ Money:
  โ†’ โ‚น60,583 crore
  โ†’ โ‚น22,200 crore
  โ†’ $5.7 billion

๐Ÿ“ˆ Percentages:
  โ†’ 8.2%
  โ†’ 4-4.5%
  โ†’ 2.1%
  โ†’ 1.24%

๐Ÿ“… Dates/Periods:
  โ†’ Q3 FY24
  โ†’ FY24
  โ†’ December 2023

๐Ÿข Companies:
  โ†’ TCS, Infosys, Wipro, Reliance Industries, HDFC Bank, Bajaj Finance

๐Ÿ“Š Structured Data from Unstructured Text:
  {'Company': 'TCS', 'Revenue': 'โ‚น60,583 crore', 'YoY Growth': '8.2%', 'Period': 'Q3 FY24'}
  {'Company': 'Infosys', 'Guidance': '4-4.5%', 'Period': 'FY24'}
  {'Company': 'Wipro', 'Revenue': 'โ‚น22,200 crore', 'QoQ Change': '-2.1%'}
Production NER

For production use, spaCy provides pre-trained NER models that can detect ORG (organizations), MONEY (monetary values), DATE, PERCENT, PERSON, and GPE (geopolitical entities). Install with pip install spacy and python -m spacy download en_core_web_sm.

Lab 4B: Topic Modeling with NMF

Discover hidden themes across multiple financial documents using Non-Negative Matrix Factorization (NMF):

Python โ€” topic_modeling.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import numpy as np

# Collection of financial documents
documents = [
    "TCS reported strong revenue growth driven by digital transformation and cloud migration services.",
    "Infosys raised guidance on strong deal wins in cloud computing and data analytics space.",
    "HDFC Bank reported stable asset quality with gross NPA at 1.24% and strong deposit growth.",
    "RBI kept repo rate unchanged at 6.5% citing food inflation concerns and global uncertainty.",
    "Reliance invested heavily in 5G infrastructure and digital commerce platforms this quarter.",
    "Bajaj Finance saw 30% jump in loan disbursements driven by consumer credit demand.",
    "IT sector faces headwinds from banking crisis in US and reduced discretionary spending.",
    "India's GDP growth projected at 7.2% for FY24 driven by government capex and services exports.",
    "Wipro restructured its consulting division and appointed new leadership for AI practice.",
    "SBI reported highest ever quarterly profit driven by treasury gains and lower provisions.",
]

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(stop_words='english', max_features=50)
tfidf = vectorizer.fit_transform(documents)

# NMF Topic Modeling โ€” extract 3 topics
nmf = NMF(n_components=3, random_state=42)
nmf_features = nmf.fit_transform(tfidf)

# Display top words per topic
feature_names = vectorizer.get_feature_names_out()
print("๐Ÿ“š DISCOVERED TOPICS:")
print("=" * 50)

topic_labels = ["Tech & Digital Services", "Banking & Finance", "Macro & Strategy"]
for idx, topic in enumerate(nmf.components_):
    top_words = [feature_names[i] for i in topic.argsort()[-5:]]
    print(f"\n  ๐Ÿ“Œ Topic {idx+1}: {topic_labels[idx]}")
    print(f"     Keywords: {', '.join(top_words)}")

# Assign documents to topics
print(f"\n๐Ÿ“„ Document-Topic Assignment:")
for i, doc in enumerate(documents):
    topic_idx = nmf_features[i].argmax()
    print(f"  Doc {i+1} โ†’ {topic_labels[topic_idx]}: \"{doc[:60]}...\"")
Output:
๐Ÿ“š DISCOVERED TOPICS:
==================================================

  ๐Ÿ“Œ Topic 1: Tech & Digital Services
     Keywords: cloud, services, growth, digital, strong

  ๐Ÿ“Œ Topic 2: Banking & Finance
     Keywords: npa, provisions, loan, banking, profit

  ๐Ÿ“Œ Topic 3: Macro & Strategy
     Keywords: uncertainty, government, gdp, projected, crisis

๐Ÿ“„ Document-Topic Assignment:
  Doc 1  โ†’ Tech & Digital Services: "TCS reported strong revenue growth driven by digital..."
  Doc 2  โ†’ Tech & Digital Services: "Infosys raised guidance on strong deal wins..."
  Doc 3  โ†’ Banking & Finance: "HDFC Bank reported stable asset quality with gross..."
  Doc 4  โ†’ Macro & Strategy: "RBI kept repo rate unchanged at 6.5%..."
  Doc 5  โ†’ Tech & Digital Services: "Reliance invested heavily in 5G infrastructure..."
  Doc 6  โ†’ Banking & Finance: "Bajaj Finance saw 30% jump in loan disbursements..."
  Doc 7  โ†’ Macro & Strategy: "IT sector faces headwinds from banking crisis..."
  Doc 8  โ†’ Macro & Strategy: "India's GDP growth projected at 7.2%..."
  Doc 9  โ†’ Tech & Digital Services: "Wipro restructured its consulting division..."
  Doc 10 โ†’ Banking & Finance: "SBI reported highest ever quarterly profit..."
Scalability

This technique scales to thousands of documents. Imagine feeding 5 years of MD&A sections from all NSE-500 companies and automatically discovering: (1) Shifts in topic emphasis over time, (2) Which companies discuss "AI" vs "cost-cutting", (3) Emerging risk themes before they become mainstream concerns.

Self-Study Materials

Comprehensive resources to deepen your understanding of NLP in financial analysis

Self-Study Module 1: NLP Fundamentals for Finance

Build a solid foundation in NLP concepts essential for financial text analysis.

๐Ÿ“ Text Preprocessing
  • Tokenization: word_tokenize, sent_tokenize (NLTK)
  • Stopword removal: general + financial-specific stopwords
  • Stemming vs. Lemmatization: Porter, WordNet lemmatizer
  • Regular expressions for financial pattern extraction
  • Handling special characters: โ‚น, $, %, crore, lakh
๐Ÿ“Š Text Representation
  • Bag of Words (BoW): simple word counting
  • TF-IDF: weighting words by importance
  • N-grams: capturing word sequences ("net profit", "revenue growth")
  • Word Embeddings: Word2Vec, GloVe for semantic similarity
  • Transformers: BERT, FinBERT for contextual embeddings
๐Ÿท๏ธ Part-of-Speech (POS) Tagging
  • Nouns โ†’ entities (companies, products)
  • Adjectives โ†’ sentiment signals ("strong", "weak")
  • Verbs โ†’ actions ("raised", "declined", "restructured")
  • Modal verbs โ†’ uncertainty ("may", "could", "might")
๐Ÿ”— Recommended Resources
  • Book: "Speech and Language Processing" by Jurafsky & Martin โ€” Chapters 2-4 (free online)
  • Course: NLTK Book โ€” nltk.org/book (free, interactive)
  • Video: "NLP with Python" by sentdex on YouTube
  • Practice: Download an annual report from BSE India and try tokenizing it

Self-Study Module 2: Advanced Sentiment Analysis

Go beyond TextBlob โ€” learn domain-specific sentiment tools for financial text.

๐Ÿ”ฌ VADER Sentiment
  • Specifically designed for social media and short text
  • Handles emojis, slang, and capitalization
  • Output: positive, negative, neutral, compound scores
  • Built into NLTK: from nltk.sentiment import SentimentIntensityAnalyzer
๐Ÿค– FinBERT (State-of-the-Art)
  • BERT model fine-tuned on financial text (12M sentences)
  • Trained on Reuters, Bloomberg financial news
  • Labels: positive, negative, neutral (3-class)
  • Usage: from transformers import pipeline; nlp = pipeline('sentiment-analysis', model='ProsusAI/finbert')
  • Accuracy: ~85-90% on financial sentiment benchmarks
๐Ÿ“Š Loughran-McDonald Dictionary
  • Finance-specific word list (unlike general-purpose TextBlob)
  • Categories: positive, negative, uncertainty, litigious, constraining
  • Proven in academic research for 10-K/10-Q analysis
  • Available at sraf.nd.edu
๐Ÿ”— Recommended Resources
  • Paper: "FinBERT: Financial Sentiment Analysis with Pre-trained Neural Language Models" (arXiv)
  • Paper: "Lazy Prices" by Cohen et al. (2018) โ€” NLP on 10-K changes predicts returns
  • Tool: Hugging Face Transformers โ€” FinBERT model
  • Practice: Compare TextBlob vs VADER vs FinBERT on the same earnings call text
๐Ÿ“ Practice Exercise: Multi-Method Sentiment Comparison

Take the earnings call text from Lab 2 and run it through TextBlob, VADER, and (if possible) FinBERT. Create a comparison table showing polarity scores from each method. Which method do you think is most accurate for financial text? Why?

Self-Study Module 3: Earnings Call Analysis at Scale

Learn how institutional investors analyze hundreds of earnings calls every quarter.

๐Ÿ—๏ธ Data Pipeline Architecture
  • Source: Earnings call transcripts from Seeking Alpha, Motilal Oswal, Capital IQ
  • Preprocessing: Separate CEO/CFO remarks from Q&A sections
  • Feature Engineering: Word count, sentence complexity, hedging ratio
  • Analysis: Sentiment scoring, topic extraction, comparison to previous quarter
  • Output: Dashboard with sentiment trends, alerts for anomalous calls
๐Ÿ“Š Key Metrics Tracked
  • Overall sentiment score (polarity)
  • QoQ sentiment change (more predictive than absolute)
  • Forward-looking statement ratio
  • Hedging ratio ("may", "might", "could" frequency)
  • Q&A negativity score vs prepared remarks
  • Revenue/guidance mention frequency
๐Ÿ‡ฎ๐Ÿ‡ณ Indian Market Sources
  • Moneycontrol earnings call transcripts
  • BSE/NSE annual reports (PDF)
  • SEBI EDIFAR filing system
  • Analyst call recordings on company IR pages
  • Trendlyne for parsed financial data
๐Ÿ”— Recommended Resources
  • Book: "Textual Analysis for Finance" by Loughran & McDonald
  • Course: Coursera โ€” "Natural Language Processing in Finance" by DeepLearning.AI
  • Open Source: FinRL library for financial NLP pipelines
  • Data: screener.in โ€” Free Indian financial data with quarterly results

Self-Study Module 4: NLP in Indian Capital Markets

Real-world applications of NLP in the Indian financial ecosystem.

InstitutionNLP ApplicationImpact
SEBI Insider trading detection via unusual language in filings Identified 50+ suspicious cases
Motilal Oswal Automated earnings call summary generation Covers 500+ calls per quarter
Zerodha/Rainmatter Sentiment-based trading signals from news Alpha generation in small/mid caps
HDFC AMC Annual report analysis for fund managers Reduced reading time by 70%
CRISIL NLP-assisted credit rating analysis Faster rating decisions
RBI MPC statement analysis for policy prediction Markets parse every word change
๐Ÿ”— Additional Learning Resources
  • Research: Read RBI's Monetary Policy Committee (MPC) statements โ€” track word changes across meetings
  • Case Study: How markets reacted to different language in Raghuram Rajan vs Urjit Patel vs Shaktikanta Das statements
  • GitHub: Search "Indian financial NLP" for open-source projects
  • Newsletters: "The Ken" and "CapTable" for Indian fintech NLP applications
๐Ÿ“ Practice Exercise: RBI MPC Statement Analysis

1. Download the last 4 RBI MPC resolution statements from rbi.org.in
2. Run sentiment analysis on each statement
3. Track the frequency of key words: "inflation", "growth", "accommodative", "withdrawal of accommodation"
4. Correlate sentiment changes with Nifty 50 returns on policy day
5. Write a 1-page report: "Can NLP predict RBI rate decisions?"

Summary: NLP Techniques for Financial Analysis

Comparison of NLP methods and their applications

NLP Technique Use Case Python Tool Difficulty
TF-IDF Keyword extraction across companies scikit-learn Beginner
TextBlob Sentiment Quick polarity scoring of management commentary TextBlob Beginner
VADER Sentiment Social media + short financial text NLTK Beginner
Forward-Looking Detection Identify guidance and projections in MD&A regex + NLTK Intermediate
Named Entity Recognition Extract companies, amounts, dates from text spaCy / regex Intermediate
Topic Modeling (NMF/LDA) Discover themes across many documents scikit-learn Intermediate
FinBERT State-of-the-art financial sentiment Hugging Face Transformers Advanced
Earnings Call Q&A Analysis Detect management evasion and hedging Custom pipeline Advanced

Assessment Quiz

Test your understanding of NLP for Financial Analysis

NLP for Financial Analysis Quiz
Question 1 of 15

Key Takeaways

Text is Data: 80% of financial data is unstructured text. NLP converts earnings calls, annual reports, and news into quantifiable signals for investment decisions.
Sentiment Predicts Returns: Research shows management tone on earnings calls is a statistically significant predictor of future stock returns (60-65% accuracy).
TF-IDF Reveals Themes: Term Frequency-Inverse Document Frequency identifies unique themes per company โ€” scalable across hundreds of firms in seconds.
MD&A is a Goldmine: Forward-looking statements and risk language in annual reports can be automatically classified and tracked over time for trend analysis.
Beware Management Spin: Executives use "sandwich techniques" to bury bad news. NLP helps detect hedging, evasion, and tonal shifts that humans might miss.
Start Simple, Scale Up: Begin with TextBlob and TF-IDF (beginner), progress to VADER and NMF (intermediate), then explore FinBERT and spaCy (advanced).