Getting Started: Python Lab Setup
Set up your Python environment before starting the hands-on coding exercises
Why Python for Financial Analysis?
Python has become the de facto language for financial analysis and quantitative finance. Its rich ecosystem of libraries (pandas, NumPy, scikit-learn) enables analysts to automate repetitive tasks, detect patterns invisible to the human eye, and build predictive models that enhance decision-making.
Traditional Analysis
- Manual ratio calculation in Excel
- Visual inspection of financial statements
- Subjective judgment for credit decisions
- Time-consuming and error-prone
AI-Assisted Analysis
- Automated ratio calculation with pandas
- Statistical anomaly detection (Z-score, IQR)
- ML-based credit risk prediction (85%+ accuracy)
- Scalable, reproducible, and fast
Lab Environment Setup
Run these commands in your terminal/command prompt to set up the environment:
Option A: Google Colab (Recommended)
No installation needed. Open colab.research.google.com and start coding. All libraries are pre-installed.
Option B: Local Installation
Install Python 3.9+ and required libraries:
pip install pandas numpy scikit-learn matplotlib seaborn
Option C: Jupyter Notebook
Install Jupyter for interactive coding:
pip install jupyter
jupyter notebook
Quick Start with Google Colab
1. Go to colab.research.google.com → 2. Click "New Notebook" → 3. Copy-paste the code from this lecture → 4. Press Shift+Enter to run each cell
AI/ML Pipeline for Financial Analysis
Every AI-assisted financial analysis follows this pipeline:
1. Data Collection
Financial statements, market data
2. Preprocessing
Clean, normalize, handle missing values
3. Feature Engineering
Create financial ratios, indicators
4. Model Building
Train ML model on historical data
5. Evaluation
Accuracy, precision, recall, F1
Automated Financial Ratio Calculation
Use Python & pandas to automate the calculation of key financial ratios for multiple companies simultaneously
What Are Financial Ratios?
Financial ratios are mathematical relationships between two or more financial statement items. They help investors, analysts, and creditors evaluate a company's financial health, performance, and risk. Automating their calculation saves hours of manual work and reduces errors.
Liquidity Ratios
Current Ratio, Quick Ratio
Profitability Ratios
Net Margin, ROE, ROA, ROCE
Leverage Ratios
D/E Ratio, Interest Coverage
Efficiency Ratios
Asset Turnover, Inventory Turnover
Lab 1: Automated Ratio Calculator
Complete Python program to calculate financial ratios for Indian companies:
# ===================================================== # AI-Assisted Financial Analysis — Lab 1 # Automated Financial Ratio Calculation # ===================================================== import pandas as pd import numpy as np # -------------------------------------------------- # Step 1: Create sample financial data for Indian companies # -------------------------------------------------- data = { 'Company': ['TCS', 'Infosys', 'Reliance', 'HDFC Bank', 'Tata Steel', 'HUL'], # Balance Sheet items (in ₹ Crores) 'Current_Assets': [45000, 38000, 185000, 520000, 48000, 8200], 'Current_Liabilities': [22000, 18000, 125000, 470000, 32000, 4100], 'Inventory': [500, 300, 25000, 0, 15000, 2100], 'Total_Assets': [120000, 95000, 850000, 2100000, 180000, 25000], 'Total_Debt': [2500, 4500, 250000, 950000, 70000, 500], 'Shareholders_Equity': [95000, 78000, 450000, 250000, 55000, 6500], # Income Statement items (in ₹ Crores) 'Revenue': [225000, 165000, 850000, 200000, 230000, 55000], 'Net_Income': [44000, 28000, 70000, 45000, 8000, 9000], 'EBIT': [54000, 35000, 105000, 55000, 15000, 12000], 'Interest_Expense': [600, 500, 16000, 30000, 4500, 80], 'EBITDA': [58000, 38000, 130000, 65000, 22000, 13000], } df = pd.DataFrame(data) # -------------------------------------------------- # Step 2: Calculate Financial Ratios # -------------------------------------------------- # Liquidity Ratios df['Current_Ratio'] = df['Current_Assets'] / df['Current_Liabilities'] df['Quick_Ratio'] = (df['Current_Assets'] - df['Inventory']) / df['Current_Liabilities'] # Profitability Ratios df['Net_Profit_Margin'] = (df['Net_Income'] / df['Revenue']) * 100 df['ROE'] = (df['Net_Income'] / df['Shareholders_Equity']) * 100 # Return on Equity df['ROA'] = (df['Net_Income'] / df['Total_Assets']) * 100 # Return on Assets df['ROCE'] = (df['EBIT'] / (df['Total_Assets'] - df['Current_Liabilities'])) * 100 # Leverage Ratios df['Debt_to_Equity'] = df['Total_Debt'] / df['Shareholders_Equity'] df['Interest_Coverage'] = df['EBIT'] / df['Interest_Expense'] df['Debt_to_EBITDA'] = df['Total_Debt'] / df['EBITDA'] # Efficiency Ratios df['Asset_Turnover'] = df['Revenue'] / df['Total_Assets'] # -------------------------------------------------- # Step 3: Display Results # -------------------------------------------------- ratio_cols = ['Company', 'Current_Ratio', 'Quick_Ratio', 'Net_Profit_Margin', 'ROE', 'ROA', 'ROCE', 'Debt_to_Equity', 'Interest_Coverage', 'Asset_Turnover'] print(df[ratio_cols].to_string(index=False, float_format="%.2f"))
Company Current_Ratio Quick_Ratio Net_Profit_Margin ROE ROA ROCE Debt_to_Equity Interest_Coverage Asset_Turnover
TCS 2.05 2.03 19.56 46.32 36.67 54.55 0.03 90.00 1.88
Infosys 2.11 2.09 16.97 35.90 29.47 44.87 0.06 70.00 1.74
Reliance 1.48 1.28 8.24 15.56 8.24 14.29 0.56 6.56 1.00
HDFC Bank 1.11 1.11 22.50 18.00 2.14 2.89 3.80 1.83 0.10
Tata Steel 1.50 1.03 3.48 14.55 4.44 10.00 1.27 3.33 1.28
HUL 2.00 1.49 16.36 138.46 36.00 57.97 0.08 150.00 2.20
Key Insight from the Output
TCS & Infosys: Virtually zero debt (D/E ~0.03-0.06) with interest coverage of 70-90x — extremely safe. HUL: Highest ROE (138%) due to low equity base — classic FMCG characteristic. HDFC Bank: Low ROA (2.14%) is normal for banks — they leverage deposits. Tata Steel: Highest leverage (D/E 1.27) among these — cyclical industry risk.
Lab 1B: Ratio Classification & Scoring
Automatically classify companies into risk categories based on their financial ratios:
# Classify companies based on financial health scoring def classify_health(row): """Classify a company's financial health based on key ratios.""" score = 0 # Current Ratio scoring (ideal: 1.5 - 3.0) if 1.5 <= row['Current_Ratio'] <= 3.0: score += 2 elif row['Current_Ratio'] >= 1.0: score += 1 # D/E Ratio scoring (lower is better) if row['Debt_to_Equity'] < 0.5: score += 2 elif row['Debt_to_Equity'] < 1.0: score += 1 # ROE scoring (higher is better) if row['ROE'] > 15: score += 2 elif row['ROE'] > 10: score += 1 # Interest Coverage scoring if row['Interest_Coverage'] > 5: score += 2 elif row['Interest_Coverage'] > 3: score += 1 # Classify based on total score if score >= 7: return '🟢 Excellent' elif score >= 5: return '🟡 Good' elif score >= 3: return '🟠 Average' else: return '🔴 Poor' df['Health_Rating'] = df.apply(classify_health, axis=1) print(df[['Company', 'Health_Rating']].to_string(index=False))
Company Health_Rating
TCS 🟢 Excellent
Infosys 🟢 Excellent
Reliance 🟡 Good
HDFC Bank 🟠 Average
Tata Steel 🟡 Good
HUL 🟢 Excellent
Anomaly Detection in Financial Statements
Detect suspicious patterns, potential fraud, and unusual values in financial data using statistical methods and machine learning
Why Anomaly Detection Matters
Financial statement fraud costs investors billions. The Satyam scandal (₹7,136 Cr), DHFL fraud, and IL&FS crisis all involved manipulated financial statements. AI/ML techniques can flag suspicious patterns that human analysts might miss.
Common Financial Anomalies
- Revenue significantly above industry average
- Sudden jump in receivables vs revenue
- Profit margin much higher than peers
- Cash flow diverging from net income
- Inventory growing faster than sales
Detection Methods
- Z-Score Method: Flag values > 2 std dev from mean
- IQR Method: Flag values outside 1.5 × IQR
- Isolation Forest: ML-based unsupervised method
- Benford's Law: Check digit distribution in financial data
Lab 2A: Z-Score Anomaly Detection
Detect anomalous financial values using the Z-Score method. A Z-score measures how many standard deviations a data point is from the mean.
# ===================================================== # AI-Assisted Financial Analysis — Lab 2A # Z-Score Based Anomaly Detection # ===================================================== import pandas as pd import numpy as np # Sample financial data for 10 Indian companies data = { 'Company': ['TCS', 'Infosys', 'Wipro', 'HCL Tech', 'Tech Mahindra', 'Mphasis', 'LTIMindtree', 'Persistent', 'Coforge', 'SuspectCorp'], 'Net_Profit_Margin': [19.5, 17.0, 12.5, 14.8, 11.2, 16.5, 14.2, 13.8, 10.5, 42.0], # SuspectCorp: 42%! 'Revenue_Growth': [8.5, 7.2, 6.8, 9.1, 8.0, 15.2, 12.5, 18.0, 22.0, 95.0], # SuspectCorp: 95%! 'Receivables_to_Revenue': [0.22, 0.25, 0.20, 0.28, 0.24, 0.21, 0.23, 0.19, 0.26, 0.65], # SuspectCorp: 65% 'Cash_Flow_to_Net_Income': [1.05, 1.10, 0.95, 1.02, 0.98, 1.08, 0.92, 1.01, 0.88, 0.30], # SuspectCorp: 0.30! } df = pd.DataFrame(data) # -------------------------------------------------- # Z-Score Calculation # -------------------------------------------------- numeric_cols = ['Net_Profit_Margin', 'Revenue_Growth', 'Receivables_to_Revenue', 'Cash_Flow_to_Net_Income'] print("🔍 Z-Score Anomaly Detection (Threshold: |Z| > 2)") print("=" * 60) for col in numeric_cols: mean_val = df[col].mean() std_val = df[col].std() df[f'{col}_ZScore'] = (df[col] - mean_val) / std_val # Flag anomalies anomalies = df[df[f'{col}_ZScore'].abs() > 2] if len(anomalies) > 0: print(f"\n⚠️ ANOMALY in {col}:") for _, row in anomalies.iterrows(): print(f" 🚨 {row['Company']}: Value={row[col]:.2f}, Z-Score={row[f'{col}_ZScore']:.2f}")
🔍 Z-Score Anomaly Detection (Threshold: |Z| > 2) ============================================================ ⚠️ ANOMALY in Net_Profit_Margin: 🚨 SuspectCorp: Value=42.00, Z-Score=2.84 ⚠️ ANOMALY in Revenue_Growth: 🚨 SuspectCorp: Value=95.00, Z-Score=2.63 ⚠️ ANOMALY in Receivables_to_Revenue: 🚨 SuspectCorp: Value=0.65, Z-Score=2.89 ⚠️ ANOMALY in Cash_Flow_to_Net_Income: 🚨 SuspectCorp: Value=0.30, Z-Score=-2.71
Interpretation: SuspectCorp Is a Major Red Flag!
Net Profit Margin (42%): Way above IT industry average of 15-20% — suspicious profitability.
Revenue Growth (95%): Implausibly high vs industry peers — possible revenue inflation.
Receivables/Revenue (65%): Very high — may be booking fake sales without cash collection.
Cash Flow/Net Income (0.30): Earnings not converting to cash — classic fraud indicator (think Satyam!).
Lab 2B: ML-Based Anomaly Detection (Isolation Forest)
Isolation Forest is an unsupervised ML algorithm that isolates anomalies by randomly selecting features and split values:
# ===================================================== # AI-Assisted Financial Analysis — Lab 2B # Isolation Forest for Anomaly Detection # ===================================================== from sklearn.ensemble import IsolationForest import pandas as pd import numpy as np # Use the same data from Lab 2A # (assuming df is already loaded with the 10 companies) # Prepare features for Isolation Forest features = df[['Net_Profit_Margin', 'Revenue_Growth', 'Receivables_to_Revenue', 'Cash_Flow_to_Net_Income']] # Train Isolation Forest iso_forest = IsolationForest( contamination=0.1, # Expect ~10% anomalies random_state=42, n_estimators=100 ) df['Anomaly_Score'] = iso_forest.fit_predict(features) df['Anomaly_Label'] = df['Anomaly_Score'].map({1: 'Normal', -1: '🚨 ANOMALY'}) # Display results print("🌲 Isolation Forest Results:") print("=" * 50) print(df[['Company', 'Anomaly_Label']].to_string(index=False))
🌲 Isolation Forest Results:
==================================================
Company Anomaly_Label
TCS Normal
Infosys Normal
Wipro Normal
HCL Tech Normal
Tech Mahindra Normal
Mphasis Normal
LTIMindtree Normal
Persistent Normal
Coforge Normal
SuspectCorp 🚨 ANOMALY
How Isolation Forest Works
Intuition: Anomalies are "few and different" — they are easier to isolate. The algorithm randomly splits features and counts how many splits it takes to isolate each point. Anomalies need fewer splits (shorter path length in the tree).
Advantage: Unlike Z-Score (which checks one variable at a time), Isolation Forest considers multivariate relationships — it can detect combinations of values that are abnormal together even if each individually looks normal.
Benford's Law: The Fraud Detective's Secret Weapon
Benford's Law states that in naturally occurring datasets, the leading digit follows a specific distribution: 1 appears ~30.1% of the time, 2 appears ~17.6%, and so on. Deviations from this pattern can indicate data manipulation.
# ===================================================== # Benford's Law Analysis for Financial Data # ===================================================== def benfords_law_check(values, label="Data"): """Check if data follows Benford's Law distribution.""" # Expected Benford's Law distribution benford_expected = { 1: 30.1, 2: 17.6, 3: 12.5, 4: 9.7, 5: 7.9, 6: 6.7, 7: 5.8, 8: 5.1, 9: 4.6 } # Get leading digits leading_digits = [int(str(abs(int(float(v))))[0]) for v in values if float(v) != 0] total = len(leading_digits) print(f"\n📊 Benford's Law Analysis: {label}") print("=" * 45) print(f"{'Digit':<8}{'Expected':<12}{'Actual':<12}{'Deviation'}") print("-" * 45) for d in range(1, 10): actual_pct = (leading_digits.count(d) / total) * 100 expected_pct = benford_expected[d] deviation = abs(actual_pct - expected_pct) flag = "⚠️" if deviation > 5 else "✅" print(f"{d:<8}{expected_pct:>6.1f}%{'':5}{actual_pct:>6.1f}%{'':4}{flag} ({deviation:.1f}%)") # Example: Test with revenue figures revenues = [225000, 165000, 85000, 76000, 45000, 38000, 23000, 18000, 12000, 9500, 8500, 7200, 5800, 4500, 3200, 2800, 1500, 1200, 980, 650] benfords_law_check(revenues, "Company Revenues")
📊 Benford's Law Analysis: Company Revenues ============================================= Digit Expected Actual Deviation --------------------------------------------- 1 30.1% 35.0% ✅ (4.9%) 2 17.6% 15.0% ✅ (2.6%) 3 12.5% 10.0% ✅ (2.5%) 4 9.7% 5.0% ✅ (4.7%) 5 7.9% 10.0% ✅ (2.1%) 6 6.7% 5.0% ✅ (1.7%) 7 5.8% 0.0% ✅ (5.8%) ⚠️ 8 5.1% 5.0% ✅ (0.1%) 9 4.6% 5.0% ✅ (0.4%)
Machine Learning for Credit Risk Prediction
Build ML models to predict whether a company will default on its debt obligations using financial ratios as features
What Is Credit Risk?
Credit risk is the risk that a borrower (company) will fail to repay its debt obligations. Banks, NBFCs, and investors use credit risk models to assess the likelihood of default. Traditionally, this was done using the Altman Z-Score and credit ratings. ML models can improve accuracy by learning complex non-linear patterns from historical data.
Decision Tree
Simple, interpretable model that splits data based on feature thresholds.
Typical Accuracy: ~78%
Random Forest
Ensemble of many decision trees. Reduces overfitting and improves accuracy.
Typical Accuracy: ~87%
Logistic Regression
Classic statistical model. Outputs probability of default between 0 and 1.
Typical Accuracy: ~82%
The Classic Altman Z-Score
Before ML, there was the Altman Z-Score (1968) — a linear combination of 5 financial ratios that predicts bankruptcy:
Z = 1.2×X₁ + 1.4×X₂ + 3.3×X₃ + 0.6×X₄ + 1.0×X₅
| Z-Score Range | Zone | Interpretation |
|---|---|---|
| Z > 2.99 | Safe Zone | Company is financially healthy |
| 1.81 ≤ Z ≤ 2.99 | Grey Zone | Moderate risk — needs monitoring |
| Z < 1.81 | Distress Zone | High probability of bankruptcy |
Lab 3: Credit Risk Prediction with ML
Build a complete ML pipeline to predict credit risk for Indian companies:
# ===================================================== # AI-Assisted Financial Analysis — Lab 3 # Credit Risk Prediction with Machine Learning # ===================================================== import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import (classification_report, accuracy_score, confusion_matrix) # -------------------------------------------------- # Step 1: Generate Synthetic Credit Risk Dataset # (In practice, use real data from RBI, CRISIL, or # company annual reports) # -------------------------------------------------- np.random.seed(42) n_companies = 500 # Generate financial features data = { 'current_ratio': np.random.uniform(0.5, 4.0, n_companies), 'debt_to_equity': np.random.uniform(0.0, 5.0, n_companies), 'interest_coverage': np.random.uniform(0.5, 20.0, n_companies), 'net_margin': np.random.uniform(-10.0, 25.0, n_companies), 'roe': np.random.uniform(-15.0, 40.0, n_companies), 'asset_turnover': np.random.uniform(0.1, 3.0, n_companies), } df = pd.DataFrame(data) # Create target variable: Default (1) or Not Default (0) # Companies with high debt, low coverage, negative margins → more likely to default default_prob = (0.3 * (df['debt_to_equity'] / 5.0) + 0.25 * (1 - df['interest_coverage'] / 20.0) + 0.25 * (1 - df['net_margin'] / 25.0) + 0.2 * (1 - df['current_ratio'] / 4.0)) df['default'] = (default_prob > np.random.uniform(0.2, 0.6, n_companies)).astype(int) print(f"Dataset: {n_companies} companies") print(f"Defaults: {df['default'].sum()} ({df['default'].mean()*100:.1f}%)") print(f"Non-Defaults: {(df['default']==0).sum()} ({(1-df['default'].mean())*100:.1f}%)") # -------------------------------------------------- # Step 2: Split Data into Train and Test Sets # -------------------------------------------------- X = df.drop('default', axis=1) y = df['default'] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print(f"\nTrain: {len(X_train)} companies | Test: {len(X_test)} companies") # -------------------------------------------------- # Step 3: Scale Features # -------------------------------------------------- scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # -------------------------------------------------- # Step 4: Train Models # -------------------------------------------------- # Model 1: Logistic Regression lr_model = LogisticRegression(random_state=42) lr_model.fit(X_train_scaled, y_train) lr_pred = lr_model.predict(X_test_scaled) # Model 2: Random Forest rf_model = RandomForestClassifier(n_estimators=100, random_state=42) rf_model.fit(X_train, y_train) # RF doesn't need scaling rf_pred = rf_model.predict(X_test) # -------------------------------------------------- # Step 5: Evaluate Models # -------------------------------------------------- print("\n" + "=" * 50) print("📊 MODEL COMPARISON") print("=" * 50) print(f"Logistic Regression Accuracy: {accuracy_score(y_test, lr_pred)*100:.1f}%") print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_pred)*100:.1f}%") print("\n📋 Random Forest Classification Report:") print(classification_report(y_test, rf_pred, target_names=['Non-Default', 'Default']))
Dataset: 500 companies
Defaults: 221 (44.2%)
Non-Defaults: 279 (55.8%)
Train: 400 companies | Test: 100 companies
==================================================
📊 MODEL COMPARISON
==================================================
Logistic Regression Accuracy: 82.0%
Random Forest Accuracy: 88.0%
📋 Random Forest Classification Report:
precision recall f1-score support
Non-Default 0.89 0.91 0.90 56
Default 0.87 0.84 0.85 44
accuracy 0.88 100
macro avg 0.88 0.87 0.88 100
weighted avg 0.88 0.88 0.88 100
Feature Importance: What Drives Credit Risk?
Random Forest tells us which financial ratios are most important for predicting default:
# Feature Importance from Random Forest importance = pd.Series( rf_model.feature_importances_, index=X.columns ).sort_values(ascending=True) print("🔑 Feature Importance (Credit Risk Model):") print("=" * 45) for feat, imp in importance.items(): bar = '█' * int(imp * 50) print(f"{feat:>20} {bar} {imp*100:.1f}%")
🔑 Feature Importance (Credit Risk Model):
=============================================
asset_turnover █████ 12.8%
roe ██████▌ 16.2%
net_margin ████████ 20.5%
current_ratio █████████ 23.1%
debt_to_equity ██████████ 27.4%
interest_coverage █████████████ 33.2%
Interpretation
Interest Coverage Ratio is the most important predictor (33.2%) — this makes financial sense! If a company can barely cover its interest payments, it's at high risk of default. Debt-to-Equity (27.4%) and Current Ratio (23.1%) follow closely. Banks like SBI, HDFC use similar metrics in their credit scoring models.
Lab 3B: Predict Credit Risk for a New Company
Use the trained model to predict default probability for a new Indian company:
# Predict credit risk for a NEW company new_company = pd.DataFrame({ 'current_ratio': [0.8], # ⚠️ Low liquidity 'debt_to_equity': [3.5], # ⚠️ High leverage 'interest_coverage': [1.2], # ⚠️ Barely covering interest 'net_margin': [-2.5], # 🔴 Negative margin! 'roe': [-8.0], # 🔴 Negative return 'asset_turnover': [0.4], # ⚠️ Low efficiency }) # Get prediction and probability prediction = rf_model.predict(new_company)[0] probability = rf_model.predict_proba(new_company)[0] risk_label = "🔴 HIGH RISK — DEFAULT LIKELY" if prediction == 1 else "🟢 LOW RISK — NON-DEFAULT" print("=" * 50) print("📊 CREDIT RISK PREDICTION FOR NEW COMPANY") print("=" * 50) print(f"Prediction: {risk_label}") print(f"Probability of Non-Default: {probability[0]*100:.1f}%") print(f"Probability of Default: {probability[1]*100:.1f}%") print("=" * 50)
================================================== 📊 CREDIT RISK PREDICTION FOR NEW COMPANY ================================================== Prediction: 🔴 HIGH RISK — DEFAULT LIKELY Probability of Non-Default: 12.0% Probability of Default: 88.0% ==================================================
Real-World Considerations
This model uses synthetic data. In practice, you would: (1) Use real financial data from NSE/BSE listed companies, (2) Include macro-economic features (GDP growth, interest rates), (3) Use credit rating data from CRISIL/ICRA as labels, (4) Handle class imbalance (defaults are rare ~2-5%), (5) Perform rigorous backtesting and validation.
Self-Study Materials
Comprehensive resources to deepen your understanding of AI/ML in financial analysis
Self-Study Module 1: Python for Finance Fundamentals
Master the Python basics needed for financial analysis. This module covers the essential libraries and techniques.
📚 pandas for Financial Data
- DataFrame creation from financial data
- Reading CSV/Excel files (Balance Sheets, P&L)
- GroupBy for sector-wise analysis
- Merge/Join for combining datasets
- Time-series operations for stock data
📚 NumPy for Numerical Computing
- Array operations for vectorized calculations
- Statistical functions (mean, std, percentile)
- Linear algebra for portfolio optimization
- Random number generation for simulations
📚 matplotlib & seaborn for Visualization
- Bar charts for ratio comparison
- Line plots for trend analysis
- Heatmaps for correlation matrices
- Box plots for outlier visualization
🔗 Recommended Resources
- Book: "Python for Finance" by Yves Hilpisch (O'Reilly) — Chapters 1-6
- Course: Kaggle's free "Pandas for Data Analysis" micro-course (kaggle.com/learn/pandas)
- Practice: Download real financial data from screener.in and practice ratio calculations
- Video: "Financial Analysis with Python" playlist on YouTube by sentdex
Self-Study Module 2: Anomaly Detection — Deep Dive
Go beyond the lecture material and understand the mathematics and advanced techniques behind financial anomaly detection.
📐 Statistical Methods
- Z-Score: Z = (X - μ) / σ; Flag if |Z| > 2 or 3
- Modified Z-Score: Uses median instead of mean (more robust)
- IQR Method: Q1 - 1.5×IQR to Q3 + 1.5×IQR
- Grubbs' Test: For detecting a single outlier
- Dixon's Q Test: For small datasets (n < 30)
🤖 ML-Based Methods
- Isolation Forest: Random feature splitting (covered in lab)
- Local Outlier Factor (LOF): Density-based detection
- One-Class SVM: Learns the boundary of normal data
- Autoencoders: Neural network reconstruction error
- DBSCAN: Clustering-based outlier detection
📊 Financial Statement Red Flags
- Revenue growing but cash flow declining
- Days Sales Outstanding (DSO) increasing rapidly
- Gross margin significantly above industry
- Serial acquisitions to mask declining organic growth
- Related party transactions > 10% of revenue
- Auditor changes or qualified opinions
🔗 Recommended Resources
- Paper: "Financial Statement Fraud Detection using Machine Learning" (Search on Google Scholar)
- Book: "Forensic Accounting and Fraud Examination" by Hopwood et al.
- Case Study: Read about Satyam (2009), DHFL (2019), and IL&FS (2018) frauds — identify which ratios would have flagged them
- Tool: Practice with scikit-learn's anomaly detection module — scikit-learn.org
- Benford's Law: "Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection" by Mark Nigrini
📝 Practice Exercise: Build a Benford's Law Checker
Download quarterly financial data of any NSE-listed company from screener.in. Extract all line-item values, check if the leading digits follow Benford's distribution. Compare a known fraudulent company (e.g., Satyam data before 2009) with a clean company (e.g., TCS). What differences do you observe?
Self-Study Module 3: Machine Learning for Credit Risk — Advanced
Deepen your understanding of ML techniques used in credit risk modeling by banks and NBFCs.
🏗️ Model Architecture
- Data Collection: Financial statements, credit bureau data, macro indicators
- Feature Engineering: Financial ratios, trend variables, industry dummies
- Model Selection: Logistic Regression, Random Forest, XGBoost, Neural Networks
- Evaluation: ROC-AUC, Precision-Recall, KS Statistic, Gini Coefficient
- Deployment: Real-time scoring API, batch scoring
📊 Key Evaluation Metrics
- Accuracy: (TP+TN) / Total — overall correctness
- Precision: TP / (TP+FP) — of predicted defaults, how many actually defaulted?
- Recall: TP / (TP+FN) — of actual defaults, how many did we catch?
- F1-Score: Harmonic mean of precision and recall
- ROC-AUC: Area under ROC curve (0.5=random, 1.0=perfect)
- KS Statistic: Max separation between default and non-default distributions
⚠️ Common Pitfalls
- Class Imbalance: Defaults are rare (2-5%). Use SMOTE, class weights, or undersampling
- Data Leakage: Don't use future information to predict past defaults
- Overfitting: Always use cross-validation and hold-out test set
- Survivorship Bias: Delisted companies are excluded, biasing the dataset
- Interpretability: Regulators require explainable models (use SHAP values)
🔗 Recommended Resources
- Book: "Machine Learning for Asset Managers" by Marcos López de Prado
- Book: "Advances in Financial Machine Learning" by Marcos López de Prado
- Course: Coursera — "Machine Learning and Reinforcement Learning in Finance" by NYU
- Competition: Kaggle "Home Credit Default Risk" competition — study winning solutions
- Research: Read about Altman Z-Score, Ohlson O-Score, and Merton Model for credit risk
- API: Explore CRISIL and ICRA credit rating methodologies
📝 Practice Exercise: Full Credit Risk Pipeline
1. Download financial data for 50+ NSE-listed companies from screener.in
2. Calculate at least 10 financial ratios (liquidity, profitability, leverage, efficiency)
3. Create a "default" label using credit ratings from CRISIL (BBB- and below = default risk)
4. Train Logistic Regression, Random Forest, and XGBoost models
5. Compare models using ROC-AUC and Precision-Recall curves
6. Use SHAP values to explain which features drive predictions
Self-Study Module 4: Real-World AI Applications in Indian Finance
Learn how Indian financial institutions are actually using AI/ML in production:
| Institution | AI Application | Impact |
|---|---|---|
| HDFC Bank | AI-based fraud detection on credit cards | 60% reduction in fraud losses |
| ICICI Bank | ML-based loan underwriting | 40% faster processing |
| Bajaj Finserv | Real-time credit scoring for personal loans | 30-second approval |
| Zerodha | Anomaly detection in trading patterns | Prevents market manipulation |
| RBI | AI for detecting shell companies | Identified 12,000+ shell firms |
🔗 Additional Learning Resources
- RBI Reports: Read RBI's annual "Trend and Progress of Banking in India" for data
- SEBI: Study SEBI's guidelines on algorithmic trading and AI usage
- GitHub: Search "credit risk prediction India" for real project repositories
- Newsletter: Subscribe to "The Ken" for Indian fintech deep-dives
Summary: Traditional vs AI-Assisted Analysis
A comprehensive comparison of approaches
| Aspect | Traditional Analysis | AI-Assisted Analysis |
|---|---|---|
| Ratio Calculation | Manual Excel formulas | Automated with Python (pandas) |
| Speed | Hours to days for 10 companies | Seconds for 100+ companies |
| Error Rate | High (manual copy-paste errors) | Low (reproducible code) |
| Anomaly Detection | Visual inspection, gut feeling | Z-Score, Isolation Forest, Benford's Law |
| Credit Risk | Altman Z-Score, credit ratings | ML models (85-90% accuracy) |
| Scalability | Limited by analyst capacity | Scales to thousands of companies |
| Bias | Cognitive biases (anchoring, confirmation) | Data-driven, but model bias exists |
| Interpretability | High (transparent calculations) | Medium (black-box models need SHAP/LIME) |
| Cost | Analyst salaries + Excel/tools | Free open-source (Python + libraries) |
Assessment Quiz
Test your understanding of AI-Assisted Financial Analysis