2 Alternative Data Analytics
Click to add text
Alternative Data Components
Modules for harnessing the power of Alternate Data
DataMart Feature Store ML Models Use Cases
Mobile Device Seamless transformation of raw ML Algorithms Customer Profiling and
Telecom data to Features, to be used for Model Landscape Segmentation
E-Commerce predictive modelling Model Development Credit Scoring
Utility and Payments (POS) Model Documentation Income Estimation
Social Media Model Validation Pricing
E-Mail Model Deployment Propensity
Insurance Independent Review
Others: Travel, Rent, Web, Tax, Policy Framework
Government Records,
Psychometrics etc.
Bank Statement ***
Alternative Lending Products
Payment data
Leverage our AGGREGATOR Leverage our FEATURE STORE We use Advanced Machine Leverage our expertise for multiple
DATAMART to accelerate data to accelerate Feature Engineering Learning algorithms to build use cases to get a 360 degree view
architecture and storage for building predictive models and Explainable predictive models for of a customer relationship
decision analytics Financial Institutions
*** Physical Copy of Bank Statement has been used for manual underwriting in consumer lending for long. However, the information typically does not flow as a
feature in a credit scoring engine. In Digital Lending paradigm, bank statement are being digitized and its information is being used for credit scoring
Alternative Data Feature Store
Automated Feature Engineering
Feature Primitives
Feature Synthesis
Raw Data Feature Classification Feature Store Predictive Model
Pattern Matching
Automated Feature Engineering
Layer
Expert Judgment
Raw data points are transformed to features using Feature Synthesis (applying library of transformations to raw data) and Feature Mining using NLP (e.g. extraction of
features from Text data such as SMS, Email), with an overlay of expert judgement.
Illustrative Feature Mining from SMS Data using NLP
Automated Feature Mining
Data SMS Tagging Data Insights Feature Engg. Decisioning
SMS classification to Rules to extract Roll-up of individual SMS Scoring Engine
standard L1 and L2 information from each level data at customer level Customer Risk
(Id / pool) Score
SMS1 categories SMS such as ID, Amount, to generate features for
Customer1 0.99
Transaction Type, Date etc. model training, such as:
Customer2 0.80
L1 such as Savings, Customer 3 0.50
SMS2 Current, Debit Card, Credit • Monthly Income
Card, E-Wallets etc. • Total Loans O/s Customer4 0.25
• Total EMI
L2 such as Savings > • Expected Monthly Spend
SMS3
Salary, Spend, Balance, and Savings
Investment, Loan / EMI • Delinquency pattern
related, Account Info
SMS4
Process Process
NLP based classification Process Feature engineering by data
SMS5 (SMS embeddings using Pattern matching based science team
neural networks) data extraction rules
Feature Mining: Bank Statement with Text Recognition and NLP
Aptivaa’s Bank Statement API supports English and Arabic Bank Statement
Customer Score
Feature Generation and
Pattern Recognition
AutoScoring
Usage of Computer Vision and NLP
Peer classification
Transaction comparison as per Text
algorithms for scanning & digitization Identification of the language available Key insights generated around Income
in the statement and translation to English Patterns/Classification Rules into standard pattern, Customer behavior and
Custom Neural Network Models for Credit
transaction typesAnalysis Psychographic Segmentation and further,
English and Arabic Identification of Text
Auto-summary generation using various metrics generated for Risk Scoring
Support for both languages in the same Patterns/Classification Rules in a master
table (e.g. transaction description customizable, user-defined metrics Feature generation (for adding to
sentence as well
containing ‘Salary’/’Payroll’ are of type exposed on user interface providing full Application Scorecard and creating
Easily trainable for specific fonts types Salary control of analysis to user* internal Feature Store)
and sizes Pivoting by different transaction types and
Minimizes data errors through present Auto Scoring (automated scorecard,
validation rules and users’ validation as other dimensions (such as Time period, provided historical performance data)
well Debit/Credit etc.)
Final reports analysis is available in
both PDF as well as in smart HTML
formats
Digitization of the input Transaction Classification
statement and Analysis
Income Estimation
Spend Analytics
Fixed Obligations
Alternative Data Modelling
Explainable Machine Learning for superior predictive power with full model transparency
Bin 1 Bin 2 Bin 3 Bin 4
XgBoost
Feature 1
Bin 1 Bin 2 Bin 3 Bin 4
Feature 2
Feature 1 Explainable ML
Bin 1 Bin 2 Bin 3 Bin 4
Feature 3
Feature Store
Feature 2 Bin 1 Bin 2 Bin 3 Bin 4
Feature 4
… Bin 1 Bin 2 Bin 3 Bin 4
Feature 5
Feature M Bin 1 Bin 2 Bin 3 Bin 4
…
Bin 1 Bin 2 Bin 3 Bin 4
Feature N Neural Net
Important Feature Predictive Model
ML Algorithms
Features Discretization
Non-linear Machine Learning Models are used for feature selection. Discretization and Transformed (such as WoE transformation) Features are passed as an input to a Linear
Algorithm or XgBoost (with Monotonic Constraints) to build fully-explainable predictive models
Alternative Data Model Landscape for different customer segments
Illustrative Model Landscape
Approach 1 Approach 2
Step1 Step1
Alternate + Traditional Data Model
Alternate Data Model
for Bureau Hit Segment
for all customers
Step2 Step2
Alternate +
Traditional Data for some segments Alternate Data Model
for No Hit Segment
Some Segments (e.g. Medium Risk Customers) are Combined Model is used for Hit Segment and
rescored using a Combined Data Model (for Bureau Standalone Alternate Data Model is used for No Hit
Hit cases only) Segment
The final approach is selected on basis of product (ticket size, loan tenor), data cost (bureau pull, alternate data cost) and marginal contribution of a source of data to predictive power
Combining Alternative Data with Traditional Data
Prevalent methodologies to combine alternative data with traditional data
Approaches to combine Alternative and Traditional Data
Traditional Data Alternative Data
Features Features
Single Model trained on combined Alternative Model Score added Traditional Model Score added Two independent models are
dataset, with features from both sources as a feature to traditional data as a feature to alternative data trained, and a matrix of scores
for model training for model training from both models is used for
decisioning
Illustrative Alternative Data Use Case
Credit Scoring using Telco Data
Call Location
User Info
Records Data
Internet Top-Ups
VAS Data Demograp Income Spend
Usage Data
hics Related Related
Daily Postpaid
SMS Data Usage Social Employme
Balance Payment
Duration Network nt
Mobile Device
Apps Data
Wallet Txn Info
Data Category Feature Category ML Algorithms Scoring Engine
Illustrative Alternative Data Use Case
Credit Scoring using Device Data
XgBoost
Call Location Demograp Income Spend
SMS Data
Records Data hics Related Related
Contacts Device Fixed Social
Apps Info Assets
Info Info Obligation Network
Data Category Feature Category ML Algorithms Scoring Engine
Business Benefit of Analytics
Improved ROA
Use of predictive models instead of heuristic/rule-based models can significantly improve profitability, business volume and ROA
1. For instance, for a default prediction model, an improvement of Gini coefficient from 40% to 50% 2. This would result in either higher business
would result in Lower Default Rate for same approval rate (reduction to 1.3% DR from 3.0% DR volumes at same delinquency rates; or lower
at same score cut-off for the ‘illustrative portfolio’) or Higher Approval Rate for same default rate delinquency rates at same business volume. In
(improvement in Approval Rate from 72.7% to 89.1% at ~3% DR for the ‘illustrative portfolio’). either case, ROA would improve significantly.
Score Cut-Off Band Applications Defaults Gini = 40% Gini = 50%
DR for Approved Cases Approval Rate ROA DR for Approved Cases Approval Rate ROA
1 10 8 5.7% 98.2% 0.1% 5.6% 98.2% 0.2%
2 20 6 4.8% 94.5% 0.6% 4.2% 94.5% 0.9%
3 30 5 4.1% 89.1% 1.0% 2.9% 89.1% 1.6%
4 40 4 3.6% 81.8% 1.2% 1.8% 81.8% 2.1%
5 50 4 3.0% 72.7% 1.5% 1.3% 72.7% 2.4%
6 60 3 2.6% 61.8% 1.7% 0.9% 61.8% 2.6%
7 70 3 2.2% 49.1% 1.9% 0.7% 49.1% 2.6%
8 80 2 2.1% 34.5% 1.9% 0.5% 34.5% 2.7%
9 90 2 2.0% 18.2% 2.0% 0.0% 18.2% 3.0%
10 100 2 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Challenges in using Alternative Data
Not all data is equal
1 Compliance with GDPR guidelines for expats 2 Data sparsity (incomplete datasets)
3 4 Unstructured formats (e.g. SMS data), not suitable for saving in
Data Integration challenges (e.g. customers will not
RDBMS
have a common ID across data sources)
5 Vendor Risk (e.g. financial strength of third-party data 6 Data Quality and Veracity
providers)
7 Commercial Implications (Cost vs. Benefit) 8 Different predictive power for different data sources, so cannot be used
with performance assessment