theme: Olive Green, 8 autoscale: true
*Amit Kapoor* @amitkaps
*Bargava Subramanian* @bargava
- Download the Repo: https://github.com/amitkaps/applied-machine-learning
- Finish installation
- Run
jupyter notebook
in the console
0900 - 0930: Breakfast 0930 - 1115: Session 1 - Conceptual 1115 - 1130: Tea Break 1130 - 1315: Session 2 - Coding 1315 - 1400: Lunch 1400 - 1530: Session 3 - Conceptual 1530 - 1545: Tea Break 1545 - 1700: Session 4 - Coding
"Data is a clue to the End Truth" -- Josh Smith
- A start-up providing loans to the consumer
- Running for the last few years
- Now planning to adopt a data-driven lens
What are the type of questions you can ask?
- What is the trend of loan defaults?
- Do older customers have more loan defaults?
- Which customer is likely to have a loan default?
- Why do customers default on their loan?
- Descriptive
- Inquisitive
- Predictive
- Causal
- Descriptive: Understand Pattern, Trends, Outlier
- Inquisitive: Conduct Hypothesis Testing
- Predictive: Make a prediction
- Causal: Establish a causal link
It’s tough to make predictions, especially about the future. -- Yogi Berra
- Human Learning: Make a Judgement
- Machine Programmed: Create explicit Rules
- Machine Learning: Learn from Data
[Machine learning is the] field of study that gives computers the ability to learn without being explicitly programmed. -- Arthur Samuel
Machine learning is the study of computer algorithm that improve automatically through experience -- Tom Mitchell
- A pattern exists
- It cannot be pinned down mathematically
- Have data on it to learn from
"Use a set of observations (data) to uncover an underlying process"
- Theory
- Paradigms
- Models
- Methods
- Process
- Theory: Understand Key Concepts (Intuition)
- Paradigms: Limit to One (Supervised)
- Models: Use Two Types (Linear, Trees)
- Methods: Apply Key Ones (Validation, Selection)
- Process: Code the Approach (Real Examples)
- What are the types of data on which we are learning?
- Can you give example of say measuring temperature?
- Categorical
- Nominal: Burned, Not Burned
- Ordinal: Hot, Warm, Cold
- Continuous
- Interval: 30 °C, 40 °C, 80 °C
- Ratio: 30 K, 40 K, 50 K
- Categorical
- Nominal: = , !=
- Ordinal: =, !=, >, <
- Continuous
- Interval: =, !=, >, <, -, % of diff
- Ratio: =, !=, >, <, -, +, %
Context: Loan Approval
Customer Application
- age: age of the applicant
- income: annual income of the applicant
- year: no. of years of employment
- ownership: type of house owned
- grade: credit grade for the applicant
Question - How much loan amount to approve?
age income years ownership grade amount
--- ------- ----- --------- ------- -------
31 12252 25.0 RENT C 2400
24 49200 13.0 RENT C 10000
28 75000 11.0 OWN B 12000
27 110000 13.0 MORTGAGE A 3600
33 24000 10.0 RENT B 5000
- Categorical
- Nominal: home owner [rent, own, mortgage]
- Ordinal: credit grade [A > B > C > D > E]
- Continuous
- Interval: approval date [20/04/16, 19/11/15]
- Ratio: loan amount [3000, 10000]
Features:
age
,income
,years
,ownership
,grade
Target: amount
Training Data: $$ (\mathbf{x}{1}, y{1}), (\mathbf{x}{2}, y{2}) ... (\mathbf{x}{n}, y{n}) $$ - historical records
Given a set of feature
Learning Paradigm: Supervised
- If
$$y$$ is continuous - Regression - If
$$y$$ is categorical - Classification
-
Features
$$\mathbf{x}$$ (customer application) -
Target
$$y$$ (loan amount) -
Target Function
$$\mathcal{f}: \mathcal{X} \to \mathcal{y}$$ (ideal formula) - Data $$ (\mathbf{x}{1}, y{1}), (\mathbf{x}{2}, y{2}) ... (\mathbf{x}{n}, y{n}) $$ (historical records)
-
Final Hypothesis
$$\mathcal{g}: \mathcal{X} \to \mathcal{y}$$ (formula to use) - Hypothesis Set $$ \mathcal{H} $$ (all possible formulas)
-
Learning Algorithm
$$\mathcal{A}$$ (how to learn the formula)
$$ | $$
$$ | $$
The Learning Model is composed of the two elements
- The Hypothesis Set: $$ \mathcal{H} = {\mathcal{h}} \qquad \mathcal{g} \in \mathcal{H} $$
- Learning Algorithm: $$ \mathcal{A} $$
$$ | $$
$$ | $$
For
How do we choose the right
How well does $$ h(\mathbf{x}) $$ approximate to $$ f(\mathbf{x}) $$
We will use squared error $$ {( h(\mathbf{x}) - f(\mathbf{x}))}^2 $$
- Linear Regression algorithm aims to minimise $$ E_{in}(h)$$
-
One-Step Learning -> Solves to give
$$g(\mathbf{x})$$
- Frame: Problem definition
- Acquire: Data ingestion
- Refine: Data wrangling
- Transform: Feature creation
- Explore: Feature selection
- Model: Model creation & assessment
- Insight: Communication
Variables
-
age
,income
,years
,ownership
,grade
,amount
,default
andinterest
-
What are the Features:
$$\mathbf{x}$$ ? -
What are the Target:
$$y$$
Features:
age
income
years
,ownership
grade
,
Target: amount
* (1 - default
)
- Simple! Just read the data from
csv
file
- REMOVE - NAN rows
- IMPUTATION - Replace them with something?
- Mean
- Median
- Fixed Number - Domain Relevant
- High Number (999) - Issue with modelling
- BINNING - Categorical variable and "Missing becomes a category*
- DOMAIN SPECIFIC - Entry error, pipeline, etc.
- What is an outlier?
- Descriptive Plots
- Histogram
- Box-Plot
- Measuring
- Z-score
- Modified Z-score > 3.5 where modified Z-score = 0.6745 * (x - x_median) / MAD
- Single Variable Exploration
- Dual Variable Exploration
- Multi Variable Exploration
Encodings
- One Hot Encoding
- Label Encoding
Feature Transformation
- Log Transform
- Sqrt Transform
Parameters
- fit_intercept
- normalization
Error Measure
- mean squared error
- The "target function"
$$f$$ is not always a function - Not unique target value for same input
- Need to add noise
$$N(0,\sigma)$$
The best model we can create will have an expected error of
If Noise (
Learning is defined as
(1) Can we make
(1) Can we make
For Learning,
To find the generalisation error, we need to split our data into training and test samples
Given large
For Learning,
Complex Model: Better chance of approximating
Lets try by increasing the model complexity - More features through interaction effect
For Learning,
Given large
- Bias are the simplifying assumptions made by a model to make the target function easier to learn.
- Variance is the amount that the estimate of the target function will change if different training data was used.
- Simple Target Function
- 5th data point - noisy
- 4th order polynomial fit
Overfitting - Fitting the data more than warranted, and hence fitting the noise
-
Regularization: Not letting the weights grow
- Ridge: add
$$||w||^2$$ to error minimisation - Lasso: add
$$||w||$$ to error minimisation
- Ridge: add
- Validation: Checking when we reach bottom point
Validation set:
Rule of Thumb:
Note: The validation set is used for learning
Repeats the process 5-times
How to choose between competing model?
Choose the function
- Theory: Formulation, Generalisation, Bias-Variance, Overfitting
- Paradigms: Supervised - Regression
- Models: Linear - OLS, Ridge, Lasso
- Methods: Regularisation, Validation
- Process: Frame, Acquire, Refine, Transform, Explore, Model
Context: Loan Default
Customer Application
- age: age of the applicant
- income: annual income of the applicant
- year: no. of years of employment
- ownership: type of house owned
- grade: credit grade for the applicant
- amount: loan amount given
- interest: interest rate of loan
Question - Who is likely to default?
Find the $$ w_{i} $$ weights that best fit: $$ y=1 $$ if $$ \sum_{i=1}^d w_{i} x_{i} > 0$$ $$ y=0$$, otherwise
Follows:
Where,
Minimise the log-likelihood values
- Logistic Regression algorithm aims to minimise $$ E_{in}(h)$$
-
Iterative Method -> Solves to give
$$g(\mathbf{x})$$
Classification Metrics
Recall (TPR) = TP / (TP + FN)
Precision = TP / (TP + FP)
Specificity (TNR) = TN / (TN + FP)
Receiver Operating Characteristic Curve
Plot of TPR vs FPR at different discrimination threshold
Example: Survivor on Titanic
- Easy to interpret
- Little data preparation
- Scales well with data
- White-box model
- Instability – changing variables, altering sequence
- Overfitting
- Also called bootstrap aggregation, reduces variance
- Uses decision trees and uses a model averaging approach
- Combines bagging idea and random selection of features.
- Similar to decision trees are constructed – but at each split, a random subset of features is used.
If you torture the data enough, it will confess. -- Ronald Case
- Data Snooping
- Selection Bias
- Survivor Bias
- Omitted Variable Bias
- Black-box model Vs White-Box model
- Adherence to regulations
- Theory: Formulation, Generalisation, Bias-Variance, Overfitting
- Paradigms: Supervised - Regression & Classification
- Models: Linear Models, Tree Models
- Methods: Regularisation, Validation, Aggregation
- Process: Frame, Acquire, Refine, Transform, Explore, Model