Those who are free of resentful thoughts surely find peace. - Buddha
Posted on 11th May 2019
What is Machine Learning?
Machine learning is making machines(computers) to learn or understand the pattern in data without being explicitly programmed. Machine learning makes use of mathematical and statistical algorithms in order to make the prediction about result or to take some decision about the data.
ML Model = Algorithm(Data) & Data = X + Y
X-> Set of independent variables or features Y-> Output variable or responder
The objective of ML is to estimate target function (f) that best maps input variables (X) to an output variable (Y ). Y =f(X) + e
Here “e” is the irreducible error because no matter how good we get at estimating the target function (f), we cannot reduce this error
The Analytics Life cycle:
1. Evalute/Monitor Results - Evaluate Results - Business Manager
2. Identify/Formulate Problem - Identify the problem
3. Data Preparation - Business Analyst
4. Data Exploration
5. Transform & Select - Data Scientist
6. Build Model
7. Validate Model - IT System
8. Deploy Model
Data Science Project Life Cycle:
CRISM-DM was conceived in late 1996 bt three veterans of the young and immature data mining market. CRISP stands for "CRoss-Industry Standard Process for Data Mining."
The Process model for data mining provides an overview of the life cycle of a data mining project. It contains all the phases of a project, their respective tasks, and the relationships between these tasks. Relationships could exist between ant data mining tasks depending on the goals, the background, and the interest of the user-and most importantly-on the data.
CRISP-DM Phases:
1. Business Understanding - This initial phase focus on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and prelimimary plan designed to achieve the objectives.
2. Data Understanding - The data understanding phase starts with initial data collection and proceeds with activities that enable you to become familiar with the data. In this phase usually ETL developer, Hadoop developers or data processing enginner or data scientist work. Connecting and Collecting the data often are done here.
3. Data Preparation - The data preparation phase covers all activities needed to construct the final dataset or data that will be fed into the modelling tools from the initial raw data. Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for modelling tools. It requires Statistical techniques.
4. Modeling - In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. In these models we have all kinds of the machine learning algorithms & techniques. It requires Machine Learning techniques.
5. Evaluation - At this stage you have built a model (or models) that appears to have high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and review the steps executed to create it, to be certain the model properly achieves the business objectives. A key objectives is to determine if there is some important business issue that has benn sufficently considered.
6. Deployment - Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented ina way that the customer can use it. It often involves applying "live" models within an organization's decision making processes. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data minins process process the enterprise.
Steps invloved in the Data Science Project Life Cycle
1. Business Understanding
> Determine Business Objectives - Background, Business Objectives, Business Success Criteria
> Access Situation - Inventory of Resources, Requirements, Assumptions, and Constraints, Risks & Contingencies, Terminology, Costs and Benefits
> Determine Data Mining Goals - Data Mining Goals, Data Mining Success Criteria
> Produce Project Plan - Project Plan, Initial Assessment of Tools and Techniques
2. Data Understanding
> Collect Initial Data - Initiial Data Collection Report
> Describe Data - Data Description Report
> Explore Data - Data Exploration Report
> Verify Data Quality - Data Quality Report
3. Data Preparation
> Select Data - Rationale for inclusion/Exclusion
> Clean Data - Data Cleaning Report
> Construct Data - Derived Attributes, Generated Records
> Integrate Data - Merged Data
> Format Data - Reformatted Data
> Dataset - Dataset Description
4. Modelling
> Select Modeling Techniques - Modelling Technique, Modelling Assumptions
> Generate Test Design - Test Design
> Build Model - Parameter Settings Model, Model Descriptions
> Assess Model - Model Assesment, Revised Parameter Settings
5. Evaluation
> Evaluate Results - Assesment of Data Mining Results wrt Business Sucess Criteria, Approved Models
> Review Process - Review of Process
> Determine Next Steps - List of possible actions decisions
6. Deployment
> Plan Deployment - Deployment plan
> Plan Monitoring and Maintenance - Monitoring and Maintenace Plan
> Produce Final Report - Final Report, Final Presenation
> Review Project - Experience Dicumentation
For more details: https://www.the-modeling-agency.com/crisp-dm.pdf
Few examples which shows applied Machine Learning techniques:
Regression Analysis – Finding the relationship between a dependent variable and one or more independent variables - Predicting Diamond price based on Carat, Cut & Clarity
Classification Analysis – Dividing objects into 2 or more known classes - Distinguishing cancer and normal cells
Outliers Analysis - Finding unusual - Credit card transactions
Association Analysis – Finding links - Shopping cart analysis
Cluster Analysis (Segmentation) – Grouping similar objects together - Grouping customers into different clusters based on their previous shopping data/transactions.
Time Series Analysis – Time dependent Data - Stock prediction
For example:
Mean - Sum of values of a data set divided by number of values: (1+2+2+3+4+7+9)/7 = 4
Median - Middle value seperating the greater and lesser halves of a data set : 1, 2, 2, 3, 4, 7, 9 : 3
Mode - Most frequent value in a data set : 1, 2, 3, 4, 7, 9 : 2
Statistics is a branch of Mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data.
1. Descriptive Statistics: Yesterday's data
How to get/collect the Data related with the problem?
> Financial/Accounting (Pay role, general ledger and cash management)
> Customer Relationship Management: Sales and Marketing, Commissions, Call centre
> e-Commerce Applications: Online Shopping, Online Ticketing
How data is being processed?
Legacy Data --------- | | Data Warehouse |
Operatonal Data ------- | Staging Area ------ | Raw Data Summary Data | ----------- Data Information ---> Target Info
Flat Files---------------- | | Meta Data |
Now this Target Information is used for the Fact Based Decision Making.
Target Information <--|Summary, Visualize {EDA}, Predict, Estimate, Forecast{Modelling}|--> Fact Based Decision Making
What are the EDA Techniques?
Visualization & Summary: Used for Reporting and Data Validation.
Example:
https://kite.com/blog/python/data-analysis-visualization-python
https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python
https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9
1. Visual Analytics: It's like Harry Potters DVD - Quick overview
https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html
2. Summary Analytics: It's like Harry Potters Novel - Contains Details
> Central Tendency gives the answer not the solution i.e if you score 34 marks in the exam.
> Dispersion is for making decision.
> Dispersion should always be low. The data quality is good.
> Dispersion should not cross or exceed the central tendency.
Summary Techniques:
Numerical Data : All statistic except Count & percentage can be applied
Character Data: Count & percentage should be applied. No other statistic.
Case Study:
CC defaulter | CC Default | Zscore(X-Mean/ Standard Deviation) |
1 | 5400 | -0.33029 |
2 | 6500 | -0.27536 |
3 | 5430 | -0.32879 |
4 | 65000 | 2.66563 |
5 | 5430 | -0.32879 |
6 | 5210 | -0.33978 |
7 | 4350 | -0.38272 |
8 | 5433 | -0.32864 |
9 | 4980 | -0.35126 |
Central Tendency |
Mean | 12014.8 |
Median | 5430 | |
Mode | 5430 | |
Geo Mean | 7022 | |
Harmonic Meann | 5885 |
Dispersion |
Standard Deviation | 20027.3 |
Data Quality(Quality) | 49% | |
Coefficent of Variance (Std/mean)(Risk) | 167% | |
Range | ||
Min | ||
Max | ||
Kurtosis |
When to do Data Cleaning?
Remove the data in the above case, to clean the data.
After cleaning the data, we have:
CC defaulter | CC Default | Zscore(X-Mean/ Standard Deviation) |
1 | 5400 | -0.33029 |
2 | 6500 | -0.27536 |
3 | 5430 | -0.32879 |
4 | 65000 | 2.66563 |
5 | 5430 | -0.32879 |
6 | 5210 | -0.33978 |
7 | 4350 | -0.38272 |
8 | 5433 | -0.32864 |
9 | 4980 | -0.35126 |
Central Tendency |
Mean | 5393 |
Median | 5430 | |
Mode | 5430 | |
Geo Mean | 5362 | |
Harmonic Meann | 5330 |
Dispersion |
Standard Deviation | 625.986 |
Data Quality(Quality) | 99% | |
Coefficent of Variance (Std/mean)(Risk) | 12% | |
Range | ||
Min | ||
Max | ||
Kurtosis |
Good, better, best. Never let it rest. Untill your good is better and your better is best. - St. Jerome