Recent Posts

Those who are free of resentful thoughts surely find peace. - Buddha

Machine Learning -II

Posted on 11th May 2019


<-Back to Blogs

Machine learning has two objectives. 

1. Statstics - Prediction =>  Y = a + bi * x1 + b2 + ... bn*xn + error for Numerical data

2. Mathematics - Classification => 1/(1+ exp^(-Y)) for Categorical data

 

Statstic is of two types:

1. Descriptive Statstics - Yesterday - EDA methods

2. Inferential Statstics - Tommorow - Models

 

Problem statement:

iphone Sales for is as follows: 18,18,19,20,20,30,30,30,30,30,31,32,33,34,40,45,46,47,47,50,50,50,59,60,60,61

1. Summarise and visualise Customer age data

2. Increase 5% of sales followed by customers.

How to approach such question?

Find the different types of approaches & techniques you have. Do the data validation, data cleaning & data munging.

Example below:

Summary Approach:

1. Continous/Discrete Data Approach:

Numerical Data Approach - All statstic expect Count & percentage

Column
Mean 38.30769231
Standard Error 2.70371211
Median 33.5
Mode 30
Standard Deviation 13.78628081
Sample Variance 190.0615385
Kurtosis -1.124971234
Skewness 0.189941371
Range 43
Minimum 18
Maximum 61
Sum 996
Count 26
Geomean 34.90718932
Harmean 32.28076389

 

2. Ordinal/ Nominal Data Approach:

Categorical Data Approach - Only Count & percentage.

To make the decision go with this apporach.

1. Histogram - Only for Data Quality check Only

2. Steam Leaf technique : For Summarization, Visualization & Data Quality Check

https://www.rosettacode.org/wiki/Stem-and-leaf_plot

for below values:

18 18 19 20 20 21 30 30 30 30 31 32 33 33 34 40 45 46 47 47 50 50 50 59 60 60 61

 

Stem Leaf Frequency Cumulative Frequency  Frequency% Cumulative Frequency %
1 8 8 9 3 3 0.111111111 0.111111111
2 0 0 1 3 6 0.111111111 0.222222222
3 0 0 0 0 1 2 3 3 4 9 15 0.333333333 0.555555556
4 0 5 6 7 7 5 20 0.185185185 0.740740741
5 0 0 0 9 4 24 0.148148148 0.888888889
6 0 0 1 3 27 0.111111111 1
    27      

 From the above we can draw the charts for the histogram which is very useful in taking the decision to increase the sales.

Next let's see what is

3. Box Plot: For Summarization, Visualization & Data Quality Check

Range = (Max Value - Min Value) 

Bin = 5 (Suppose)

So, 43/5 = 8.6 is the width of the bin.

Bin     frequency
1 18 26.6 6
2 26.6 35.2 9
3 35.2 43.8 1
4 43.8 52.4 7
5 52.4 61 4

Summary:

Data Types
Continues
Discrete
Ordinal
Nominal
Interval
Ratio

 

Central Tendency
Mean
Median
Mode
Geo mean
Harmean
Trimmed mean
95% upper mean = Mean + Zscore * Std Error = Mean + (X - Xbar)/Std Deviation * (Std Deviation)/Sqrt(Sample)
95% lower mean = Mean - Zscore * Std Error
weighted average

 

Dispersions
Standard Deviation ( should not dominate Central Tendency) =  Sqrt(1/N*(X-Mean)^2)
Variance = 1/N * (X-Mean)^2
CV (coffeciant of variance(as much as low is better) = Standard Deviation/Arithmetic Mean
Range = Max - Min
Min
Max
Skwed( +- 0.80)  ~(Zscore)^3 https://en.wikipedia.org/wiki/Skewness
Kurtosys (+- 3.00)
IQR (Inter Quartile Range)
Std Error (should be < 0.05 is good sample) = Standard Deviation /Sqrt(Sample Size)
DQ (should be   High then data is good ) = Harmonic mean / Arithmetic mean
Zscore (+- 1.96 then outlier) = X(Value) - Xbar(Mean) / Standard Deviation

 

Five number summary
Q0 0%
Q1 25%
Q2 50% (median)
Q3 75%
Q4 100%

 


<-Back to Blogs

Categories

Good, better, best. Never let it rest. Untill your good is better and your better is best. - St. Jerome

© SOFTHINKERS 2013-18 All Rights Reserved. Privacy policy