Those who are free of resentful thoughts surely find peace. - Buddha
Posted on 11th May 2019
Machine learning has two objectives.
1. Statstics - Prediction => Y = a + bi * x1 + b2 + ... bn*xn + error for Numerical data
2. Mathematics - Classification => 1/(1+ exp^(-Y)) for Categorical data
Statstic is of two types:
1. Descriptive Statstics - Yesterday - EDA methods
2. Inferential Statstics - Tommorow - Models
Problem statement:
iphone Sales for is as follows: 18,18,19,20,20,30,30,30,30,30,31,32,33,34,40,45,46,47,47,50,50,50,59,60,60,61
1. Summarise and visualise Customer age data
2. Increase 5% of sales followed by customers.
How to approach such question?
Find the different types of approaches & techniques you have. Do the data validation, data cleaning & data munging.
Example below:
Summary Approach:
1. Continous/Discrete Data Approach:
Numerical Data Approach - All statstic expect Count & percentage
Column | |
Mean | 38.30769231 |
Standard Error | 2.70371211 |
Median | 33.5 |
Mode | 30 |
Standard Deviation | 13.78628081 |
Sample Variance | 190.0615385 |
Kurtosis | -1.124971234 |
Skewness | 0.189941371 |
Range | 43 |
Minimum | 18 |
Maximum | 61 |
Sum | 996 |
Count | 26 |
Geomean | 34.90718932 |
Harmean | 32.28076389 |
2. Ordinal/ Nominal Data Approach:
Categorical Data Approach - Only Count & percentage.
To make the decision go with this apporach.
1. Histogram - Only for Data Quality check Only
2. Steam Leaf technique : For Summarization, Visualization & Data Quality Check
https://www.rosettacode.org/wiki/Stem-and-leaf_plot
for below values:
18 | 18 | 19 | 20 | 20 | 21 | 30 | 30 | 30 | 30 | 31 | 32 | 33 | 33 | 34 | 40 | 45 | 46 | 47 | 47 | 50 | 50 | 50 | 59 | 60 | 60 | 61 |
Stem | Leaf | Frequency | Cumulative Frequency | Frequency% | Cumulative Frequency % |
1 | 8 8 9 | 3 | 3 | 0.111111111 | 0.111111111 |
2 | 0 0 1 | 3 | 6 | 0.111111111 | 0.222222222 |
3 | 0 0 0 0 1 2 3 3 4 | 9 | 15 | 0.333333333 | 0.555555556 |
4 | 0 5 6 7 7 | 5 | 20 | 0.185185185 | 0.740740741 |
5 | 0 0 0 9 | 4 | 24 | 0.148148148 | 0.888888889 |
6 | 0 0 1 | 3 | 27 | 0.111111111 | 1 |
27 |
From the above we can draw the charts for the histogram which is very useful in taking the decision to increase the sales.
Next let's see what is
3. Box Plot: For Summarization, Visualization & Data Quality Check
Range = (Max Value - Min Value)
Bin = 5 (Suppose)
So, 43/5 = 8.6 is the width of the bin.
Bin | frequency | ||
1 | 18 | 26.6 | 6 |
2 | 26.6 | 35.2 | 9 |
3 | 35.2 | 43.8 | 1 |
4 | 43.8 | 52.4 | 7 |
5 | 52.4 | 61 | 4 |
Data Types |
Continues |
Discrete |
Ordinal |
Nominal |
Interval |
Ratio |
Central Tendency |
Mean |
Median |
Mode |
Geo mean |
Harmean |
Trimmed mean |
95% upper mean = Mean + Zscore * Std Error = Mean + (X - Xbar)/Std Deviation * (Std Deviation)/Sqrt(Sample) |
95% lower mean = Mean - Zscore * Std Error |
weighted average |
Dispersions |
Standard Deviation ( should not dominate Central Tendency) = Sqrt(1/N*(X-Mean)^2) |
Variance = 1/N * (X-Mean)^2 |
CV (coffeciant of variance(as much as low is better) = Standard Deviation/Arithmetic Mean |
Range = Max - Min |
Min |
Max |
Skwed( +- 0.80) ~(Zscore)^3 https://en.wikipedia.org/wiki/Skewness |
Kurtosys (+- 3.00) |
IQR (Inter Quartile Range) |
Std Error (should be < 0.05 is good sample) = Standard Deviation /Sqrt(Sample Size) |
DQ (should be High then data is good ) = Harmonic mean / Arithmetic mean |
Zscore (+- 1.96 then outlier) = X(Value) - Xbar(Mean) / Standard Deviation |
Five number summary |
Q0 0% |
Q1 25% |
Q2 50% (median) |
Q3 75% |
Q4 100% |
Good, better, best. Never let it rest. Untill your good is better and your better is best. - St. Jerome