# 3: Model Building

## Introduction
In this session we shall try to understand the different types of machine learning, the most commonly used models, python packages used in machine learning and much more!

![Model Building](https://miro.medium.com/max/1400/1*PAqzvCxPjpDN8RC9HQw45w.jpeg)

## Python Packages 

The following are the python packages that are most commonly used in machine learning:
* `Sklearn`
* `numpy`
* `Pandas`
* `Tensorflow`
* `Keras`
* `Pytorch`
* `OpenCV`
* `Pillow`
* `Beautiful Soup`
* `Requests`

Although there are many more, these are commonly used in the industry and that's why they're in the list.


## Types of Machine Learning
Machine Learning can broadly be classified into two types:
- Supervised Learning
- Unsupervised Learning. 

Here's a brief explanantion of what they are.

### Supervised Learning
As the name indicates, supervised learning involves machine learning algorithms that learn under the presence of a supervisor. Supervised machine learning relies on labelled input and output training data.

*Examples:*
- Predicting house prices
- Image classification
- Weather Forecasting
- Text classification


### Unsupervised Learning
In unsupervised learning, even though we do not have any labels for data points, we do have the actual data points. This means we can draw references from observations in the input data. 

*Unsupervised learning processes unlabelled or raw data.*

*Examples*:
- Finding customer segments
- Reducing the complexity of a problem
- Feature selection


**Throughout this session we shall focus mostly on the different types of supervised learning.**

## Commonly Used Machine Learning Algorithms

### Supervised Learning
- Linear Regression
- Multiple Regression
- Naive Bayesian Model - Decision Trees
- Random Forest
- Neural Networks
- Support Vector Machines
- KNN (K Nearest Neighbor)

### Unsupervised Learning
- K-Means Clustering
- Association

**Over the course of this session, we shall understand *Linear regression* and *KNN*.**

### 1. Linear Regression
Linear Regression is usually the first machine learning algorithm that every data scientist comes across. It is a simple model but everyone needs to master it as it lays the foundation for other machine learning algorithms.

It can be used to forecast sales in the coming months by analyzing the sales data for previous months. It can also be used to gain various insights about customer behaviour.

Here are the 5 basic steps when implementing linear regression.
1. Import the packages and classes you need.
2. Provide data to work with and eventually do appropriate transformations.
3. Create a regression model and fit it with existing data.
4. Check the results of model fitting to know whether the model is satisfactory.
5. Apply the model for predictions.

![Linear regression](https://upload.wikimedia.org/wikipedia/commons/b/be/Normdist_regression.png)

*These steps are more or less general for most of the regression approaches and implementations.*

**We shall try to predict the *grades of the students* using the example given below.**

#### Step 1: Import packages and classes

In [20]:
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle

The class `sklearn.linear_model.LinearRegression` will be used to perform linear regression and make predictions accordingly.

#### Step 2: Provide data

In [22]:
data = pd.read_csv("student-mat.csv", sep=";")
# Since our data is seperated by semicolons we need to do sep=";"b

In [23]:
data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


Now we need to choose the most suitable coulmns.

In [24]:
data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]


#### Step 3: Create a model and fit it

In [25]:
predict = "G3"

X = np.array(data.drop([predict], 1)) # Features
y = np.array(data[predict]) # Labels

In [26]:
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

In [27]:
linear = linear_model.LinearRegression()

The function `LinearRegression()` has other paramaters such as:
- `fit_intercept` is a Boolean (True by default) that decides whether to calculate the intercept ùëè‚ÇÄ (True) or consider it equal to zero (False).
- `normalize` is a Boolean (False by default) that decides whether to normalize the input variables (True) or not (False).
- `copy_X` is a Boolean (True by default) that decides whether to copy (True) or overwrite the input variables (False).
- `n_jobs` is an integer or None (default) and represents the number of jobs used in parallel computation. None usually means one job and -1 to use all processors.

In [35]:
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test) # acc stands for accuracy 
print(acc)

0.8567890220083849


In [36]:
print('Coefficient: \n', linear.coef_) # These are each slope value
print('Intercept: \n', linear.intercept_) # This is the intercept

Coefficient: 
 [ 0.16672657  0.96369032 -0.19684191 -0.27743914  0.03254111]
Intercept: 
 -1.438174445680957


#### Step 4: Get results

Once you have your model fitted, you can get the results to check whether the model works satisfactorily and interpret it.


In [50]:
print('Coefficient: \n', linear.coef_) # These are each slope value
print('Intercept: \n', linear.intercept_) # This is the intercept

Coefficient: 
 [ 0.16672657  0.96369032 -0.19684191 -0.27743914  0.03254111]
Intercept: 
 -1.438174445680957


In [51]:
print('coefficient of determination:', r_sq)

coefficient of determination: 0.7158756137479542


When you‚Äôre applying `.score()`, the arguments are also the predictor `x` and regressor `y`, and the return value is `ùëÖ¬≤`.


We can get the intercept and the slope of the line using the `model.intercept_` and `model.coef_` commands.

#### Step 5: Predict response
Once there is a satisfactory model, you can use it for predictions with either existing or new data.


In [33]:
predictions = linear.predict(x_test) # Gets a list of all predictions

for x in range(len(predictions)):
    print(f"Predicted: {predictions[x]}, actual: {y_test[x]}")


Predicted: 3.4696627554374073, actual: 5
Predicted: 5.476917103456871, actual: 7
Predicted: 8.83403139834456, actual: 10
Predicted: 7.648808981949109, actual: 8
Predicted: 9.238906601772648, actual: 10
Predicted: 9.334104148442279, actual: 10
Predicted: 12.331671007054274, actual: 14
Predicted: 7.992925608753779, actual: 8
Predicted: 12.096712937861891, actual: 12
Predicted: 7.621421999133414, actual: 7
Predicted: 14.719412956813049, actual: 15
Predicted: 7.0496018426953455, actual: 8
Predicted: 12.156178819434402, actual: 13
Predicted: 9.370666302230253, actual: 9
Predicted: -0.8616141579668186, actual: 0
Predicted: 5.096238322864227, actual: 8
Predicted: 11.733144452915091, actual: 11
Predicted: 18.545761149340354, actual: 18
Predicted: 11.222301251854967, actual: 11
Predicted: 13.760574175098895, actual: 14
Predicted: 3.709513637268808, actual: 0
Predicted: 8.435496039599911, actual: 10
Predicted: 9.602475086719448, actual: 10
Predicted: 7.211476879588844, actual: 0
Predicted: 8.537

*Now we can see the results predicted by our model and the actual result in the model.*

### 2. `KNN` (K-Nearest Neighbours)
KNN stands for K-Nearest Neighbors. KNN is a machine learning algorithm used for classifying data. Rather than coming up with a numerical prediction such as a students grade or stock price it attempts to classify data into certain categories. 

In the coming sections, we will be using this algorithm to classify cars in 4 categories based upon certain features.


#### How does KNN Work?
In short, K-Nearest Neighbors works by looking at the K closest points to the given data point (the one we want to classify) and picking the class that occurs the most to be the predicted value. This is why this algorithm typically works best when we can identify clusters of points in our data set

![KNN](https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2018/07/11/sagemaker-knn-1.gif)

#### Step 1 : Importing Modules

In [54]:
from sklearn import preprocessing
import sklearn
from sklearn.utils import shuffle
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
from sklearn import linear_model, preprocessing

#### Step 2 :  Loading Data

In [55]:
data = pd.read_csv("car.data")
print(data.head())  # To check if our data is loaded correctly

  buying  maint door persons lug_boot safety  class
0  vhigh  vhigh    2       2    small    low  unacc
1  vhigh  vhigh    2       2    small    med  unacc
2  vhigh  vhigh    2       2    small   high  unacc
3  vhigh  vhigh    2       2      med    low  unacc
4  vhigh  vhigh    2       2      med    med  unacc


We now need to convert the string data into some kind of a number so that we can train the our KNN model.

In [57]:
le = preprocessing.LabelEncoder()

The method fit_transform() takes a list (each of our columns) and will return to us an array containing our new values.



In [59]:
buying = le.fit_transform(list(data["buying"]))
maint = le.fit_transform(list(data["maint"]))
door = le.fit_transform(list(data["door"]))
persons = le.fit_transform(list(data["persons"]))
lug_boot = le.fit_transform(list(data["lug_boot"]))
safety = le.fit_transform(list(data["safety"]))
cls = le.fit_transform(list(data["class"]))

Now we need to recombine our data into a feature list and a label list. We can use the zip() function to makes things easier.



In [60]:
X = list(zip(buying, maint, door, persons, lug_boot, safety))  # features
y = list(cls)  # labels

Finally we will split our data into training and testing data using the same process seen previously.



In [62]:
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)


#### Step 3 : Training our Model

In [65]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=9)

To train our model we follow precisely the same steps as outlined earlier.

In [66]:
model.fit(x_train, y_train)


KNeighborsClassifier(n_neighbors=9)

Now we can find the accuracy of our model.

In [69]:
acc = model.score(x_test, y_test)
print(acc)

0.9248554913294798


#### Step 4 : Testing Our Model

In [75]:
predicted = model.predict(x_test)
names = ["unacc", "acc", "good", "vgood"]

for x in range(len(predicted)):
    print("Predicted: ", names[predicted[x]], "Data: ", x_test[x], "Actual: ", names[y_test[x]])


Predicted:  good Data:  (1, 2, 1, 2, 2, 1) Actual:  good
Predicted:  good Data:  (1, 1, 0, 0, 0, 0) Actual:  good
Predicted:  good Data:  (1, 0, 3, 1, 1, 1) Actual:  good
Predicted:  good Data:  (2, 0, 3, 0, 2, 2) Actual:  good
Predicted:  good Data:  (3, 1, 0, 0, 2, 0) Actual:  good
Predicted:  unacc Data:  (1, 0, 3, 1, 2, 0) Actual:  unacc
Predicted:  good Data:  (1, 1, 1, 1, 0, 1) Actual:  good
Predicted:  good Data:  (3, 3, 0, 0, 0, 0) Actual:  good
Predicted:  good Data:  (3, 1, 3, 1, 2, 1) Actual:  good
Predicted:  good Data:  (0, 0, 1, 1, 0, 1) Actual:  good
Predicted:  good Data:  (3, 3, 0, 2, 1, 1) Actual:  good
Predicted:  unacc Data:  (1, 0, 0, 2, 1, 2) Actual:  unacc
Predicted:  good Data:  (3, 0, 1, 0, 0, 2) Actual:  good
Predicted:  unacc Data:  (1, 2, 2, 1, 2, 0) Actual:  acc
Predicted:  good Data:  (0, 0, 2, 0, 0, 1) Actual:  good
Predicted:  unacc Data:  (3, 2, 2, 1, 2, 0) Actual:  unacc
Predicted:  good Data:  (0, 3, 2, 2, 1, 2) Actual:  good
Predicted:  good Data:  (

To see which rows of data were classified incorrectly we can use the code below:

In [78]:
predicted = model.predict(x_test)
names = ["unacc", "acc", "good", "vgood"]

for x in range(len(predicted)):
    if names[predicted[x]] != names[y_test[x]]:
        print("Predicted: ", names[predicted[x]], "Data: ", x_test[x], "Actual: ", names[y_test[x]])

Predicted:  unacc Data:  (1, 2, 2, 1, 2, 0) Actual:  acc
Predicted:  good Data:  (1, 3, 0, 2, 0, 2) Actual:  unacc
Predicted:  good Data:  (1, 3, 0, 1, 0, 0) Actual:  unacc
Predicted:  good Data:  (0, 0, 1, 2, 1, 2) Actual:  unacc
Predicted:  unacc Data:  (1, 1, 0, 2, 2, 2) Actual:  good
Predicted:  unacc Data:  (1, 2, 2, 1, 0, 2) Actual:  acc
Predicted:  good Data:  (2, 0, 0, 2, 0, 2) Actual:  unacc
Predicted:  unacc Data:  (1, 2, 0, 2, 2, 2) Actual:  good
Predicted:  good Data:  (1, 0, 3, 2, 2, 2) Actual:  unacc
Predicted:  unacc Data:  (1, 2, 2, 2, 1, 2) Actual:  acc
Predicted:  good Data:  (2, 1, 2, 2, 2, 2) Actual:  unacc
Predicted:  unacc Data:  (1, 2, 3, 1, 2, 0) Actual:  acc
Predicted:  good Data:  (2, 1, 3, 2, 2, 2) Actual:  unacc


We can look at the predicted and actual values and compare how accurate they are. The accuracy may always not be the same at all times. 

#### Looking at Neighbors
The KNN model has a unique method that allows for us to see the neighbors of a given data point. We can use this information to plot our data and get a better idea of where our model may lack accuracy. We can use model.neighbors to do this.

Note: the .neighbors method takes 2D as input, this means if we want to pass one data point we need surround it with [] so that it is in the right shape.

Parameters: The parameters for .neighbors are as follows: data(2D array), # of neighbors(int), distance(True or False)

Return: This will return to us an array with the index in our data of each neighbor. If distance=True then it will also return the distance to each neighbor from our data point.

In [80]:
predicted = model.predict(x_test)
names = ["unacc", "acc", "good", "vgood"]

for x in range(len(predicted)):
    print("Predicted: ", names[predicted[x]], "Data: ", x_test[x], "Actual: ", names[y_test[x]])
    # Now we will we see the neighbors of each point in our testing data
    n = model.kneighbors([x_test[x]], 9, True)
    print("N: ", n)

Predicted:  good Data:  (1, 2, 1, 2, 2, 1) Actual:  good
N:  (array([[1., 1., 1., 1., 1., 1., 1., 1., 1.]]), array([[410, 604, 686, 155, 900, 347, 887, 781, 455]]))
Predicted:  good Data:  (1, 1, 0, 0, 0, 0) Actual:  good
N:  (array([[1.        , 1.        , 1.        , 1.        , 1.        ,
        1.        , 1.        , 1.41421356, 1.41421356]]), array([[ 739,  500, 1010,   69, 1041, 1180,  203, 1213,  436]]))
Predicted:  good Data:  (1, 0, 3, 1, 1, 1) Actual:  good
N:  (array([[1., 1., 1., 1., 1., 1., 1., 1., 1.]]), array([[1545,  200,  239,  741,  683,  322,  516, 1517, 1294]]))
Predicted:  good Data:  (2, 0, 3, 0, 2, 2) Actual:  good
N:  (array([[1.        , 1.        , 1.        , 1.        , 1.        ,
        1.41421356, 1.41421356, 1.41421356, 1.41421356]]), array([[ 639,  761, 1528,  465,  787,  317,  163, 1122,    4]]))
Predicted:  good Data:  (3, 1, 0, 0, 2, 0) Actual:  good
N:  (array([[1.        , 1.        , 1.        , 1.        , 1.        ,
        1.        , 1.4

That's it for today's session! Next week we'll dive deeper into neural networks and how to implement them.
