Machine Learning

According to Arthur Samuel, machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.

Applications of Machine Learning

Computer Vision
Natural Language Processing
Generative AI
Deep Learning

The Machine Learning Pipeline

Data: We first need to gather millions of logs of what users watched.
Processing: We must clean that data (removing testing accounts or corrupted logs).
Training: Only then does the model learn patterns (e.g., People who watch Inception also watch Interstellar).
Deployment: Finally, we have to deliver those predictions to the user’s TV screen in real-time.

Types of Machine Learning

Based on the type of data available and the nature of the learning problem, machine learning can be broadly categorized into three main types:

Supervised learning
Unsupervised learning
Reinforcement learning

There are two main types of supervised machine learning: regression and classification.

Regression is the type of supervised learning where the goal is to predict a continuous output variable. For example, when predicting the price of a house, the input might be the features of a house, such as the number of bedrooms and bathrooms, and the output will be the price of the house.
Classification is a type of supervised learning that aims to predict a categorical label for each input. For example, in fruit classification, the input is a fruit (this could be an actual image of the fruit or a list of its properties like weight, color, and shape), and the output is a label indicating whether the fruit is an apple or a mango.

There are two main types of unsupervised machine learning tasks: clustering and dimensionality reduction.

Clustering groups similar data points together to discover distinct segments (like customer groups that buy frequently and spend highly or infrequently with low average purchases).
Dimensionality Reduction simplifies data by reducing the number of input features and removing irrelevant features (like using only a student’s final exam score and total attendance instead of their scores on every quiz, homework, and class participation if the latter aren’t influencing the final results).

In reinforcement learning, an agent (the learner) learns to take actions in an environment to maximize a reward signal. It’s a unique type of learning, characterized by continuous interaction. For example, a self-driving car is learning to navigate city streets. It decides when to accelerate, brake, or turn, while the environment, including traffic, pedestrians, and road conditions, responds. The car receives rewards for staying safe, following traffic rules, and reaching its destination efficiently. Over time, it learns which actions lead to the best outcomes.

Parametric model

Parametric models are functions defined by a fixed set of parameters(often called weights) in machine learning. These are assumed to be able to approximate the underlying pattern of the data. By adjusting the values of the parameters during the training process, these models can learn to fit the data and make accurate predictions on new inputs.

Consider a simple linear function:

$f_{w_{1}, w_{2}, w_{0}} (x_{1}, x_{2}) = w_{1} x_{1} + w_{2} x_{2} + w_{0}$

This is an instance of a function class (or model class), which is a family of possible functions defined by the same structure. The variables $w_{1}$ w1, $w_{2}$ w2, and $w_{0}$ w0 are the parameters (or weights). Any specific choice of these parameters (e.g., $w_{1} = 2, w_{2} = - 3, w_{0} = 7$ w1=2,w2=−3,w0=7) results in a specific function instance, $f (x_{1}, x_{2}) = 2 x_{1} - 3 x_{2} + 7$ f(x1,x2)=2x1−3x2+7, that belongs to this model class.

The goal of training is to find the single best set of parameters ( $w_{1}, w_{2}, w_{0}$ w1,w2,w0) within this class that best maps our input features ( $x_{1}, x_{2}$ x1,x2) to the target ( $y$ y).

To better understand how parameters define the shape of a function, let’s look at some common parametric models and the role their parameters play.

Linear model: $f_{m, c} (x) = m x + c$ fm,c(x)=mx+c. This function class represents all possible straight lines in a 2D plane. The parameters are $m$ m (slope) and $c$ c (y-intercept). Adjusting these two values changes the line’s position and angle.
Weighted average: $f_{w} (x) = \sum_{i = 1}^{n} w_{i} x_{i}$ fw(x)=∑i=1nwixi. This model represents a weighted average of a set of inputs $x$ x. The parameters are the weights $w = [w_{1}, w_{2}, \dots, w_{n}]$ w=[w1,w2,…,wn].

In general if $x$ x represents the input and $w$ w represents the set of parameters, then $f_{w} (x)$ fw(x) is the representation of the parametric model.

Loss function

Given a model class $f_{w}$ fw for input $x$ x and target $y$ y, a loss function $l (f_{w} (x), y)$ l(fw(x),y) gives a measure of the deviation of the prediction $f_{w} (x)$ fw(x) from the ground truth $y$ y on the data point $(x, y)$ (x,y). The entire training process is dedicated to finding the set of parameters $w$ w that minimizes the total loss across all given data points.

Mean Squared Error (MSE)

A common loss function for regression problems is the Mean Squared Error (MSE). It is a necessary tool because it provides a quantifiable, differentiable way to measure the average error, allowing optimization algorithms to efficiently find the minimum loss. The MSE is calculated as the average of the squared differences between the predicted value and the actual target value for a dataset with $N$ N data points:

$MSE (w) = \frac{1}{N} \sum_{i = 1}^{N} {(f_{w} (x_{i}) - y_{i})}^{2}$ MSE(w)=N1i=1∑N(fw(xi)−yi)2

Note: The MSE formula involves squaring the error $(f_{w} (x_{i}) - y_{i})^{2}$ (fw(xi)−yi)2 before averaging. This is crucial because it ensures that all errors are positive and penalizes larger errors much more severely than smaller errors.

Hyperparameters

For the training process, all parameters other than the parameters of the model class ( $f_{w}$ fw) are called hyperparameters. They are set before training a model and they do not learn from the data.

The model selection process starts with questions like:

Which model class should we use (e.g., a line or a quadratic function)?
Which loss function should we use (e.g., MSE or absolute error)?

These choices define the environment for the training process and are therefore hyperparameters.

Examples of hyperparameters:

Model selection: Choosing the structure of the model (e.g., the degree of a polynomial).
Loss function: Choosing the error metric (e.g., MSE, Cross-Entropy).
Learning rate: The size of the steps taken to update the parameters during training.
Weight initialization: The starting values for the model’s parameters.

Classifier

Every machine takes an input, performs its respective function on that input, and produces an output. When this machine is configured/trained to predict a category/class label from a prespecified finite set of categories, it’s called a classifier.

For example, suppose we have a set of inputs, x, and the machine gives an output of either 1 or 0:

x (Input)	y (Output)
4	1
-2	0
-3	1
7	0

Here, the possible class labels are 0 and 1, and the classifier’s job is to assign the correct label to each input.

Prediction confidence

Prediction confidence is the level of certainty that a machine learning model has in its predictions, and it can be expressed through hard or soft predictions.

Hard prediction

Predicting actual class labels $(0$ (0 or $1)$ 1) is called hard prediction. It seems to be a desirable property of a classifier, but it’s generally difficult to model. This is because the classifiers are mathematical functions, and the constraint of discrete values on the output makes the function challenging to be approximated from data.

Soft prediction

Soft prediction is the prediction of class probabilities (a continuous value between 0 and 1) rather than the actual label values. This probability represents the model’s confidence score.