Concepts of Machine Learning

7 min readOct 9, 2021

This post pertains to the first of a series of blog posts on Machine Learning. The objectives of said series is to complement my studies of the topic as well as sharing information about the concepts and techniques used in Machine Learning (ML) in a non-technical manner. I will, nevertheless, include as much mathematics as needed in the Appendix of each post for the curious reader who would like to explore the topics in further detail.

In this particular post I aim to explain the main concepts of Machine Learning. What is ML? What are the goals? What do we mean by “learning”? What techniques and tools are used to teach machines? These are some of the questions we will be exploring together today. I will start by introducing terminology and notation, and continue to describe the various sub-fields of ML (specifically supervised learning). Let’s get started!

So, if you are reading this article it is reasonable to assume that it is not your first time hearing about Machine Learning. But, what is it really? It sounds fancy, but at the core, it is nothing more than using applied statistics and linear algebra to design algorithms that can learn a specific task without being explicitly programmed. Formally, ML is defined as follows:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E. ~ Tom Mitchell - “Machine Learning”, 1997.

Don’t scratch your head to hard yet, I want you to keep that definition in mind as we break it down in the following paragraphs. So, what do we mean by experience? Well, it simply is the term used to refer to our dataset. Such dataset, defined as D = {< x[i], y[i] : i ∈ ℕ }, can contain numerical data, categorical data, or a mixture of both. Furthermore, a dataset meant to train an ML algorithm contains two sections: Training sample and Test sample.

Training sample: Refers to the collection of instances, defined by vectors x = {x_1, x_2, … , x_n} corresponding to the rows of the dataset.
Test sample: Refers to the collection of features, or indicators, that will be used to predict a certain target feature. These features correspond to the columns of the dataset and thus are defined as column vectors.
Features: Indicators used to predict our target.
Target: What we intend to predict (aka our dependent variable).
Parameters: These are the values, or weights, that define what the model can do. Parameters are updated after each iteration based on a loss function. That “update” is what makes the machine “learn”.
Cost/Loss function: A measure of how good the model is. Designed to be consumed by the machine, it is not easily interpretable for a human. This is referred, also, as the loss function and it is a mathematical function that assesses the accuracy of the model and updates the parameters accordingly. The idea is to minimize such a loss. The most common approach is Stochastic Gradient Descent (SGD).
Metric: A measurement of how good the model is, using the test sample, designed to be easy to understand for humans. The most common metric is the error-rate, which simply is the ratio of how many data points were correctly classified (or predicted) vs how many were incorrectly classified.

Following the definition provided, our algorithms task could be to predict something (i.e. the future price of a stock), classify something (i.e. classify pictures into cats/dogs), find hidden patterns in data, among many other alternatives. The nature of the task of interest can help us divide the discipline of ML into smaller sub-fields:

Supervised learning: Makes use of labeled data (meaning that the developer knows the features beforehand or possess some information about the origin of the data such that they can generate the labels) to make predictions (Regression, Neural Networks). Such prediction is achieved by utilizing a particular function y = f(x) + e, where e stands for “error”, to estimate a “hypothesis” function Ŷ = f(x).
Unsupervised learning: Very helpful type of ML algorithms used to find hidden patterns in unlabeled data (meaning that all you have is a bunch of data and you are not really sure what each instance represents, for example a sample of pictures or news articles). Unsupervised learning is commonly used for classification tasks such as Clustering (K-Means) or Dimensionality Reduction (PCA).
Semi-supervised learning: As you may have imagined, this is simply a mixture of supervised and unsupervised learning. Often, in the real-world, we encounter data that can be partially labeled. Thus, we could use a supervised algorithm, say regression, to make a prediction on the labeled subset of the data and then an unsupervised algorithm, say K-means, to make a classification - or the other way around. For example, Google Photos uses clustering algorithms to classify the pictures where a person A appears, then all it needs is for someone to label one of the pictures and it will automatically label all the other pictures. Pretty cool.
Reinforcement learning: Commonly used in robotics, this special type of learning technique gives an agent the freedom to analyze its environment, select and perform actions, and get rewards or penalties in return. By an iterative process the agent learns by itself the best strategy to maximize rewards. Check out this post if you would like to look into this technique in further detail.

SO MANY OPTIONS!!! I know, trust me, there are dozens of algorithm we could implement. But, are there too many algorithms? If you have ever taken a class or read a little bit about Algorithms you may have heard that it is crucial to implement the right algorithm, at the right time, for the right task. Interestingly, this not need to be the case for Machine Learning algorithms. David Wolpert demonstrated that if you make no assumptions about your data then there is no reason to prefer one model over any other, the only way to be certain is to evaluate ALL of them. This is called the No Free Lunch (NFL) theorem and was introduced in this paper. Furthermore, with the rise of Big Data, it seems to be the case that even the efficiency of the algorithm is not as important as it was once thought. In 2001, Microsoft researchers Michele Banko and Eric Bill showed that very different ML algorithms, even the simplest ones, perform almost identically well on a given task once they were given enough data.

source: https://dl.acm.org/doi/10.3115/1073012.1073017

This was a controversial idea which suggested that “researchers may want to reconsider the trade-off between spending time and money on algorithm development versus spending it on corpus development (i.e. larger datasets)”. Although both NFL and the effectiveness of Big Data are very theoretical, they suggest powerful ideas that we will explore in further detail throughout this series.

To conclude this introductory post, I want to show you how a Supervised learning workflow looks like. The process is very similar for the other approaches (unsupervised, semi-supervised) and since supervised algorithms are, arguably, the most common technique used in the real-world I believe it will be enlightening to explore the nuts and bolts of it before moving onto its implementation. This is how the process looks like:

Preprocessing: Collect data and label it. Then, use it to create your dataset (or corpus of data). Divide your full dataset into two parts: Training and Testing. It is commonly divided as 70% training data and 30% testing data.
Learning: Feed the data to your model for training. Training, or learning, happens by updating the weights the model assigns to each feature iteratively while attempting to minimize a cost function. When the optimal distribution of weights (i.e. the combination of weights that minimize the cost function) is found, training stops and your are left with your final model. There are several techniques we can use to pick the best model, among them are Cross-Validation and Hyperparameter optimization — more on this on upcoming posts.
Evaluation: Evaluate how well your model generalizes. In other words, evaluate the performance of the model in previously unseen data (test sample) by computing performance metrics such as Mean Square Error (MSE), Accuracy, error-rate, entropy, etc. If the performance is not great then we can use the feedback to redesign our training strategy.
Prediction: Use your model to make some cool predictions.

This concludes my introductory post to Machine Learning, I hope you found it both instructive and entertaining. Machine learning is often pictured as a black box discipline in which no one really knows what’s going on behind the curtains but, hopefully, this post made you realize that it is not the case. In the next post I will cover fundamental concepts of Linear Algebra and Vector/Matrix Calculus, which we will make extensive use of when developing our ML implementation of Linear Regression.

Extra resources to feed your curiosity:

My adventure in Deep Learning — part 1

I am a 20 years old student at Marist College, Poughkeepsie, originally from a very small town in Argentina called…

agbonorino.medium.com

My adventure in Deep Learning — part 2

In this post I will provide the fundamental concepts and vocabulary necessary to start working with machine and deep…

agbonorino.medium.com

The Lack of A Priori Distinctions Between Learning Algorithms

This is the first of two papers that use off-training set (OTS) error to investigate the assumption-free relationship…

direct.mit.edu