Machine Learning Math Class

During fall semester at University of Edinburgh, a few of the required classes covered subjects that I already felt comfortable in (python, stats, and applied ML). I had the option to switch for a more challenging class. Luckily, I had another in my program that wanted to take Machine Learning and Pattern Recognition. It was designed by the university to prepare people for PhD research in machine learning, aka, a lot of math.

I wish I had more appreciation and knowledge about linear algebra when I was taking MATH313 for my math minor at BYU. Linear algebra is so powerful and can be abstract and unintuitive at times. I wanted to write about some cool things that I learned during the class and outline my notes in a way that can be meaningful and potentially helpful.

Week 1 in MLPR

Feature engineering is important for ML tasks, but one goal of ML is to get rid of feature engineering and replace it with feature learning.

$$\text{ML}=\text{fitting functions to data}$$

The simple function $y=f(\textbf{x};\textbf{w},b)=\textbf{w}^T\textbf{x}+b$ is a representation of some set number of features that are weighted with a bias and this can model a simple linear relationship between $\textbf{x}$ and $y$. For a known set ($y$) of data, we can solve for the $\textbf{w}$ weights. That is the essence of ML. Find the right assumptions about the world (mathematical function) and find the corresponding weights/biases that match the data we can observe.

The linear function is powerful. We can easily extrapolate to a large set of data with a design matrix.

$$X = \begin{bmatrix}\textbf{x}_{(1)}^{T} & \dots & 1 \\ \textbf{x}_{(2)}^{T} & \dots & 1 \\ \vdots & \ddots & \vdots \\ \textbf{x}_{(n)}^{T} & \dots & 1 \end{bmatrix}$$

The ones make the last entry in the weights vector $\textbf{w}$ the bias term. We can also fit polynomials with a linear matrix function in a similar way. With a simple equation $f = \textbf{w}^T\phi(\textbf{x})$ where $\phi(\textbf{x})$ is a transformation to a polynomial representation. This $\phi$ function can also be used for basis functions like gaussians or sigmoids.

Our ML goal is to find a weight vector such that $\textbf{w}=\Phi^{-1}\textbf{y}$. Fits aren't perfect because data is noisy. If we don't have the right assumptions about the model or there isn't enough complexity (ie not enough basis functions), we tend to underfit. We can also overfit if we have enough complexity to memorize our data. We often want to start from a simple baseline that underfits and then expand to the more complex.

Gaussians

In any stats class you quickly become familiar with the gaussian distribution. This distribution is everywhere, especially in ML. There are a lot of normally distributed random variables in life and ML is all about representing the real world.

We did a lot of work with Multivariate gaussians. I have seen a lot of point clouds and seen them be used in many different situations, like speech recognition using HMMs and research about improving phone error rates.

Classification tasks

A major use case of ML is classification. I learned what the $\propto$ symbol means (is proportional to). I also learned that our friend the logarithm is ubiquitous for so many purposes. One, it allows us to take an unconstrained variable and constrain it to the positive domain. Two, multiplication in normal space is addition in log space. And, three, the log of a feature is more often gets you a gaussian distribution which is easier to work with. Lastly, log addition avoids the problem of numerical overflow in comutation.

Another important part of machine learning and the class was understanding and using Bayes' Rule to manipulate probabilities. Before school, I watched the Stanford lectures on NLP and they talked a lot about negative log probabilities, and that was a big question mark for me. This class gave me the intuition as why we use -log probabilities. The goal of our system is to predict the function that yields that highest probability of an output $y$ given inputs $\textbf{x}$ and weights $\textbf{w}$ or

$$\prod^N p(y^n | \textbf{x}^n,\textbf{w})$$

That is essentially the same thing as minimizing the negative log probability, because minimizing the negative is maximizing the positive and the log is just constraining the function to a nice and predictable function. Makes so much more sense when you gain the intuition for these complex ideas.

$$\underset{\theta}{\mathrm{argmax}}\prod^N_{n=1} p(y | \theta) = \underset{\theta}{\mathrm{argmin}}-\sum^N log p(y | \theta)$$

Bayes is the Probability King

My course was a lot about Bayesian treatment of ML tasks, this means that we are going to have some distribution that we will call our prior, and we will use Bayes' Theorem to use our data as evidence and a calculated likelihood to update the distribution into what we call the posterior distribution. The main point being, we have a prior belief and we want to update our belief with data.

$$p(\textbf{w}|\mathcal{D}) = \frac{p(\mathcal{D}|\textbf{w}) p(\textbf{w})}{p(\mathcal{D})}$$

$$\text{postierior} = \frac{\text{likelihood x prior}}{\text{evidence}}$$

Using a bayesian treatment doesn't solve the problem that model assumptions can still be bad or too simple. Fitting a model with a Bayesian treatment on a linear regression problem assumes linearity. If our data has a curved shape, we will always yield very small probabilities for $p(\mathcal{D}|\textbf{w})$, or the likelihood, so the max likelihood is still a bad fit.

With probabilistic treatments of our data, the sum rule and the product rule are extremely helpful in our calculations. The sum rule allows us to introduce new variables into a probability. The product rule allows us to pull apart a joint probability into the product of 2 probabilities.

$$\text{Sum Rule: } p(y|\textbf{x},\mathcal{D}) = \int p(y,\textbf{w}|\textbf{x},\mathcal{D})d\textbf{w}$$

$$\text{Product Rule: } \int p(y,\textbf{w}|\textbf{x},\mathcal{D})d\textbf{w} = \int p(y|\textbf{x},\mathcal{D},\textbf{w})p(\textbf{w}|\textbf{x},\mathcal{D})d\textbf{w} = \int p(y|\textbf{x},\textbf{w})p(\textbf{w}|\mathcal{D})d\textbf{w}$$

The second simplification step can happen because our answer $y$ doesn't depend on $\mathcal{D}$, only the input and the weights. Also, the weights, \textbf{w}, don't depend on the test location \textbf{x}, so we drop them from the conditional probabilities.

Gaussian Processes

This week we talked about gaussian processes. This was a difficult concept, but the general idea is that we can compute our uncertainty about a function given data. Using this uncertainty is powerful because we can optimize our efforts in collecting data if we collect data points at our locations of most uncertainty. Gaussian Processes have kernel functions that encapsulate beliefs about a function, whether it changes values rapidly or slowly or if there are many turns, or few turns. Kernel parameters are good hyper-parameters that we can tune a model with. But they again work under the assumption that we are working with multivariate gaussian representations of the world. Two steps that come often in model selection are to, firstly, identify a probabilistic model of the data, and secondly, minimize the negative log probability.

I had heard a lot about this idea of a softmax before taking this course. What is a softmax? It is a function that takes numerical outputs and turns them into probabilities. In the context of the ML work we did in this class, we used the idea of softmax to allow us to have a multi-class regression/classification.

Neural Networks

Neural networks are awesome.

The power of a neural network is in the non-linear activation functions. So if you imagine a graph and a classification task, it is easy to draw a line that divides your two classes. Non-linearities allow us to basically fold our space such that we can draw a line that divides our classes.

Another awesome benefit of neural nets is that we can optimize our weights by gradient descent. We have a bunch of functions that compound to get an output, and we can differentiate values backwards through the network to update our weights. Our main process is finding a minimum in our cost function. That process is usually represented like weights get updated by knowing the direction that we need to go to obtain a minimum (the derivative of the cost with respect to the weights), and a step size ($\eta$). This is one of the secrets to the incredible learning of ML systems.

$$w \leftarrow w - \eta \frac{\partial C}{\partial w} $$

Backpropagation by hand is terrible, but doable. Thank goodness for automated processes and well-established coding libraries.

Autoencoders and Principle Component Analysis

This week was a super interesting one for me. We learned about autoencoders and Principle Component Analysis (PCA).

First, autoencoders are to help us with unsupervised learning (when we don't have labels). Our goal is to teach a system to take inputs purposefully ruin the data, and then learn how to reconstruct the original input. Several approaches include reducing dimensionality, adding noise, corrupting the input, and dropping weights.

PCA was a super interesting idea for me. It is essentially breaking down a high dimensional space into a lower dimensional subspace, and quantifying the significance of each of those dimensions. This is achieved by identifying a matrix $V$ such that $\textbf{y}=VV^T \textbf{x}$ where $V$ is a DxK matrix, and correspond to the most important/significant variances seen in the data. This means that using the a Dx2 matrix, we can understand which 2 dimensions are the most important given a dataset and we can produce a quick 2D visual of complex data. This can be used to visualize groupings of word embeddings when doing NLP.

This type of analysis is very helpful for analysis but very unhelpful in prediction.

I thought about this idea and how it can be applied to the realm of speech production. Speech synthesis is really natural right now, but accents are still very stereotypical. What are the principle components that define an accent? If we know what a stereotypical Spanish accent sounds like, can we extrapolate what a light or medium Spanish accent sounds like? I am wanting to explore this for a potential dissertation in the summer time.

The rest of the course

The remainder of the course was dedicated to approximations and re-parameterization. We need a way to do calculations on distributions that are not gaussian. Gaussians are easy, and we can easily draw samples and manipulate gaussians, so there are a lot of tricks that help us do that. LaPlace approximation helps us approximate a distribution as a gaussian whose mean aligns with the mode and turning point of the function we are approximating. Monte Carlo approximation helps us approximate a prediction with an empirical average over our data samples which helps us with importance sampling. We can also define distributions and calculate (and minimize) the divergences between two functions, like a gaussian and some posterior.

I really enjoyed taking Machine Learning and Pattern Recognition. It was a true challenge for me, but I learned so much about probabilistic reasoning, ML basics, ML techniques, and most importantly I learned to understand the process and reasoning behind certain ML decisions.

I will hopefully be helping with many machine learning systems over the course of my career. I know that this class gave me a foundational understanding of machine learning.

The whole course website is online and can be found at https://mlpr.inf.ed.ac.uk.

Search This Blog

spencerjensen.dev