The ocean is complicated, expansive, and powerful. It is also random and mysterious. This analysis tries to fit a model to this randomness using a machine learning technique called a Gaussian Process Model.

Source: Pexels.com

#### ELI5 - What is a Gaussian Process?

You have some data, and there are an infinite number of functions that can fit this data. It could be a straight line, it could be a quadratic curve, it could be a sine wave. Some are better than others. The assumption is that the choice of *function* relative to its *fit* conforms to a Gaussian probability distribution. But we can do better than that - it is actually a conditional Gaussian distribution because it is *conditional* on the known (or training or seen) data points. In other words, the known data gives us some knowledge to construct a joint Gaussian distribution from which to sample.

Okay, maybe ELI18

### Step 1 - The Data

This data set contains ocean flow velocity in U (east-west) and V (north-south) directions spanning an approximately 1500 km x 1500 km grid over the Philippines Archipelago.

The data comes in the form of 200 csv files, each of which contains a 504 x 555 grid of velocity data, and one mask file that differentiates land and water.

There are 100 csv files for each direction, and each file represents a particular unit in time. We assume that the time points are equally spaced, and that measurements are taken at consistent (x, y) coordinates. This data is loaded into two, 3-dimensional NumPy arrays (x, y, t). Velocity on land is 0.

### Step 2 - Streamplots Visualization

It’s hard to get a sense of what we are actually looking at here with this data. Let’s visualize what the ocean flow looks like using Matplotlib’s Streamplots and Animation functionalities. We also weight the “arrows” by the velocity, so we see where the current is strongest and how it changes over time.

### Step 3 - Dipoles & Correlation

The Indian Ocean Dipole is a temperature anomaly in which the eastern and western regions of the Indian Ocean are extremely negatively correlated, which has far-reaching climate impacts that can be felt as far as South America through its impact on El Nino.

Nothing this impactful exists in the data we have, but it is a good exercise to see if there are some long-range correlations in this data. The algorithm uses random sampling instead of calculating a (504x555)^{2} correlation matrix and is explained in more detail in the docstring of the function titled find_dipoles.

### Step 4 - A Plane Crash!

Imagine there is a hypothetical plane crash in the ocean near the Philippines. A search party knows the location of the accident, but some time has elapsed. Where would be the best place to look after time T = t? We can build a simulation of the ocean flows by using the algorithm described in the docstring of plane_crash.

For a plane crash at point [400, 400] with an assumed variance of 10 for x and y directions.

### Step 5 - Gaussian Priors

In the last step, we made a pretty large assumption about the interval of the time steps. What if we only had data every 3 days or longer - we would need a model that can interpolate this noisy data. The typical method that might come to mind is some kind of regression, but it’s obvious after looking at the random, cyclic nature of the data that we need a more flexible method.

Enter Gaussian Process, a non-parametric approach that “defines a prior over functions, which can be converted into a posterior over functions once we have seen some data (K. Murphy)”

Rasmussen & Williams explain

A Gaussian process is a generalization of the Gaussian probability distribution. Whereas a probability distribution describes random variables which are scalars or vectors (for multivariate distributions), a stochastic process governs the properties of functions. Leaving mathematical sophistication aside, one can loosely think of a function as a very long vector, each entry in the vector specifying the function value f(x) at a particular input x. It turns out, that although this idea is a little naïve, it is surprisingly close what we need. Indeed, the question of how we deal computationally with these infinite dimensional objects has the most pleasant resolution imaginable: if you ask only for the properties of the function at a finite number of points, then inference in the Gaussian process will give you the same answer if you ignore the infinitely many other points, as if you would have taken them all into account!

The key idea here is that if two values are deemed to be similar by the kernel covariance matrix (more on that later), then the outputs of those values in the function should also be correlated. In other words, if two instances in time are close together, their resulting flow velocity should be similar. Sounds intuitive, right? We just have to find the joint probability distribution over the functions. Our known, truth, or seen data are the data in the 3-D matrices, and any unknown or unseen data are any values in between (eg - we know the flow at t=4 and t=5, but t=4.5 is “unseen”).

Ocean flow is a great candidate for being modeled by a Gaussian process because the noisy signal requires a more unconstrained approach to modeling. Ocean flow is a function of numerous random elements like temperature, barometric pressure, cloud cover, etc. The sum off all these yields Gaussian distribution (Central Limit Theorem).

The Gaussian prior is sampling data from a Gaussian distribution with mean = 0 and a variance specified by the kernel covariance matrix, which is constructed with two hyperparameters: l^{2} and σ^{2} or length scale and variance. It is the model *before* the joint distribution is created from the training data.

These appear to be just noisy trends, but the hyperparameters chosen clearly have a large impact on the model even without any data. We'll get into this in step 7.

### Step 6 - Gaussian Posterior

The Gaussian posterior are values sampled from the joint probability distribution of the functions of seen and unseen data. Some pretty detailed linear algebra gets us to this point, and if you are curious, you can read more about it in the sources listed at the end of this README. The implementation is in the function plot_crash_coordinates_gauss_posterior.

We are now continuously sampling from a joint distribution that takes into account the ground truth data to get values at a much higher granularity.

Let’s look at the same posterior model using the same hyperparameters as before.

A definite improvement!

However, these are actually pretty terrible models. The first is definitely overfitting the data and would be quite inflexible, and the second one is much too noisy. We need to optimize the hyperparameters to get a best-fit model, but first we need to look more closely at how they impact the covariance matrix.

### Step 7 - Visualizing Kernels as Covariance Matrices

The covariance matrix is the multivariate extension to the variance specified in a typical Gaussian distribution. Kernels are essentially a similarity metric and are therefore typically used as covariance matrices in machine learning (gross over-simplification - a lot of mathematical proof behind this).

The kernel function will input an Nx1 matrix and return a symmetric, semi-definite NxN matrix and with each *(i, j)* entry equal to the covariance between points *(i, j)*. For cross-covariance, the input is an Nx1 and Mx1 matrix that returns a triangular NxM covariance matrix.

In this particular model, I chose to use a radial basis function kernel, which has two hyperparameters. The l^{2}, or length scale term, indicates how fast the covariance decays as a function of the squared distance between the points, and σ^{2} determines the variance at the diagonals. This is visualized below in the grid of heatmaps, which shows a N=10 kernel with varying hyperparameters.

"

RBF Kernel - description of the kernel used in this model

Covariance - General information about covariance

### Step 8 - K-Fold Cross-Validation & Log Likelihood

One method of hyperparameter selection is to perform a K-Fold Cross-Validation, which leaves out a ‘fold’ of the training data and calculates how well the model performs. This code uses SKLearn’s implementation to help with this.

Source: Rebecca Patro - Towards Data Science

But wait, what exactly is the error function here - how do we gauge how well the model performs? Enter maximizing the log-likelihood - this metric tells us the [logarithm of the] likelihood of seeing the posterior function from each iteration of the cross-validation, given the known training data, the training covariance matrix, and the cross-covariance matrix between the training and [artificially] unseen data. It also takes into account both the model-fit and model-complexity. The implementation can be seen in the get_log_likelihood function.

### Step 9 - Hyperparameter Optimization

Now that we have the framework of cross-validation and log likelihood estimation, optimization is straightforward. We grid search over a range of l^{2} and σ^{2}, which is implemented in the hyperparameter_optimization function.

Below is a heatmap of log likelihoods with varying length scales and variances for the U direction. There’s no monotonic increase and many, many local optimums. There are probably quite a few parameters that would give decent models, and what we found is likely not even the global maximum likelihood.

This 3-D view of log likelihood in the V direction as a function of hyperparameters shows a *general* directional improvement with longer length scales.

### Step 10 - Optimized Gaussian Posterior

Finally, we have the parameters that will give us the best fit. Let’s plot the Gaussian posterior from Step 6.

This is it! This is our model for the point [400, 400]

We could fit a model for every single point in the grid and have continuous estimates of the flow of the ocean in this entire 1500 km x 1500 km grid.

###### Sources

Katherine Bailey blog post - If you want a slightly more in-depth version

Gaussian Process for Machine Learning by Rasmussen & Williams - If you want the entire textbook

Credit to MITx - Data Analysis: Statistical Methods & Computation for this data and the skills to perform this analysis.