I have long been interested in AI and Machine Learning (ML), having presented on the topic at Dreamforce a few years ago, written on the topic for Salesforce, and completed a course on the topic through Coursera. I find the subject extremely interesting, partly because of the possibilities of AI/ML systems, partly because it appeals to my mathematical background, and partly because in many cases it is a black box to figure out how it works.

In 2016, Salesforce announced Einstein as a way of making AI and ML tools avaialble to a wider audience. Whilst I think this is great, it doesn’t scratch my itch of understanding how and why something works. I also know within the Salesforce community there is a healthy level of discussion around what is really AI, what is ML, what is the difference, and when is it just statistics or a set of if statements?

So to help answer this I have decided to write this blog series. The plan here is to start laying the groundwork on some of the maths we will need, then talk through a number of popular AI/ML algorithms, implement them in Apex and see some results. We will get to peek behind the curtain at some common algorithms and walk across the line from statistics into ML.

# Major Caveat Time

You should just use Einstein or another tool and not use any of the code in this or throughout the series in Production. We are going to have to do some fun coding because we have governor limits and whilst we will learn a lot, a tool built for this purpose like Einstein will be a better option overall.

# Laying the Groundwork - Maths

We will go into more detail on the maths required for each algorithm as we go, but let us begin with some common background in maths and statistics.

## Sampling the World

We will always be working with the concept of a sample of data - in most instances we don’t anticipate having the full data set of all data points available to us. In fact, we would be really dumb to start using a predictive algorithm for a dataset we already know and have all possible values for. The set of data we will be working with is a subset of all the possible data in the dataset and we call this subset our *sample*. Typically the bigger the sample the better, however obviously this will then slow down our processing speed.

## What Do You Mean?

The *mean* (often called the average) is the expected result of our data sample if we were to draw at random. The result here may be a bit non-sensical but is a useful property. For example the expected outcome for a dice roll is 3.5, which we know is not possible.

We calculate the mean by summing (adding up) every value in the sample and then dividing by the number of values in the dataset. So for our dice:

[{(1+2+3+4+5+6) \over 6} = 3.5]

Thus the formula for the mean is:

[\frac{1}{n} {\sum\limits_{i=1}^n x_i}]

where the big “E” looking symbol (captiual Greek sigma) means sum (add up) all these items, the i=1 at the bottom and the n at the top mean for each value 1,2,…,n where n is the size of our sample, and x_{i} refers to the i^{th} member of our set. Then we divide that sum by n (which is shown here as multiplying it by 1 divided by n - a math notation convention), the number of items in the sample. I know for some the math syntax may be familiar but for others it might not be. Note as well we have a special symbol for the mean, x with a bar acorss the top:

[\bar{x} = { \frac{1}{n} {\sum\limits_{i=1}^n x_i} }]

## Variance and Deviation

So now we know what our expected value is, we can start to think about the rest of the data and how it is spread out. The measure for the spread of some data is called the variance and the square root of this is called the standard deviation. For the work we are doing, the standard deviation is most important but we calculate it using the variance. The variance can be calculated for a sample as:

[{ \frac{1}{n - 1} } \sum\limits_{i=1}^n (x_i - \bar{x})^2]

Okay, so we have seen all of these symbols before so there should be nothing too scary for us here. What we are doing is for every point in the dataset, finding out how far away it is from the mean (this could be negative in some cases), squaring it (so it is always positive), then adding up all those values and dividing by 1 less than the size of the dataset. We do 1 less as this counters the bias inherent in not using the entire population’s dataset, and as we get to very large numbers 1 divided by both n and n-1 start to be basically the same thing (1/100000 and 1/100001 are for all reasonable purposes the same).

To get the standard deviation we just take the square root of this.

## Matrices and Entering the Matrix

We are going strong now with the statistics we will need, lets bring today’s post home with some final pieces of algebra, a matrix and matrices.

Let’s begin by imagining we are representing a point in space with 3 numbers. Look at the room you are in and look at the bottom left hand corner of a wall you can turn to face. Now any point in that room can be described by 3 numbers, one saying how far to the right the point is, one saying how high it is, and one detailing how close it is. We could out those 3 points into a list and that list describes the point exactly. Lets say it is 1.5 metres to the right, 2 metres above the ground and 4 metres towards you from the wall. We could write that as:

[\begin{bmatrix} 1.5 & 2 & 4 \ \end{bmatrix}]

This is called a vector. Now let’s say we have 3 points in space with vectors a,b,c we wanted to represent, well we can can write them as 3 vectors together:

[\begin{bmatrix} a_1 & a_2 & a_3 \ b_1 & b_2 & b_3 \ c_1 & c_2 & c_3 \ \end{bmatrix}]

where the subscripts here refer to the components of the vector.

Now we can do all sorts of cool things with matrices that we will get to later on, but today I am going to focus on 2 simpler actions - subtraction and pointwise exponents.

### Subtraction

To subtract one matrix from another, they must be the same number of columns and rows (dimension) and you match up points, subtracting entries in what is called a pointwise manner. Formally:

[\begin{bmatrix} a_1 & a_2 & a_3 \ b_1 & b_2 & b_3 \ c_1 & c_2 & c_3 \ \end{bmatrix} - \begin{bmatrix} d_1 & d_2 & d_3 \ e_1 & e_2 & e_3 \ f_1 & f_2 & f_3 \ \end{bmatrix} = \begin{bmatrix} a_1 - d_1 & a_2 - d_2 & a_3 - d_3 \ b_1 - e_1 & b_2 - e_2 & b_3 - e_3 \ c_1 - f_1 & c_2 - f_2 & c_3 - f_3 \ \end{bmatrix}]

So a real example:

[\begin{bmatrix} 6 & 3 & 2 \ 8 & 4 & 9 \ 8 & 3 & 1 \ \end{bmatrix} - \begin{bmatrix} 5 & 6 & 3 \ 2 & 4 & 3 \ 9 & 8 & 3 \ \end{bmatrix} = \begin{bmatrix} 6 - 5 & 3 - 6 & 2 - 3 \ 8 - 2 & 4 - 4 & 9 - 3 \ 8 - 9 & 3 - 8 & 1 - 3 \ \end{bmatrix} = \begin{bmatrix} 1 & -3 & -1 \ 6 & 0 & 6 \ -1 & -5 & -2 \ \end{bmatrix}]

This is going to be useful, because if we have a series of data points, we can store them in a matrix and use that to do our processing more rapidly.

### Pointwise Exponent

Okay, so technically this is called the Hadamard power in a lot of literature but I am using Pointwise Exponent because I think it explains the function more readily. We will use the Hadamard notation though. So for a matrix A as follows:

[A = \begin{bmatrix} a_1 & a_2 & a_3 \ b_1 & b_2 & b_3 \ c_1 & c_2 & c_3 \ \end{bmatrix}]

The Hadamard power is:

[A^{\circ k} = \begin{bmatrix} {a_1}^k & {a_2}^k & {a_3}^k \ {b_1}^k & {b_2}^k & {b_3}^k \ {c_1}^k & {c_2}^k & {c_3}^k \ \end{bmatrix}]

for some power k. So Hadamard squaring a Matrix A is then:

[A^{\circ 2} = \begin{bmatrix} {a_1}^2 & {a_2}^2 & {a_3}^2 \ {b_1}^2 & {b_2}^2 & {b_3}^2 \ {c_1}^2 & {c_2}^2 & {c_3}^2 \ \end{bmatrix}]

Again we will use this later when we are working with our algorithms to help make things easier.

# Summary of Part 1

I think that is enough for this post, we have got all the mathematical basics we will need to start preparing for our first algorithm. Next time we are going to see how we can put this lot into Apex and testing it all works and hopefully get onto actually doing our first algorithm!