Popular and even technical media inundate us with the analogy that artificial neural networks function like synapses in the human brain. Writers and speakers dramatically extend the analogy, referring to parameters of regression equations as “neurons” that are “activated” by training. Straightforward machine learning methodology is now ubiquitously conflated with artificial intelligence. This language is now so commonplace that blogs and even textbooks use it without consideration of that reality that it is a loose analogy which contains absolutely no functional value for understanding how machine learning methods work today. The popular artificial intelligence hype is packed with assumptions and implications about the human brain that were apparently seeded before people knew that this field would become wildly popular. Now it may be too late too change the terms, no matter how misleading they are.

If anything becomes certain about the way actual brain physiology renders our ideas, memory, consciousness, analytical thinking it is likely to turn ML language on its head. To set up a meaningful contrast, the purpose of this article is to explain the fundamentals of ML using the simple, ordinary language of the mathematics on which it is based.

Above all, ML is math. There is no mystery involved. ML algorithms do exactly what they are designed and explicitly programmed to do. The opposite may sell ads, but this hype will deflate and we’ll have to deal with reality. So let’s take the plunge and deal with reality now.

Linear Regression

Regression analysis is more than 200 years old, and was first introduced by the mathematician Legendre. The core of most machine learning apps in use today is linear regression. After rendering physical data in numerical form, linear regression is applied to predict future values.

Let’s have a look at a primitive example of how linear regression works and then expand the idea toward current uses in technology, such as face and handwriting recognition. Suppose you gather a collection of data points and plot them in red as in Diagram 1. What is the simplest way to generalize this dataset?

A standard method is to write the equation of a line which best fits the data. Roughly speaking, the red line drawn through the data points is a good approximation of the data. If you input any value for x then the output will appear somewhere on the red line, and will serve as an approximation for predicting other values in the neighborhood of these data points. This is the most basic neural network, and it is so simplistic that readers may find it difficult to believe that it is really just increasingly complex extensions of this method that are now packaged in an incomprehensible language and called “machine learning.” To qualify this new and embellished terminology, let’s start with the simplest mechanics and work toward increasing complexity to expand our view of actual ML models. The “neural net” in diagram 1 defines a model with only a single descriptive feature. This is why Andrew Ng calls it the simplest neural network.

In fact, in the red line in the diagram, the coefficient 2 in the equation y=2x+3 is the “parameter” which determines the slope of the line. This parameter is what programmers fine tune in more complex ML regression models. This is the parameter which is popularly referred to as a “neuron” in tech blogs, and even in research and academic presentations. When this “neuron” is adjusted, the slope of the line is altered. Adjusting it to a value of 1.903 might actually make the equation fit the data more accurately. It is precisely this procedure of fitting a line to data which is now referred to as learning. In fact, the rate of adjusting this parameter is actually called the learning rate. The 3 in our equation determines the location of the line on the graph. Can machine learning actually be this simple?

As we have seen, a single linear equation is a fair approximation of the dataset in our diagram. But there is an obvious limit to the extent we can refine such a simplistic model, and this you may have already guessed is just an average of our data values. A model such as this is not sophisticated enough to be useful for anything other than predicting the speed of a car or distance traveled.

Suppose we want to predict house prices and we have only the number of bedrooms in our dataset. That single feature is not enough to make accurate predictions. Now imagine adding another equation and another variable which uniquely add meaning to our model, a feature like the area in square feet, for example. Now we have two equations with two variables – and variables such as this in ML models are often called features. The solution to this “system of equations” will give us a more accurate prediction of future house prices. Now extend this to 100 equations with 100 feature variables, including data such as proximity to expensive private schools. Now we are approaching the complexity of today’s machine learning models. But today’s models contain two more advances in method and technology which distinguish them from Legendre and Gauss.

ML models used today in computer vision algorithms for face recognition, assisted medical diagnosis, and natural language processing are more sophisticated, but they are still fundamentally based on the least squares method devised by the mathematicians Legendre and Gauss 200 years ago. So, why change the name from least squares to neural network? What has changed recently to qualify this model as a “neural network”? Is it the predictive strength?

As we have discovered, ML neural nets are actually systems of linear equations, and these are reduced to matrices containing only the parameters – the neurons – for the purpose of solving the equations using the matrix transpose method. In the day of Legendre, calculating the transpose of a 100 x 100 matrix was not realistic. Today it is not only realistic, but it is a comparatively small operation! The ability to do such massive computations in nanoseconds is frankly one of the cornerstone dependencies of modern AI techniques. The success of combing GPU – graphical processing units intended for fast video gaming – with ancient mathematics yields results including the capability of an iPhone to recognize its owner’s face! This success aroused people to such wild fervor that no exaggeration is too far off the charts when proclaiming the merits and mysteries of AI. However, the hype around artificial intelligence is sufficiently disproportionate that perhaps only one of the recent pioneers in the field has a voice loud enough to tame the madding crowd.

Andrew NG on Neural Nets

In a presentation and lecture on machine learning at the Future Forum of Stanford Business School in January, 2017,  the ML innovator Andrew Ng, Baidu chief scientist, Coursera co-founder, and Stanford professor delivered a surprising introduction to his example on using neural nets to predict housing prices:

“I want to get slightly technical and tell you what a neural network is, you know, a neural network, loosely inspired by the human brain… tends to make people think we’re building artificial brains, that are just like the human brain. The reality is that today frankly we have almost no idea how the human brain works. So we have even less idea how to build a computer that works just like the human brain. And even though we like to say neural networks work like the human brain, they are so different that we have gone past the point where that analogy is useful.” — Andrew Ng

Ng continues his presentation with the classic house price forecasting neural network model and describes the parameter of house size as a “neuron.” And Ng asks the question, “So, what is a neural network?” and he replies that it is, “taking a bunch of these things and stringing them together.” In other words, taking the equation of a line in diagram 1 and adding features such as frontage area, school district, and political boundaries of the neighborhood, all of which impact the price of the house, and lead to increasing complexity and accuracy in the forecasting model. And hopefully educating people with a clear example of this kind serves to distinguish machine intelligence from human intelligence.

Increasing Complexity

Ng’s Coursera on machine learning explores the intricacies of the house price forecasting model used in the Future Forum lecture. To illustrate how the simple linear model grows in complexity to achieve up to 99% accuracy in a variety of models, let’s touch lightly on a code sample from the actual course. This will ideally extend our own simple model of linear forecasting from Diagram 1.

In this example, we have a set of training data, which includes the house size, number of bedrooms, area, and other features, as well as a target value, which is the actual sale price of the house. For simplicity, let’s assume that all the features are in a two-dimensional array X, and the target price values are in a one-dimensional array Y.

We first develop a cost function, which is a set of linear equations similar to our example in Diagram 1, but with the additional parameters. This example is coded in the open source Octave language, which is identical to the commercial language Matlab:

function J = computeCost(X, y, theta)
%COMPUTECOST Compute cost function for linear regression
%   J = COMPUTECOST(X, y, theta) computes the cost of using theta as the
%   parameter for linear regression to fit the data points in X and y
m = length(y); % number of training examples
% You need to return the following variables correctly
J = 0;
J = sum(((X * theta) - y) .^ 2) / (2 * m);
% J = sum(((theta'*X)-y).^2)/(2*m);
end

Here we have several different ways to code the cost function, with only one not commented out. The trick to solving the cost function is to find the global minimum point of the function, and this leads us to one of the advances in ML which make this method efficient. Octave (and other math-intensive languages) supports a fast method for calculating the transpose of a matrix, which leads to a fast solution of the cost function. In the following Octave code, we use the classic gradient descent algorithm introduced by Ng:

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)

%GRADIENTDESCENT Performs gradient descent to learn theta

%   theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta %   by taking num_iters gradient steps with learning rate alpha

% Initialize some useful values

m = length(y); % number of training examples

J_history = zeros(num_iters, 1);

for iter = 1:num_iters
 
    % one feature
 
    h = X * theta; % matrix mult dim m*n x n*1 = m*1
 
    dif = h - y; % dim m*1
 
    theta_chg = (X'*dif)*alpha/m; % n*m × m*1 × n*1 / 1
 
    theta = theta - theta_chg;
 
    J_history(iter) = computeCost(X, y, theta);
 
end

 

Here again, you can see several coding methods commented out for comparison. X’ is the transpose of the matrix X, as coded in Octave. Generally, when a set of known housing prices and features are given, a portion of the dataset is withheld from training and then used to test the accuracy of the model.

A frequent question is, “Why use a linear model for apparently nonlinear phenomena?” That is a great question for people who have a basic knowledge of function analysis. The answer is that, when sufficient features and a rich and valid dataset are available, the linear model reaches accuracy levels of greater than 99% for many applications which contain nonlinear features. Moreover, the matrix transpose method used above is very fast but it only works for linear equations. For nonlinear programs such methods as the Fast Fourier Transform are used.

Magic or Reality?

If ML is really based on 200 year old math, why all the hype and repackaging? New things do appear under the Sun, and things change, so maybe the new nomenclature is apropos. The video gaming craze spawned GPUs under extraordinary pressure from gamers. GPUs turn out to be very fast as doing matrix math. The rise of Big Data has rather suddenly contributed mountains of data (most of it spurious and not very useful except for academic purposes). Combining recent hardware, algorithms, and datasets, we have something new, but people are at a loss for what to call it.

Relativity and quantum mechanics received their share of hype, and are still not clearly understood by most. Elucidating relativity may not be realistic for popular journals because it really is multivariate tensor calculus in four dimensions, and the accuracy of analogies and metaphors suffers in proportion to their creativity. Artificial intelligence is now effectively a science fiction which people enjoy believing for the moment. But as the complexity increases the methods of artificial intelligence are likely to follow those of genetics and quantum electrodynamics with regard to how well the public understands what’s really happening under the hood.

 

References

  1. Brain function
  2. Andrew Ng Stanford