Brief History of Deep Learning, and the Artificial Neurons (McCulloch Pitt's Neuron)

1. Historical Roots of Deep Learning

Deep Learning has its root in 1940's, when Warren McCulloch and Warren Pitts published the concept of Artificial Neuron, also referred to as MCP Neuron in 1943 (A Logical Calculus of the Ideas Immanent in Nervous Activity by W.S McCulloch and W.Pitts, Bulletin of Mathetmatical Biophysics, 5(4): 115-133, 193) Reference and Link to Paper

They drew on the three sources to come up with the concept of Artificial Neurons:

knowledge of basic psychology and functions of neuron brain
a formal analysis of propositional logic by Russell and Whitehead, and
Turing's Theory of Computation

1.1 McCulloch Pitt's Neuron

The concept of artificial neuron was inspired by the Biological Neurons which are interconnected nerve cells in the brain and are involved in the processing and transmitting of various signals, which is illustrated in figure below:

A single neuron in its simplified explanation consists of following parts:

SOMA: the main part of the neuron which processes signal.
DENDRITES: the branch-like shapes which receives signals from other neurons (i.e. are input to neuron), and
AXON: a single nerve which sends signals to other neurons (i.e., is an output of the neuron)

and the connections between the nerurons are known as SYNAPSES

The idea of the McCulloch Pitts Neuron was to provide the abstractions on how the brain neuron works, and was considered as a simple logic gate which receives

multiple input binary signals (equivalent to dendrites),
binary output i.e. ON or OFF state (i.e. output of neuron, equivalent to Axon),
and the neruron fires if the accumulates signal of all the inputs have enough stimulation in it, or in other words is above certain threshold.

1.1.1 Formal Mathematical Definition of McCulloch Pitt's Neuron

The inputs to MCP Neuron can be considered as a vector: x: [x₁, x₂, x₃, ....... x_n-1, x_n]
Every input would have certain weights (importance) associated with it and let that be represented by the vector of weights w: [w₁, w₂, w₃, ....... w_n-1, w_n], where each w_i has a value of -1, 0 or 1
- input signal with weight 1 are called excitory input, since they contribute towards a positive output signal in the sum
- input signal with weight of -1 are called inhibitory since they repress a positive output signal in the sum
- input signal with weight of 0 do not contribute at all to the neuron
The accumulated signal (sometimes also referred to as net input) can then be calculated as follows:
- z = w₁.x₁ + w₂.x₂ + .... + w_n-1.x_n-1 + w_n.x_n
Then for some threshold value t, an integer, the output signal is determined by the decision function or activation function as follows:
- y = 1, if z >= t or 0 otherwise

The neuron is said to be activated or in ON state (i.e. having value of 1), when the weighted sum is greater than the threshold value

1.1.2 Original Experiments using McCulloch Pitt's Neuron

The original experiments proposed by Warren McCulloch and Warren Pitts to use this model of artificial neurons, and construct different LOGIC GATES by simply specifying what the weights and threshold value should be. It can be easily shown that MCP Neuron can be used to model the AND, OR and NOT logic gates, as well as the composition of these three logic gates

OR Gate
OR Gate is a logic gate which returns true (1) if atleast one of the input signal is true i.e. 1. This can be achieved by setting

Weights w₁ and w₁ to 1, and
Threshold t = 1 and we can easily see that, whenever the aggregated sum is greater then or equal to threshold t, the output is ON (as expected from OR operation)

x₁	x₂	OR	=>	Agg. Sum	is >= Threshold (1)	Output
0	0	0	=>	1(0)+1(0) = 0	No	0
0	1	1	=>	1(0)+1(1) = 1	Yes	1
1	0	1	=>	1(1)+1(0) = 1	Yes	1
1	1	1	=>	1(1)+1(1) = 1	Yes	1

AND Gate
AND Gate is a logic gate which returns true (1) if atleast one of the input signal is true i.e. 1. This can be achieved by setting

Weights w₁ and w₁ to 1, and
Threshold t = 2 and we can easily see that, whenever the aggregated sum is greater then or equal to threshold t, the output is ON (as expected from OR operation)
x₁ x₂ OR => Agg. Sum is >= Threshold (2) Output
0 0 0 => 1(0)+1(0) = 0 No 0
0 1 0 => 1(0)+1(1) = 1 No 0
1 0 0 => 1(1)+1(0) = 1 No 0
1 1 1 => 1(1)+1(1) = 2 Yes 1

NOT Gate
The NOT Gate inverts the signal of its input, and this can be achieved by having following weights and threshold values:

Weights w₁ and w₁ to -1, and
Threshold t = 0

x₁	NOT	=>	Agg. Sum	is >= Threshold (0)	Output
0	1	=>	-1(0) = 0	Yes	1
1	0	=>	-1(1) = -1	No	0

NAND Gate
NAND Gate is a logical composition of the AND GATE followed by NOT Gate. It negates the logic of AND Gate, returning ON(1) when no more than one of its input signal is true, and returns OFF (0) when all input signals are true (1). This can be achieved by having the following weights and threshold values:

Weights w₁ and w₁ to -1, and
Threshold t = -1

x₁	x₂	NAND	=>	Agg. Sum	is >= Threshold (-1)	Output
0	0	1	=>	-1(0)-1(0) = 0	Yes	1
0	1	1	=>	-1(0)-1(1) = -1	Yes	1
1	0	1	=>	-1(1)-1(0) = -1	Yes	1
1	1	0	=>	-1(1)-(1) = -2	No	0

NOR Gate
NOR Gate is a logical composition of the OR Gate followed by the NOT Gate. It negates the logic of the OR Gate, return ON (1) only when none of the inputs are true. This can be achieved by having the following weights and threshold values:

Weights w₁ and w₁ to -1, and
Threshold t = 0

x₁	x₂	NOR	=>	Agg. Sum	is >= Threshold (0)	Output
0	0	1	=>	-1(0)-1(0) = 0	Yes	1
0	1	0	=>	-1(0)-1(1) = -1	No	0
1	0	0	=>	-1(1)-1(0) = -1	No	0
1	1	0	=>	-1(1)-(1) = -2	No	0

1.1.3 Excitement and Challenges of MCP Neuron

The concept of MCP Neuron seems so simple to represent artificial intelligence of any kind, yet it is - and it isn't at the same time. Formal Logic is a fundamental component, and will always remain one of the important ones for the Computational Intelligence, and for a machine to have some sort of intelligence it should be able to comprehend logic gates at minimum. The idea being that logic gates can be stringed together to form logic cicruits, capable of executing any kind of instructions.

What makes the MCP Neuron different is that it was able to achieve this through an approach which was inspired by biological neurons, and this was a promising starting point, and there was a LOT and LOT of excitement because of this aspect.

On the other hand if we take a pause and think it through, the main challenge with this concept of Artificial Neuron alone is that every logic gate that can be modelled (and hence every logic cirucit which is a collection of neurons could model) had to be pre-programmed i.e., we need to somehow identify and lock the values of weights and threshold for the neuron to activate certain way based on various input combinations.

This stands out as a massive contrast to how every brain works, which learns from experience. It was not after about a decade or so when various Learning Algorithms were devised, combining these algoritms with the MCP Neurons, for the very first time we were able to learn (even through rudimentary stuffs) from the training data, without explictly programming.

In the next section of this blog we will discuss some of these very early learning algorithms, gain an intuitive understanding of how this algorithm works (logically speaking), before we dive deep into the nuts and bolts of the deep learning of today.

And yes, we will also see the limitation which was uncovered in the approach of MCP Neuron + Learning Algorithm, which dried up all the funding and let to the era which is referred to as AI WINTER

On a side note, do you reckon based on all the hype around LLM capabilities and the fact that OpenAI going to incur billion of dollar of losses, are we heading into another AI Winter. If it happens it will be sad, because I do believe there is lot of potential in computation intelligence approaches and LLM/GenAI is just a tiny portion of that.

1.1.4 Limitations of MCP Neuron

Now we have already see that MCP Neuron experiments were focused on constructing Logical Gates using the weighted sum approach which resembles (atleast at a very high level) how the neurological brain works.

The following figure summarize the workings of MCP Neuron:

This was actually a simplest of the binary classification, but there are various limitations in the concept MCP Neuron, and they are as follows:

it process boolean inputs only.
it gives equal weights the each inputs
the threshold value theta (or t) must be chosen manually
the learning aspect was missing, i.e. it cannot learn from data, already highlighted by the fact that weights and thresholds have to be chosen beforehand manually

For all of these reasons, various Learning Algorithms were proposed in the coming years, which helped push the area of Neural Networks (or the connectionist approach of AI)

1.2 Learning Algorithms for Aritifical Neurons

It was almost after a decade with the invent of the famour learning algorithms Perceptron, that the neural network were able to learn based on the data. But it was not just perceptron, which made Neurons mainstream, there are other notable discoveries as well and are listed below:

Hebbian Learning Algorithm, from Donald Hebb in 1949
Perceptron Learning Algorithm, from Frank Rossenblatt in 1960, and
Adaline Learning Algorithm, from Bernie Widrow in 1962

1.2.1 Rosenblatt's Perceptron Algorithm (1957)

The Rosenblatt's Perceptron Algorithm was designed to overcome most of the issues of McCulloch-Pitt's neuron:

it can process non-boolean inputs
it included the learning capabilities, where based on the data it can assign the weights of each input individually,
it can compute the threshold automatically.

Let's consider the setup of Neuron for Perceptron Algorithm now, where we will adjust the representation slightly my moving the threshold to the left, and including it in the calculation of weighted aggregation. And with the the equation can be adjusted and made simpler as is highlighted in the diagram below:

As we have moved the arbitrary threshold value (theta) to left and included it in the calculation of weighted sum it becomes learnable (more on the actual algorithm and how the learning happen shortly). Here in this setup w₀ which is equal to -theta, is called bias, and the weight assigned to this is x₀ and is equal to 1.

Perceptron Algorithm was proposed as a Binary Classification algorithm (i.e. predicting values 0 and 1), so we will now step by step build the algorithm, building our intution algebraically as well as geometrically (obviously for a better understanding)

Let's first have a look at a linearly separable (dummy) dataset with two inputs X1 and X2 just for simplicity, and having two class labels 0 and 1.

Let's say the following is our training dataset, based on which we want to train our Artificial Neuron; by the visual inspection of it, we can quickly draw few Linear Classifiers through the dataset pretty quickly (some Binary Linear Classifiers are depicted through various lines in the figure on the right-hand side

CONCEPT 1: Algebric Representation of Line / Parameterizing the Linear Classifier Line:

Now for us to mathematically find the 'a' linear classifier that can divide this dataset into two, the first step is to paramaterize the line, that our Preceptron Algorithm will learn

So as we can see based on the diagram above, the first step of calculation that our Artificial Neuron does represent a line. So by essentially having weight Ws as the parameter associated with each input and threshold value, we have parameterized the line that will represent the Linear Classifier.

CONCEPT 2: Vector Representation of Line and Geometric/Visual Interpretation:

In this dummy dataset we have only considered 2 inputs (features) in set, but in practice we seldom have only 2 features to consider, and as such we will using the Vectors/Matrices to perform the linear transformations, i.e. taking the weighed sum of the inputs features, and it is prudent to understand visually (geomterically) what happens when we do Matrix Multiplcation and Additions/Subtractions etc.

This visual intution of the operations will be very useful in understanding why the Perceptron Algorithm, and for that matter many other algorithms which we will learn down the line works

So in summary:

Algorithmically, the weighted sum of the inputs and weights, can be performed as Matrix Multiplication,
And the Vector corresponding to Weights Ws, actually represent a vector orthogonal to the Linear Classifier
The space on one side (above the line) of the Linear Classifier will represent +ve space i.e. Class/Label 1, and the space on the other side of the line (below the line) i.e. Class/Label 0

With this mathemtical and vector concepts out of the way, intuitively understanding the Perceptron Algorithm and why it works the way it do will be pretty easy

1.2.1.1 Step1: Initialize the Weights and Bias Unit.

In the original Perceptron Algorithm, weights and bias were initialized to 0. In practice though, in actual implementation we can also initialized these to some random number

Let's also consider a simpler dataset, to build our intuition, as per diagram below:

1.2.1.2 Step 2: Iterate through all the Training Example, Perform Calculation

In this step we will now go through each of the training example one by one and perform following operation:

Compute Aggregated Sum $z = w_1.x_1 + w_2.x_2$
Compute Output Value y based on the following threshold logic:
$y = \begin{cases} 1 &\text{if} z \geq 0\\ 0 &\text{if} z < 0 \end{cases}$
Update the Weight and Bias Unit simultaneously:
$w_j = w_j + (y^{(i)} - \^y^{(i)}).x_j^{(i)}$ $b = b + (y^{(i)} - \^y^{(i)})$

We will pick two training example to understand - one of Class 0 and other of Class 1, and just for demonstration purposes we will consider a simpler dataset.

Let's start with Class 0 sample first, and the results of performing the algorithm operations are explained below:

Let's take look at another example but Class 1 this time. The sample datapoint and the result of operations are explained in the figure below:

1.2.1.3 Step 3: Outer Loop until Convergence or Fixed Number of Times

Now as Perceptron moves through the individual data points, it is very much possible that the adjustments made to correctly classify previous picked up points are reverted, and as a result some of the points become incorrectly classified. As such there is a need to create a surrounding loop on the operations we have done in Step 2, and continue that loop until the algorithm converges, or in practice run that for certain number of times (50/100).

1.2.2 Linear Separation and Convergence of the Perceptron Algorithm

As highlighted in the previous section, we need to run the Perceptron Algorithm until it converges. Now, there may be doubt in your minds, and rightly so; what if it never converges and continues to be in the loop? There is theoretical proof that the algorithm converges, and attached here is the one from Prof. Michael Collins of Columbia University: Convergence Proof Here

It's also pretty clear today that Perceptron only works when there is a clear Linear Separation in the data. If we have a dataset that cannot be linearly separated, the Perceptron Algorithm will fail. In the 1960s Marvin Minsky and Seymour Papert highlighted the limitation that Perceptron could not perform similar XOR operations, one of the basic logic gates. As a result, all the excitement around Neural Network crashed, and we went into the AI Winter. Various other factors were causing this AI Winter, but the algorithm's limitations were one of the important ones. XOR Issue, Perceptron Controversy References

1.2.3 General Architecture of Learning Algorithm

In our next blog series, we will implement the Perceptron Algorithm from scratch, but before we do that, I would like to bring your attention to a mental model of how Learning Algorithms work. This mental model will apply not only to the Perceptron Algorithm but also to various other algorithms that we will cover.

In our previous model, we have considered our Artificial Neuron as taking in all the inputs, doing NET INPUT, also referred to as WEIGHTED SUM calculation of all the inputs, and then based on the threshold approach providing the output of 0 and 1. We will break this into individual steps and also generalize these operations as described below:

The First Operation we will refer to as NET-INPUT function (these will map to the Linear Layer of the Deep Learning Models)
The Second Operation, we will refer to as ACTIVATION FUNCTION(these will map to ACTIVATION FUNCTIONS of the Deep Learning Models)

So, our algorithm can be graphically depicted as follows:

With this model of Neurons and Algorithms, different algorithms can be implemented by having different Activation Functions and Prediction Functions.

For the Perceptron Algorithm, our Activation Function is a Step Function that squashes the net input into 0 and 1, and the prediction function equals the output of the Activation Function.

1.2.3 Conclusion

In this article we briefly explored the Biological Neurons, and how this inspired the postulate of Artificial Neuron way back in 1940s. The first concept of Artificial Neuron was published by Warren McCulloch and Warren Pitts, and is thus referred to as McCuolloch Pitts Neuron or in short as MCP Neuron. We then showed how using MCP Neuron, various logical gates can be represented.

We discussed the need for a Learning Algorithm so that instead of manually deciding the values of weights w and bias b, the algorithm should be able to derive these based on the training data. We then discussed that the first such algorithm was Perceptron Algorithmand developed an algebraic and geometric intuition of how the algorithm works. We then also highlighted a crucial limitation of the Perceptron Algorithm, in that it only works in the case of Linearly Separable classification data. We also discussed that when it was identified that Perceptron is not able to simulate XOR, majority of the excitement vanished and was one of the reasons.

Just to conclude, this limitation of Perceptron was later resolved by extending the network and using what is now known as Multi Layer Perceptron, which is a concept of interconnected layers of networks that is very much prevalent in today's DEEP LEARNING MODELS as well.

To continue this journey and to warm up the coding muscle, in our next blog, we will develop the PERCEPTRON ALGORITHM and ADALINE ALGORITHM from scratch. After which, we will jump to PyTorch, a deep learning library we will extensively use in the series of blogs to come.