Archive for the ‘topics in artificial intelligence’ Category
Forward Feed Back Propagation networks (aka Three Layer Forward Feed Networks) have been very successful. Some uses include teaching neural networks to play games, speak and recognize things. If a non-linear activation function is used (usually the sigmoid function) then one hidden layer is sufficient to solve almost any equation with reasonable accuracy.
Back-propagation supervised training for Forward-Feed neural nets uses pairs of input and output patterns.
The weights on all the vectors are set to random values.
Then input is fed to the net and propagates to the output layer
hidden layer = sigmoid( sum ( inputs * weights + bias ) )
output layer = sigmoid( sum ( hidden * weights + bias ) )
The errors are calculated.
Error = Output * ( 1 – Output ) * ( expected value – Output )
Then the error correction is propagated back through the hidden layer
adjust output weights
Wnew = Wold + ( Error * Output ) * learningRate
then to the input layer in the network.
Error = Hidden * ( 1 – Hidden ) * ( OutputError * Weight + OutputError * Weight….. )
Wnew = Wold + ( Error * Output ) * learningRate
There is one input neurode for each number (dimension) in the input vector, there is one output neurode for each dimension in the output vector. So the network maps in-dimensional space to out-dimension space.
The weight adjustment is gradient descent. You are calculating the derivative and finding the quickest route to zero error.
The learningRate is a small number ~0.1 to keep the network from blowing up or over shooting a minimum. If the rate is too small the network will be slow to learn.
There is no set rule for determining the number of hidden layers or the number of neurodes in the hidden layer. However, if too few hidden neurodes are chosen then the network can not learn. If too many are chosen, then the network memorizes the patterns rather than learning to extract relevant information. A rule of thumb for choosing the number of hidden neurodes is to choose log ( 2)X where X is the number of patterns. So if you have 8 distinct patterns to be learned, then log ( 2)8 = 3 and 3 hidden neurodes are probably needed. This is just a rule of thumb, experiment to see what works best for your situation.
The error vector is aimed at zero during training.
Error = ( 1/2 * (sum (desired-actual)^2))
To get the error close to zero, with in a tolerance, we use iteration. Each iteration we move a step downward. We take the gradient, the derivative of a vector, and use the steepest descent to minimize the error. So thenewweight = oldW eight + stepsize (-gradientW (e(W )).
The error may be backpropagated after each training example, or more commonly after a batch of examples.
Perceptrons separate data linearly, if the data is linearly separable.
y = mx + b
m => the weights which give the slope of the line
b => the bias, the place where the line crosses the y axis
x => input data
y => output
Rosenblatt added the learning law to the McCulloch-Pitts neurode to make it Perception, which is the first of the neural net learning models. The perception has one layer of inputs and one layer of outputs, and one group of weights. If data points on a plot are linearly separable (we can draw a straight line separating points that belong in different categories), then we can use this learning method to teach the neural net to properly separate the data points.
The McCulloch-Pitts neurode fires a +1 if the neurode’s total input the sum of each input * its weight is greater than the set threshold. If it is less than the set threshold, or if there is any inhibitory input a 0 is fired. Any logical function can be created using only AND, OR and NOT gates so a neural net can be created with McCulloch-Pitts neurodes to solve any logical function.
We start with a weight vector that is set to zeros or random numbers. During training weights are increased or decreased by adding or subtracting the input to the weights. This is done until all data points are input and the neurode gives the correct output for each point.
The perception fell out of favor since it can only handle linearly separable functions which means simple functions like XOR, or parity can not be computed by them. Minsky and Papert published a book ‘Perceptions’, in the 1980’s, that proved that one and two layer neural nets could not handle many real world problems and research fell off for about twenty years in neural nets.
An additional layer and set of weights can enable the Perception to handle functions that are not linear. A separate layer is needed for each vertex needed to separate the function. A 1950’s paper by A.N. Kilmogorov published a proof that a three layer neural network could perform any mapping exactly between any two sets of numbers.
Multi layered perceptrons were developed than can handle XOR functions. Hidden layers are added and they are trained using backpropagation or a similar training algorithm. Using one layer linearly separable problems can be solved. Using two layers regions can be sorted and with three layers enclosed regions can be sorted.
GitHub source code for a Swift Perceptron network
Synthetic psychology is a field where biological behavior is synthesized rather than analyzed. The father of such behavior Valentino Braitenberg ( home page ) did some interesting work with toy vehicles in the 1986 book “Vehicles: Experiments in Synthetic Psychology“.
The Braitenberg vehicles were simple toy cars with light sensors as headlights. In some cars the wire for each headlight went to the real wheel directly behind it, in some the wires went to the opposite back wheel. The headlight receptors were aimed slightly off to the outside. The more light received by a receptor the faster the wheel wired to that receptor would turn.
Each vehicle exhibited realistic behavior when placed in a plane with several randomly placed light sources. A vehicle wired straight back when placed near a light source will then rush towards the light veering off as it gets closer to the light. As it gets more distant from the light sources the vehicle slows down. The reason is the wheel receiving the most light spins fastest turning the car away from the light source.
The vehicle with the crossed wires will turn and drive towards the brightest light in its field of view. The closer it gets, the faster it goes eventually running into the light.
Pretty interesting behavior from a very simple machine. But what if we add in a neurode to each wire and instead of a plain wire we use a wire that inhibits signals? Neurodes are set to only fire if they receive signals over a certain threshold. In this case zero is to be our threshold. So now our cars send signals to the wheels if there is no light, and do not send a signal if there is a light. Now the vehicle with the wires straight back moves toward a light and slows as it approaches, facing the light. The second vehicle now avoids light sources, speeding off to the darkest corner it can find.
Using thresholds ( see logic version ) you begin to see group behavior, one set of vehicles herds and traps another set.
So what has this all to do with current artificial intelligence? Some of our best stuff right now came from earlier work that was done and stopped. Some of our best mathematical algorithms come from extremely early math. And to remind you ( and me ) that simple rules can create very complex behavior in game characters and artificial life forms.
Four neurode versions of these vehicles have been built and they will exhibit more complex, non-linear behavior. “The question of whether a computer can think is no more interesting than the question of whether a submarine can swim.” (Edsger Dijkstra) At what point does the behavior become realist enough to be considered a life form?
ObjC simulation of Braitenberg Vehicles( #2 fear and aggression )
ObjC simulation of Braitenberg Vehicles( #3 love, explore and complex )
ObjC simulation of Braitenberg Vehicles( #4 values and tastes )
ObjC simulation of Braitenberg Vehicles( #5 Logic )
Braitenberg Vehicles ( Java and Lisp simulators here )
If you are interested in network information theory you might want to check out this pdf of combined lecture notes from several graduate classes.
Network information theory deals with the fundamental limits on information flow in networks and optimal coding techniques and protocols that achieve these limits. It extends Shannon’s point-to-point information theory and the Ford–Fulkerson max-flow min-cut theorem to networks with multiple sources and destinations, broadcasting, interference, relaying, distributed compression and computing. Although a complete theory is yet to be developed, several beautiful results and techniques have been developed over the past forty years with potential applications in wireless communication, the Internet, and other networked systems.
This set of lecture notes, which is a much expanded version of lecture notes used in graduate courses over the past eight years at Stanford, UCSD, CUHK, UC Berkeley, and EPFL, aims to provide a broad coverage of key results, techniques, and open problems in network information theory. The lectures are organized in a “top-down” manner into four parts: background, single-hop networks, multi-hop networks, and extensions. The organization attempts to balance the introduction of new techniques and new models. Extensions (if any) to many users and large networks are discussed throughout. The lectures notes provide a unified, simplified, and formalized treatment of achievability using a few basic lemmas. The proofs in the lecture notes use elementary tools and techniques, which should make them accessible to graduate students in EE, CS, Statistics, and related fields as well as to researchers and practitioners in industry.
First improvement to max flow algorithm in 10 years announced by MIT
Sparse distributed memory first appeared in 1998 as a model of long term memory in humans. The main idea is that distances between concepts in our brains can be represented as distances between points in a high dimension world. Since distances between points are far apart in many dimensions, the distance between concepts is large. The large distance between concepts means I only have to come closer to it than another idea for a match to be made.
For example if I give each letter of the alphabet its own dimension then map it by position in a word then ‘aeple’ is closer to ‘apple’ than ‘ample’ and I can guess that the correct word for the mistyped word. Or a four legged creature that is tall and is spotted is closer to a giraffe than a short legged spotted leopard.
This type of storage of data means you can store far fewer examples you need to match allowing you to store more information in a much smaller memory footprint.
All input is represented in binary form in sparse distributed memory. The algorithm works by calculating the ‘Hamming distance’ between data input and existing examples in memory.
Kanerva’s Sparse Distributed Memory An associative memory algorithm well suited to the connection machine. (pdf) ( exellent starting point )
Kanerva’s Sparse Distributed Memory, Multiple Hamming Thresholds ( pdf )