## Archive for the ‘source code’ Category

## Particle Swarms

Particle Swarms take several particles move them randomly around an area looking for a minimum/maximum or some other condition. As particles test each location they land at the best score for the group and each particle is updated. The particles velocity is then adjusted to head more in the direction of each one’s personal best and in the direction of the group best. The velocities also contain a random variable to keep particles from swarming the first good location they find and ending up at a local min or max.

The algorithm is very simple:

Initialize particles

Loop:

For each particle:

move

test location to see if it is better than previous best

update velocity vectors

End for each particle

Test to see if any particle has beaten the group best so far and up date if so

End Loop:

The velocity vector is calculated:

OldVelocity += ConstantForGroup * randomNumberBetweenZeroAndOne * (GroupBestLocation – CurrentParticleLocation) + ConstantForParticles * randomNumberBetweenZeroAndOne * ( PersonalBestLocation – CurrentLocation)

The time step is decreased each cycle by multiplying dt by .99. Otherwise the particles start zooming around and fail to home in except by brute force.

The example takes six particles through a hundred loops and attempts to find the location of the maximum. The code is heavily commented and should be easy to understand and adjust.

See also:

Particle Swarm Optimization

Swarm Intelligence, Focus on Ant and Particle Swarm Optimization

Ant Algorithms

Particle Swarm Optimization ( PSO ) in Python (source code available)

Swarm Behavior – National Geographic

The swarm is reporting for duty

Swarming birds, plasma, crowds and stock markets

Swarm intelligence reaches a new level

Are swarms chaotic

Swarm approach to photography

## Simulated Annealing

This AI process mimicks the physical process of annealing. Annealing is the process of heating and cooling a metal in a controlled manner. This gives us a stronger metal if the cooling part is done slowly. A higher temperature means more energy. So if I am trying to place a marble on a hilly landscape in the global minimum I may have to shake things up a great deal to bounce the marble out of a local minimum. So I start out with lots of energy hoping to locate the global minimum. Then reduce the energy hoping to zero in on the minimum better.

Start with a random solution or your best solution from a previous run. This is our current solution. The energy of our solution is then measured. For the eight queens problem the amount of energy is the amount of conflicts between our queens.

We then take a copy of the solution, bounce it around a bit and see if it has less energy than the previous solution. If so we use that as our working solution. If not we test it using P(dE) = exp( -dE/T). At higher temperatures we take worse solutions than we do when we get close to finding the correct solution.

Simulated Annealing is used in traveling salesman and scheduling problems and also it has been used to do some image clean up work.

Algorithm:

Set up initial board

Copy initial solution to current, working and best solutions

Loop:

Bounce the queens around a the board

See if this solution is better than last one ( uses less energy / has less conflicts )

if so use new solution

if not test again using P(dE) = exp( -dE/T )

if this gives a lower energy than a random number we pick, use this solution

Switch copy boards as needed

Adjust temperature

Bail from loop if problem solved ( energy < some number )

Java source code 8 queens, Simulated Annealing

Simulated Annealing was developed by S. Kirkpatrick, CD Gelatt and MP Vecchi in 1983 and by Cerny 1985. It is a Monte Carlo type of solution to a problem.

More information:

Simulated annealing for location area planning in cellular networks

Wiki – Simulated Annealing

Taygeta Scientific – Simulated Annealing Information

## Game Theory

Game theory is the study of how decision makers will act and react in various situations like negotiating business deals. It is used quite a bit in the study of economics and politics. This is the field that has recently gained some fame with the ‘A Beautiful Mind’ book and movie about John Nash. Much of game theory is built on the Nash Equilibrium. John Von Neumann laid much of the ground work for game theory. Using simplified models, often based only a few rules, many behaviors of people in various situations can be predicted.In these game models it is assumed that the players will make rational choices in each decision. Rational choice is defined as: ‘the action chosen by a decision maker is at least as good as every other available action’. A player is presented with a set of actions from which she chooses one. She may prefer one action over another or may consider some actions to be equally preferable. The only restriction on the actions preferences is that if a player prefers action A over action B, and she prefers action B over actions C then she must also prefer action A over C. A payoff function is used to describe the benefit of each action from the player’s point of view. For example I may visit a used car lot and find a car that is worth 1,000 by my valuation, you may value that car at 500. So the payoff function for the car for me is 1,000 for you it is 500. If I see another car I like that I value at 250, it only means I prefer the first car to the second car. It does not mean that I prefer it 4 times as much. The values are only to show ordering of choices.

The Nash Equilibrium is the place where each player can do no better, no matter what the other person decides to do.

In the well known game of ‘Prisoner’s Dilemma’ we have two crooks who worked together on a crime and have each been caught and are being held separately from each other. They each have a choice of finking or remaining quiet. If one finks, he walks with 1 years time in jail, the other person gets 4 years in jail. If neither finks there is enough evidence to put each away for 2 years time.

Crook | quiet | fink |

2 | 2,2 | 0,3 |

1 | 3,0 | 1,1 |

The highest score of all the plays is for both crooks remain quiet and each receives 2 years jail time. But if Crook one finks, he gets a score of 3 (1 years time ) against the other player who remains quiet or a score of 1(and gets 3 years jail time) So his best bet is to fink, as is the other crooks. The Nash equilibrium is at fink/fink (1,1) since finking is the best move for each player individually.

The payoff function for this game is the same for each player and is: f(Fink, Quiet) f(Quiet, Quiet) f(Fink, Fink) f(Quiet, Fink) So we could as easily score it 3, 2, 1, 0 rather than counting years out or 32, 21, 9, 1 the score only serves to order the choices. A clearer example is a game in which we have 2 players moving pieces on a 3-d game board. Each player can move in the x, y, or z direction.

X1 | Y1 | Z1 | |

X2 | 2*,1* | 0,1* | 0,0 |

Y2 | 0,0 | 2*,1 | 1*,2* |

Z2 | 1,2* | 1,0 | 0,1 |

The * is the best move for each player considering what the other player does. If player 1 moves in the Y direction player 2′s best move is also in the Y direction (2*,1) The squares with both players have *s are X1 X2 and Z1 Y 2 . These are both Nash equilibriums. Finding the Nash Equilibrium this way is called ‘Best Response’

Suppose instead of set numbers I have a function that describes the payoff for each player. I could have A(x) = y 2 andB(y) = 1 2 x + xy then to find the Nash equilibrium I take the derivative of each function, set it to zero, solve and plot. Any and all places the functions cross on the plot are Nash equilibriums.

It is not possible, even for most simple games, to search all the possible routes the game can take. Too much time and hardware would be needed. For example tic-tac-toe is one of the simplest games we all know. A game tree that mapped all possible moves from start to finish would be 9! or 362,880 nodes large. The first player would have 9 choices of which box to play in, the second 8 choices since the first player had taken one, the first player’s second move would have 7 choices, etc. So the top level of nodes would have 9 choices. Each level in the tree represents a turn in the game. The second level would have 8 nodes off of each of the original 9 nodes and so on. So you can imagine what chess or other more complicated games have as the number of possible moves. All games and all competitions can be represented by trees.

Pruning is used to take sections off of the search tree that make no difference to play. Heuristic (rule of thumb) evaluations allow approximations to save search time. For instance in the tic-tac-toe tree described above once the first player chooses a position to play then the other 8 nodes of the top layer can be trimmed off and only the 8 trees under that node need to be searched.

Since it is not usually practical to calculate each possible outcome a cut off is usually put in place. As an example look at a chess game. For each board in play we can calculate the advantage by adding up the point value of the pieces on the board or adding points for position. Then the program can see which of those gives the program a higher score. Then the program need only calculate five or so moves ahead, calculate the advantage at each node and choose the best path. Rather than calculate ahead a set number of moves, the program can use an iterative deepening approach and calculate until time runs out. A quiescent search restricts the above approach. This eliminates moves that are likely to cause wild swings in the score. The horizon problem occurs when searches do not look ahead to the end of the game. This is a current unsolved problem in game programming.

The min max algorithm assumes a ‘zero sum game’, such as tic-tac-toe where what is good for one player is bad for the other player. This algorithm assumes that both players will play perfectly and attempt to maximize their scores. The algorithm only generates the trees on the nodes that are likely to be played. Max is the computer, min is the opposing player. It is assumed max will get first turn.

-generate entire game tree down to the maximum level to check generate each terminal state value, high values are most beneficial to max, negative values are most beneficial to min, zero holds no advantage for either player.

-go up one level, give the node above the previous layer the best score from the prior layer

-continue up the tree one level at a time until top is reached

-pick the node with the highest score.

The Alpha-Beta method determines whether an evaluation should be made of the top node by the min-max algorithm. It searches all of the nodes, like min max, then eliminates (prunes) those that are never going to reached. The program begins by proceeding with the min-max algorithm systematically through the nodes of a tree. First we go down a branch of the tree and calculate the score for that node. Then we proceed down the next branch. If the score at one of the leaves is lower than the score obtained in a previous branch of the tree we don’t finish evaluating all the nodes of the branch, rather we move onto the next branch. The search can be shallow rather than deep saving time. Further gains in speed can be made by caching the information from branches in a look up table, re-ordering results, extending some and shortening other searches, or using probabilities rather than actual numbers for cutoffs and using parallel algorithms.

Ordering may be used to save time as well. In chess captures would be considered first, followed by forward moves, followed by backward moves. Or, ordering can consider the nodes with the highest values first.

The program must try to find a winning strategy that does not depend on the human user’s moves. Humans often make small goals and consider moves that work toward that goal, i.e. capture the queen. David Wilkins Paradise is the only program so far to do this successfully. Another approach is to use book learning. Several boards are loaded into a table in memory and if the same board comes into play the computer can look up what to do from there. The Monte Carlo simulation has been used successfully in games with non-deterministic information, such as; Scrabble, dice, and card games.

Temporal-difference learning is derived from Samuel’s machine learning re search. Several games are played out and kept in a database. This works well with board games like backgammon, chess and checkers. Neural nets can be trained to play games this way, TD Gammon being one the more famous ones.

More information:

Game Theory

Martin J Osborne

Gam3r 7h30ry

Distinguishing the opponents in the prisoner dilemma in well-mixed populations

Towards Network Games with Social Preferences

An introduction to quantum game theory

Games on social networks: On a problem posed by Goyal

Backward Induction

Are mathematics the new artificial intelligence?

Backward Induction

## Hopfield Networks

John Hopfield, in the late 1970′s, brought us these networks. These networks can be generalized and are robust. These networks can also be described mathematically. On the downside they can only store 15% as many states as they have neurodes, and the patterns stored must have Hamming distances that are about 50% of the number of neurodes.

Hopfield networks, aka crossbar systems, are a networks that recall what is fed into them. This makes it useful for restoring degraded images. It is a fully connected net, every node is connected to every other node. The nodes are not connected to themselves.

Calculating the weight matrix for a Hopfield network is easy. This is an example with 3 input vectors. You can train the network to match any number of vectors provided that they are orthogonal.

Orthogonal vectors are vectors that give zero when you calculate the dot product.

orthogonal (0, 0, 0, 1) (1, 1, 1, 0) => 0*1 + 0*1 + 0*1 + 1*0 = 0

orthogonal (1, 0, 1, 0) (0, 1, 0, 1) => 1*0 + 0*1 + 1*0 + 0*1 = 0

NOT orthogonal (0, 0, 0, 1) (0, 1, 0, 1) = 0*0 + 0*1 + 0*0 + 1*1 = 1

Orthogonal vectors are perpendicular to each other.

To calculate the weight matrix for the orthogonal vectors (0, 1, 0, 0), (1, 0, 1, 0), (0, 0, 0, 1)

first make sure all the vectors are orthogonal

(0, 1, 0, 0) (1, 0, 1, 0) => 0*1 + 1*0 + 0*1 + 0*0 = 0

(0, 1, 0, 0) (0, 0, 0, 1) => 0*0 + 1*0 + 0*0 + 0*1 = 0

(1, 0, 1, 0) (0, 0, 0, 1) => 1*0 + 1*0 + 1*0 + 0*1 = 0

Change the zeros to negative ones in each vector

(0, 1, 0, 0) => (-1, 1, -1, -1)

(1, 0, 1, 0) => (1, -1, 1, -1)

(0, 0, 0, 1) => (-1, -1, -1, 1)

Multiply each matrix by itself

(-1) (-1, 1, -1, -1)

[ 1, -1, 1, 1]

( 1)

[-1, 1, -1, -1]

(-1)

[ 1, -1, 1, 1]

(-1)

[ 1, -1, 1, 1]

( 1) (1, -1, 1, -1)

[ 1, -1, 1, -1]

(-1)

[-1, 1, -1, 1]

( 1)

[ 1, -1, 1, -1]

(-1)

[-1, 1, -1, 1]

(-1) (-1, -1, -1, 1)

[ 1, 1, 1,-1]

(-1)

[ 1, 1, 1,-1]

(-1)

[ 1, 1, 1,-1]

( 1)

[-1,-1,-1,-1]

The third step is to put zeros on the main diagonal of each matrix and add them together. (Putting zeros on the main diagonal keeps each node from being connected to itself.

( 0,-1, 1, 1)

( 0,-1, 1,-1)

(0, 1, 1,-1)

(-1, 0,-1,-1)

+

(-1, 0,-1, 1)

+

(1, 0, 1,-1)

( 1,-1, 0, 1)

( 1,-1, 0,-1)

(1, 1, 0,-1)

( 1,-1, 1, 1)

(-1, 1,-1, 0)

(1, 1, 1, 0)

The resulting matrix is:

[ 0,-1, 3,-1]

[-1, 0,-1,-1]

[ 3,-1, 0,-1]

[ 1, 1, 0, 1]

The Hopfield network is fully connected, each weight connects to every other weight

[n1] -> [n2] => weight is -1

[n1] -> [n3] => weight is 3

[n1] -> [n4] => weight is -1

[n2] -> [n1] => weight is -1

[n2] -> [n3] => weight is -1

[n2] -> [n4] => weight is -1

[n3] -> [n1] => weight is 3

[n3] -> [n2] => weight is -1

[n3] -> [n4] => weight is -1

[n4] -> [n1] => weight is 1

[n4] -> [n2] => weight is 1

[n4] -> [n3] => weight is 1

They can also be described as having a potential energy surface with conical holes representing the data. Holes having similar depth and diameter represent data with similar properties. The input data seeks the lowest potential energy and settles in to the closest hole. The energy surfaces of these networks are mathematically equivalent to that of ‘spin glasses’.

Some problems with these neural nets are they are computationally intensive, use lots of memory, and although I haven’t seen it mentioned I would guess race conditions may present a problem since data is updated continuously at each node with the output from one becoming the input for another.

BAM, bidirectional associative memory is an example of a Hopfield network. It consists of two fully connected layers, one for input and one for output. The nodes in each layer do not have connections to other nodes in the same layer. The weights are bidirectional, meaning that there is communication in both directions along the weight vector. There are no connections between neurodes in the same layer. BAM networks take only -1′s and 1′s as input and only output -1′s and 1′s. So if you are working with binary data, you must convert the zeros to -1′s. The weights are calculated in the same way as the Hopfield example above. The nodes are either 0 or 1 (on or off).

Hopfield Network Example

## Backpropagation Networks

Forward Feed Back Propagation networks (aka Three Layer Forward Feed Networks) have been very successful. Some uses include teaching neural networks to play games, speak and recognize things. Backpropagation networks can be used on several network architectures. The networks are all highly interconnected and use non-linear transfer functions. The network must have at minimum three layers, but rarely needs more than three layers.

Back-propagation supervised training for Forward-Feed neural nets uses pairs of input and output patterns. The weights on all the vectors are set to random values. Then input is fed to the net and propagates to the output layer and the errors are calculated. Then the error correction is propagated back through the hidden layer then to the input layer in the network. There is one input neurode for each number (dimension) in the input vector, there is one output neurode for each dimension in the output vector. So the network maps IN-dimensional space to OUT-dimension space. There is no set rule for determining the number of hidden layers or the number of neurodes in the hidden layer. However, if too few hidden neurodes are chosen then the network can not learn. If too many are chosen, then the network memorizes the patterns rather than learning to extract relevant information. A rule of thumb for choosing the number of hidden neurodes is to choose log ( 2)X where X is the number of patterns. So if you have 8 distinct patterns to be learned, then log ( 2)8 = 3 and 3 hidden neurodes are probably needed. This is just a rule of thumb, experiment to see what works best for your situation.

The error vector is aimed at zero during training. The vector is calculated as: Error = ( 1/2 * (sum (desired-actual)^2)) To get the error close to zero, with in a tolerance, we use iteration. Each iteration we move a step downward. We take the gradient, the derivative of a vector, and use the steepest descent to minimize the error. So thenewweight = oldW eight + stepsize (-gradientW (e(W )).

The derivative of the function T (x) = (1/(1 e^-x )) is just T (x) (1 T (x)) so using the chain rule we arrive at the error correction function (desired actual)(1 actual) eachN odeOutW eight eachN odeHiddenW eight the weight is then changed by the amount of the error correction function as it propagates back through the network.

To train the net all weights are randomly set to a value between -1.0 and 1.0

To do the calculations going forward through the net:

Each NodeInput is multplied by each weight connected to it

Each HiddenNode sums up these incoming weights and adds a bias to the total

This value is used in the sigmoid function as x { 1/(1+e^-x) }

If this value is greater than the threshold the HiddenNode fires this value, else it fires zero

Each HiddenNode is multiplied by each weight connected to it

Each OutputNode sums up these incoming weights and adds a bias to the total

This value is used in the sigmoid function as x { 1/(1 + e^-x) }

This is the value out put by the OutputNode

To calculate the adjusments during training, you figure out the error and propigate it back like this:

Adjust weights between HiddenNodes and OutputNodes

ErrorOut = ( OutputNode)*(1-OutputNode)(DesiredOutput – OutputNode)

ErrorHidden = (HiddenNode)*(1-HiddenNode)*(Sum { ErrorOut*Weight + ErrorOut*Weight … } ) for each weight connected to this node

LearningRate = LearningConstant * HiddenNode

(LearningConstant is usually set to something around 0.2 )

Adjustment = ErrorOut * LearningRate

Weight = Weight – Adjustment

Adjust weights between HiddenNodes and InputNodes

Adjustment = ( ErrorHidden)*(LearningConstant)*(NodeInput)

Weight = Weight – Adjustment

Adjust Threshold

On OutputNode, Threshold = Threshold – ErrorOut * LearningRate

On HiddenNode, Threshold = Threshold – ErrorHidden * LearningRate

If you use a neural net that also accounts for imaginary numbers you can adapt this function so it is not always positive and calculate all of the four derivatives needed.

Numerous iterations are required for a backpropagation network to learn. Therefore it is not practical for neural nets that must learn in ‘real time’. It will not always arrive at a correct set of weights. It may get trapped in local minimums rather than an actual minimum. This is a problem with the ‘steepest decent’ algorithm. A momentum term that allows the calculation to slide over small bumps is sometimes employed. Back propagation networks do not scale well. They are only good for small neural nets.

Winning Dog Track Predictor ( C/PERL)