# Machine learning

A project log for Humanoid robot named Murphy

Created to explore new ways of artificial intelligence

M. Bindhammer 04/13/2016 at 12:490 Comments

Here I will briefly discuss the concept behind the learning robot and its environment based on the variable structure stochastic learning automaton (VSLA). This will enable the robot to learn similar like a child.

First of all, the robot can choose from a finite number of actions (e.g. drive forwards, drive backwards, turn right, turn left). Initially at a time t = n = 1 one of the possible actions α is chosen by the robot at random with a given probability p. This action is now applied to the random environment in which the robot "lives" and the response β from the environment is observed by the sensor(s) of the robot.

The feedback β from the environment is binary, i.e. it is either favorable or unfavorable for the given task the robot should learn. We define β = 0 as a reward (favorable) and β = 1 as a penalty (unfavorable). If the response from the environment is favorable (β = 0), then the probability pi of choosing that action αi for the next period of time t = n + 1 is updated according to the updating rule Τ.

After that, another action is chosen and the response of the environment observed. When a certain stopping criterion is reached, the algorithm stops and the robot has learned some characteristics of the random environment.

We define furthermore:

is the finite set of r actions/outputs of the robot. The output (action) is applied to the environment at time t = n, denoted by α(n).

is the binary set of inputs/responses from the environment. The input (response) is applied to the robot at time t = n, denoted by β(n). In our case, the values for are β chosen to be 0 or 1. β = 0 represents a reward and β = 1 a penalty.

is the finite set of probabilities a certain action α(n) is chosen at a time t = n, denoted by p(n).

Τ is the updating function (rule) according to which the elements of the set P are updated at each time t = n. Hence

where the i-th element of the set P(n) is

with i = 1,2,...,r,

and

is the finite set of penalty probabilities that the action αi will result in a penalty input from the random environment. If the penalty probabilities are constant, the environment is called a stationary random environment.

The updating functions (reinforcement schemes) are categorized based on their linearity. The general linear scheme is given by:

If α(n) = αi,

where a and b are the learning parameter with 0 > a,b < 1.

If a = b, the scheme is called the linear reward-penalty scheme, which is the earliest scheme considered in mathematical psychology.

For simplicity we consider the random environment as a stationary random environment and we are using the linear reward-penalty scheme. It can be seen immediately that the limits of a probability pi for n are either 0 or 1. Therefore the robot learns to choose the optimal action asymptotically. It shall be noted, that it converges not always to the correct action; but the probability that it converges to the wrong one can be made arbitrarily small by making the learning parameter a small.