Here I will briefly discuss the concept behind the learning robot and its environment based on the variable structure stochastic learning automaton (VSLA). This will enable the robot to learn similar like a child.

First of all, the robot can choose from a finite number of actions (e.g. *drive forwards*, *drive backwards*, *turn right*, *turn left*). Initially at a time *t* = *n *= 1 one of the possible actions *α* is chosen by the robot at random with a given probability* p*. This action is now applied to the random environment in which the robot "lives" and the response *β* from the environment is observed by the sensor(s) of the robot.

The feedback *β* from the environment is binary, i.e. it is either favorable or unfavorable for the given task the robot should learn. We define *β* = 0 as a reward (favorable) and *β* = 1 as a penalty (unfavorable). If the response from the environment is favorable (*β* = 0), then the probability *pi* of choosing that action *αi* for the next period of time *t* = *n* + 1 is updated according to the updating rule *Τ*.

After that, another action is chosen and the response of the environment observed. When a certain stopping criterion is reached, the algorithm stops and the robot has learned some characteristics of the random environment.

We define furthermore:

is the finite set of *r *actions/outputs of the robot. The output (action) is applied to the environment at time *t *= *n*, denoted by *α*(*n*).

is the binary set of inputs/responses from the environment. The input (response) is applied to the robot at time *t *= *n*, denoted by * β*(

*n*). In our case, the values for are

*β*chosen to be 0 or 1.

*β*= 0 represents a reward and

*β*= 1 a penalty.

is the finite set of probabilities a certain action *α*(*n*) is chosen at a time *t *= *n*, denoted by *p*(*n*).

*Τ* is the updating function (rule) according to which the elements of the set *P* are updated at each time *t* = *n*. Hence

where the *i*-th element of the set *P*(*n*) is

with *i *= 1,2,...,*r, *

* *and

is the finite set of penalty probabilities that the action *α**i *will result in a penalty input from the random environment. If the penalty probabilities are constant, the environment is called a *stationary random environment*.

The updating functions (reinforcement schemes) are categorized based on their linearity. The general linear scheme is given by:

If *α*(*n*) = *α**i*,

where *a* and *b* are the learning parameter with 0 > *a*,*b* < 1.

If *a* = *b*, the scheme is called the *linear reward-penalty scheme*, which is the earliest scheme considered in mathematical psychology.

For simplicity we consider the random environment as a *stationary random environment* and we are using the *linear reward-penalty scheme*. It can be seen immediately that the limits of a probability *pi* for *n* → *∞* are either 0 or 1. Therefore the robot learns to choose the optimal action asymptotically. It shall be noted, that it converges not always to the correct action; but the probability that it converges to the wrong one can be made arbitrarily small by making the learning parameter *a* small.

## Discussions

## Become a Hackaday.io Member

Create an account to leave a comment. Already have an account? Log In.