Class RlTdAgent
-
- All Implemented Interfaces:
public final class RlTdAgentTD learning agent for queueing network routing decisions.
This agent learns an optimal routing policy using average-reward TD(0) learning. When a new job arrives (departure from source), the agent selects a queue using an epsilon-greedy policy derived from the value function. When a job departs from a queue, the queue length is decremented.
The value function is normalized after each update so that V(0,...,0) = 0, ensuring the differential value function interpretation.
-
-
Method Summary
Modifier and Type Method Description final DoubleArraygetV()Value function stored as a flat array (N-dimensional table). final UnitsetV(DoubleArray v)Value function stored as a flat array (N-dimensional table). final DoubleArraygetQ()Q-function stored as a flat array (N+1 dimensional table). final UnitsetQ(DoubleArray q)Q-function stored as a flat array (N+1 dimensional table). final IntArraygetVSize()Shape of the value function array (one entry per dimension). final UnitsetVSize(IntArray vSize)Shape of the value function array (one entry per dimension). final IntArraygetQSize()Shape of the Q-function array (one entry per dimension, last is actionSize). final UnitsetQSize(IntArray qSize)Shape of the Q-function array (one entry per dimension, last is actionSize). final DoublegetLr()final DoublegetEpsilon()final UnitsetEpsilon(Double epsilon)final DoublegetEpsDecay()final Unitreset(RlEnv env)Resets the agent and environment to their initial states. final DoubleArraygetValueFunction()Returns the learned value function. final DoubleArraygetQFunction()Returns the learned Q-function. final Unitsolve(RlEnv env)Trains the agent using average-reward TD(0) learning. final static DoubleArraycreateGreedyPolicy(DoubleArray stateQ, Double epsilon, Integer nA)Creates an epsilon-greedy policy from state-action values. final static IntegergetStateFromLoc(IntArray objSize, IntArray loc)Converts a multi-dimensional location to a linear index (column-major order). final static IntArraygetStateFromLocs(IntArray objSize, Array<IntArray> locs)Converts multiple multi-dimensional locations to linear indices. -
-
Method Detail
-
getV
final DoubleArray getV()
Value function stored as a flat array (N-dimensional table).
-
setV
final Unit setV(DoubleArray v)
Value function stored as a flat array (N-dimensional table).
-
getQ
final DoubleArray getQ()
Q-function stored as a flat array (N+1 dimensional table).
-
setQ
final Unit setQ(DoubleArray q)
Q-function stored as a flat array (N+1 dimensional table).
-
setVSize
final Unit setVSize(IntArray vSize)
Shape of the value function array (one entry per dimension).
-
getQSize
final IntArray getQSize()
Shape of the Q-function array (one entry per dimension, last is actionSize).
-
setQSize
final Unit setQSize(IntArray qSize)
Shape of the Q-function array (one entry per dimension, last is actionSize).
-
getEpsilon
final Double getEpsilon()
-
setEpsilon
final Unit setEpsilon(Double epsilon)
- Parameters:
epsilon- initial exploration rate for the epsilon-greedy policy (0 to 1)
-
getEpsDecay
final Double getEpsDecay()
-
reset
final Unit reset(RlEnv env)
Resets the agent and environment to their initial states.
Clears the value function and Q-function, then resets the environment.
- Parameters:
env- the RL environment
-
getValueFunction
final DoubleArray getValueFunction()
Returns the learned value function.
- Returns:
the value function as a flat array
-
getQFunction
final DoubleArray getQFunction()
Returns the learned Q-function.
- Returns:
the Q-function as a flat array
-
solve
final Unit solve(RlEnv env)
Trains the agent using average-reward TD(0) learning.
Runs the TD learning algorithm for 10,000 episodes (matching MATLAB default). In each episode:
An event is sampled from the environment
If a new job arrives (source departure), the agent selects a queue using epsilon-greedy policy (or JSQ if outside action space)
If a job completes (queue departure), the queue length is decremented
If the state is valid, the value function is updated using TD(0) rule
The average cost rate is estimated using exponentially weighted sums of costs and times.
- Parameters:
env- the RL environment to train on
-
createGreedyPolicy
final static DoubleArray createGreedyPolicy(DoubleArray stateQ, Double epsilon, Integer nA)
Creates an epsilon-greedy policy from state-action values.
Each action gets a base probability of epsilon/nA. The remaining probability mass (1-epsilon) is distributed equally among all actions whose value is within FineTol of the minimum value (cost minimization).
- Parameters:
stateQ- array of state-action values (one per action)epsilon- exploration probabilitynA- number of actions- Returns:
probability distribution over actions
-
getStateFromLoc
final static Integer getStateFromLoc(IntArray objSize, IntArray loc)
Converts a multi-dimensional location to a linear index (column-major order).
This mirrors MATLAB's column-major (Fortran) linear indexing: index = loc0 + (loc1-1)*size0 + (loc2-1)*size0*size1 + ...
Note: locations are 1-based (as in MATLAB), converted to 0-based internally.
- Parameters:
objSize- shape of the array (size of each dimension)loc- multi-dimensional location (1-based indices)- Returns:
linear index (0-based) into the flat array
-
getStateFromLocs
final static IntArray getStateFromLocs(IntArray objSize, Array<IntArray> locs)
Converts multiple multi-dimensional locations to linear indices.
- Parameters:
objSize- shape of the arraylocs- array of locations (each row is one multi-dimensional location)- Returns:
array of linear indices
-
-
-
-