Package jline.api.rl

Class RlTdAgent

  • All Implemented Interfaces:

    
    public final class RlTdAgent
    
                        

    TD learning agent for queueing network routing decisions.

    This agent learns an optimal routing policy using average-reward TD(0) learning. When a new job arrives (departure from source), the agent selects a queue using an epsilon-greedy policy derived from the value function. When a job departs from a queue, the queue length is decremented.

    The value function is normalized after each update so that V(0,...,0) = 0, ensuring the differential value function interpretation.

    • Constructor Detail

      • RlTdAgent

        RlTdAgent(Double lr, Double epsilon, Double epsDecay)
        Parameters:
        lr - learning rate for value function updates
        epsilon - initial exploration rate for the epsilon-greedy policy (0 to 1)
        epsDecay - decay factor applied to epsilon each episode
    • Method Detail

      • getV

         final DoubleArray getV()

        Value function stored as a flat array (N-dimensional table).

      • setV

         final Unit setV(DoubleArray v)

        Value function stored as a flat array (N-dimensional table).

      • getQ

         final DoubleArray getQ()

        Q-function stored as a flat array (N+1 dimensional table).

      • setQ

         final Unit setQ(DoubleArray q)

        Q-function stored as a flat array (N+1 dimensional table).

      • getVSize

         final IntArray getVSize()

        Shape of the value function array (one entry per dimension).

      • setVSize

         final Unit setVSize(IntArray vSize)

        Shape of the value function array (one entry per dimension).

      • getQSize

         final IntArray getQSize()

        Shape of the Q-function array (one entry per dimension, last is actionSize).

      • setQSize

         final Unit setQSize(IntArray qSize)

        Shape of the Q-function array (one entry per dimension, last is actionSize).

      • setEpsilon

         final Unit setEpsilon(Double epsilon)
        Parameters:
        epsilon - initial exploration rate for the epsilon-greedy policy (0 to 1)
      • reset

         final Unit reset(RlEnv env)

        Resets the agent and environment to their initial states.

        Clears the value function and Q-function, then resets the environment.

        Parameters:
        env - the RL environment
      • getValueFunction

         final DoubleArray getValueFunction()

        Returns the learned value function.

        Returns:

        the value function as a flat array

      • getQFunction

         final DoubleArray getQFunction()

        Returns the learned Q-function.

        Returns:

        the Q-function as a flat array

      • solve

         final Unit solve(RlEnv env)

        Trains the agent using average-reward TD(0) learning.

        Runs the TD learning algorithm for 10,000 episodes (matching MATLAB default). In each episode:

        • An event is sampled from the environment

        • If a new job arrives (source departure), the agent selects a queue using epsilon-greedy policy (or JSQ if outside action space)

        • If a job completes (queue departure), the queue length is decremented

        • If the state is valid, the value function is updated using TD(0) rule

        The average cost rate is estimated using exponentially weighted sums of costs and times.

        Parameters:
        env - the RL environment to train on
      • createGreedyPolicy

         final static DoubleArray createGreedyPolicy(DoubleArray stateQ, Double epsilon, Integer nA)

        Creates an epsilon-greedy policy from state-action values.

        Each action gets a base probability of epsilon/nA. The remaining probability mass (1-epsilon) is distributed equally among all actions whose value is within FineTol of the minimum value (cost minimization).

        Parameters:
        stateQ - array of state-action values (one per action)
        epsilon - exploration probability
        nA - number of actions
        Returns:

        probability distribution over actions

      • getStateFromLoc

         final static Integer getStateFromLoc(IntArray objSize, IntArray loc)

        Converts a multi-dimensional location to a linear index (column-major order).

        This mirrors MATLAB's column-major (Fortran) linear indexing: index = loc0 + (loc1-1)*size0 + (loc2-1)*size0*size1 + ...

        Note: locations are 1-based (as in MATLAB), converted to 0-based internally.

        Parameters:
        objSize - shape of the array (size of each dimension)
        loc - multi-dimensional location (1-based indices)
        Returns:

        linear index (0-based) into the flat array

      • getStateFromLocs

         final static IntArray getStateFromLocs(IntArray objSize, Array<IntArray> locs)

        Converts multiple multi-dimensional locations to linear indices.

        Parameters:
        objSize - shape of the array
        locs - array of locations (each row is one multi-dimensional location)
        Returns:

        array of linear indices