Package jline.api.rl

Class RlTdAgentGeneral

  • All Implemented Interfaces:

    
    public final class RlTdAgentGeneral
    
                        

    General TD learning agent for queueing network control.

    This agent operates with RlEnvGeneral environments and supports:

    The agent uses average-reward TD(0) updates where the cost at each step is the total number of jobs in the system multiplied by the elapsed time.

    • Constructor Detail

      • RlTdAgentGeneral

        RlTdAgentGeneral(Double lr, Double epsilon, Double epsDecay)
        Parameters:
        lr - learning rate for value function updates
        epsilon - initial exploration rate (0 to 1)
        epsDecay - decay factor applied to epsilon each episode
    • Method Detail

      • getV

         final DoubleArray getV()

        Value function stored as a flat array (N-dimensional table).

      • setV

         final Unit setV(DoubleArray v)

        Value function stored as a flat array (N-dimensional table).

      • setEpsilon

         final Unit setEpsilon(Double epsilon)
        Parameters:
        epsilon - initial exploration rate (0 to 1)
      • reset

         final Unit reset(RlEnvGeneral env)

        Resets the agent and environment.

        Parameters:
        env - the general RL environment
      • getValueFunction

         final DoubleArray getValueFunction()

        Returns the learned value function.

        Returns:

        the value function as a flat array

      • solveForFixedPolicy

         final DoubleArray solveForFixedPolicy(RlEnvGeneral env, Integer numEpisodes)

        Evaluates the value function for the current (fixed) routing policy.

        This method runs TD(0) learning without modifying routing decisions. Events are sampled from the environment and the model's existing routing is used. The value function V(s) is updated to reflect the average cost under the current policy.

        This is useful for evaluating heuristic policies (e.g., JSQ, round-robin) before attempting policy improvement.

        Parameters:
        env - the general RL environment
        numEpisodes - number of episodes to run (typically 10^4)
        Returns:

        the learned value function as a flat array

      • solve

         final DoubleArray solve(RlEnvGeneral env, Integer numEpisodes)

        Learns an optimal routing policy using tabular TD control.

        In each episode, the agent:

        • Samples an event from the environment

        • Processes departures from queue nodes

        • If the departure is from an action node and the state is in the action space, selects a routing action using epsilon-greedy policy based on the value of successor states

        • Processes arrivals at queue nodes

        • Updates the value function using average-reward TD(0)

        The epsilon parameter decays by epsDecay each episode for gradual exploitation.

        Parameters:
        env - the general RL environment
        numEpisodes - number of episodes to run (typically 10^4)
        Returns:

        the learned value function as a flat array

      • solveByHashmap

         final RlTdAgentGeneral.HashmapResult solveByHashmap(RlEnvGeneral env, Integer numEpisodes)

        Learns a routing policy using a HashMap-based sparse value function.

        Instead of allocating a full N-dimensional table, this method stores value function entries only for states actually visited during learning. States not in the map use an "external" default value.

        This is efficient for large state spaces where only a fraction of states are reachable.

        Parameters:
        env - the general RL environment
        numEpisodes - number of episodes to run
        Returns:

        HashmapResult containing the feature matrix X and value vector Y

      • solveByLinear

         final RlTdAgentGeneral.ApproximationResult solveByLinear(RlEnvGeneral env, Integer numEpisodes)

        Learns a routing policy and fits a linear value function approximator.

        Runs HashMap-based TD control, then fits a linear model: V(q1, q2, ..., qn) = w0 + w1q1 + w2q2 + ... + wn*qn

        The regression is performed using ordinary least squares (OLS): coefficients = (X^T X)^{-1} X^T Y

        Parameters:
        env - the general RL environment
        numEpisodes - number of episodes to run
        Returns:

        ApproximationResult with feature matrix, values, and regression coefficients

      • solveByQuad

         final RlTdAgentGeneral.ApproximationResult solveByQuad(RlEnvGeneral env, Integer numEpisodes)

        Learns a routing policy and fits a quadratic value function approximator.

        Runs HashMap-based TD control, then fits a quadratic model: V(q1, ..., qn) = sum_{i,j} w_{ij} * q_i * q_j + linear terms + intercept

        The feature matrix is augmented with all pairwise products of the original features (including self-products q_i^2).

        Parameters:
        env - the general RL environment
        numEpisodes - number of episodes to run
        Returns:

        ApproximationResult with augmented feature matrix, values, and regression coefficients