Class RlTdAgentGeneral
-
- All Implemented Interfaces:
public final class RlTdAgentGeneralGeneral TD learning agent for queueing network control.
This agent operates with RlEnvGeneral environments and supports:
Value function evaluation for a fixed policy (solveForFixedPolicy)
Policy optimization using tabular TD control (solve)
Sparse state space exploration using HashMap-based value functions (solveByHashmap)
Linear and quadratic value function approximation (solveByLinear, solveByQuad)
The agent uses average-reward TD(0) updates where the cost at each step is the total number of jobs in the system multiplied by the elapsed time.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description public final classRlTdAgentGeneral.HashmapResultResult of the HashMap-based TD control solve.
public final classRlTdAgentGeneral.ApproximationResultResult of value function approximation.
-
Constructor Summary
Constructors Constructor Description RlTdAgentGeneral(Double lr, Double epsilon, Double epsDecay)
-
Method Summary
Modifier and Type Method Description final DoubleArraygetV()Value function stored as a flat array (N-dimensional table). final UnitsetV(DoubleArray v)Value function stored as a flat array (N-dimensional table). final IntArraygetVSize()Shape of the value function array. final UnitsetVSize(IntArray vSize)Shape of the value function array. final DoublegetLr()final DoublegetEpsilon()final UnitsetEpsilon(Double epsilon)final DoublegetEpsDecay()final Unitreset(RlEnvGeneral env)Resets the agent and environment. final DoubleArraygetValueFunction()Returns the learned value function. final DoubleArraysolveForFixedPolicy(RlEnvGeneral env, Integer numEpisodes)Evaluates the value function for the current (fixed) routing policy. final DoubleArraysolve(RlEnvGeneral env, Integer numEpisodes)Learns an optimal routing policy using tabular TD control. final RlTdAgentGeneral.HashmapResultsolveByHashmap(RlEnvGeneral env, Integer numEpisodes)Learns a routing policy using a HashMap-based sparse value function. final RlTdAgentGeneral.ApproximationResultsolveByLinear(RlEnvGeneral env, Integer numEpisodes)Learns a routing policy and fits a linear value function approximator. final RlTdAgentGeneral.ApproximationResultsolveByQuad(RlEnvGeneral env, Integer numEpisodes)Learns a routing policy and fits a quadratic value function approximator. -
-
Method Detail
-
getV
final DoubleArray getV()
Value function stored as a flat array (N-dimensional table).
-
setV
final Unit setV(DoubleArray v)
Value function stored as a flat array (N-dimensional table).
-
getEpsilon
final Double getEpsilon()
-
setEpsilon
final Unit setEpsilon(Double epsilon)
- Parameters:
epsilon- initial exploration rate (0 to 1)
-
getEpsDecay
final Double getEpsDecay()
-
reset
final Unit reset(RlEnvGeneral env)
Resets the agent and environment.
- Parameters:
env- the general RL environment
-
getValueFunction
final DoubleArray getValueFunction()
Returns the learned value function.
- Returns:
the value function as a flat array
-
solveForFixedPolicy
final DoubleArray solveForFixedPolicy(RlEnvGeneral env, Integer numEpisodes)
Evaluates the value function for the current (fixed) routing policy.
This method runs TD(0) learning without modifying routing decisions. Events are sampled from the environment and the model's existing routing is used. The value function V(s) is updated to reflect the average cost under the current policy.
This is useful for evaluating heuristic policies (e.g., JSQ, round-robin) before attempting policy improvement.
- Parameters:
env- the general RL environmentnumEpisodes- number of episodes to run (typically 10^4)- Returns:
the learned value function as a flat array
-
solve
final DoubleArray solve(RlEnvGeneral env, Integer numEpisodes)
Learns an optimal routing policy using tabular TD control.
In each episode, the agent:
Samples an event from the environment
Processes departures from queue nodes
If the departure is from an action node and the state is in the action space, selects a routing action using epsilon-greedy policy based on the value of successor states
Processes arrivals at queue nodes
Updates the value function using average-reward TD(0)
The epsilon parameter decays by epsDecay each episode for gradual exploitation.
- Parameters:
env- the general RL environmentnumEpisodes- number of episodes to run (typically 10^4)- Returns:
the learned value function as a flat array
-
solveByHashmap
final RlTdAgentGeneral.HashmapResult solveByHashmap(RlEnvGeneral env, Integer numEpisodes)
Learns a routing policy using a HashMap-based sparse value function.
Instead of allocating a full N-dimensional table, this method stores value function entries only for states actually visited during learning. States not in the map use an "external" default value.
This is efficient for large state spaces where only a fraction of states are reachable.
- Parameters:
env- the general RL environmentnumEpisodes- number of episodes to run- Returns:
HashmapResult containing the feature matrix X and value vector Y
-
solveByLinear
final RlTdAgentGeneral.ApproximationResult solveByLinear(RlEnvGeneral env, Integer numEpisodes)
Learns a routing policy and fits a linear value function approximator.
Runs HashMap-based TD control, then fits a linear model: V(q1, q2, ..., qn) = w0 + w1q1 + w2q2 + ... + wn*qn
The regression is performed using ordinary least squares (OLS): coefficients = (X^T X)^{-1} X^T Y
- Parameters:
env- the general RL environmentnumEpisodes- number of episodes to run- Returns:
ApproximationResult with feature matrix, values, and regression coefficients
-
solveByQuad
final RlTdAgentGeneral.ApproximationResult solveByQuad(RlEnvGeneral env, Integer numEpisodes)
Learns a routing policy and fits a quadratic value function approximator.
Runs HashMap-based TD control, then fits a quadratic model: V(q1, ..., qn) = sum_{i,j} w_{ij} * q_i * q_j + linear terms + intercept
The feature matrix is augmented with all pairwise products of the original features (including self-products q_i^2).
- Parameters:
env- the general RL environmentnumEpisodes- number of episodes to run- Returns:
ApproximationResult with augmented feature matrix, values, and regression coefficients
-
-
-
-