In reinforcement learning (RL) there’s no answer key, but your reinforcement learning agent
still has to decide how to act to perform its task. In the absence of
existing training data, the agent learns from experience. It collects
the training examples (“this action was good, that action was bad”)
through trial-and-error as it attempts its task, with the goal of maximizing long-term reward.
One simple strategy for exploration would be to
take the best known action most of the time (say, 80% of the time), but
occasionally explore a new, randomly selected action even though it
might be moving away from known reward. This strategy is called the epsilon-greedy strategy, where epsilon
is the percent of the time that the agent takes a randomly selected
action rather than taking the action that is most likely to maximize
reward.
Markov Decision Process is a process that has specified transition probabilities from state to state.
Q-learning is a technique that evaluates which action to take based on an action-value function that determines the value of being in a certain state and taking a certain action at that state.
Policy learning is a more straightforward alternative in which we learn a policy function, which
is a direct map from each state to the best corresponding action at
that state. Think of it as a behavioral policy: “when I observe state s, the best thing to do is take action a”.
source: Machine Learning for Humans, Part 5: Reinforcement Learning
No comments:
Post a Comment