Relatively Clever - Netflix

Relatively Clever pits families against one another in general knowledge and intelligence tests.

Relatively Clever - Netflix

Type: Game Show

Languages: English

Status: Ended

Runtime: 60 minutes

Premier: 2015-03-27

Relatively Clever - Reinforcement learning - Netflix

Reinforcement learning (RL) is an area of machine learning inspired by behaviourist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms. In the operations research and control literature, reinforcement learning is called approximate dynamic programming, or neuro-dynamic programming. The problems of interest in reinforcement learning have also been studied in the theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation, particularly in the absence of a mathematical model of the environment. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality. In machine learning, the environment is typically formulated as a Markov decision process (MDP), as many reinforcement learning algorithms for this context utilize dynamic programming techniques. The main difference between the classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible. Reinforcement learning differs from standard supervised learning in that correct input/output pairs need not be presented, and sub-optimal actions need not be explicitly corrected. Instead the focus is on performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.

Relatively Clever - Temporal difference methods - Netflix

ϕ              {\displaystyle \phi }   that assigns a finite-dimensional vector to each state-action pair. Then, the action values of a state-action pair                     (        s        ,        a        )              {\displaystyle (s,a)}   are obtained by linearly combining the components of                     ϕ        (        s        ,        a        )              {\displaystyle \phi (s,a)}   with some weights                     θ              {\displaystyle \theta }  :

Q        (        s        ,        a        )        =                  ∑                      i            =            1                                d                                    θ                      i                                    ϕ                      i                          (        s        ,        a        )              {\displaystyle Q(s,a)=\sum \limits {i=1}^{d}\theta \phi _{i}(s,a)}  .

The algorithms then adjust the weights, instead of adjusting the values associated with the individual state-action pairs. Methods based on ideas from nonparametric statistics (which can be seen to construct their own features) have been explored. Value iteration can also be used as a starting point, giving rise to the Q-Learning algorithm and its many variants. The problem with using action-values is that they may need highly precise estimates of the competing action values that can be hard to obtain when the returns are noisy. Though this problem is mitigated to some extent by temporal difference methods. Using the so-called compatible function approximation method compromises generality and efficiency. Another problem specific to TD comes from their reliance on the recursive Bellman equation. Most TD methods have a so-called

Relatively Clever - References - Netflix