My reduction works as follows: first the highest reward choice is chosen, then its reward is adjusted to −∞, and the process is repeated until a set of size m has been achieved. This is essentially how I've always seen it done in the past (e.g., construct a regressor, and fill positions in sorted order), but for this greedy approach to work the positional dependence cannot be arbitrary: it must be that m∑i=1maxai∈A∖{a∗1,…,a∗i−1}r(ai,i)=maxa∈˜Amm∑i=1r(ai,i). Here r:A×[1,m]→[0,1] are the (conditionally expected) rewards, A are the actions, and [1,m] are the positions, and the right hand side maximum is over vectors of actions ˜Am that do not have duplicates (i.e., the same action cannot be chosen at multiple positions). This ``greedy works'' condition is actually a much weaker condition than I'm used to assuming. For instance, here's an example set of expected rewards for which greedy works: Actionr(a,1)r(a,2)r(a,3)1105122413123 The first action has decreasing reward by position, while the second action prefers a particular position and the third action has increasing reward by position. So I thought I'd had some fun exploring the above ``greedy works'' condition.
Lemma:Position Maximizers
If there is at least one maximizer of ∑mj=1r(aj,j), then there is a maximizer of ˜a∗ of ∑mj=1r(aj,j) which uses each individual position maximizer a∗i for all i∈[1,m], where a∗i maximizes r(a,i).
Proof: Let a∗i be a maximizer of r(a,i), and assume ˜a∗ is a maximizer of ∑mj=1r(aj,j) that does not use a∗i in any position. Define a′ as ˜a∗ with position i replaced by a∗i. Since a∗i is a maximizer of r(a,i), the resulting total reward cannot decrease, therefore a′ would also be a maximizer of ∑mj=1r(aj,j). Repeating this argument for each position i eventually uses an individual position maximizer from every position.
Proof: Let a∗i be a maximizer of r(a,i), and assume ˜a∗ is a maximizer of ∑mj=1r(aj,j) that does not use a∗i in any position. Define a′ as ˜a∗ with position i replaced by a∗i. Since a∗i is a maximizer of r(a,i), the resulting total reward cannot decrease, therefore a′ would also be a maximizer of ∑mj=1r(aj,j). Repeating this argument for each position i eventually uses an individual position maximizer from every position.
Sufficient Condition:Swap Supremacy
If r(a,i)≥r(a′,i)⟹∀j>i:r(a,i)+r(a′,j)≥r(a,j)+r(a′,i), then m∑i=1maxai∈A∖{a∗1,…,a∗i−1}r(ai,i)=maxa∈˜Amm∑i=1r(ai,i). Proof: Proof is by induction. The base case is when m=1, in which case greedy always works.
To show m−1⇒m, note from the lemma it follows that there is a maximizer of the right hand side that uses a∗1 in some position. If that position is j≠1, then construct a new maximizer of the right hand side by swapping positions 1 and j: this is guaranteed not to decrease the total reward due to the precondition of the theorem. Since both the left hand and right hand side of the desired result use a∗1 in position 1, it can be subtracted from both sides, yielding m−1∑i′=1maxa′i′∈A1∖{a′∗1,…,a′∗i′−1}r(a′i′,i′)=maxa∈˜Am1m−1∑i′=1r(ai′,i′), where A1=A∖a∗1 and i′=i−1.
To show m−1⇒m, note from the lemma it follows that there is a maximizer of the right hand side that uses a∗1 in some position. If that position is j≠1, then construct a new maximizer of the right hand side by swapping positions 1 and j: this is guaranteed not to decrease the total reward due to the precondition of the theorem. Since both the left hand and right hand side of the desired result use a∗1 in position 1, it can be subtracted from both sides, yielding m−1∑i′=1maxa′i′∈A1∖{a′∗1,…,a′∗i′−1}r(a′i′,i′)=maxa∈˜Am1m−1∑i′=1r(ai′,i′), where A1=A∖a∗1 and i′=i−1.
Now to just make this work.
No comments:
Post a Comment