Recently Yoav Goldberg had a famous blog rant. I appreciate his concern, because the situation is game-theoretically dangerous: any individual researcher receives a benefit for aggressively positioning their work (as early as possible), but the field as a whole risks another AI winter as rhetoric and reality become increasingly divergent. Yoav's solution is to incorporate public shaming in order to align local incentives with aggregate outcomes (c.f., reward shaping).
I feel there is a better way, as exemplified by a recent paper by Jia and Liang. In this paper the authors corrupt the SQUAD dataset with distractor sentences which have no effect on human performance, but which radically degrade the performance of the systems on the leaderboard. This reminds me of work by Paperno et. al. on a paragraph completion task which humans perform with high skill and for which all state of the art NLP approaches fail miserably. Both of these works clearly indicate that our current automatic systems only bear a superficial (albeit economically valuable) resemblance to humans.
This approach to honest self-assessment of our capabilities is not only more scholarly, but also more productive, as it provides concrete tasks to consider. At minimum, this will result in improved technological artifacts. Furthermore iterating this kind of goal-setting-and-goal-solving procedure many many times might eventually lead to something worthy of the moniker Artificial Intelligence.
(You might argue that the Yoav Goldberg strategy is more entertaining, but the high from the Yoav Goldberg way is a "quick hit", whereas having a hard task to think about has a lot of "replay value".)
Wednesday, July 26, 2017
Monday, July 17, 2017
Tiered Architectures, Counterfactual Learning, and Sample Complexity
I'm on a product team now, and once again I find myself working on a tiered architecture: an “L1” model selects some candidates which are passed to an “L2” model which reranks and filters the candidates which are passed to an “L3”, etc. The motivation for this is typically computational, e.g., you can index a DSSM model pretty easily but indexing a BIDAF model is more challenging. However I think there are potential sample complexity benefits as well.
I worry about sample complexity in counterfactual setups, because I think it is the likely next source for AI winter. Reinforcement learning takes a tremendous amount of data to converge, which is why all the spectacular results from the media are in simulated environments, self-play scenarios, discrete optimization of a sub-component within a fully supervised setting, or other situations where there is essentially infinite data. In real life data is limited.
So when I read Deep Reinforcement Learning in Large Discrete Action Spaces by Dulac-Arnold et. al., I noticed that the primary motivation was computational, but figured another (more important?) benefit might be statistical. Tiered architectures cannot overcome worst-case sample complexity bounds, but I think in practice they are a good strategy for counterfactual setups.
Tiered architectures admit semi-supervised approaches, because an L1 model can often be initialized using unsupervised techniques (e.g., word embeddings, sentence embeddings, inverted indicies with tf-idf). Learning the L2 model utilizing this L1 model only has a sample complexity based upon the number of candidates produced by the L1 model, rather than the total number of candidates. Of course, learning the L1 still has a sample complexity based upon the total number of candidates, but if the unsupervised initialization is good then it is ok that the L1 learns slowly. Furthermore in practice the L1 hypothesis class is simpler (because of computational reasons) which mitigates the sample complexity.
There was a workshop called ``coarse-to-fine inference'' at NIPS 2017 which presumably explored these ideas, but I didn't attend it and their website is down. Hopefully there will be another one, I will attend!
I worry about sample complexity in counterfactual setups, because I think it is the likely next source for AI winter. Reinforcement learning takes a tremendous amount of data to converge, which is why all the spectacular results from the media are in simulated environments, self-play scenarios, discrete optimization of a sub-component within a fully supervised setting, or other situations where there is essentially infinite data. In real life data is limited.
So when I read Deep Reinforcement Learning in Large Discrete Action Spaces by Dulac-Arnold et. al., I noticed that the primary motivation was computational, but figured another (more important?) benefit might be statistical. Tiered architectures cannot overcome worst-case sample complexity bounds, but I think in practice they are a good strategy for counterfactual setups.
Tiered architectures admit semi-supervised approaches, because an L1 model can often be initialized using unsupervised techniques (e.g., word embeddings, sentence embeddings, inverted indicies with tf-idf). Learning the L2 model utilizing this L1 model only has a sample complexity based upon the number of candidates produced by the L1 model, rather than the total number of candidates. Of course, learning the L1 still has a sample complexity based upon the total number of candidates, but if the unsupervised initialization is good then it is ok that the L1 learns slowly. Furthermore in practice the L1 hypothesis class is simpler (because of computational reasons) which mitigates the sample complexity.
There was a workshop called ``coarse-to-fine inference'' at NIPS 2017 which presumably explored these ideas, but I didn't attend it and their website is down. Hopefully there will be another one, I will attend!