I only attended NIPS for the Conversation AI workshop, so my thoughts are limited to that. I really liked the subtitle of the workshop: "today's practice and tomorrow's potential." Since I'm on a product team trying to build chatbots that are actually effective, it struck me as exactly the right tone.
Several presentations were related to the Alexa prize. When reading these papers, keep in mind that contestants were subject to extreme sample complexity constraints. Semifinalists had circa 500 on-policy dialogs and finalists less than 10 times more. This is because 1) the Alexa chat function is not the primary purpose of the device so not all end users participated and 2) they had to distribute the chats to all contestants.
The result of sample complexity constraints is a “bias against variance”, as I've discussed before. In the Alexa prize, that meant the winners had the architecture of “learned mixture over mostly hand-specified substrategies.” In other words, the (scarce) on-policy data was limited to adjusting the mixture weights. (The MILA team had substrategies that were trained unsupervised on forum data, but it looks like the other substrategies were providing most of the benefit.) Sample complexity constraints are pervasive in dialog, but nonetheless the conditions of the contest were more extreme than what I encounter in practice so if you find yourself with more on-policy data consider more aggressive usage.
Speaking of sample complexity constraints, we have found pre-training representations on MT tasks a la CoVE is extremely effective in practice for multiple tasks. We are now playing with ELMo-style pre-training using language modeling as the pre-training task (very promising: no parallel corpus needed!).
Another sample complexity related theme I noticed at the workshop was the use of functional role dynamics. Roughly speaking, this is modeling the structure of the dialog independent of the topic. Once topics are abstracted, the sample complexity of learning what are reasonably structured conversations seems low. Didericksen et. al. combined a purely structural L1 model with a simple topically-sensitive L2 (tf-idf) to build a retrieval based dialog simulator. Analogously for their Alexa prize submission, Serban et. al. learned a dialog simulator from observational data which utilized only functional role and sentiment information and then applied Q-learning: this was more effective than off-policy reinforce with respect to some metrics.
Overall the workshop gave me enough optimism to continue plugging away despite the underwhelming performance of current dialog systems.