刘铁岩

现任微软亚洲研究院首席研究员/主任研究员,美国卡耐基梅隆大学(CMU)客座教授、英国诺丁汉大学荣誉教授、中国科技大学、中山大学、南开大学兼职教授/博导。刘博士的研究兴趣包括:人工智能、机器学习、信息检索、数据挖掘等。他被公认为“排序学习”这一学术领域的代表人物, 在顶级国际会议和期刊上发表了数十篇有关排序学习的论文,已被引用八千余次,并受 Springer之邀撰写了该领域的首部学术专著。刘铁岩博士的研究工作多次获得最佳论文奖、最高引用论文奖、研究突破奖,并被CNet, Business Week, NPR等国际媒体广泛报道。他受邀担任了WINE 2014、ACML2015, SOCINFO 2015等国际会议的程序委员会主席,ICML 2014的组委会主席,WWW 2014和SIGIR 2016的讲座主席、WSDM 2015的博士论坛主席、KDD 2012的演示/展览主席, 以及AAAI、IJCAI、NIPS、KDD、SIGIR、WWW等顶级会议的领域主席或资深程序委员。现任ACM Transactions on Information Systems副主编,国际期刊Information Retrieval、Foundations and Trends in Information Retrieval编委。他是国际电子电气工程师学会(IEEE)、美国计算机学会(ACM)和中国计算机学会(CCF)的高级会员,中国计算机学会的杰出演讲者

演讲题目:Knowledge Powered Word Embedding For Verbal IQ Test

摘要:In this research, we adopt word embedding technologies to tackle the verbal IQ test. Most existing word embedding technologies take each word as a basic unit, and obtain its embedding by statistical learning from co-occurrence information in the sliding windows of a text corpus. However, this learning procedure suffers from a few issues: (i) Many words have multiple senses, and a single embedding vector simply cannot reflect their diverse meanings; (ii) Not all words could have statistically sufficient co-occurrence data to learn reliable embedding; (iii) Free text data usually have missing (biased) information and contain noises, thus the learned embedding might not be fully consistent with human knowledge. As a result, direct application of existing word embedding techniques cannot achieve a desirable performance in verbal IQ test, which usually considers multiple senses of a (polysemous) word, focuses on the rare senses, and examines complex relations among words (sense). To tackle these challenges, we propose several new technologies, including (i) Mixture of Skip-Grams, which adopts an EM approach to simultaneously learn word embedding and deal with word sense disambiguation; (ii) K-NET, which leverages morphological rules to enhance the embedding of rare and new words whose co-occurrence information is insufficient; (iii) ProjectNet: which leverages asymmetric low-rank projections of knowledge graphs to enhance the semantics of word embedding. Experiments show that these new technologies can generate more reliable and meaningful word embedding than previous work, and the appropriate use of them can even beat common human beings (mechanical turkers) in answering the verbal questions in IQ tests.

演讲题目:Making super large-scale machine learning possible

摘要:The capability of learning super big models is becoming crucial in this big data era. For example, one may need to learn an LDA model with millions of topics, or a word embedding model with billions of parameters. However, it turns out that training such big models is very challenging: with the state-of-the-art machine learning technologies, one has to use a huge number (e.g., thousands) of machines for this purpose, which is clearly beyond the capability of common machine learning practitioners. In this research, we want to answer the question whether it is possible to train super big machine learning models using just a modest computer cluster. To achieve this goal, we focus on two kind of innovations. First, we make important modifications to the training procedure of existing machine learning algorithms, to make them much more cost-effective. For instance, we propose a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm for LDA, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art LDA samplers. For another instance, we adopt a new, distribution-based training process for word embedding, which transforms huge training data to a modest-sized histogram, and therefore significantly reduces the requirement of memory and disk capacity. Second, we develop a new parameter server based distributed machine learning framework, which specifically targets the efficient training of super big models.  By using separate data structures for high- and low-frequency parameters, the framework allows extremely large models to fit in memory, while maintaining high access speed. By using a so-called model scheduling scheme, the framework allows each worker machine to pull the sub-models as needed from the parameter server, resulting in a frugal use of limited memory capacity and network bandwidth. By using automatic pipelining of model training and network communication, the framework can achieve very high training speed regardless of various conditions of computational resources and network connections. Our experimental results show that with a modest cluster of just 24 machines, we can train a LDA model with 1 million topics and a 20-million-word vocabulary, or a word embedding model with 1000 dimensions and a 20-million-word vocabulary, on a Web document collection with 200-billion tokens — a scale not yet reported even with thousands of machines.