ﻻ يوجد ملخص باللغة العربية
Deep matching models aim to facilitate search engines retrieving more relevant documents by mapping queries and documents into semantic vectors in the first-stage retrieval. When leveraging BERT as the deep matching model, the attention score across two words are solely built upon local contextualized word embeddings. It lacks prior global knowledge to distinguish the importance of different words, which has been proved to play a critical role in information retrieval tasks. In addition to this, BERT only performs attention across sub-words tokens which weakens whole word attention representation. We propose a novel Global Weighted Self-Attention (GLOW) network for web document search. GLOW fuses global corpus statistics into the deep matching model. By adding prior weights into attention generation from global information, like BM25, GLOW successfully learns weighted attention scores jointly with query matrix $Q$ and key matrix $K$. We also present an efficient whole word weight sharing solution to bring prior whole word knowledge into sub-words level attention. It aids Transformer to learn whole word level attention. To make our models applicable to complicated web search scenarios, we introduce combined fields representation to accommodate documents with multiple fields even with variable number of instances. We demonstrate GLOW is more efficient to capture the topical and semantic representation both in queries and documents. Intrinsic evaluation and experiments conducted on public data sets reveal GLOW to be a general framework for document retrieve task. It significantly outperforms BERT and other competitive baselines by a large margin while retaining the same model complexity with BERT.
We present our solution to the Yandex Personalized Web Search Challenge. The aim of this challenge was to use the historical search logs to personalize top-N document rankings for a set of test users. We used over 100 features extracted from user- an
Information extraction from semi-structured webpages provides valuable long-tailed facts for augmenting knowledge graph. Relational Web tables are a critical component containing additional entities and attributes of rich and diverse knowledge. Howev
This paper presents GEneric iNtent Encoder (GEN Encoder) which learns a distributed representation space for user intent in search. Leveraging large scale user clicks from Bing search logs as weak supervision of user intent, GEN Encoder learns to map
We analyze access statistics of a hundred and fifty blog entries and news articles, for periods of up to three years. Access rate falls as an inverse power of time passed since publication. The power law holds for periods of up to thousand days. The
Topic models are popular models for analyzing a collection of text documents. The models assert that documents are distributions over latent topics and latent topics are distributions over words. A nested document collection is where documents are ne