Online Algorithms for Estimating Change Rates of Web Pages

243 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Konstantin Avrachenkov

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Konstantin Avrachenkov - Kishor Patil - Gugan Thoppe

استرجاع المعلومات التعلم الآلي الشبكات الاجتماعية والمعلومات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

For providing quick and accurate search results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. It would have been ideal if the crawler managed to update the local snapshot as soon as a page changed on the web. However, finite bandwidth availability and server restrictions mean that there is a bound on how frequently the different pages can be crawled. This then brings forth the following optimisation problem: maximise the freshness of the local cache subject to the crawling frequency being within the prescribed bounds. Recently, tractable algorithms have been proposed to solve this optimisation problem under different cost criteria. However, these assume the knowledge of exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide three novel schemes for online estimation of page change rates. All these schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawl instance. Our first scheme is based on the law of large numbers, the second on the theory of stochastic approximation, while the third is an extension of the second and involves an additional momentum term. For all of these schemes, we prove convergence and, also, provide their convergence rates. As far as we know, the results concerning the third estimator is quite novel. Specifically, this is the first convergence type result for a stochastic approximation algorithm with momentum. Finally, we provide some numerical experiments (on real as well as synthetic data) to compare the performance of our proposed estimators with the existing ones (e.g., MLE).

قيم البحث

83 - Pan Peng , Daniel Lopatta , Yuichi Yoshida 2021

Effective resistance is an important metric that measures the similarity of two vertices in a graph. It has found applications in graph clustering, recommendation systems and network reliability, among others. In spite of the importance of the effect ive resistances, we still lack efficient algorithms to exactly compute or approximate them on massive graphs. In this work, we design several emph{local algorithms} for estimating effective resistances, which are algorithms that only read a small portion of the input while still having provable performance guarantees. To illustrate, our main algorithm approximates the effective resistance between any vertex pair $s,t$ with an arbitrarily small additive error $varepsilon$ in time $O(mathrm{poly}(log n/varepsilon))$, whenever the underlying graph has bounded mixing time. We perform an extensive empirical study on several benchmark datasets, validating the performance of our algorithms.

بنى وهياكل البيانات والخوارزميات التعلم الآلي الشبكات الاجتماعية والمعلومات

Relating Web pages to enable information-gathering tasks

105 - Amitabha Bagchi , Garima Lahoti 2010

We argue that relationships between Web pages are functions of the users intent. We identify a class of Web tasks - information-gathering - that can be facilitated by a search engine that provides links to pages which are related to the page the user is currently viewing. We define three kinds of intentional relationships that correspond to whether the user is a) seeking sources of information, b) reading pages which provide information, or c) surfing through pages as part of an extended information-gathering process. We show that these three relationships can be productively mined using a combination of textual and link information and provide three scoring mechanisms that correspond to them: {em SeekRel}, {em FactRel} and {em SurfRel}. These scoring mechanisms incorporate both textual and link information. We build a set of capacitated subnetworks - each corresponding to a particular keyword - that mirror the interconnection structure of the World Wide Web. The scores are computed by computing flows on these subnetworks. The capacities of the links are derived from the {em hub} and {em authority} values of the nodes they connect, following the work of Kleinberg (1998) on assigning authority to pages in hyperlinked environments. We evaluated our scoring mechanism by running experiments on four data sets taken from the Web. We present user evaluations of the relevance of the top results returned by our scoring mechanisms and compare those to the top results returned by Googles Similar Pages feature, and the {em Companion} algorithm proposed by Dean and Henzinger (1999).

استرجاع المعلومات بنى وهياكل البيانات والخوارزميات

Local and Global Topics in Text Modeling of Web Pages Nested in Web Sites

199 - Jason Wang , Robert E. Weiss 2021

Topic models are popular models for analyzing a collection of text documents. The models assert that documents are distributions over latent topics and latent topics are distributions over words. A nested document collection is where documents are ne sted inside a higher order structure such as stories in a book, articles in a journal, or web pages in a web site. In a single collection of documents, topics are global, or shared across all documents. For web pages nested in web sites, topic frequencies likely vary between web sites. Within a web site, topic frequencies almost certainly vary between web pages. A hierarchical prior for topic frequencies models this hierarchical structure and specifies a global topic distribution. Web site topic distributions vary around the global topic distribution and web page topic distributions vary around the web site topic distribution. In a nested collection of web pages, some topics are likely unique to a single web site. Local topics in a nested collection of web pages are topics unique to one web site. For US local health department web sites, brief inspection of the text shows local geographic and news topics specific to each department that are not present in others. Topic models that ignore the nesting may identify local topics, but do not label topics as local nor do they explicitly identify the web site owner of the local topic. For web pages nested inside web sites, local topic models explicitly label local topics and identifies the owning web site. This identification can be used to adjust inferences about global topics. In the US public health web site data, topic coverage is defined at the web site level after removing local topic words from pages. Hierarchical local topic models can be used to identify local topics, adjust inferences about if web sites cover particular health topics, and study how well health topics are covered.

استرجاع المعلومات المنهجية

Automated Discovery of Real-Time Network Camera Data From Heterogeneous Web Pages

349 - Ryan Dailey , Aniesh Chawla , Andrew Liu 2021

Reduction in the cost of Network Cameras along with a rise in connectivity enables entities all around the world to deploy vast arrays of camera networks. Network cameras offer real-time visual data that can be used for studying traffic patterns, eme rgency response, security, and other applications. Although many sources of Network Camera data are available, collecting the data remains difficult due to variations in programming interface and website structures. Previous solutions rely on manually parsing the target website, taking many hours to complete. We create a general and automated solution for aggregating Network Camera data spread across thousands of uniquely structured web pages. We analyze heterogeneous web page structures and identify common characteristics among 73 sample Network Camera websites (each website has multiple web pages). These characteristics are then used to build an automated camera discovery module that crawls and aggregates Network Camera data. Our system successfully extracts 57,364 Network Cameras from 237,257 unique web pages.

استرجاع المعلومات

Flu Detector: Estimating influenza-like illness rates from online user-generated content

76 - Vasileios Lampos 2016

We provide a brief technical description of an online platform for disease monitoring, titled as the Flu Detector (fludetector.cs.ucl.ac.uk). Flu Detector, in its current version (v.0.5), uses either Twitter or Google search data in conjunction with statistical Natural Language Processing models to estimate the rate of influenza-like illness in the population of England. Its back-end is a live service that collects online data, utilises modern technologies for large-scale text processing, and finally applies statistical inference models that are trained offline. The front-end visualises the various disease rate estimates. Notably, the models based on Google data achieve a high level of accuracy with respect to the most recent four flu seasons in England (2012/13 to 2015/16). This highlighted Flu Detector as having a great potential of becoming a complementary source to the domestic traditional flu surveillance schemes.

الذكاء الاصطناعي الحساب واللغة الشبكات الاجتماعية والمعلومات