ترغب بنشر مسار تعليمي؟ اضغط هنا

Effective Blog Pages Extractor for Better UGC Accessing

36   0   0.0 ( 0 )
 نشر من قبل Kui Zhao
 تاريخ النشر 2017
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Blog is becoming an increasingly popular media for information publishing. Besides the main content, most of blog pages nowadays also contain noisy information such as advertisements etc. Removing these unrelated elements can improves user experience, but also can better adapt the content to various devices such as mobile phones. Though template-based extractors are highly accurate, they may incur expensive cost in that a large number of template need to be developed and they will fail once the template is updated. To address these issues, we present a novel template-independent content extractor for blog pages. First, we convert a blog page into a DOM-Tree, where all elements including the title and body blocks in a page correspond to subtrees. Then we construct subtree candidate set for the title and the body blocks respectively, and extract both spatial and content features for elements contained in the subtree. SVM classifiers for the title and the body blocks are trained using these features. Finally, the classifiers are used to extract the main content from blog pages. We test our extractor on 2,250 blog pages crawled from nine blog sites with obviously different styles and templates. Experimental results verify the effectiveness of our extractor.


قيم البحث

اقرأ أيضاً

For providing quick and accurate search results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. It would have been ideal if the cr awler managed to update the local snapshot as soon as a page changed on the web. However, finite bandwidth availability and server restrictions mean that there is a bound on how frequently the different pages can be crawled. This then brings forth the following optimisation problem: maximise the freshness of the local cache subject to the crawling frequency being within the prescribed bounds. Recently, tractable algorithms have been proposed to solve this optimisation problem under different cost criteria. However, these assume the knowledge of exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide three novel schemes for online estimation of page change rates. All these schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawl instance. Our first scheme is based on the law of large numbers, the second on the theory of stochastic approximation, while the third is an extension of the second and involves an additional momentum term. For all of these schemes, we prove convergence and, also, provide their convergence rates. As far as we know, the results concerning the third estimator is quite novel. Specifically, this is the first convergence type result for a stochastic approximation algorithm with momentum. Finally, we provide some numerical experiments (on real as well as synthetic data) to compare the performance of our proposed estimators with the existing ones (e.g., MLE).
We argue that relationships between Web pages are functions of the users intent. We identify a class of Web tasks - information-gathering - that can be facilitated by a search engine that provides links to pages which are related to the page the user is currently viewing. We define three kinds of intentional relationships that correspond to whether the user is a) seeking sources of information, b) reading pages which provide information, or c) surfing through pages as part of an extended information-gathering process. We show that these three relationships can be productively mined using a combination of textual and link information and provide three scoring mechanisms that correspond to them: {em SeekRel}, {em FactRel} and {em SurfRel}. These scoring mechanisms incorporate both textual and link information. We build a set of capacitated subnetworks - each corresponding to a particular keyword - that mirror the interconnection structure of the World Wide Web. The scores are computed by computing flows on these subnetworks. The capacities of the links are derived from the {em hub} and {em authority} values of the nodes they connect, following the work of Kleinberg (1998) on assigning authority to pages in hyperlinked environments. We evaluated our scoring mechanism by running experiments on four data sets taken from the Web. We present user evaluations of the relevance of the top results returned by our scoring mechanisms and compare those to the top results returned by Googles Similar Pages feature, and the {em Companion} algorithm proposed by Dean and Henzinger (1999).
We address the problem of cross-referencing text fragments with Wikipedia pages, in a way that synonymy and polysemy issues are resolved accurately and efficiently. We take inspiration from a recent flow of work [Cucerzan 2007, Mihalcea and Csomai 20 07, Milne and Witten 2008, Chakrabarti et al 2009], and extend their scenario from the annotation of long documents to the annotation of short texts, such as snippets of search-engine results, tweets, news, blogs, etc.. These short and poorly composed texts pose new challenges in terms of efficiency and effectiveness of the annotation process, that we address by designing and engineering TAGME, the first system that performs an accurate and on-the-fly annotation of these short textual fragments. A large set of experiments shows that TAGME outperforms state-of-the-art algorithms when they are adapted to work on short texts and it results fast and competitive on long texts.
Job recommendation is a crucial part of the online job recruitment business. To match the right person with the right job, a good representation of job postings is required. Such representations should ideally recommend jobs with fitting titles, alig ned skill set, and reasonable commute. To address these aspects, we utilize three information graphs ( job-job, skill-skill, job-skill) from historical job data to learn a joint representation for both job titles and skills in a shared latent space. This allows us to gain a representation of job postings/ resume using both elements, which subsequently can be combined with location. In this paper, we first present how the presentation of each component is obtained, and then we discuss how these different representations are combined together into one single space to acquire the final representation. The results of comparing the proposed methodology against different base-line methods show significant improvement in terms of relevancy.
Reduction in the cost of Network Cameras along with a rise in connectivity enables entities all around the world to deploy vast arrays of camera networks. Network cameras offer real-time visual data that can be used for studying traffic patterns, eme rgency response, security, and other applications. Although many sources of Network Camera data are available, collecting the data remains difficult due to variations in programming interface and website structures. Previous solutions rely on manually parsing the target website, taking many hours to complete. We create a general and automated solution for aggregating Network Camera data spread across thousands of uniquely structured web pages. We analyze heterogeneous web page structures and identify common characteristics among 73 sample Network Camera websites (each website has multiple web pages). These characteristics are then used to build an automated camera discovery module that crawls and aggregates Network Camera data. Our system successfully extracts 57,364 Network Cameras from 237,257 unique web pages.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا