Preprocessing: A Prerequisite for Discovering Patterns in WUM Process

172 0 0.0 ( 0 )

Download Cite

Added by Ramya C

Publication date 2011

fields Informatics Engineering

and research's language is English

Authors C. Ramya - K S Shreedhara - G Kavitha

Databases

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Web log data is usually diverse and voluminous. This data must be assembled into a consistent, integrated and comprehensive view, in order to be used for pattern discovery. Without properly cleaning, transforming and structuring the data prior to the analysis, one cannot expect to find meaningful patterns. As in most data mining applications, data preprocessing involves removing and filtering redundant and irrelevant data, removing noise, transforming and resolving any inconsistencies. In this paper, a complete preprocessing methodology having merging, data cleaning, user/session identification and data formatting and summarization activities to improve the quality of data by reducing the quantity of data has been proposed. To validate the efficiency of the proposed preprocessing methodology, several experiments are conducted and the results show that the proposed methodology reduces the size of Web access log files down to 73-82% of the initial size and offers richer logs that are structured for further stages of Web Usage Mining (WUM). So preprocessing of raw data in this WUM process is the central theme of this paper.

rate research

Preprocessing: A Prerequisite for Discovering Patterns in Web Usage Mining Process

108 - C. Ramya , G. Kavitha , Dr. K. S. Shreedhara 2011

Web log data is usually diverse and voluminous. This data must be assembled into a consistent, integrated and comprehensive view, in order to be used for pattern discovery. Without properly cleaning, transforming and structuring the data prior to the analysis, one cannot expect to find meaningful patterns. As in most data mining applications, data preprocessing involves removing and filtering redundant and irrelevant data, removing noise, transforming and resolving any inconsistencies. In this paper, a complete preprocessing methodology having merging, data cleaning, user/session identification and data formatting and summarization activities to improve the quality of data by reducing the quantity of data has been proposed. To validate the efficiency of the proposed preprocessing methodology, several experiments are conducted and the results show that the proposed methodology reduces the size of Web access log files down to 73-82% of the initial size and offers richer logs that are structured for further stages of Web Usage Mining (WUM). So preprocessing of raw data in this WUM process is the central theme of this paper.

Databases

Discovering High Utility-Occupancy Patterns from Uncertain Data

135 - Chien-Ming Chen , Lili Chen , Wensheng Gan 2020

It is widely known that there is a lot of useful information hidden in big data, leading to a new saying that data is money. Thus, it is prevalent for individuals to mine crucial information for utilization in many real-world applications. In the past, studies have considered frequency. Unfortunately, doing so neglects other aspects, such as utility, interest, or risk. Thus, it is sensible to discover high-utility itemsets (HUIs) in transaction databases while utilizing not only the quantity but also the predefined utility. To find patterns that can represent the supporting transaction, a recent study was conducted to mine high utility-occupancy patterns whose contribution to the utility of the entire transaction is greater than a certain value. Moreover, in realistic applications, patterns may not exist in transactions but be connected to an existence probability. In this paper, a novel algorithm, called High-Utility-Occupancy Pattern Mining in Uncertain databases (UHUOPM), is proposed. The patterns found by the algorithm are called Potential High Utility Occupancy Patterns (PHUOPs). This algorithm divides user preferences into three factors, including support, probability, and utility occupancy. To reduce memory cost and time consumption and to prune the search space in the algorithm as mentioned above, probability-utility-occupancy list (PUO-list) and probability-frequency-utility table (PFU-table) are used, which assist in providing the downward closure property. Furthermore, an original tree structure, called support count tree (SC-tree), is constructed as the search space of the algorithm. Finally, substantial experiments were conducted to evaluate the performance of proposed UHUOPM algorithm on both real-life and synthetic datasets, particularly in terms of effectiveness and efficiency.

Databases

An Efficient Preprocessing Methodology for Discovering Patterns and Clustering of Web Users using a Dynamic ART1 Neural Network

259 - C. Ramya , , G. Kavitha 2011

In this paper, a complete preprocessing methodology for discovering patterns in web usage mining process to improve the quality of data by reducing the quantity of data has been proposed. A dynamic ART1 neural network clustering algorithm to group users according to their Web access patterns with its neat architecture is also proposed. Several experiments are conducted and the results show the proposed methodology reduces the size of Web log files down to 73-82% of the initial size and the proposed ART1 algorithm is dynamic and learns relatively stable quality clusters.

Neural and Evolutionary Computing

Discovering Sequential Patterns in a UK General Practice Database

174 - Jenna Reps , Jonathan M. Garibaldi , Uwe Aickelin 2013

The wealth of computerised medical information becoming readily available presents the opportunity to examine patterns of illnesses, therapies and responses. These patterns may be able to predict illnesses that a patient is likely to develop, allowing the implementation of preventative actions. In this paper sequential rule mining is applied to a General Practice database to find rules involving a patients age, gender and medical history. By incorporating these rules into current health-care a patient can be highlighted as susceptible to a future illness based on past or current illnesses, gender and year of birth. This knowledge has the ability to greatly improve health-care and reduce health-care costs.

Machine Learning Computational Engineering Applications

Constant delay enumeration with FPT-preprocessing for conjunctive queries of bounded submodular width

66 - Christoph Berkholz , Nicole Schweikardt 2020

Marx (STOC~2010, J.~ACM 2013) introduced the notion of submodular width of a conjunctive query (CQ) and showed that for any class $Phi$ of Boolean CQs of bounded submodular width, the model-checking problem for $Phi$ on the class of all finite structures is fixed-parameter tractable (FPT). Note that for non-Boolean queries, the size of the query result may be far too large to be computed entirely within FPT time. We investigate the free-connex variant of submodular width and generalise Marxs result to non-Boolean queries as follows: For every class $Phi$ of CQs of bounded free-connex submodular width, within FPT-preprocessing time we can build a data structure that allows to enumerate, without repetition and with constant delay, all tuples of the query result. Our proof builds upon Marxs splitting routine to decompose the query result into a union of results; but we have to tackle the additional technical difficulty to ensure that these can be enumerated efficiently.

Databases Logic in Computer Science