No Arabic abstract
The insights revealed from process mining heavily rely on the quality of event logs. Activities extracted from healthcare information systems with the free-text nature may lead to inconsistent labels. Such inconsistency would then lead to redundancy of activity labels, which refer to labels that have different syntax but share the same behaviours. The identifications of these labels from data-driven process discovery are difficult and rely heavily on resource-intensive human review. Existing work achieves low accuracy either redundant activity labels are in low occurrence frequency or the existence of numerical data values as attributes in event logs. However, these phenomena are commonly observed in healthcare information systems. In this paper, we propose an approach to detect redundant activity labels using control-flow relations and numerical data values from event logs. Natural Language Processing is also integrated into our method to assess semantic similarity between labels, which provides users with additional insights. We have evaluated our approach through synthetic logs generated from the real-life Sepsis log and a case study using the MIMIC-III data set. The results demonstrate that our approach can successfully detect redundant activity labels. This approach can add value to the preprocessing step to generate more representative event logs for process mining tasks in the healthcare domain.
Providing appropriate structures around human resources can streamline operations and thus facilitate the competitiveness of an organization. To achieve this goal, modern organizations need to acquire an accurate and timely understanding of human resource grouping while faced with an ever-changing environment. The use of process mining offers a promising way to help address the need through utilizing event log data stored in information systems. By extracting knowledge about the actual behavior of resources participating in business processes from event logs, organizational models can be constructed, which facilitate the analysis of the de facto grouping of human resources relevant to process execution. Nevertheless, open research gaps remain to be addressed when applying the state-of-the-art process mining to analyze resource grouping. For one, the discovery of organizational models has only limited connections with the context of process execution. For another, a rigorous solution that evaluates organizational models against event log data is yet to be proposed. In this paper, we aim to tackle these research challenges by developing a novel framework built upon a richer definition of organizational models coupling resource grouping with process execution knowledge. By introducing notions of conformance checking for organizational models, the framework allows effective evaluation of organizational models, and therefore provides a foundation for analyzing and improving resource grouping based on event logs. We demonstrate the feasibility of this framework by proposing an approach underpinned by the framework for organizational model discovery, and also conduct experiments on real-life event logs to discover and evaluate organizational models.
Generalization is a central problem in Machine Learning. Indeed most prediction methods require careful calibration of hyperparameters usually carried out on a hold-out textit{validation} dataset to achieve generalization. The main goal of this paper is to introduce a novel approach to achieve generalization without any data splitting, which is based on a new risk measure which directly quantifies a models tendency to overfit. To fully understand the intuition and advantages of this new approach, we illustrate it in the simple linear regression model ($Y=Xbeta+xi$) where we develop a new criterion. We highlight how this criterion is a good proxy for the true generalization risk. Next, we derive different procedures which tackle several structures simultaneously (correlation, sparsity,...). Noticeably, these procedures textbf{concomitantly} train the model and calibrate the hyperparameters. In addition, these procedures can be implemented via classical gradient descent methods when the criterion is differentiable w.r.t. the hyperparameters. Our numerical experiments reveal that our procedures are computationally feasible and compare favorably to the popular approach (Ridge, LASSO and Elastic-Net combined with grid-search cross-validation) in term of generalization. They also outperform the baseline on two additional tasks: estimation and support recovery of $beta$. Moreover, our procedures do not require any expertise for the calibration of the initial parameters which remain the same for all the datasets we experimented on.
In the current world of economic crises, the cost control is one of the chief concerns for all types of industries, especially for the small venders. The small vendors are suppose to minimize their budget on Information Technology by reducing the initial investment in hardware and costly database servers like ORACLE, SQL Server, SYBASE, etc. for the purpose of data processing and storing. In other divisions, the electronic devices manufacturing companies want to increase the demand and reduce the manufacturing cost by introducing the low cost technologies. The new small devices like ipods, iphones, palm top etc. are now-a-days used as data computation and storing tools. For both the cases mentioned above, instead of going for the costly database servers which additionally requires extra hardware as well as the extra expenses in training and handling, the flat file may be considered as a candidate due to its easy handling nature, fast accessing, and of course free of cost. But the main hurdle is the security aspects which are not up to the optimum level. In this paper, we propose a methodology that combines all the merit of the flat file and with the help of a novel steganographic technique we can maintain the utmost security fence. The new proposed methodology will undoubtedly be highly beneficial for small vendors as well as for the above said electronic devices manufacturer
The query log of a DBMS is a powerful resource. It enables many practical applications, including query optimization and user experience enhancement. And yet, mining SQL queries is a difficult task. The fundamental problem is that queries are symbolic objects, not vectors of numbers. Therefore, many popular statistical concepts, such as means, regression, or decision trees do not apply. Most authors limit themselves to ad hoc algorithms or approaches based on neighborhoods, such as k Nearest Neighbors. Our project is to challenge this limitation. We introduce methods to manipulate SQL queries as if they were vectors, thereby unlocking the whole statistical toolbox. We present three families of methods: feature maps, kernel methods, and Bayesian models. The first technique directly encodes queries into vectors. The second one transforms the queries implicitly. The last one exploits probabilistic graphical models as an alternative to vector spaces. We present the benefits and drawbacks of each solution, highlight how they relate to each other, and make the case for future investigation.
Selecting the best items in a dataset is a common task in data exploration. However, the concept of best lies in the eyes of the beholder: different users may consider different attributes more important, and hence arrive at different rankings. Nevertheless, one can remove dominated items and create a representative subset of the data set, comprising the best items in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be almost as big as the full data. Representative can be found if we relax the requirement to include the best item for every possible user, and instead just limit the users regret. Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full data set, for any chosen ranking function. However, the score is often not a meaningful number and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the data set. In contrast, users do understand the notion of rank ordering. Therefore, alternatively, we consider the position of the items in the ranked list for defining the regret and propose the {em rank-regret representative} as the minimal subset of the data containing at least one of the top-$k$ of any possible ranking function. This problem is NP-complete. We use the geometric interpretation of items to bound their ranks on ranges of functions and to utilize combinatorial geometry notions for developing effective and efficient approximation algorithms for the problem. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.