No Arabic abstract
The Clair library is intended to simplify a number of generic tasks in Natural Language Processing (NLP), Information Retrieval (IR), and Network Analysis. Its architecture also allows for external software to be plugged in with very little effort. Functionality native to Clairlib includes Tokenization, Summarization, LexRank, Biased LexRank, Document Clustering, Document Indexing, PageRank, Biased PageRank, Web Graph Analysis, Network Generation, Power Law Distribution Analysis, Network Analysis (clustering coefficient, degree distribution plotting, average shortest path, diameter, triangles, shortest path matrices, connected components), Cosine Similarity, Random Walks on Graphs, Statistics (distributions, tests), Tf, Idf, Community Finding.
We present a system that uses a learned autocompletion mechanism to facilitate rapid creation of semi-structured clinical documentation. We dynamically suggest relevant clinical concepts as a doctor drafts a note by leveraging features from both unstructured and structured medical data. By constraining our architecture to shallow neural networks, we are able to make these suggestions in real time. Furthermore, as our algorithm is used to write a note, we can automatically annotate the documentation with clean labels of clinical concepts drawn from medical vocabularies, making notes more structured and readable for physicians, patients, and future algorithms. To our knowledge, this system is the only machine learning-based documentation utility for clinical notes deployed in a live hospital setting, and it reduces keystroke burden of clinical concepts by 67% in real environments.
Quickly moving to a new area of research is painful for researchers due to the vast amount of scientific literature in each field of study. One possible way to overcome this problem is to summarize a scientific topic. In this paper, we propose a model of summarizing a single article, which can be further used to summarize an entire topic. Our model is based on analyzing others viewpoint of the target articles contributions and the study of its citation summary network using a clustering approach.
Semantic Atlas is a mathematic and statistic model to visualise word senses according to relations between words. The model, that has been applied to proximity relations from a corpus, has shown its ability to distinguish word senses as the corpus contributors comprehend them. We propose to use the model and a specialised corpus in order to create automatically a specialised dictionary relative to the corpus domain. A morpho-syntactic analysis performed on the corpus makes it possible to create the dictionary from syntactic relations between lexical units. The semantic resource can be used to navigate semantically - and not only lexically - through the corpus, to create classical dictionaries or for diachronic studies of the language.
The Clinical E-Science Framework (CLEF) project was used to extract important information from medical texts by building a system for the purpose of clinical research, evidence-based healthcare and genotype-meets-phenotype informatics. The system is divided into two parts, one part concerns with the identification of relationships between clinically important entities in the text. The full parses and domain-specific grammars had been used to apply many approaches to extract the relationship. In the second part of the system, statistical machine learning (ML) approaches are applied to extract relationship. A corpus of oncology narratives that hand annotated with clinical relationships can be used to train and test a system that has been designed and implemented by supervised machine learning (ML) approaches. Many features can be extracted from these texts that are used to build a model by the classifier. Multiple supervised machine learning algorithms can be applied for relationship extraction. Effects of adding the features, changing the size of the corpus, and changing the type of the algorithm on relationship extraction are examined. Keywords: Text mining; information extraction; NLP; entities; and relations.
Timely analysis of cyber-security information necessitates automated information extraction from unstructured text. While state-of-the-art extraction methods produce extremely accurate results, they require ample training data, which is generally unavailable for specialized applications, such as detecting security related entities; moreover, manual annotation of corpora is very costly and often not a viable solution. In response, we develop a very precise method to automatically label text from several data sources by leveraging related, domain-specific, structured data and provide public access to a corpus annotated with cyber-security entities. Next, we implement a Maximum Entropy Model trained with the average perceptron on a portion of our corpus ($sim$750,000 words) and achieve near perfect precision, recall, and accuracy, with training times under 17 seconds.