Feature engineering is an important step in classical NLP pipelines, but machine learning engineers may not be aware of the signals to look for when processing foreign language text. The Russian Feature Extraction Toolkit (RFET) is a collection of fe
ature extraction libraries bundled for ease of use by engineers who do not speak Russian. RFET's current feature set includes features applicable to social media genres of text and to computational social science tasks. We demonstrate the effectiveness of the tool by using it in a personality trait identification task. We compare the performance of Support Vector Machines (SVMs) trained with and without the features provided by RFET; we also compare it to a SVM with neural embedding features generated by Sentence-BERT.
In this paper, we present our systems submitted to SemEval-2021 Task 1 on lexical complexity prediction.The aim of this shared task was to create systems able to predict the lexical complexity of word tokens and bigram multiword expressions within a
given sentence context, a continuous value indicating the difficulty in understanding a respective utterance. Our approach relies on gradient boosted regression tree ensembles fitted using a heterogeneous feature set combining linguistic features, static and contextualized word embeddings, psycholinguistic norm lexica, WordNet, word- and character bigram frequencies and inclusion in wordlists to create a model able to assign a word or multiword expression a context-dependent complexity score. We can show that especially contextualised string embeddings can help with predicting lexical complexity.
The present work aims at assigning a complexity score between 0 and 1 to a target word or phrase in a given sentence. For each Single Word Target, a Random Forest Regressor is trained on a feature set consisting of lexical, semantic, and syntactic in
formation about the target. For each Multiword Target, a set of individual word features is taken along with single word complexities in the feature space. The system yielded the Pearson correlation of 0.7402 and 0.8244 on the test set for the Single and Multiword Targets, respectively.
Building NLP systems that serve everyone requires accounting for dialect differences. But dialects are not monolithic entities: rather, distinctions between and within dialects are captured by the presence, absence, and frequency of dozens of dialect
features in speech and text, such as the deletion of the copula in He ∅ running''. In this paper, we introduce the task of dialect feature detection, and present two multitask learning approaches, both based on pretrained transformers. For most dialects, large-scale annotated corpora for these features are unavailable, making it difficult to train recognizers. We train our models on a small number of minimal pairs, building on how linguists typically define dialect features. Evaluation on a test set of 22 dialect features of Indian English demonstrates that these models learn to recognize many features with high accuracy, and that a few minimal pairs can be as effective for training as thousands of labeled examples. We also demonstrate the downstream applicability of dialect feature detection both as a measure of dialect density and as a dialect classifier.
In the next decade, we will see a considerable need for NLP models for situated settings where diversity of situations and also different modalities including eye-movements should be taken into account in order to grasp the intention of the user. How
ever, language comprehension in situated settings can not be handled in isolation, where different multimodal cues are inherently present and essential parts of the situations. In this research proposal, we aim to quantify the influence of each modality in interaction with various referential complexities. We propose to encode the referential complexity of the situated settings in the embeddings during pre-training to implicitly guide the model to the most plausible situation-specific deviations. We summarize the challenges of intention extraction and propose a methodological approach to investigate a situation-specific feature adaptation to improve crossmodal mapping and meaning recovery from noisy communication settings.
يهدف البحث إلى تقديم دراسة مرجعيّة مفصلة عن استخدام الشبكات العصبونية الإلتفافية (CNNs) في استخراج الميزات (Features) من الصور.
وسيتطرق البحث إلى التعريف بمعنى الميزات (Features) الخاصة بالصور وأهميتها في تطبيقات معالجة الصورة.
وسيتم أيضاً التعريف
بالشبكات العصبونية الإلتفافية (CNNs) وبنيتها و طريقة عملها وأنواع المقاربات والمنهجيات المستخدمة في تدريبها لاستخراج الميزات (Features) من الصور.
Student dropout is a serious problem in education, there are many factors that can influence student dropout so it is not an easy issue to resolve. The scope of this research is to examine the accuracy of the ensemble techniques for predicting the st
udent dropout, particularly for primary school students in the Syrian Arab Republic.
The new classifier is designed based on the ensemble techniques “Stacking” and application of techniques Feature Selection where the database suffers from the problem of imbalance.
This new classifier has been compared with individual ones by using the Cross-Validation technique, the study concluded that the proposed classifier is the best among the others that have been compared to predict the student dropout.
This paper introduces a new approach to extract palm print features
and select the best ones. The paper also studies the effectiveness of
the selection process on speed and performance of system.
This Research suggests a new mechanism that aims to increase the effectiveness of
surveillance systems by extracting the moving objects coming from surveillance camera in
order to identify them and propose a new mechanism for indexing and storing i
n database
and classified them according to the basic characteristics and strong indicators and retrieval
when needed in less possible time.
The basic idea lies in the combination of the basic characteristics of the goal (color,
edges and texture) which ensures the best performance in extracting the basic target
features and depend on it as indexes, then nonlinear transfers has been done on the edges
of the target in order to get a picture bearing the minutest details, then conducted adverse
transfers on the edges of the target during the process retrieved from the database. Finally,
we propose a new mechanism for indexing all images tabase to Retrieval them in best
accuracy and less time, and a program had been achieved to realize this idea.
Personal identification based on handprint has been gaining more attention with the
increasing needs of high level of security. In this study a novel approach for human
recognition based on handprint is proposed. Wavelet transform was used to extra
ct features
presented in the palm image based on wavelet zero-crossing method. Firstly the wavelet
transform of the whole palm image at the fourth level was worked out, which results in
four matrices; three of them are detail matrices (i.e., horizontal, vertical and diagonal) as
well as one approximation matrix. Throughout this study, only the detail matrices were
used because the required information (i.e., hand lines and curves) is included in those
matrices. Sixteen features were extracted from each detail matrix, and then arranged in one
vector. Consequently, for each palm sample a feature vector consisting of 48 input features
of the used neural network was obtained. For this purpose, a database consisting of 400
palm images belonging to 40 people at the rate of 10 images per person was built. Practical
tests outcome showed that the designed system successfully indentified 91.36% of the
tested images.