مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

More than Word Frequencies: Authorship Attribution via Natural Frequency Zoned Word Distribution Analysis

156 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zhili Chen Dr.

تاريخ النشر 2012

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Zhili Chen - Liusheng Huang - Wei Yang

الحساب واللغة

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

With such increasing popularity and availability of digital text data, authorships of digital texts can not be taken for granted due to the ease of copying and parsing. This paper presents a new text style analysis called natural frequency zoned word distribution analysis (NFZ-WDA), and then a basic authorship attribution scheme and an open authorship attribution scheme for digital texts based on the analysis. NFZ-WDA is based on the observation that all authors leave distinct intrinsic word usage traces on texts written by them and these intrinsic styles can be identified and employed to analyze the authorship. The intrinsic word usage styles can be estimated through the analysis of word distribution within a text, which is more than normal word frequency analysis and can be expressed as: which groups of words are used in the text; how frequently does each group of words occur; how are the occurrences of each group of words distributed in the text. Next, the basic authorship attribution scheme and the open authorship attribution scheme provide solutions for both closed and open authorship attribution problems. Through analysis and extensive experimental studies, this paper demonstrates the efficiency of the proposed method for authorship attribution.

قيم البحث

اقرأ أيضاً

Word Familiarity and Frequency

114 - Kumiko Tanaka-Ishii , Hiroshi Terada 2018

Word frequency is assumed to correlate with word familiarity, but the strength of this correlation has not been thoroughly investigated. In this paper, we report on our analysis of the correlation between a word familiarity rating list obtained throu gh a psycholinguistic experiment and the log-frequency obtained from various corpora of different kinds and sizes (up to the terabyte scale) for English and Japanese. Major findings are threefold: First, for a given corpus, familiarity is necessary for a word to achieve high frequency, but familiar words are not necessarily frequent. Second, correlation increases with the corpus data size. Third, a corpus of spoken language correlates better than one of written language. These findings suggest that cognitive familiarity ratings are correlated to frequency, but more highly to that of spoken rather than written language.

الحساب واللغة

Distinct word length frequencies: distributions and symbol entropies

357 - Reginald D. Smith 2012

The distribution of frequency counts of distinct words by length in a languages vocabulary will be analyzed using two methods. The first, will look at the empirical distributions of several languages and derive a distribution that reasonably explains the number of distinct words as a function of length. We will be able to derive the frequency count, mean word length, and variance of word length based on the marginal probability of letters and spaces. The second, based on information theory, will demonstrate that the conditional entropies can also be used to estimate the frequency of distinct words of a given length in a language. In addition, it will be shown how these techniques can also be applied to estimate higher order entropies using vocabulary word length.

الحساب واللغة تحليل البيانات والإحصاءات والاحتمال

Towards Robustness Against Natural Language Word Substitutions

117 - Xinshuai Dong , Anh Tuan Luu , Rongrong Ji 2021

Robustness against word substitutions has a well-defined and widely acceptable form, i.e., using semantically similar words as substitutions, and thus it is considered as a fundamental stepping-stone towards broader robustness in natural language pro cessing. Previous defense methods capture word substitutions in vector space by using either $l_2$-ball or hyper-rectangle, which results in perturbation sets that are not inclusive enough or unnecessarily large, and thus impedes mimicry of worst cases for robust training. In this paper, we introduce a novel textit{Adversarial Sparse Convex Combination} (ASCC) method. We model the word substitution attack space as a convex hull and leverages a regularization term to enforce perturbation towards an actual substitution, thus aligning our modeling better with the discrete textual space. Based on the ASCC method, we further propose ASCC-defense, which leverages ASCC to generate worst-case perturbations and incorporates adversarial training towards robustness. Experiments show that ASCC-defense outperforms the current state-of-the-arts in terms of robustness on two prevailing NLP tasks, emph{i.e.}, sentiment analysis and natural language inference, concerning several attacks across multiple model architectures. Besides, we also envision a new class of defense towards robustness in NLP, where our robustly trained word vectors can be plugged into a normally trained model and enforce its robustness without applying any other defense techniques.

الحساب واللغة

Discrete Word Embedding for Logical Natural Language Understanding

111 - Masataro Asai , Zilu Tang 2020

We propose an unsupervised neural model for learning a discrete embedding of words. Unlike existing discrete embeddings, our binary embedding supports vector arithmetic operations similar to continuous embeddings. Our embedding represents each word a s a set of propositional statements describing a transition rule in classical/STRIPS planning formalism. This makes the embedding directly compatible with symbolic, state of the art classical planning solvers.

الحساب واللغة الذكاء الاصطناعي

Word Embedding Visualization Via Dictionary Learning

131 - Juexiao Zhang , Yubei Chen , Brian Cheung 2019

Co-occurrence statistics based word embedding techniques have proved to be very useful in extracting the semantic and syntactic representation of words as low dimensional continuous vectors. In this work, we discovered that dictionary learning can op en up these word vectors as a linear combination of more elementary word factors. We demonstrate many of the learned factors have surprisingly strong semantic or syntactic meaning corresponding to the factors previously identified manually by human inspection. Thus dictionary learning provides a powerful visualization tool for understanding word embedding representations. Furthermore, we show that the word factors can help in identifying key semantic and syntactic differences in word analogy tasks and improve upon the state-of-the-art word embedding techniques in these tasks by a large margin.

الحساب واللغة التعلم الآلي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

المعهد الوطني الجزائري للبحث الزراعي

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

More than Word Frequencies: Authorship Attribution via Natural Frequency Zoned Word Distribution Analysis

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً