Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Text Preprocessing and its Implications in a Digital Humanities Project

إعادة النظر في النص وآثارها في مشروع العلوم الإنسانية الرقمية

792 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

تركز هذه الورقة على تنظيف البيانات كجزء من إجراء مسبق مسبق تطبق على البيانات النصية المستردة من الويب. على الرغم من أن أهمية هذه المرحلة المبكرة في مشروع باستخدام أساليب NLP غالبا ما يسلط الضوء عليها من قبل الباحثون، فإن التفاصيل، والمبادئ والتقنيات العامة عادة ما يتم استبعادها بسبب النظر في الفضاء. في أحسن الأحوال، يتم رفضهم بتعليق تم تطبيق إجراءات تنظيف البيانات المعتادة ومعالجتها المعتميات ". عادة ما يتم إعطاء المزيد من التغطية الشرح النص التلقائي النصي مثل Lemmatisation ووضع العلامات والتحليلات الجزئية والتحليلات، والتي غالبا ما يتم تضمينها في Preprocessing. في الأدب، يتم استخدام مصطلح المعالجة المسبق "للإشارة إلى مجموعة واسعة من الإجراءات، من التصفية والتنظيف لتحويل البيانات مثل التمثيل الناتج والرقم، مما قد يخلق الارتباك. نقول أن إعادة معالجة النصوص قد تشوه توزيع البيانات الأصلية فيما يتعلق بالبيانات الوصفية، مثل أنواع المواقع والأوقات وأوقات البيانات المسجلة. في هذه الورقة، نصف نهجا منهجيا لتنظيف البيانات النصية الملغومة من قبل شركة لتوفير البيانات لبرنامج العلوم الإنسانية الرقمية (DH) التي تركز على التحليلات الثقافية. نحن نكشف عن أنواع وكمية الضوضاء في البيانات الواردة من مصادر الويب المختلفة وتقدير التغييرات في حجم البيانات المرتبطة بالموافقة المسبقة. نحن أيضا مقارنة نتائج تجربة تصنيف النص يعمل على البيانات الخام ومعالجتها. نأمل أن تساعد تجربتنا ونهجنا على مساعدة مجتمع DH لتشخيص جودة البيانات النصية التي تم جمعها من الويب وإعدادها لمزيد من معالجة اللغة الطبيعية.

This paper focuses on data cleaning as part of a preprocessing procedure applied to text data retrieved from the web. Although the importance of this early stage in a project using NLP methods is often highlighted by researchers, the details, general principles and techniques are usually left out due to consideration of space. At best, they are dismissed with a comment The usual data cleaning and preprocessing procedures were applied''. More coverage is usually given to automatic text annotation such as lemmatisation, part-of-speech tagging and parsing, which is often included in preprocessing. In the literature, the term preprocessing' is used to refer to a wide range of procedures, from filtering and cleaning to data transformation such as stemming and numeric representation, which might create confusion. We argue that text preprocessing might skew original data distribution with regard to the metadata, such as types, locations and times of registered datapoints. In this paper we describe a systematic approach to cleaning text data mined by a data-providing company for a Digital Humanities (DH) project focused on cultural analytics. We reveal the types and amount of noise in the data coming from various web sources and estimate the changes in the size of the data associated with preprocessing. We also compare the results of a text classification experiment run on the raw and preprocessed data. We hope that our experience and approaches will help the DH community to diagnose the quality of textual data collected from the web and prepare it for further natural language processing.

References used

https://aclanthology.org/

rate research

Too Much in Common: Shifting of Embeddings in Transformer Language Models and its Implications

799 - Association for Computation Linguistics 2021 مقالة

The success of language models based on the Transformer architecture appears to be inconsistent with observed anisotropic properties of representations learned by such models. We resolve this by showing, contrary to previous studies, that the represe ntations do not occupy a narrow cone, but rather drift in common directions. At any training step, all of the embeddings except for the ground-truth target embedding are updated with gradient in the same direction. Compounded over the training set, the embeddings drift and share common components, manifested in their shape in all the models we have empirically tested. Our experiments show that isotropy can be restored using a simple transformation.

تحويل language models based نماذج اللغة القائمة صناعة حمض الفوسفور

Gender Bias in Text: Origin, Taxonomy, and Implications

759 - Association for Computation Linguistics 2021 مقالة

Gender inequality represents a considerable loss of human potential and perpetuates a culture of violence, higher gender wage gaps, and a lack of representation of women in higher and leadership positions. Applications powered by Artificial Intellige nce (AI) are increasingly being used in the real world to provide critical decisions about who is going to be hired, granted a loan, admitted to college, etc. However, the main pillars of AI, Natural Language Processing (NLP) and Machine Learning (ML) have been shown to reflect and even amplify gender biases and stereotypes, which are mainly inherited from historical training data. In an effort to facilitate the identification and mitigation of gender bias in English text, we develop a comprehensive taxonomy that relies on the following gender bias types: Generic Pronouns, Sexism, Occupational Bias, Exclusionary Bias, and Semantics. We also provide a bottom-up overview of gender bias, from its societal origin to its spillover onto language. Finally, we link the societal implications of gender bias to their corresponding type(s) in the proposed taxonomy. The underlying motivation of our work is to help enable the technical community to identify and mitigate relevant biases from training corpora for improved fairness in NLP systems.

سيري صناعة حمض الفوسفور

Total Quality management of Overall at al-Baath University from the Viewpoint of its Academic Staff: Afield Study on the Faculty of Arts and Humanities

1875 - Aِl-Baath University 2016 ورقة بحثية

This study aimed at highlighting the reality of the Total Quality management of overall at the Faculty of Arts and Humanities at al- Baath University as well as the relation between the demographic and functional changes with respect to the member s of the specimens and the principles of overall Total Quality management. This study used the field survey methodology, and the researcher prepared a questionnaire consisting of six fields: continual improvement, staff participation, training and teaching, work teams, beneficiary satisfactions.

أعضاء الهيئة التدريسية إدارة الجودة الشاملة جامعة البعث

On Randomized Classification Layers and Their Implications in Natural Language Generation

1026 - Association for Computation Linguistics 2021 مقالة

In natural language generation tasks, a neural language model is used for generating a sequence of words forming a sentence. The topmost weight matrix of the language model, known as the classification layer, can be viewed as a set of vectors, each r epresenting a target word from the target dictionary. The target word vectors, along with the rest of the model parameters, are learned and updated during training. In this paper, we analyze the properties encoded in the target vectors and question the necessity of learning these vectors. We suggest to randomly draw the target vectors and set them as fixed so that no weights updates are being made during training. We show that by excluding the vectors from the optimization, the number of parameters drastically decreases with a marginal effect on the performance. We demonstrate the effectiveness of our method in image-captioning and machine-translation.

عرض محول متعدد الوسائط randomized classification layers implications in natural طبقات التصنيف العشوائية الآثار في الطبيعية صناعة حمض الفوسفور

The American-Zionist Project and its Implications for the Arab World

2604 - Damascus University 2015 ورقة بحثية

This project is a Zionist-American update of those projects which have been developed and planned by the Colonial and Zionist departments, which were designed to separate the bright Arab world for western parts, by planting the Zionist entity in the heart of the Arab world, and after a series of conventions and treaties that paved the way for carrying out such as the Convention Sykes-Picot colonial in 1916 and the Balfour Declaration in 1917. Thus, what is happening in the Arab world cannot be separated from the scheme of the US-Zionist-Western European targets to penetrate the Arab region as a whole, in order to synthesize weak statelets which would be easy to control, and thus plunder the riches and capabilities of the Arabs, and to ensure the security "of Israel." In addition, it also reached the objectives of those countries working towards the fragmentation and occupation of the Arab world, as well as the elimination of governments and nationalist parties, thus ending the national project and the Arab system. One of the tools, or the colonial scenarios, posed to achieve this by the owners of the American-Zionist project is to hit the kind of gender in the Arab region, whether sectarian, ethnic or national. And thus ignite sectarian wars and civil between the components of the Arab community until you return the peoples of the region to the pre-national state, which leads to spread chaos and unrest and insecurity, leading to serious implications and repercussions of the disastrous walks of life, different cultural, political, social, economic and others. What progress allows the right climate for the division and fragmentation of the Arab states towards the character of a sectarian and nationalist and sectarian and ethnic, and thus draw a new map for the Arab region to serve the interests of the colonial powers? This climate of chaos gives justifications and arguments of the States project owners a Zionist-American intervention in the affairs of Arab countries, and the violation of their sovereignty and control over their own resources, whether oil or gas, or take advantage of its strategic location in ways to control global trade. ...

The American-Zionist Project the Arab World Implications المشروع الصهيوني-الأمريكي الوطن العربي تداعيات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Text Preprocessing and its Implications in a Digital Humanities Project

إعادة النظر في النص وآثارها في مشروع العلوم الإنسانية الرقمية

Ask ChatGPT about the research

Read More

suggested questions