Do you want to publish a course? Click here

Text Preprocessing and its Implications in a Digital Humanities Project

إعادة النظر في النص وآثارها في مشروع العلوم الإنسانية الرقمية

379   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

This paper focuses on data cleaning as part of a preprocessing procedure applied to text data retrieved from the web. Although the importance of this early stage in a project using NLP methods is often highlighted by researchers, the details, general principles and techniques are usually left out due to consideration of space. At best, they are dismissed with a comment The usual data cleaning and preprocessing procedures were applied''. More coverage is usually given to automatic text annotation such as lemmatisation, part-of-speech tagging and parsing, which is often included in preprocessing. In the literature, the term preprocessing' is used to refer to a wide range of procedures, from filtering and cleaning to data transformation such as stemming and numeric representation, which might create confusion. We argue that text preprocessing might skew original data distribution with regard to the metadata, such as types, locations and times of registered datapoints. In this paper we describe a systematic approach to cleaning text data mined by a data-providing company for a Digital Humanities (DH) project focused on cultural analytics. We reveal the types and amount of noise in the data coming from various web sources and estimate the changes in the size of the data associated with preprocessing. We also compare the results of a text classification experiment run on the raw and preprocessed data. We hope that our experience and approaches will help the DH community to diagnose the quality of textual data collected from the web and prepare it for further natural language processing.



References used
https://aclanthology.org/
rate research

Read More

The success of language models based on the Transformer architecture appears to be inconsistent with observed anisotropic properties of representations learned by such models. We resolve this by showing, contrary to previous studies, that the represe ntations do not occupy a narrow cone, but rather drift in common directions. At any training step, all of the embeddings except for the ground-truth target embedding are updated with gradient in the same direction. Compounded over the training set, the embeddings drift and share common components, manifested in their shape in all the models we have empirically tested. Our experiments show that isotropy can be restored using a simple transformation.
Gender inequality represents a considerable loss of human potential and perpetuates a culture of violence, higher gender wage gaps, and a lack of representation of women in higher and leadership positions. Applications powered by Artificial Intellige nce (AI) are increasingly being used in the real world to provide critical decisions about who is going to be hired, granted a loan, admitted to college, etc. However, the main pillars of AI, Natural Language Processing (NLP) and Machine Learning (ML) have been shown to reflect and even amplify gender biases and stereotypes, which are mainly inherited from historical training data. In an effort to facilitate the identification and mitigation of gender bias in English text, we develop a comprehensive taxonomy that relies on the following gender bias types: Generic Pronouns, Sexism, Occupational Bias, Exclusionary Bias, and Semantics. We also provide a bottom-up overview of gender bias, from its societal origin to its spillover onto language. Finally, we link the societal implications of gender bias to their corresponding type(s) in the proposed taxonomy. The underlying motivation of our work is to help enable the technical community to identify and mitigate relevant biases from training corpora for improved fairness in NLP systems.
This study aimed at highlighting the reality of the Total Quality management of overall at the Faculty of Arts and Humanities at al- Baath University as well as the relation between the demographic and functional changes with respect to the member s of the specimens and the principles of overall Total Quality management. This study used the field survey methodology, and the researcher prepared a questionnaire consisting of six fields: continual improvement, staff participation, training and teaching, work teams, beneficiary satisfactions.
In natural language generation tasks, a neural language model is used for generating a sequence of words forming a sentence. The topmost weight matrix of the language model, known as the classification layer, can be viewed as a set of vectors, each r epresenting a target word from the target dictionary. The target word vectors, along with the rest of the model parameters, are learned and updated during training. In this paper, we analyze the properties encoded in the target vectors and question the necessity of learning these vectors. We suggest to randomly draw the target vectors and set them as fixed so that no weights updates are being made during training. We show that by excluding the vectors from the optimization, the number of parameters drastically decreases with a marginal effect on the performance. We demonstrate the effectiveness of our method in image-captioning and machine-translation.
This project is a Zionist-American update of those projects which have been developed and planned by the Colonial and Zionist departments, which were designed to separate the bright Arab world for western parts, by planting the Zionist entity in the heart of the Arab world, and after a series of conventions and treaties that paved the way for carrying out such as the Convention Sykes-Picot colonial in 1916 and the Balfour Declaration in 1917. Thus, what is happening in the Arab world cannot be separated from the scheme of the US-Zionist-Western European targets to penetrate the Arab region as a whole, in order to synthesize weak statelets which would be easy to control, and thus plunder the riches and capabilities of the Arabs, and to ensure the security "of Israel." In addition, it also reached the objectives of those countries working towards the fragmentation and occupation of the Arab world, as well as the elimination of governments and nationalist parties, thus ending the national project and the Arab system. One of the tools, or the colonial scenarios, posed to achieve this by the owners of the American-Zionist project is to hit the kind of gender in the Arab region, whether sectarian, ethnic or national. And thus ignite sectarian wars and civil between the components of the Arab community until you return the peoples of the region to the pre-national state, which leads to spread chaos and unrest and insecurity, leading to serious implications and repercussions of the disastrous walks of life, different cultural, political, social, economic and others. What progress allows the right climate for the division and fragmentation of the Arab states towards the character of a sectarian and nationalist and sectarian and ethnic, and thus draw a new map for the Arab region to serve the interests of the colonial powers? This climate of chaos gives justifications and arguments of the States project owners a Zionist-American intervention in the affairs of Arab countries, and the violation of their sovereignty and control over their own resources, whether oil or gas, or take advantage of its strategic location in ways to control global trade. ...

suggested questions

comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا