التطبيقات القائمة على البيانات العلمية هي ذات أهمية متزايدة من أي وقت مضى.ينتج عن هذا العيوب للمناطق التي لا تتوفر فيها بيانات عالية الجودة وأنظمة متوافقة، مثل المنشورات غير الإنجليزية.لتعزيز تخفيف هذا الخلل، نستخدم منشورات البرنامج النصي السيريلية من المجموعة الأساسية لإنشاء مجموعة بيانات عالية الجودة لاستخراج البيانات الوصفية.نستخدم بياناتنا للتدريب وتقييم نماذج وضع التسلسل لاستخراج معلومات العنوان والمؤلف.إعادة تدريب جروبيد على بياناتنا، نلاحظ تحسينات كبيرة من حيث الدقة وتذكر وتحقيق نتائج أفضل مع نموذج متطور بنفسي.نجعل بياناتنا مجموعة تغطي أكثر من 15000 منشورات بالإضافة إلى شفرة المصدر لدينا بحرية.
Applications based on scholarly data are of ever increasing importance. This results in disadvantages for areas where high-quality data and compatible systems are not available, such as non-English publications. To advance the mitigation of this imbalance, we use Cyrillic script publications from the CORE collection to create a high-quality data set for metadata extraction. We utilize our data for training and evaluating sequence labeling models to extract title and author information. Retraining GROBID on our data, we observe significant improvements in terms of precision and recall and achieve even better results with a self developed model. We make our data set covering over 15,000 publications as well as our source code freely available.
References used
https://aclanthology.org/
Multi-label document classification, associating one document instance with a set of relevant labels, is attracting more and more research attention. Existing methods explore the incorporation of information beyond text, such as document metadata or
The lack of labeled training data for new features is a common problem in rapidly changing real-world dialog systems. As a solution, we propose a multilingual paraphrase generation model that can be used to generate novel utterances for a target feat
The number of biomedical documents is increasing rapidly. Accordingly, a demand for extracting knowledge from large-scale biomedical texts is also increasing. BERT-based models are known for their high performance in various tasks. However, it is oft
Implicit event argument extraction (EAE) is a crucial document-level information extraction task that aims to identify event arguments beyond the sentence level. Despite many efforts for this task, the lack of enough training data has long impeded th
Understanding the semantic meaning of content on the web through the lens of entities and concepts has many practical advantages. However, when building large-scale entity extraction systems, practitioners are facing unique challenges involving findi