استخراج الجزء الأكثر أهمية من وثائق التشريعات له قيمة عمل كبيرة لأن النصوص عادة ما تكون طويلة جدا ويصعب فهمها.الهدف من هذه المقالة هو تقييم خوارزميات مختلفة لتلخيص النص على وثائق تشريعات الاتحاد الأوروبي.يحتوي المحتوى على كلمات خاصة بالمجال.جمعنا مجموعة بيانات تلخيص نصية من الوثائق القانونية للاتحاد الأوروبي والتي تتكون من 1563 وثيقة، والتي متوسط طول الملخصات هو 424 كلمة.أجريت التجارب مع خوارزميات مختلفة باستخدام DataSet الجديدة.تم اختيار خوارزمية استخراجية بسيطة كأساس.الخوارزميات الاستخراجية المتقدمة، والتي تستخدم التشفير تظهر نتائج أفضل من خط الأساس.تم تحقيق أفضل نتيجة تقاس بواسطة درجات Rouge بواسطة نموذج T5 المبشط الذي تم ضبطه بشكل جيد، والذي تم تكييفه للعمل مع النصوص الطويلة.
Extracting the most important part of legislation documents has great business value because the texts are usually very long and hard to understand. The aim of this article is to evaluate different algorithms for text summarization on EU legislation documents. The content contains domain-specific words. We collected a text summarization dataset of EU legal documents consisting of 1563 documents, in which the mean length of summaries is 424 words. Experiments were conducted with different algorithms using the new dataset. A simple extractive algorithm was selected as a baseline. Advanced extractive algorithms, which use encoders show better results than baseline. The best result measured by ROUGE scores was achieved by a fine-tuned abstractive T5 model, which was adapted to work with long texts.
References used
https://aclanthology.org/
Neural sequence-to-sequence (Seq2Seq) models and BERT have achieved substantial improvements in abstractive document summarization (ADS) without and with pre-training, respectively. However, they sometimes repeatedly attend to unimportant source phra
Large pretrained models have seen enormous success in extractive summarization tasks. In this work, we investigate the influence of pretraining on a BERT-based extractive summarization system for scientific documents. We derive significant performanc
Repetition in natural language generation reduces the informativeness of text and makes it less appealing. Various techniques have been proposed to alleviate it. In this work, we explore and propose techniques to reduce repetition in abstractive summ
Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settin
Abstract We present the Quantized Transformer (QT), an unsupervised system for extractive opinion summarization. QT is inspired by Vector- Quantized Variational Autoencoders, which we repurpose for popularity-driven summarization. It uses a clusterin