تجد النماذج الموجودة الإشراف على النصوص النصية صعوبة في تحسين نتائج تجميعها مباشرة.وذلك لأن التجميع عملية منفصلة، ومن الصعب تقدير التدرج المجدي لأي وظيفة منفصلة يمكن أن تدفع خوارزميات التحسين المستندة إلى التدرج.لذا، فإن خوارزميات التجميع الموجودة محتجزة بشكل غير مباشر تحقق بشكل غير مباشر لبعض الوظائف المستمرة التي تقارب عملية التجميع.نقترح استراتيجية تدريبية قابلة للتطوير التي تعمل بشكل مباشر على متري تجميع منفصل.نحن ندرب نموذج التضمين القائم على بيرت باستخدام أسلوبنا وتقييمه على مجموعة بيانات متوفرة للجمهور.نظهر أن أسلوبنا تتفوق على نموذج آخر مضمون يستند إلى بيرت توظف خسارة ثلاثية وغيرها من خطوط الأساس غير المدعومة.هذا يشير إلى أن التحسين مباشرة لنتائج التجميع تعطي بالفعل تمثيل أفضل مناسبة للتجميع.
Existing supervised models for text clustering find it difficult to directly optimize for clustering results. This is because clustering is a discrete process and it is difficult to estimate meaningful gradient of any discrete function that can drive gradient based optimization algorithms. So, existing supervised clustering algorithms indirectly optimize for some continuous function that approximates the clustering process. We propose a scalable training strategy that directly optimizes for a discrete clustering metric. We train a BERT-based embedding model using our method and evaluate it on two publicly available datasets. We show that our method outperforms another BERT-based embedding model employing Triplet loss and other unsupervised baselines. This suggests that optimizing directly for the clustering outcome indeed yields better representations suitable for clustering.
References used
https://aclanthology.org/
Improving model generalization on held-out data is one of the core objectives in common- sense reasoning. Recent work has shown that models trained on the dataset with superficial cues tend to perform well on the easy test set with superficial cues b
Paraphrase generation is a longstanding NLP task that has diverse applications on downstream NLP tasks. However, the effectiveness of existing efforts predominantly relies on large amounts of golden labeled data. Though unsupervised endeavors have be
We present a scaffolded discovery learning approach to introducing concepts in a Natural Language Processing course aimed at computer science students at liberal arts institutions. We describe some of the objectives of this approach, as well as prese
Lemmatization is often used with morphologically rich languages to address issues caused by morphological complexity, performed by grammar-based lemmatizers. We propose an alternative for this, in form of a tool that performs lemmatization in the spa
For each goal-oriented dialog task of interest, large amounts of data need to be collected for end-to-end learning of a neural dialog system. Collecting that data is a costly and time-consuming process. Instead, we show that we can use only a small a