Abstractive summarization quality had large improvements since recent language pretraining techniques. However, currently there is a lack of datasets for the growing needs of conversation summarization applications. Thus we collected ForumSum, a dive
rse and high-quality conversation summarization dataset with human written summaries. The conversations in ForumSum dataset are collected from a wide variety of internet forums. To make the dataset easily expandable, we also release the process of dataset creation. Our experiments show that models trained on ForumSum have better zero-shot and few-shot transferability to other datasets than the existing large chat summarization dataset SAMSum. We also show that using a conversational corpus for pre-training improves the quality of the chat summarization model.
Personas are useful for dialogue response prediction. However, the personas used in current studies are pre-defined and hard to obtain before a conversation. To tackle this issue, we study a new task, named Speaker Persona Detection (SPD), which aims
to detect speaker personas based on the plain conversational text. In this task, a best-matched persona is searched out from candidates given the conversational text. This is a many-to-many semantic matching task because both contexts and personas in SPD are composed of multiple sentences. The long-term dependency and the dynamic redundancy among these sentences increase the difficulty of this task. We build a dataset for SPD, dubbed as Persona Match on Persona-Chat (PMPC). Furthermore, we evaluate several baseline models and propose utterance-to-profile (U2P) matching networks for this task. The U2P models operate at a fine granularity which treat both contexts and personas as sets of multiple sequences. Then, each sequence pair is scored and an interpretable overall score is obtained for a context-persona pair through aggregation. Evaluation results show that the U2P models outperform their baseline counterparts significantly.
In this paper, we focus on improving the quality of the summary generated by neural abstractive dialogue summarization systems. Even though pre-trained language models generate well-constructed and promising results, it is still challenging to summar
ize the conversation of multiple participants since the summary should include a description of the overall situation and the actions of each speaker. This paper proposes self-supervised strategies for speaker-focused post-correction in abstractive dialogue summarization. Specifically, our model first discriminates which type of speaker correction is required in a draft summary and then generates a revised summary according to the required type. Experimental results show that our proposed method adequately corrects the draft summaries, and the revised summaries are significantly improved in both quantitative and qualitative evaluations.
In this paper, we use domain generalization to improve the performance of the cross-device speaker verification system. Based on a trainable speaker verification system, we use domain generalization algorithms to fine-tune the model parameters. First
, we use the VoxCeleb2 dataset to train ECAPA-TDNN as a baseline model. Then, use the CHT-TDSV dataset and the following domain generalization algorithms to fine-tune it: DANN, CDNN, Deep CORAL. Our proposed system tests 10 different scenarios in the NSYSU-TDSV dataset, including a single device and multiple devices. Finally, in the scenario of multiple devices, the best equal error rate decreased from 18.39 in the baseline to 8.84. Successfully achieved cross-device identification on the speaker verification system.
For children, the system trained on a large corpus of adult speakers performed worse than a system trained on a much smaller corpus of children's speech. This is due to the acoustic mismatch between training and testing data. To capture more acoustic
variability we trained a shared system with mixed data from adults and children. The shared system yields the best EER for children with no degradation for adults. Thus, the single system trained with mixed data is applicable for speaker verification for both adults and children.
In this research, some of audio signal properties have been studied according to the
speaker's vocal tract shape. A database of audio files has been recorded. These files belong
to 57 men whose age between 35 and 45. All speakers came from the same
academic and
social culture. Furthermore, they don't suffer from any problems in hearings and utterance.
The vowel database was created in perfect recording conditions. The spent time
needed for recording process was about five minutes for each speaker who said the Arabic
word " سألتمُونِيهَا " three times. That word is very rich of vowel letters. It composes of the
whole Arabic long vowel.
Based on the analysis study of the recorded audio signals, the relationship between
the formant frequencies and the length of speaker's vocal tract has been studied. The results
show an inverse proportion for the first three frequencies F1, f2, F3 and no clear
relationship for the two other frequencies F4, F5.
Voice recognition includes two basic parts: speech and speaker recognition. These
recognition processes consider as the most important processes of modern technologies,
many systems has been developed that differ in the methods used to extract feat
ures and
classification ways to support recognition systems of this type.
The study was conducted in this research on the previous subject, where the system
is designed to recognize the speaker and his voice orders and focus on several
complementary algorithms to carry out the research. we conducted an analytical study on
MFCC algorithm used in the extraction of features, and it has been studying two
parameters the number of filters in the filters bank and the number of features that taken
from each frame and the impact of these two parameters in the recognition rate and the
relationship of these two parameters on each other. It was the use of feed forwarding back
propagation neural networks performance analysis as characteristics and we analyze the
performance of the network to gain access to the best features and components to the
process of achieving recognition. And it has been studying Endpoint algorithm that used
to remove periods of silence and its impact on voice recognition rates.