DIGANTA SAHA
Articles written in Sadhana
Volume 44 Issue 7 July 2019 Article ID 0168
Word Sense Disambiguation in Bengali language using unsupervised methodology with modifications
In this work, Word Sense Disambiguation (WSD) in Bengali language is implemented using unsupervised methodology. In the first phase of this experiment, sentence clustering is performed using Maximum Entropy method and the clusters are labelled with their innate senses by manual intervention, as thesesense-tagged clusters could be used as sense inventories for further experiment. In the next phase, when a test data comes to be disambiguated, the Cosine Similarity Measure is used to find the closeness of that test data withthe initially sense-tagged clusters. The minimum distance of that test data from a particular sense-tagged cluster assigns the same sense to the test data as that of the cluster it is assigned with. This strategy is considered as the baseline strategy, which produces 35% accurate result in WSD task. Next, two extensions are adopted over this baseline strategy: (a) Principal Component Analysis (PCA) over the feature vector, which produces 52% accuracy in WSD task and (b) Context Expansion of the sentences using Bengali WordNet coupled with PCA,which produces 61% accuracy in WSD task. The data sets that are used in this work are obtained from the Bengali corpus, developed under the Technology Development for the Indian Languages (TDIL) project of the Government of India, and the lexical knowledge base (i.e., the Bengali WordNet) used in the work is developed at the Indian Statistical Institute, Kolkata, under the Indradhanush Project of the DeitY, Government of India. The challenges and the pitfalls of this work are also described in detail in the pre-conclusion section.
Volume 44 Issue 8 August 2019 Article ID 0181
A novel approach to word sense disambiguation in Bengali language using supervised methodology
ALOK RANJAN PAL DIGANTA SAHA NILADRI SEKHAR DASH SUDIP KUMAR NASKAR ANTARA PAL
An attempt is made in this paper to report how a supervised methodology has been adopted for the task of Word Sense Disambiguation (WSD) in Bengali with necessary modifications. At the initial stage, four commonly used supervised methods, Decision Tree (DT), Support Vector Machine (SVM), Artificial NeuralNetwork (ANN) and Naı¨ve Bayes (NB), are developed at the baseline. These algorithms are applied individually on a data set of 13 most frequently used Bengali ambiguous words. On experimental basis, the baseline strategyis modified with two extensions: (a) inclusion of lemmatization process into the system and (b) bootstrapping of the operational process. As a result, the levels of accuracy of the baseline methods are slightly improved, which is a positive signal for the whole process of disambiguation as it opens scope for further modification of the existing method for better result. In this experiment, the data sets are prepared from the Bengali corpus, developed in the Technology Development for Indian Languages (TDIL) project of the Government of India andfrom the Bengali WordNet, which is developed at the Indian Statistical Institute, Kolkata. The paper reports the challenges and pitfalls of the work that have been closely observed during the experiment.
Volume 47 All articles Published: 1 January 2022 Article ID 0002
An improvement of Bengali factoid question answering system using unsupervised statistical methods
ARIJIT DAS JAYDEEP MANDAL ZARGHAM DANIAL ALOK RANJAN PAL DIGANTA SAHA
Virtual Assistants (VA) and Chatbots have boosted the pace of research in Question Answering (QA) system. QA systems are supposed to return the answers of the questions by processing the backend repository. All the questions and the text in the repositories are in natural languages only. Substantial number of projects are executed for building QA systems in high resource languages. In case of low resource languages, the progress is still in early stage. In this work, we have designed, developed and evaluated the performance of a factoid QA system in a low resource language—Bengali. The system takes the questions from the human and then retrieves all the prospective answers from a multi-domain repository. Based on six parameters, the answers are ranked and returned. Therefore, the performance of the system is evaluated and compared with earlier systems using standard metrics. The algorithm is tested on two repositories. First is the TDIL corpus containing large collection of famous Bengali literature, which was developed in the Technology Development of Indian Languages (TDIL) project. Second is the translated SQuAD which is the Bengali translation of Stanford Question Answering Dataset. The accurate answer is ranked by the system as 1st in 88.23% cases. Accuracy and F1 score are calculated as 97.64% and 98.5%, respectively for TDIL corpus and 97.16% and 98.51% for translated SQuAD based on the performance evaluation by confusion matrix.
Volume 48, 2023
All articles
Continuous Article Publishing mode
Click here for Editorial Note on CAP Mode
© 2022-2023 Indian Academy of Sciences, Bengaluru.