Word Sense Disambiguation in Bengali language using unsupervised methodology with modifications
Click here to view fulltext PDF
In this work, Word Sense Disambiguation (WSD) in Bengali language is implemented using unsupervised methodology. In the first phase of this experiment, sentence clustering is performed using Maximum Entropy method and the clusters are labelled with their innate senses by manual intervention, as thesesense-tagged clusters could be used as sense inventories for further experiment. In the next phase, when a test data comes to be disambiguated, the Cosine Similarity Measure is used to find the closeness of that test data withthe initially sense-tagged clusters. The minimum distance of that test data from a particular sense-tagged cluster assigns the same sense to the test data as that of the cluster it is assigned with. This strategy is considered as the baseline strategy, which produces 35% accurate result in WSD task. Next, two extensions are adopted over this baseline strategy: (a) Principal Component Analysis (PCA) over the feature vector, which produces 52% accuracy in WSD task and (b) Context Expansion of the sentences using Bengali WordNet coupled with PCA,which produces 61% accuracy in WSD task. The data sets that are used in this work are obtained from the Bengali corpus, developed under the Technology Development for the Indian Languages (TDIL) project of the Government of India, and the lexical knowledge base (i.e., the Bengali WordNet) used in the work is developed at the Indian Statistical Institute, Kolkata, under the Indradhanush Project of the DeitY, Government of India. The challenges and the pitfalls of this work are also described in detail in the pre-conclusion section.
ALOK RANJAN PAL1 DIGANTA SAHA2
Volume 48, 2023
Continuous Article Publishing mode
Click here for Editorial Note on CAP Mode