• CALAM: model-based compilation and linguistic statistical analysis of Urdu corpus

    • Fulltext

       

        Click here to view fulltext PDF


      Permanent link:
      https://www.ias.ac.in/article/fulltext/sadh/045/0020

    • Keywords

       

      Corpus statistical analysis; Zipf’s rule; quantitative analysis; linguistic evaluation; corpus; NLP.

    • Abstract

       

      In this paper, we introduce an efficient framework for the compilation of an Urdu corpus along with ground truth and transcription in Unicode format. A novel scheme of the annotation based on four-level XML has been incorporated for the corpus CALAM. In addition to compilation and bench marking test, the frameworkgenerates the word frequency distribution according to category sapient useful for linguistic evaluation. This paper presents the statistical analysis with corpus data based on transcript text and frequency of occurrences. The observation of statistical analysis is conducted using vital statistics like rank of words, the frequency of words, ligatures length (number of ligatures with combination of two to seven characters), entropy and perplexity of the corpus. Besides rudimental statistics coverage, some additional statistical features are also evaluated like Zipf’s linguistic rule and measurement of dispersion in corpus information. The experimental results obtained from statistical observation are presented for asserting viability and usability of the corpus data as a standard platformfor linguistic research on the Urdu language.

    • Author Affiliations

       

      PRAKASH CHOUDHARY1 NEETA NAIN2

      1. Department of Computer Science and Engineering, National Institute of Technology Hamirpur, Hamirpur 177005, Himachal Pradesh, India
      2. Department of Computer Science and Engineering, Malaviya National Institute of Technology Jaipur, Jaipur, India
    • Dates

       
  • Sadhana | News

    • Editorial Note on Continuous Article Publication

      Posted on July 25, 2019

      Click here for Editorial Note on CAP Mode

© 2021-2022 Indian Academy of Sciences, Bengaluru.