Towards Constructing Corpus of Punjabi N-grams Written in Gurmukhi Script

Main Article Content

Charanjiv Singh Saroa, Kawaljeet Singh

Abstract

The availability of a robust corpus is crucial for developing linguistic resources. For the Punjabi language, written in the Gurmukhi script, the scarcity of such a resource hinders the validation of various natural language processing (NLP) techniques. This paper addresses this gap by presenting the creation of a comprehensive corpus for Punjabi in Gurmukhi. The corpus, with approximately 23 million words drawn from diverse published materials, serves as a valuable foundation for NLP research. Additionally, the paper describes a dedicated corpus processing tool designed specifically for Punjabi. This tool employs a novel method for constructing word, bigram, and trigram levels of the corpus, applicable for building such resources for any script. As a demonstration, we showcase a generated dataset composed of approximately 15.5 million Punjabi words and 50 million characters

Article Details

How to Cite
Kawaljeet Singh, C. S. S. (2023). Towards Constructing Corpus of Punjabi N-grams Written in Gurmukhi Script. International Journal on Recent and Innovation Trends in Computing and Communication, 11(10), 2680–2687. Retrieved from https://ijritcc.org/index.php/ijritcc/article/view/10248
Section
Articles

Similar Articles

You may also start an advanced similarity search for this article.