Towards Constructing Corpus of Punjabi N-grams Written in Gurmukhi Script

Charanjiv Singh Saroa, Kawaljeet Singh

PDF

Published: Dec 31, 2023

Keywords:

NLP, Regional Languages, N-grams, UNICODE,

Charanjiv Singh Saroa, Kawaljeet Singh

Abstract

The availability of a robust corpus is crucial for developing linguistic resources. For the Punjabi language, written in the Gurmukhi script, the scarcity of such a resource hinders the validation of various natural language processing (NLP) techniques. This paper addresses this gap by presenting the creation of a comprehensive corpus for Punjabi in Gurmukhi. The corpus, with approximately 23 million words drawn from diverse published materials, serves as a valuable foundation for NLP research. Additionally, the paper describes a dedicated corpus processing tool designed specifically for Punjabi. This tool employs a novel method for constructing word, bigram, and trigram levels of the corpus, applicable for building such resources for any script. As a demonstration, we showcase a generated dataset composed of approximately 15.5 million Punjabi words and 50 million characters

How to Cite

Kawaljeet Singh, C. S. S. (2023). Towards Constructing Corpus of Punjabi N-grams Written in Gurmukhi Script. International Journal on Recent and Innovation Trends in Computing and Communication, 11(10), 2680–2687. Retrieved from https://ijritcc.org/index.php/ijritcc/article/view/10248

Issue

Vol. 11 No. 10 (2023)

Section

Articles

Make a Submission

Announcements

Call for Papers

January 5, 2026

Call for Papers for the New Issue.
Last Date of Submission: July 20^th, 2026

Imp. Announcement

April 15, 2022

Dear Authors,
We are feeling proud congratulations to all the contributors of IJRITCC. Because The "International Journal on Recent and Innovation Trends in Computing and Communication" has been accepted for Scopus.

Like, Subscribe and Share This Video