Skip to main content

Dr. Bornini Lahiri

Indian Institute of Technology Kharagpur, Humanities and Social Sciences, Faculty Member

Central Institute of Indian Languages (CIIL), Scheme for Protection and Preservation of Endangered Languages, Junior Resource Person

Jadavpur University, Kolkata, India, School of Languages and Linguistics, Research Associate

Central Institute of Hindi, Agra, Linguistics, Faculty Member

Followers

230

Following

97

Co-authors

5

Mentions

1

Public Views

Linguist

less

InterestsView All (12)

Uploads

Papers by Dr. Bornini Lahiri

Annotated Speech Corpus for Low Resource Indian Langauges: Awadhi, Bhojpuri, Braj and Magahi

1st Workshop on Speech for Social Good (S4SG)

Aggression in Hindi and English Speech: Acoustic Correlates and Automatic Identification

ArXiv, 2022

In the present paper, we will present the results of an acoustic analysis of political discourse ... more In the present paper, we will present the results of an acoustic analysis of political discourse in Hindi and discuss some of the conventionalised acoustic features of aggressive speech regularly employed by the speakers of Hindi and English. The study is based on a corpus of slightly over 10 hours of political discourse and includes debates on news channel and political speeches. Using this study, we develop two automatic classification systems for identifying aggression in English and Hindi speech, based solely on an acoustic model. The Hindi classifier, trained using 50 hours of annotated speech, and English classifier, trained using 40 hours of annotated speech, achieve a respectable accuracy of over 73% and 66% respectively. In this paper, we discuss the development of this annotated dataset, the experiments for developing the classifier and discuss the errors that it makes.

Bihari Hindi as a Mixed Language (co-authored with Bornini Lahiri and Deepak Alok)

Proceedings of Language Contact in India: Historical, Typological and Sociolinguistic Perspectives, 2013

Language Resources and Technologies for Non-Scheduled and Endangered Indian Languages

ArXiv, 2022

In the present paper, we will present a survey of the language resources and technologies availab... more In the present paper, we will present a survey of the language resources and technologies available for the non-scheduled and endangered languages of India. While there have been different estimates from different sources about the number of languages in India, it could be assumed that there are more than 1,000 languages currently being spoken in India. However barring some of the 22 languages included in the 8 th Schedule of the Indian Constitution (called the scheduled languages), there is hardly any substantial resource or technology available for the rest of the languages. Nonetheless there have been some individual attempts at developing resources and technologies for the different languages across the country. Of late, some financial support has also become available for the endangered languages. In this paper, we give a summary of the resources and technologies for those Indian languages which are not included in the 8 th schedule of the Indian Constitution and/or which are e...

Instrumental case

Jadavpur Journal of Languages and Linguistics A Questionnaire Developed for Conducting Fieldwork on Endangered and Indigenous Languages in India 1

International Journal of Language and Applied Linguistics English Spelling Errors in Hindi Speaking Children

The present study reports an investigation of English spelling errors by the students of native H... more The present study reports an investigation of English spelling errors by the students of native Hindi speakers. The study was conducted on grade five students of an English medium school of India. Students with similar socioeconomic background, studying in the same school were given various tasks to test their English spelling skills. The errors were grouped into five categories. It was noticed that most of the wrong spellings were phonologically similar to the correct spelling. The students wrote what they heard. They knew the alphabets of English and related these alphabets to some sounds. This seems to be the influence of L1 (Hindi) where there is one to one correspondence between sound and orthographic symbol. But it is not so in English. Moreover many English words have silent alphabets in their spellings which also created problem for the students. Lack of knowledge of correct pronunciation of English words added to the spelling errors. This study is important because as far a...

Objective and benefactive cases

The Case System of Eastern Indo-Aryan Languages

Aggressive and Offensive Language Identification in Hindi, Bangla, and English: A Comparative Study

SN Computer Science

ComMA@FIRE 2020: Exploring Multilingual Joint Training across different Classification Tasks

In this paper, we give a description of the systems submitted to the three tracks of FIRE 2020 Ha... more In this paper, we give a description of the systems submitted to the three tracks of FIRE 2020 Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC), Sentiment Analysis of Dravidian Languages in Code-Mixed Text and Event Detection from News in Indian Languages (EDNIL). While the first two tasks were binary and multi-class text classification problems, EDNIL was a sequence classification problem. For all the three tracks, we jointly fine-tuned mBERT, DistilBERT, RoBERTa and XLM-R using the dataset from all the languages for the given task.

Some island cases

Developing a POS tagger for Magahi: A Comparative Study

In this paper, we present a comparative study of the four state-of-the-art sequential taggers app... more In this paper, we present a comparative study of the four state-of-the-art sequential taggers applied on Magahi data for part-of-speech (POS) annotation . Magahi is one of the smaller Indo-Aryan languages spoken in Eastern state of Bihar in India. It is an extremely resource-poor language and it is the first attempt to develop some kind of Natural Language Processing (NLP) resource for the language. The four taggers that we test are – Support Vector Machines (SVM) based SVMTool, Hidden Markov Model (HMM) based TnT tagger, Maximum Entropy based MxPost tagger and Memory based MBT tagger. All these taggers are trained on a miniscule dataset of around 50,000 words using 33 tags from the BIS-tagset for Indian languages and tested on around 13,000 words. The performance of all these taggers are tested against a frequency-based baseline tagger. While all these taggers perform worse than on the English data, the best performance is given by the Maximum Entropy tagger after tuning of certain...

Developing a Multilingual Annotated Corpus of Misogyny and Aggression

In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggr... more In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.

Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Lang... more We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

Automatic Identification of Closely-related Indian Languages: Resources and Experiments

ArXiv, 2018

In this paper, we discuss an attempt to develop an automatic language identification system for 5... more In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India, Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state of the art accuracy of 96.48\%. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of closeness of these languages.

Descriptive Study of Eastern Hindi: A mixed language

In this paper, we give a description of one of the varieties of Eastern Hindi spoken in thecentra... more In this paper, we give a description of one of the varieties of Eastern Hindi spoken in thecentral, Magahi-speaking parts of Bihar (the variety spoken in and around the capital city ofPatna) and present the case for it being a mixed language. Based on extensive empiricalevidence, we conclude that Eastern Hindi is a conventionalised/plain mixed language(following the classification given in Bakkar (2000) and Matras and Bakker(2003)) which hascome into being because of contact between the official Hindi and Magahi spoken in theregion.

Semantics of classifiers in some Indian Languages

In the present paper, we present a detailed description of the classifier systems of five Indian ... more In the present paper, we present a detailed description of the classifier systems of five Indian languages-- Mizo, Galo, Tagin (all belongs to the Tibeto-Burman family), Assamese (Indo-Aryan) and Malto (Dravidian). It is observed that the classifiers are a predominant feature in the Tibeto-Burman and we observe an extensive classifier system in these languages. There is no equivalent classifier system in other language families. However in the languages belonging to Eastern India, irrespective of the family, there is some sort of classifier system. Thus classifiers seem to be an areal feature in most of the Eastern and whole of the North-Eastern India. The purpose of the paper is to study if there is some semantic similarity among the classifier systems across language families in this area and thus to see if it is indeed an areal feature. It is just a preliminary description of an ongoing research in which we intend to study many more languages and include languages from the Austro...

mScrabble: A Multilingual Scrabble and Lexicon Collection Generator

In the situation of language endangerment, especially because of various kinds of pressure from s... more In the situation of language endangerment, especially because of various kinds of pressure from surrounding majority languages and a low language prestige among the community members, language games of various kinds could prove to be an effective tool enhancing the prestige, providing an additional domain of language use to the community members and also for the researchers working with the communities for language documentation and possibly revitalisation. Keeping these in mind, we have developed a word game - mScrabble, a substantially changed and adapted version of the popular game of Scrabble for a large number of languages as a mobile app. In this paper, we present this game for two endangered Indian languages - Koda and Mahali - and discuss its features and rules, its technical specifications and its initial reception in the community. We also present a generic generator of this game which, given a word list and a few translations (of the items on the interface), could generat...

Annotated Speech Corpus for Low Resource Indian Langauges: Awadhi, Bhojpuri, Braj and Magahi

1st Workshop on Speech for Social Good (S4SG)

Aggression in Hindi and English Speech: Acoustic Correlates and Automatic Identification

ArXiv, 2022

In the present paper, we will present the results of an acoustic analysis of political discourse ... more In the present paper, we will present the results of an acoustic analysis of political discourse in Hindi and discuss some of the conventionalised acoustic features of aggressive speech regularly employed by the speakers of Hindi and English. The study is based on a corpus of slightly over 10 hours of political discourse and includes debates on news channel and political speeches. Using this study, we develop two automatic classification systems for identifying aggression in English and Hindi speech, based solely on an acoustic model. The Hindi classifier, trained using 50 hours of annotated speech, and English classifier, trained using 40 hours of annotated speech, achieve a respectable accuracy of over 73% and 66% respectively. In this paper, we discuss the development of this annotated dataset, the experiments for developing the classifier and discuss the errors that it makes.

Bihari Hindi as a Mixed Language (co-authored with Bornini Lahiri and Deepak Alok)

Proceedings of Language Contact in India: Historical, Typological and Sociolinguistic Perspectives, 2013

Language Resources and Technologies for Non-Scheduled and Endangered Indian Languages

ArXiv, 2022

In the present paper, we will present a survey of the language resources and technologies availab... more In the present paper, we will present a survey of the language resources and technologies available for the non-scheduled and endangered languages of India. While there have been different estimates from different sources about the number of languages in India, it could be assumed that there are more than 1,000 languages currently being spoken in India. However barring some of the 22 languages included in the 8 th Schedule of the Indian Constitution (called the scheduled languages), there is hardly any substantial resource or technology available for the rest of the languages. Nonetheless there have been some individual attempts at developing resources and technologies for the different languages across the country. Of late, some financial support has also become available for the endangered languages. In this paper, we give a summary of the resources and technologies for those Indian languages which are not included in the 8 th schedule of the Indian Constitution and/or which are e...

Instrumental case

Jadavpur Journal of Languages and Linguistics A Questionnaire Developed for Conducting Fieldwork on Endangered and Indigenous Languages in India 1

International Journal of Language and Applied Linguistics English Spelling Errors in Hindi Speaking Children

The present study reports an investigation of English spelling errors by the students of native H... more The present study reports an investigation of English spelling errors by the students of native Hindi speakers. The study was conducted on grade five students of an English medium school of India. Students with similar socioeconomic background, studying in the same school were given various tasks to test their English spelling skills. The errors were grouped into five categories. It was noticed that most of the wrong spellings were phonologically similar to the correct spelling. The students wrote what they heard. They knew the alphabets of English and related these alphabets to some sounds. This seems to be the influence of L1 (Hindi) where there is one to one correspondence between sound and orthographic symbol. But it is not so in English. Moreover many English words have silent alphabets in their spellings which also created problem for the students. Lack of knowledge of correct pronunciation of English words added to the spelling errors. This study is important because as far a...

Objective and benefactive cases

The Case System of Eastern Indo-Aryan Languages

Aggressive and Offensive Language Identification in Hindi, Bangla, and English: A Comparative Study

SN Computer Science

ComMA@FIRE 2020: Exploring Multilingual Joint Training across different Classification Tasks

In this paper, we give a description of the systems submitted to the three tracks of FIRE 2020 Ha... more In this paper, we give a description of the systems submitted to the three tracks of FIRE 2020 Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC), Sentiment Analysis of Dravidian Languages in Code-Mixed Text and Event Detection from News in Indian Languages (EDNIL). While the first two tasks were binary and multi-class text classification problems, EDNIL was a sequence classification problem. For all the three tracks, we jointly fine-tuned mBERT, DistilBERT, RoBERTa and XLM-R using the dataset from all the languages for the given task.

Some island cases

Developing a POS tagger for Magahi: A Comparative Study

In this paper, we present a comparative study of the four state-of-the-art sequential taggers app... more In this paper, we present a comparative study of the four state-of-the-art sequential taggers applied on Magahi data for part-of-speech (POS) annotation . Magahi is one of the smaller Indo-Aryan languages spoken in Eastern state of Bihar in India. It is an extremely resource-poor language and it is the first attempt to develop some kind of Natural Language Processing (NLP) resource for the language. The four taggers that we test are – Support Vector Machines (SVM) based SVMTool, Hidden Markov Model (HMM) based TnT tagger, Maximum Entropy based MxPost tagger and Memory based MBT tagger. All these taggers are trained on a miniscule dataset of around 50,000 words using 33 tags from the BIS-tagset for Indian languages and tested on around 13,000 words. The performance of all these taggers are tested against a frequency-based baseline tagger. While all these taggers perform worse than on the English data, the best performance is given by the Maximum Entropy tagger after tuning of certain...

Developing a Multilingual Annotated Corpus of Misogyny and Aggression

In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggr... more In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.

Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Lang... more We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

Automatic Identification of Closely-related Indian Languages: Resources and Experiments

ArXiv, 2018

In this paper, we discuss an attempt to develop an automatic language identification system for 5... more In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India, Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state of the art accuracy of 96.48\%. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of closeness of these languages.

Descriptive Study of Eastern Hindi: A mixed language

In this paper, we give a description of one of the varieties of Eastern Hindi spoken in thecentra... more In this paper, we give a description of one of the varieties of Eastern Hindi spoken in thecentral, Magahi-speaking parts of Bihar (the variety spoken in and around the capital city ofPatna) and present the case for it being a mixed language. Based on extensive empiricalevidence, we conclude that Eastern Hindi is a conventionalised/plain mixed language(following the classification given in Bakkar (2000) and Matras and Bakker(2003)) which hascome into being because of contact between the official Hindi and Magahi spoken in theregion.

Semantics of classifiers in some Indian Languages

In the present paper, we present a detailed description of the classifier systems of five Indian ... more In the present paper, we present a detailed description of the classifier systems of five Indian languages-- Mizo, Galo, Tagin (all belongs to the Tibeto-Burman family), Assamese (Indo-Aryan) and Malto (Dravidian). It is observed that the classifiers are a predominant feature in the Tibeto-Burman and we observe an extensive classifier system in these languages. There is no equivalent classifier system in other language families. However in the languages belonging to Eastern India, irrespective of the family, there is some sort of classifier system. Thus classifiers seem to be an areal feature in most of the Eastern and whole of the North-Eastern India. The purpose of the paper is to study if there is some semantic similarity among the classifier systems across language families in this area and thus to see if it is indeed an areal feature. It is just a preliminary description of an ongoing research in which we intend to study many more languages and include languages from the Austro...

mScrabble: A Multilingual Scrabble and Lexicon Collection Generator

In the situation of language endangerment, especially because of various kinds of pressure from s... more In the situation of language endangerment, especially because of various kinds of pressure from surrounding majority languages and a low language prestige among the community members, language games of various kinds could prove to be an effective tool enhancing the prestige, providing an additional domain of language use to the community members and also for the researchers working with the communities for language documentation and possibly revitalisation. Keeping these in mind, we have developed a word game - mScrabble, a substantially changed and adapted version of the popular game of Scrabble for a large number of languages as a mobile app. In this paper, we present this game for two endangered Indian languages - Koda and Mahali - and discuss its features and rules, its technical specifications and its initial reception in the community. We also present a generic generator of this game which, given a word list and a few translations (of the items on the interface), could generat...

Dhimal: A Struggle for Existence

Dhimal is a Tibeto-Burman language spoken in India and Nepal. The variety spoken in India is cons... more Dhimal is a Tibeto-Burman language spoken in India and Nepal. The variety spoken in India is considered as endangered by Government of India. In one side it can be seen that the Dhimal speakers are shifting towards Bengali but on the other side, they want to assert their identity as Dhimal. The paper explores the reasons for the existence of the opposite attitudes in the community.

English Spelling Errors in Hindi Speaking Children

by Dr. Bornini Lahiri and International Journal of Language & Applied Linguistics IJLAL

The present study reports an investigation of English spelling errors by the students of native ... more The present study reports an investigation of
English spelling errors by the students of native Hindi
speakers. The study was conducted on grade five students of
an English medium school of India. Students with similar
socio-economic background, studying in the same school
were given various tasks to test their English spelling skills.
The errors were grouped into five categories. It was noticed
that most of the wrong spellings were phonologically similar
to the correct spelling. The students wrote what they heard.
They knew the alphabets of English and related these
alphabets to some sounds. This seems to be the influence of
L1 (Hindi) where there is one to one correspondence
between sound and orthographic symbol. But it is not so in
English. Moreover many English words have silent
alphabets in their spellings which also created problem for
the students. Lack of knowledge of correct pronunciation of
English words added to the spelling errors. This study is
important because as far as my knowledge, no such study on
the effect of Hindi on the English spelling errors have been
done earlier.

Non-Canonical Cases in Bihari Languages

Lincom Europa, 2019

Hindi-Bengali Code-Mixing in Bengali Films

Hindi in Bengali films is generously used. It is mostly in the form of code-mixing. The present p... more

The Semantics of classifiers in some Indian Languages (co-authored with Bornini Lahiri, Atanu Saha and Sudhanshu Shekhar)

by Ritesh Kumar and Dr. Bornini Lahiri

Proceedings of the 3rd Students' Conference of Linguistics in India (SCONLI-3), 2011

Developing LRs for Indian Languages A case of Magahi (co-authored with Bornini Lahiri and Deepak Alok)

by Ritesh Kumar, Dr. Bornini Lahiri, and Deepak Alok

Human Language Technology, 2014

Reduplicative classifier: a typological survey (co-authored with Bornini Lahiri and Atanu Saha)

by Ritesh Kumar and Dr. Bornini Lahiri

Bihari Hindi as a Mixed Language (co-authored with Bornini Lahiri and Deepak Alok)

by Ritesh Kumar, Deepak Alok, and Dr. Bornini Lahiri

Proceedings of Language Contact in India: Historical, Typological and Sociolinguistic Perspectives, 2013

Developing a POS tagger for Magahi: A Comparative Study (co-authored with Bornini Lahiri and Deepak Alok)

by Ritesh Kumar, Dr. Bornini Lahiri, and Deepak Alok

Proceedings of 10th Workshop on Asian Language Resources, 24th International Conference on Computational Linguistics (COLING-24), 2012

Challenges in Developing LRs for Non-Scheduled Languages: A Case of Magahi (co-authored with Bornini Lahiri and Deepak Alok)

by Ritesh Kumar, Deepak Alok, and Dr. Bornini Lahiri

Proceedings of 5th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, 2011

A Questionnaire Developed for Conducting Fieldwork on Endangered and Indigenous Languages

by Dr. Bornini Lahiri and Arup Majumder

Jadavpur Journal of Languages and Linguistics, 2018