Papers by Dr. Bornini Lahiri
1st Workshop on Speech for Social Good (S4SG)
Bookmarks Related papers MentionsView impact
ArXiv, 2022
In the present paper, we will present the results of an acoustic analysis of political discourse ... more In the present paper, we will present the results of an acoustic analysis of political discourse in Hindi and discuss some of the conventionalised acoustic features of aggressive speech regularly employed by the speakers of Hindi and English. The study is based on a corpus of slightly over 10 hours of political discourse and includes debates on news channel and political speeches. Using this study, we develop two automatic classification systems for identifying aggression in English and Hindi speech, based solely on an acoustic model. The Hindi classifier, trained using 50 hours of annotated speech, and English classifier, trained using 40 hours of annotated speech, achieve a respectable accuracy of over 73% and 66% respectively. In this paper, we discuss the development of this annotated dataset, the experiments for developing the classifier and discuss the errors that it makes.
Bookmarks Related papers MentionsView impact
Proceedings of Language Contact in India: Historical, Typological and Sociolinguistic Perspectives, 2013
Bookmarks Related papers MentionsView impact
ArXiv, 2022
In the present paper, we will present a survey of the language resources and technologies availab... more In the present paper, we will present a survey of the language resources and technologies available for the non-scheduled and endangered languages of India. While there have been different estimates from different sources about the number of languages in India, it could be assumed that there are more than 1,000 languages currently being spoken in India. However barring some of the 22 languages included in the 8 th Schedule of the Indian Constitution (called the scheduled languages), there is hardly any substantial resource or technology available for the rest of the languages. Nonetheless there have been some individual attempts at developing resources and technologies for the different languages across the country. Of late, some financial support has also become available for the endangered languages. In this paper, we give a summary of the resources and technologies for those Indian languages which are not included in the 8 th schedule of the Indian Constitution and/or which are e...
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
The present study reports an investigation of English spelling errors by the students of native H... more The present study reports an investigation of English spelling errors by the students of native Hindi speakers. The study was conducted on grade five students of an English medium school of India. Students with similar socioeconomic background, studying in the same school were given various tasks to test their English spelling skills. The errors were grouped into five categories. It was noticed that most of the wrong spellings were phonologically similar to the correct spelling. The students wrote what they heard. They knew the alphabets of English and related these alphabets to some sounds. This seems to be the influence of L1 (Hindi) where there is one to one correspondence between sound and orthographic symbol. But it is not so in English. Moreover many English words have silent alphabets in their spellings which also created problem for the students. Lack of knowledge of correct pronunciation of English words added to the spelling errors. This study is important because as far a...
Bookmarks Related papers MentionsView impact
The Case System of Eastern Indo-Aryan Languages
Bookmarks Related papers MentionsView impact
SN Computer Science
Bookmarks Related papers MentionsView impact
In this paper, we give a description of the systems submitted to the three tracks of FIRE 2020 Ha... more In this paper, we give a description of the systems submitted to the three tracks of FIRE 2020 Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC), Sentiment Analysis of Dravidian Languages in Code-Mixed Text and Event Detection from News in Indian Languages (EDNIL). While the first two tasks were binary and multi-class text classification problems, EDNIL was a sequence classification problem. For all the three tracks, we jointly fine-tuned mBERT, DistilBERT, RoBERTa and XLM-R using the dataset from all the languages for the given task.
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
In this paper, we present a comparative study of the four state-of-the-art sequential taggers app... more In this paper, we present a comparative study of the four state-of-the-art sequential taggers applied on Magahi data for part-of-speech (POS) annotation . Magahi is one of the smaller Indo-Aryan languages spoken in Eastern state of Bihar in India. It is an extremely resource-poor language and it is the first attempt to develop some kind of Natural Language Processing (NLP) resource for the language. The four taggers that we test are – Support Vector Machines (SVM) based SVMTool, Hidden Markov Model (HMM) based TnT tagger, Maximum Entropy based MxPost tagger and Memory based MBT tagger. All these taggers are trained on a miniscule dataset of around 50,000 words using 33 tags from the BIS-tagset for Indian languages and tested on around 13,000 words. The performance of all these taggers are tested against a frequency-based baseline tagger. While all these taggers perform worse than on the English data, the best performance is given by the Maximum Entropy tagger after tuning of certain...
Bookmarks Related papers MentionsView impact
In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggr... more In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.
Bookmarks Related papers MentionsView impact
We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Lang... more We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.
Bookmarks Related papers MentionsView impact
ArXiv, 2018
In this paper, we discuss an attempt to develop an automatic language identification system for 5... more In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India, Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state of the art accuracy of 96.48\%. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of closeness of these languages.
Bookmarks Related papers MentionsView impact
In this paper, we give a description of one of the varieties of Eastern Hindi spoken in thecentra... more In this paper, we give a description of one of the varieties of Eastern Hindi spoken in thecentral, Magahi-speaking parts of Bihar (the variety spoken in and around the capital city ofPatna) and present the case for it being a mixed language. Based on extensive empiricalevidence, we conclude that Eastern Hindi is a conventionalised/plain mixed language(following the classification given in Bakkar (2000) and Matras and Bakker(2003)) which hascome into being because of contact between the official Hindi and Magahi spoken in theregion.
Bookmarks Related papers MentionsView impact
In the present paper, we present a detailed description of the classifier systems of five Indian ... more In the present paper, we present a detailed description of the classifier systems of five Indian languages-- Mizo, Galo, Tagin (all belongs to the Tibeto-Burman family), Assamese (Indo-Aryan) and Malto (Dravidian). It is observed that the classifiers are a predominant feature in the Tibeto-Burman and we observe an extensive classifier system in these languages. There is no equivalent classifier system in other language families. However in the languages belonging to Eastern India, irrespective of the family, there is some sort of classifier system. Thus classifiers seem to be an areal feature in most of the Eastern and whole of the North-Eastern India. The purpose of the paper is to study if there is some semantic similarity among the classifier systems across language families in this area and thus to see if it is indeed an areal feature. It is just a preliminary description of an ongoing research in which we intend to study many more languages and include languages from the Austro...
Bookmarks Related papers MentionsView impact
In the situation of language endangerment, especially because of various kinds of pressure from s... more In the situation of language endangerment, especially because of various kinds of pressure from surrounding majority languages and a low language prestige among the community members, language games of various kinds could prove to be an effective tool enhancing the prestige, providing an additional domain of language use to the community members and also for the researchers working with the communities for language documentation and possibly revitalisation. Keeping these in mind, we have developed a word game - mScrabble, a substantially changed and adapted version of the popular game of Scrabble for a large number of languages as a mobile app. In this paper, we present this game for two endangered Indian languages - Koda and Mahali - and discuss its features and rules, its technical specifications and its initial reception in the community. We also present a generic generator of this game which, given a word list and a few translations (of the items on the interface), could generat...
Bookmarks Related papers MentionsView impact
Uploads
Papers by Dr. Bornini Lahiri
English spelling errors by the students of native Hindi
speakers. The study was conducted on grade five students of
an English medium school of India. Students with similar
socio-economic background, studying in the same school
were given various tasks to test their English spelling skills.
The errors were grouped into five categories. It was noticed
that most of the wrong spellings were phonologically similar
to the correct spelling. The students wrote what they heard.
They knew the alphabets of English and related these
alphabets to some sounds. This seems to be the influence of
L1 (Hindi) where there is one to one correspondence
between sound and orthographic symbol. But it is not so in
English. Moreover many English words have silent
alphabets in their spellings which also created problem for
the students. Lack of knowledge of correct pronunciation of
English words added to the spelling errors. This study is
important because as far as my knowledge, no such study on
the effect of Hindi on the English spelling errors have been
done earlier.