With the advancements of deep learning and computer vision-incorporated applications, the Visual Font Recognition (VFR) field has evolved rapidly. From browser extensions to mobile and web apps, several efficient systems now exist for identifying fonts from images. However, progress in languages other than English has been limited, largely due to insufficient data availability. To address this obstacle, we have created a synthetic image dataset for VFR encompassing four different languages: Bangla, Hindi, Russian, and Spanish. Each language is represented by a dedicated folder, with 10 subfolders containing 5,000 images each, resulting in a substantial corpus of 200,000 images overall. Furthermore, we have provided the Python generator script used to create this dataset, which can be employed to generate synthetic VFR image data for additional languages, furthering the progress of the VFR field in languages with limited resources.
- The dataset has been published in IEEE Data Descriptions and can be accessed from IEEE Xplore
- Access the dataset directly from Mendeley Data
- The MVFR dataset is a comprehensive synthetic dataset that can be a valuable resource for researchers working on the Visual Font Recognition domain. Researchers and developers can explore the use of deep learning architectures to effectively recognize visual font styles by employing this dataset.
- The dataset contains a total of 2,00,000 images of 4 different languages. For each language, we employed 10 popular fonts and generated images of 5000 distinct common words for each font, thus each language has 50,000 images. This large collection can be utilized for benchmarking and comparing the performances of different traditional to advanced deep learning models.
- This is the first-ever large open-source VFR dataset on the respective languages. Researchers can utilize this dataset for developing tools and applications for visual font recognition of the respective language e.g. browser extensions, mobile apps, etc.
- Apart from the VFR application, this dataset can also be utilized in other computer vision applications as well, for instance, researchers can experiment with this dataset for optical character recognition (OCR) from diverse font styles as well.
@ARTICLE{10680521,
author={Tonmoy, Moshiur Rahman and Adnan, Akhtaruzzaman and Saha, Aloke Kumar and Mridha, M. F. and Dey, Nilanjan},
journal={IEEE Data Descriptions},
title={Descriptor: Multilingual Visual Font Recognition Dataset (MVFR)},
year={2024},
volume={1},
number={},
pages={1-6},
doi={10.1109/IEEEDATA.2024.3460768}}