The rapid rise of social media has unfortunately led to increased online abuse, with memes where images combined with short and provocative texts – becoming a common vehicle for hateful or derogatory content. Detecting such abusive memes is especially challenging in low-resource languages like Bangla, Hindi, Gujarati, and Bodo, where annotated datasets are limited. To address this gap, we develop a multilingual abusive meme dataset for these four Indic languages, annotated with five labels: sentiment, sarcasm, vulgarity, abuse, and target. In the associated shared task, 20 unique teams submitted over 306 system runs. Performance was evaluated using Macro F1, with the best scores reaching 0.6275 (Bangla), 0.6570 (Hindi), 0.6750 (Gujarati) and 0.6312 (Bodo). This article provides a brief overview of the task, dataset construction, system results, and key methodological approaches.
@article{ghosh2025overview,
title = {{Overview of the HASOC Track at FIRE 2025: Abusive Meme Identification — Shadows Behind the Laughter}},
author = {Ghosh, Koyel and Das, Mithun and Patel, Sumukh and Bhandary, Nilotpal and Das, Alloy and Mukherjee, Animesh and Modha, Sandip and Ganguly, Debasis and Garain, Utpal and Jaki, Sylvia and Mandl, Thomas},
year = {2025},
journal = {{}},
pages = {28--31},
doi = {10.1145/3777867.3778259},
}@article{mazumder2025docgraphformer,
title = {{Doc2GraphFormer: Bridging Structured Graph Learning with Transformer Attention for Efficient Document Understanding}},
author = {Mazumder, Souparni and Biswas, Sanket and Pal, Aniket and Das, Alloy and Pal, Umapada and Lladós, Josep},
year = {2025},
journal = {{Lecture notes in computer science}},
pages = {506--522},
doi = {10.1007/978-3-032-04627-7_29},
}@article{pal2025icdar,
title = {{ICDAR 2025 Handwritten Notes Understanding Challenge}},
author = {Pal, Aniket and Biswas, Sanket and Das, Alloy and Lodh, Ayush and Banerjee, Priyanka and Chattopadhyay, Soumitri and Mondal, Ajoy and Karatzas, Dìmosthenis and Lladós, Josep and Jawahar, C. V.},
year = {2025},
journal = {{Lecture notes in computer science}},
pages = {553--567},
doi = {10.1007/978-3-032-04630-7_32},
}Mammalian hair serves as a critical biological marker, aiding species identification essential for wildlife conservation and crime control. This study introduces the first extensive benchmark for classifying microscopic images of mammal hair from species prioritized for conservation. Our goal is to develop standardized methods, metrics, and best practices for utilizing advanced computer vision techniques, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs) , and Swin Transformers, to classify hair samples across Order, Family, Genus and Species taxonomic levels. We present a novel dataset of 76 species, including critically endangered and endangered species, curated specifically for this classification challenge. The methodology integrates automated feature extraction of cuticle patterns and medulla structures, enabling high-precision species differentiation. Our findings demonstrate that Swin Transformer-based models outperform traditional CNNs and ViTs across taxonomic levels, with techniques like image cropping further improving classification accuracy by diversifying the training set. The proposed Tricho-Vision framework offers significant applications in biodiversity monitoring and wildlife crime investigation, facilitating accurate species identification from forensic hair samples. Additionally, we introduce a interactive tool for real-time taxonomic classification, showcasing the practical utility of our research and fostering broader interdisciplinary engagement in conservation science and forensic applications. • Curated dataset with 76 species for research in hair classification. • Standardized suite for evaluating Tricho-Taxonomy models. • Exhaustive tests ensure framework performance accuracy. • Real-time demo highlights practical conservation applications.
@article{das2025trichovision,
title = {{Tricho-Vision: The use of computer vision in trichotaxonomy for enhancing wildlife conservation of priority species}},
author = {Das, Alloy and Banerjee, Priyanka and Biswas, Sanket and Kamalakannan, Manokaran and Chattopadhyay, Joydev and Banerjee, Dhriti and Mukherjee, Tanoy},
year = {2025},
journal = {{Ecological Informatics}},
volume = {90},
pages = {103161--103161},
doi = {10.1016/j.ecoinf.2025.103161},
}Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.
@misc{pal2025notesbank,
title = {{NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding}},
author = {Pal, Aniket and Biswas, Sanket and Das, Alloy and Lodh, Ayush and Banerjee, Priyanka and Chattopadhyay, Soumitri and Karatzas, Dìmosthenis and Lladós, Josep and Jawahar, C. V.},
year = {2025},
doi = {10.48550/arxiv.2504.09249},
}Scene Text Editing (STE) is a challenging research prob-lem, that primarily aims towards modifying existing texts in an image while preserving the background and the font style of the original text. Despite its utility in numerous real-world applications, existing style-transfer-based approaches have shown sub-par editing performance due to (1) complex image backgrounds, (2) diverse font attributes, and (3) varying word lengths within the text. To address such limitations, in this paper, we propose a novel font-agnostic scene text editing and rendering framework, named FASTER, for simultaneously generating text in arbitrary styles and locations while preserving a natural and realistic appearance and structure. A combined fusion of target mask generation and style transfer units, with a cascaded self-attention mech-anism has been proposed to focus on multi-level text region edits to handle varying word lengths. Extensive evaluation on a real-world database withfurther subjective human eval-uation study indicates the superiority of FASTER in both scene text editing and rendering tasks, in terms of model per-formance and efficiency. The code and pre-trained models have been released in our Gi thub repo.
@article{das2025faster,
title = {{FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework}},
author = {Das, Alloy and Biswas, Sanket and Roy, Prasun and Ghosh, Subhankar and Pal, Umapada and Blumenstein, Michael and Lladós, Josep and Bhattacharya, Saumik},
year = {2025},
journal = {{}},
pages = {1944--1954},
doi = {10.1109/wacv61041.2025.00196},
}@article{mazumder2025docgraphx,
title = {{Doc2Graph-X: A Multilingual Graph-Based Framework for Form Understanding}},
author = {Mazumder, Souparni and Biswas, Sanket and Das, Alloy and Lladós, Josep},
year = {2025},
journal = {{Lecture notes in computer science}},
pages = {257--266},
doi = {10.1007/978-3-031-94139-9_24},
}@article{das2024fasttextspotter,
title = {{FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting}},
author = {Das, Alloy and Biswas, Sanket and Pal, Umapada and Lladós, Josep and Bhattacharya, Saumik},
year = {2024},
journal = {{Lecture notes in computer science}},
pages = {135--150},
doi = {10.1007/978-3-031-78498-9_10},
}The presence of unpredictable occlusions on natural scene text is a significant challenge, exacerbating the difficulties already posed on text detection and recognition by the variability of such images. Addressing the need for a robust, consistently performing approach that can effectively address the above challenges, this paper presents a new Soft Set-based end-to-end system for text detection, recognition and prediction in occluded natural scene images. This is the first approach to integrate text detection, recognition and prediction , unlike existing systems developed for end-to-end text spotting (text detection and recognition) only. For candidate text components detection, the proposed combination of Soft Sets with Maximally Stable Extremal Regions (SS-MSER) improves text detection and spotting in natural scene images, irrespectively of the presence of arbitrarily orientated and shaped text, complex backgrounds and occlusion. Furthermore, a Graph Recurrent Neural Network is proposed for grouping candidate text components into text lines and for fitting accurate bounding boxes to each word. Finally, a Convolutional Recurrent Neural Network (CRNN) is proposed for the recognition of text and for predicting missing characters due to occlusion. Experimental results on a new occluded scene text dataset (OSTD) and on the most relevant benchmark natural scene text datasets demonstrate that the proposed system outperforms the state-of-the-art in text detection, recognition and prediction. The code and dataset are available at https://github.com/alloydas/Softset-MSER-Based-Occluded-Scene-Text-Spotting/blob/master/Soft_set_MSER.ipynb
@article{das2024soft,
title = {{Soft set-based MSER end-to-end system for occluded scene text detection, recognition and prediction}},
author = {Das, Alloy and Shivakumara, Palaiahnakote and Banerjee, Ayan and Antonacopoulos, Apostolos and Pal, Umapada},
year = {2024},
journal = {{Knowledge-Based Systems}},
volume = {305},
pages = {112593--112593},
doi = {10.1016/j.knosys.2024.112593},
}@article{pradhan2024swinsight,
title = {{SwinSight: a hierarchical vision transformer using shifted windows to leverage aerial image classification}},
author = {Pradhan, Praveen Kumar and Das, Alloy and Kumar, Amish and Baruah, Udayan and Sen, Biswaraj and Ghosal, Palash},
year = {2024},
journal = {{Multimedia Tools and Applications}},
volume = {83},
number = {39},
pages = {86457--86478},
doi = {10.1007/s11042-024-19615-9},
}When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models have been released in our Github.
@article{das2024diving,
title = {{Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes}},
author = {Das, Alloy and Biswas, Sanket and Pal, Umapada and Lladós, Josep},
year = {2024},
journal = {{}},
pages = {410--417},
doi = {10.1109/icra57147.2024.10611120},
}The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions. However, existing SOTA approaches usually incorporate scene text detection and recognition simply by pretraining on natural scene text datasets, which do not directly exploit the intermediate feature representations between multiple domains. Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data such that it can directly adapt to target domains rather than being specialized for a specific domain or scenario. Further, we investigate a transformer baseline called Swin-TESTR to focus on solving scene-text spotting for both regular and arbitraryshaped text along with an exhaustive evaluation. The results demonstrate the potential of intermediate representations to gain significant performance on text spotting benchmarks across multiple domains (e.g. language, synth-to-real, and documents). both in terms of accuracy and efficiency.
@article{das2024harnessing,
title = {{Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards Enhancing Text Spotting Performance}},
author = {Das, Alloy and Biswas, Sanket and Banerjee, Ayan and Lladós, Josep and Pal, Umapada and Bhattacharya, Saumik},
year = {2024},
journal = {{}},
doi = {10.1109/wacv57701.2024.00077},
}The infectious disease caused by novel coronavirus (2019-nCoV) has been widely spreading since last year and has shaken the entire world. It has caused an unprecedented effect on daily life, global economy and public health. Hence this disease detection has life-saving importance for both patients as well as doctors. Due to limited test kits, it is also a daunting task to test every patient with severe respiratory problems using conventional techniques (RT-PCR) . Thus implementing an automatic diagnosis system is urgently required to overcome the scarcity problem of Covid-19 test kits at hospital, health care systems. The diagnostic approach is mainly classified into two categories-laboratory based and Chest radiography approach. In this paper, a novel approach for computerized coronavirus (2019-nCoV) detection from lung x-ray images is presented. Here, we propose models using deep learning to show the effectiveness of diagnostic systems. In the experimental result, we evaluate proposed models on publicly available data-set which exhibit satisfactory performance and promising results compared with other previous existing methods.
@article{das2022automatic,
title = {{Automatic detection of COVID-19 from chest x-ray images using deep learning model}},
author = {Das, Alloy and Agarwal, Rohit and Singh, Rituparna and Chowdhury, Arindam and Nandi, Debashis},
year = {2022},
journal = {{AIP conference proceedings}},
volume = {2424},
pages = {040003--040003},
doi = {10.1063/5.0076882},
}Document age estimation using handwritten text line images is useful for several pattern recognition and artificial intelligence applications such as forged signature verification, writer identification, gender identification, personality traits identification, and fraudulent document identification. This paper presents a novel method for document age classification at the text line level. For segmenting text lines from handwritten document images, the wavelet decomposition is used in a novel way. We explore multiple levels of wavelet decomposition, which introduce blur as the number of levels increases for detecting word components. The detected components are then used for a direction guided-driven growing approach with linearity, and nonlinearity criteria for segmenting text lines. For classification of text line images of different ages, inspired by the observation that, as the age of a document increases, the quality of its image degrades, the proposed method extracts the structural, contrast, and spatial features to study degradations at different wavelet decomposition levels. The specific advantages of DenseNet, namely, strong feature propagation, mitigation of the vanishing gradient problem, reuse of features, and the reduction of the number of parameters motivated us to use DenseNet121 along with a Multi-layer Perceptron (MLP) for the classification of text lines of different ages by feeding features and the original image as input. To demonstrate the efficacy of the proposed model, experiments were conducted on our own as well as standard datasets for both text line segmentation and document age classification. The results show that the proposed method outperforms the existing methods for text line segmentation in terms of precision, recall, F-measure, and document age classification in terms of average classification rate.
@article{shivakumara2022new,
title = {{New Deep Spatio-Structural Features of Handwritten Text Lines for Document Age Classification}},
author = {Shivakumara, Palaiahnakote and Das, Alloy and Raghunandan, K. S. and Pal, Umapada and Blumenstein, Michael},
year = {2022},
journal = {{International Journal of Pattern Recognition and Artificial Intelligence}},
volume = {36},
number = {09},
doi = {10.1142/s0218001422520139},
}@article{chowdhury2021unet,
title = {{U-Net Based Optic Cup and Disk Segmentation from Retinal Fundus Images via Entropy Sampling}},
author = {Chowdhury, Arindam and Agarwal, Rohit and Das, Alloy and Nandi, Debashis},
year = {2021},
journal = {{Advances in intelligent systems and computing}},
pages = {479--489},
doi = {10.1007/978-981-16-4369-9_47},
}