Alloy Das (অলয়)

Ames, Iowa, USA

I am a PhD student in the Department of Mechanical Engineering at Iowa State University, working in the SCSLab under the supervision of Prof. Soumik Sarkar.

My research lies at the intersection of computer vision, multi-modal representation learning, and agricultural AI. I am currently working on:

EmbodiedMAE — a multi-modal masked autoencoder for 3D plant reconstruction from RGB images, depth maps, and point clouds, targeting Sorghum phenotyping.
Lighting-robust instance segmentation — extending SAM with a custom Lighting Convolutional Attention (LCA) module for robust segmentation under challenging illumination conditions.

Previously, I was a Research Assistant at the Computer Vision and Pattern Recognition Unit (CVPRU), Indian Statistical Institute, Kolkata, supervised by Prof. Umapada Pal. My work there focused on scene text spotting, recognition, and editing — resulting in publications at WACV 2024, WACV 2025, ICRA 2024, and ICPR 2024.

I am a peer reviewer for The Visual Computer and Scientific Reports journals.

News

May 25, 2026	🎉 NoTeS-Bank has been accepted at ECML PKDD 2026! [arXiv]
May 22, 2026	🚀 Our paper Lighting-aware Unified Model for Instance Segmentation is now live on arXiv!
Aug 01, 2025	🎓 Started my PhD at Iowa State University, advised by Prof. Soumik Sarkar.
Feb 01, 2025	📄 FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework accepted at WACV 2025!
Dec 01, 2024	📄 FastTextSpotter accepted at ICPR 2024!

Research at a Glance

Publication Timeline

{
  "tooltip": { "trigger": "axis" },
  "grid": { "left": "5%", "right": "5%", "bottom": "10%", "containLabel": true },
  "xAxis": {
    "type": "category",
    "data": ["2021", "2022", "2024", "2025", "2026"],
    "axisLabel": { "color": "#666" }
  },
  "yAxis": {
    "type": "value",
    "name": "Papers",
    "minInterval": 1,
    "axisLabel": { "color": "#666" }
  },
  "series": [
    {
      "name": "Publications",
      "type": "bar",
      "barMaxWidth": 40,
      "data": [1, 2, 5, 7, 1],
      "itemStyle": {
        "color": {
          "type": "linear",
          "x": 0, "y": 0, "x2": 0, "y2": 1,
          "colorStops": [
            { "offset": 0, "color": "#4f8ef7" },
            { "offset": 1, "color": "#7fcfe8" }
          ]
        },
        "borderRadius": [4, 4, 0, 0]
      },
      "label": { "show": true, "position": "top" }
    }
  ]
}

Research Skills

{
  "tooltip": {},
  "radar": {
    "indicator": [
      { "name": "Computer Vision", "max": 10 },
      { "name": "Deep Learning", "max": 10 },
      { "name": "Multi-modal Learning", "max": 10 },
      { "name": "Scene Text Spotting", "max": 10 },
      { "name": "Agricultural AI", "max": 10 },
      { "name": "3D Reconstruction", "max": 10 }
    ],
    "radius": "65%"
  },
  "series": [
    {
      "type": "radar",
      "data": [
        {
          "value": [9, 9, 8, 9, 7, 6],
          "name": "Expertise",
          "areaStyle": { "opacity": 0.3 },
          "lineStyle": { "color": "#4f8ef7", "width": 2 },
          "itemStyle": { "color": "#4f8ef7" }
        }
      ]
    }
  ]
}

Publication Venues

{
  "tooltip": { "trigger": "item", "formatter": "{b}: {c} papers ({d}%)" },
  "legend": {
    "orient": "vertical",
    "right": "5%",
    "top": "center"
  },
  "series": [
    {
      "type": "pie",
      "radius": ["35%", "60%"],
      "center": ["38%", "50%"],
      "avoidLabelOverlap": true,
      "itemStyle": { "borderRadius": 6, "borderColor": "#fff", "borderWidth": 2 },
      "label": { "show": false },
      "emphasis": {
        "label": { "show": true, "fontSize": 13, "fontWeight": "bold" }
      },
      "data": [
        { "value": 4, "name": "WACV / ICRA / ICPR", "itemStyle": { "color": "#4f8ef7" } },
        { "value": 3, "name": "Journals (KBS / MTA / Eco. Inf.)", "itemStyle": { "color": "#7fcfe8" } },
        { "value": 4, "name": "ICDAR / LNCS", "itemStyle": { "color": "#5cc88a" } },
        { "value": 3, "name": "Preprints / Workshops", "itemStyle": { "color": "#f7a64f" } },
        { "value": 2, "name": "AIP / IJPRAI", "itemStyle": { "color": "#e87c7c" } }
      ]
    }
  ]
}

Selected Publications

Tricho-Vision: The use of computer vision in trichotaxonomy for enhancing wildlife conservation of priority species

Alloy Das, Priyanka Banerjee, Sanket Biswas, Manokaran Kamalakannan, Joydev Chattopadhyay, Dhriti Banerjee, and Tanoy Mukherjee

Ecological Informatics, 2025

Abs DOI Bib

Mammalian hair serves as a critical biological marker, aiding species identification essential for wildlife conservation and crime control. This study introduces the first extensive benchmark for classifying microscopic images of mammal hair from species prioritized for conservation. Our goal is to develop standardized methods, metrics, and best practices for utilizing advanced computer vision techniques, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs) , and Swin Transformers, to classify hair samples across Order, Family, Genus and Species taxonomic levels. We present a novel dataset of 76 species, including critically endangered and endangered species, curated specifically for this classification challenge. The methodology integrates automated feature extraction of cuticle patterns and medulla structures, enabling high-precision species differentiation. Our findings demonstrate that Swin Transformer-based models outperform traditional CNNs and ViTs across taxonomic levels, with techniques like image cropping further improving classification accuracy by diversifying the training set. The proposed Tricho-Vision framework offers significant applications in biodiversity monitoring and wildlife crime investigation, facilitating accurate species identification from forensic hair samples. Additionally, we introduce a interactive tool for real-time taxonomic classification, showcasing the practical utility of our research and fostering broader interdisciplinary engagement in conservation science and forensic applications. • Curated dataset with 76 species for research in hair classification. • Standardized suite for evaluating Tricho-Taxonomy models. • Exhaustive tests ensure framework performance accuracy. • Real-time demo highlights practical conservation applications.
@article{das2025trichovision, title = {{Tricho-Vision: The use of computer vision in trichotaxonomy for enhancing wildlife conservation of priority species}}, author = {Das, Alloy and Banerjee, Priyanka and Biswas, Sanket and Kamalakannan, Manokaran and Chattopadhyay, Joydev and Banerjee, Dhriti and Mukherjee, Tanoy}, year = {2025}, journal = {{Ecological Informatics}}, volume = {90}, pages = {103161--103161}, doi = {10.1016/j.ecoinf.2025.103161}, }
FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework

Alloy Das, Sanket Biswas, Prasun Roy, Subhankar Ghosh, Umapada Pal, Michael Blumenstein, Josep Lladós, and Saumik Bhattacharya

2025

Abs DOI arXiv Bib

Scene Text Editing (STE) is a challenging research prob-lem, that primarily aims towards modifying existing texts in an image while preserving the background and the font style of the original text. Despite its utility in numerous real-world applications, existing style-transfer-based approaches have shown sub-par editing performance due to (1) complex image backgrounds, (2) diverse font attributes, and (3) varying word lengths within the text. To address such limitations, in this paper, we propose a novel font-agnostic scene text editing and rendering framework, named FASTER, for simultaneously generating text in arbitrary styles and locations while preserving a natural and realistic appearance and structure. A combined fusion of target mask generation and style transfer units, with a cascaded self-attention mech-anism has been proposed to focus on multi-level text region edits to handle varying word lengths. Extensive evaluation on a real-world database withfurther subjective human eval-uation study indicates the superiority of FASTER in both scene text editing and rendering tasks, in terms of model per-formance and efficiency. The code and pre-trained models have been released in our Gi thub repo.
@article{das2025faster, title = {{FASTER: A Font-Agnostic Scene Text Editing and Rendering Framework}}, author = {Das, Alloy and Biswas, Sanket and Roy, Prasun and Ghosh, Subhankar and Pal, Umapada and Blumenstein, Michael and Lladós, Josep and Bhattacharya, Saumik}, year = {2025}, journal = {{}}, pages = {1944--1954}, doi = {10.1109/wacv61041.2025.00196}, arxiv = {2308.02905}, }

FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Alloy Das, Sanket Biswas, Umapada Pal, Josep Lladós, and Saumik Bhattacharya

Lecture notes in computer science, 2024

DOI arXiv Bib

@article{das2024fasttextspotter,
  title = {{FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting}},
  author = {Das, Alloy and Biswas, Sanket and Pal, Umapada and Lladós, Josep and Bhattacharya, Saumik},
  year = {2024},
  journal = {{Lecture notes in computer science}},
  pages = {135--150},
  doi = {10.1007/978-3-031-78498-9_10},
  arxiv = {2408.14998},
}

Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes

Alloy Das, Sanket Biswas, Umapada Pal, and Josep Lladós

2024

Abs DOI arXiv Bib

When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models have been released in our Github.
@article{das2024diving, title = {{Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes}}, author = {Das, Alloy and Biswas, Sanket and Pal, Umapada and Lladós, Josep}, year = {2024}, journal = {{}}, pages = {410--417}, doi = {10.1109/icra57147.2024.10611120}, arxiv = {2310.00558}, }
Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards Enhancing Text Spotting Performance

Alloy Das, Sanket Biswas, Ayan Banerjee, Josep Lladós, Umapada Pal, and Saumik Bhattacharya

2024

Abs DOI arXiv Bib

The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions. However, existing SOTA approaches usually incorporate scene text detection and recognition simply by pretraining on natural scene text datasets, which do not directly exploit the intermediate feature representations between multiple domains. Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data such that it can directly adapt to target domains rather than being specialized for a specific domain or scenario. Further, we investigate a transformer baseline called Swin-TESTR to focus on solving scene-text spotting for both regular and arbitraryshaped text along with an exhaustive evaluation. The results demonstrate the potential of intermediate representations to gain significant performance on text spotting benchmarks across multiple domains (e.g. language, synth-to-real, and documents). both in terms of accuracy and efficiency.
@article{das2024harnessing, title = {{Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards Enhancing Text Spotting Performance}}, author = {Das, Alloy and Biswas, Sanket and Banerjee, Ayan and Lladós, Josep and Pal, Umapada and Bhattacharya, Saumik}, year = {2024}, journal = {{}}, pages = {707--717}, doi = {10.1109/wacv57701.2024.00077}, arxiv = {2310.00917}, }