During my time as an Undergraduate Researcher at IIT Delhi, I became fascinated by how machines "see" the world. While Transformer-based models were winning everywhere in NLP, they were surprisingly struggling with traditional Computer Vision tasks like pedestrian detection. My team and I really wanted to understand why.
The "Aha!" Moment
We realized that the standard way these models were trained (called one-to-one matching) was actually holding them back. Imagine trying to learn how to identify a person, but you only get one "correct" signal for every dozen attempts. It's slow and frustrating.
Our Approach: Many-to-One Matching
We thought, "What if we allow the model to learn from multiple good predictions at once?" We developed a Min-cost-flow based matching algorithm that does exactly that. By giving the model a much richer signal during training, we saw a massive jump in performance.
It was incredibly rewarding to see this approach work so well on challenging datasets like CrowdHuman. It didn't just make the models faster to train—it made them much better at detecting pedestrians in crowded, complex environments.
Sharing the Work
We ended up presenting this work at WACV 2024. We have always believed in open research, so we've made all our code and matchers available for anyone to experiment with.
If you're interested in the technical nitty-gritty, you can find the full paper and code links in my Publications section.