Job Description
Research Intern – Audio-Visual VoiceAI (Open Source)
We’re looking for a Research Intern to join WhissleAI and help advance our open-source work at the intersection of speech, vision, and structured understanding — inspired by projects like
- advanced speech recognition asr.whissle.ai
- and recent multi-modal alignment research (example : -main.845.pdf)
You’ll work on developing audio-visual foundation models that connect voice, context, and environment — enabling systems that can listen, see, and act coherently in real time. Most of this work is open-source and contributes directly to the broader research community.
Ideal candidate
Undergrad, Master’s, or PhD student in CS, AI, or related fieldPrior research experience (conference / workshop publications a plus)Strong background in one or more of : multimodal learning, audio-visual representation learning, speech modeling, or self-supervised methodsExperience with PyTorch, Hugging Face, or similar frameworksWhat you’ll do
Prototype and evaluate audio-visual alignment modelsExtend our open-source ASR and meta-speech pipelinesCollaborate on papers, demos, and real-time VoiceAI applicationsLocation : Remote
Type : Paid internship / research collaboration