Understanding speech and scene with ears and eyes
Acronym
USSEE
Description of the granted funding
One of the biggest challenges of AI is to develop computational abilities to understand speech and video scenes as effectively as we humans do it. This project aims to develop multimodal techniques for understanding and interpreting aural and visual inputs. These novel machine learning based techniques will first learn representations of visual stimuli and human speech in various abstraction levels and then cross-modal correlations between the representations. This can be achieved by devising new network structures and utilizing diverse uni- and multimodal datasets for training of the various parts of the model first separately and then jointly. As a result, we believe, the accuracy of both speech recognition, visual description and interpretation will improve.
Show moreStarting year
2022
End year
2024
Granted funding
Funder
Research Council of Finland
Funding instrument
Targeted Academy projects
Other information
Funding decision number
345790
Fields of science
Computer and information sciences
Research fields
Laskennallinen data-analyysi
Identified topics
languages, linguistics, speech