
Multimodal Semantic Integration
Everyday interaction is to a great extent multimodal and multisensory. However, how does one create multisensory messages? How do we manage to integrate verbal messages from our interacting partners with specific spatiotemporal settings and goals, and how do we form appropriate replies? Beyond live interaction, we have become recipients and creators of more and more multimodal messages: tv series, illustrated prints (newspapers, books, blogs, encyclopedias), captioned photo albums (published through social media platforms or even archived in governmental services, as in the case of crime scene investigation archives), amateur videos, surveillance videos, multimodal learning material, multimodal cultural heritage promotion material, memes, motion or speech-controlled video games, just to name a few. Some of these multimodal messages are layman syntheses capturing everyday multimodal interaction, while others are professional (and usually artistic) creations. In all cases, we employ our anticipatory brain to semantically integrate different modalities so that we understand, predict and interpret or synthesize messages.
However, how does one integrate what one perceives with what others say and do (and vice-versa)? Semantic integration of language, perception (vision, audition, olfaction etc.) and motion comprises a fundamental cognitive mechanism that enables -among others- multimodal interaction and learning. Understanding and conquering such mechanism enables (a) enables the development of critical thinking skills in multimodal message analysis and formation and (b) computational modeling of multimodal human machine/robot interaction. The latter is one of the ultimate objectives in Artificial General Intelligence agents, who are expected to be able to perform multimodal semantic integration when interacting with humans.
An important step in this direction is COSMOROE (CMR), a theoretical framework for describing cross-media semantics, with wide coverage that allows computational modeling. Verbal, visual, motoric (or other) representations of entities, agents, actions, gestures, event and abstract concepts participate in a semantic interplay which ranges from simple equivalence relations to forced (figurative) equivalence (cases of multimodal metonymy and metaphor), contradiction and complementarity (e.g., cases of multimodal complements, compulsory exophora, multimodal apposition etc.). COSMOROE has been employed for the analysis of diverse multimodal genre and everyday interaction, including multimodal dialogues in TV travel series, Hollywood movies, TV ads, and newspaper caricatures, and for developing multimodal applications, such as ones for automatic movie summarisation, automatic indexing and retrieval of tv programmes, and crime scene investigation related documents. Visit the CMR Data Series (3 datasets) and the CMR search engine, for details and examples, all open to the public.