For best experience please turn on javascript and use a modern browser!
You are using a browser that is no longer supported by Microsoft. Please upgrade your browser. The site may not present itself correctly if you continue browsing.

Summary

This thesis explores the use of multimodal machine learning models, incorporating geo, temporal, textual, and visual data to overcome the limitations of traditional unimodal approaches. By reflecting the complexity of human perception, it enhances real-world applications and helps tackle challenges such as urban event detection, object detection, and demand forecasting.

Key research questions include how multimodal methods enhance applications using diverse data types, the optimal integration of modalities, and the role of contextual information. We examine the contributions of advances in natural language processing, forecasting, and computer vision, alongside the challenges of multimodal fusion in real-world contexts.

The thesis is based on four publications. The first demonstrates how combining visual and textual data improves urban micro-event classification. The second introduces a system for collecting and analyzing street-level imagery to detect urban objects. The third discusses the GIGO dataset for classifying urban garbage, highlighting the need for multimodal approaches. The final chapter presents a multimodal product demand forecasting system, showcasing the Multimodal Temporal Fusion Transformer’s success in reducing food waste and improving predictions.

Our findings underscore the importance of contextual information and techniques like representation learning in solving real-world problems with multimodal machine learning. The challenges faced in integrating diverse data types highlight the field’s potential to advance future real-world applications.