ZOD Project

ZOD – Project Zero-shot Object Discovery and Description for Visually-Impaired Aid

Realization
  • HES-SO Fribourg
  • Prof. Jean Hennebert
  • Dr. Oussama Zayene
Keywords
  • Multimodal Foundation Models
  • Visual Question Answering
  • Prompt Engineering
Funding
Hasler Stiftung
Schedule
01.01.2023 – 31.12.2023

Blind and partially sighted persons are daily struggling in non-codified situations where objects are new or at unexpected location. For example, blind persons will often need to be accompanied to go shopping for new clothes, or simply to match the everyday clothes. One can observe that their autonomy in new or changing environment is very much reduced.

This 12 months’ study aims to put the latest AI findings at the service of these people in such a way they can carry out their daily tasks with less dependency on other persons. More specifically, our goal is to develop an interactive object discovery and description PoC, based on the recent DL technologies, namely multimodal foundation models and Zero-Shot Learning (ZSL) paradigms. The following figure illustrates the functioning of the targeted system.

 

 

To better understand this pipeline, let’s imagine a situation where a user wants to find a specific tool in the garage (e.g. a screwdriver) or his missing black cat in the house. He starts by opening the ZOD mobile app, then asking a natural langage question like “where is my black cat sitting?”. The algorithms behind this app convert the spoken question to a text prompt (using the Speech-to-Text and Prompt Engineering modules), then start mapping the textual query to the visual content broadcasted from the mobile camera. Once the object is identified by the VL model, an appropriate answer or a location clue will be generated to guide the user to the targeted object.

As in most R&D projects, our study adopts the classic “Work Packages” approach. Three WPS have been identified:

  1. Data collection (including the study/choice of existing datasets and the collection/annotation of new data)
  2. Information extraction (including R&D of all the required models).
  3. System integration and validation (including encapsulation of DL algorithms in micro-services and their integration to build a complete processing pipeline).