An audio and image-based on-demand content annotation framework for augmenting the video viewing experience on mobile devices


The availability of annotated multimedia contents is a crucial requirement for a number of applications. In the context of education it could support the automatic summarization of recorded lessons or the retrieval of learning material. In the field of entertainment, it could serve to recommend audio and video resources based on user's attitudes. In this work, a framework supporting video viewing experience augmentation on mobile devices by means of image- and text-based annotations extracted on-demand from Wikipedia is presented. Speech recognition is exploited to periodically get text snaps from the audio track of the video currently displayed on the mobile device, while query-by-images is used to generate a text summary of extracted video frames. Keywords obtained are treated by semantic techniques to find named entities associated with the multimedia contents, which are then superimposed to the video and displayed to the user in a synchronized way. Promising results obtained with a prototype implementation showed the feasibility of the proposed solution, which could be possibly combined with other systems, e.g., Providing information about user's location, preferences, etc. To build up more sophisticated context-aware applications.