Multi Spatial Relation Detection in Images

Predicted multi spatial relation output alongside human selected prepositions.

Abstract

Detecting spatial relationships between objects depicted in an image is an important sub-task in vision and language understanding. Its practical use lies in visual discourse when referring to objects by their relationship in context of others and finds application in higher level tasks such as visual question answering and image description generation. Presumably, the selection of spatial prepositions grounded in an image is straightforward. However, in general, human beings either do not always agree or are not consistent when choosing spatial prepositions. This could be due to various reasons, such as near synonyms, overlapping terms and different frames of reference. For these reasons, the automatic detection of spatial relations is a non-trivial multi-label problem. This paper addresses the automatic multi-selection of prepositions. The study is based on the development of a number of machine learning models, namely Nearest Neighbor (NN), k-Means Clustering (kM-C), Agglomerative Hierarchical Clustering (A-HC) and Multi-label Neural Network (ML-NN). The model performances are compared quantitatively using multi-label metrics as well as human evaluations that are independent of the ground truth labels. Additionally, the classification results are used as a basis to carry out an error and qualitative analysis that sheds light on the relative merits of how each model deals with synonymous and overlapping relations, and groups common errors to inform future directions. Furthermore, to gain insight into the merits of multi-label models, a single-label Random Forest (RF) classifier is developed and its results are included in the analysis. Of all multi-label models, the ML-NN exhibits the best overall performance when evaluated on both the dataset ground truth and the independent human evaluations. It, however, suffers from under-generating prepositions, while the rest of the models often generate more prepositions at the expense of precision. The clustering-based methods are also not quite consistent, although they do better than the other models in less frequent spatial configurations that other models struggle with. The results from the single-label RF classifier highlight the usefulness of having a multi-label model. Finally, the error analysis indicates that the majority of errors is due to lack of features that give cues on object position and orientation (object pose), the fixed frame of reference, and the failure to resolve depth in perspective view.

Publication
Spatial Cognition & Computation