Brandon Birmingham is a Computer Scientist and has recently finalised his PhD research in Vision and Language at the University of Malta. He contributed to Spatial Relation Detection in images and developed novel models for automatic Image Caption Generation.
He graduated with a BSc (Hons) and MSc in Computer Science from the University of Malta in 2015 and 2016, respectively. In parallel with his academic studies and duties, he worked in the industry as a Software and Business Intelligence Developer.
He is passionate about creative problem solving and the engineering of innovative technological solutions. Brandon is also intrigued in psychology and neuroscience.
Download CV.
PhD in Artificial Intelligence, 2022
University of Malta
MSc in Computer Science, 2016
University of Malta
BSc (Hons) in Computing Science, 2015
University of Malta
Python, Java, C#, C, C++, Assembly, Web development
Tensorflow, Pytorch, Pandas, Numpy, scikit-learn, NLTK, Matplotlib
SQL, SSIS, SSRS, SSAS
Detecting spatial relationships between objects depicted in an image is an important sub-task in vision and language understanding. Its practical use lies in visual discourse when referring to objects by their relationship in context of others and finds application in higher level tasks such as visual question answering and image description generation. Presumably, the selection of spatial prepositions grounded in an image is straightforward. However, in general, human beings either do not always agree or are not consistent when choosing spatial prepositions. This could be due to various reasons, such as near synonyms, overlapping terms and different frames of reference. For these reasons, the automatic detection of spatial relations is a non-trivial multi-label problem. This paper addresses the automatic multi-selection of prepositions. The study is based on the development of a number of machine learning models, namely Nearest Neighbor (NN), k-Means Clustering (kM-C), Agglomerative Hierarchical Clustering (A-HC) and Multi-label Neural Network (ML-NN). The model performances are compared quantitatively using multi-label metrics as well as human evaluations that are independent of the ground truth labels. Additionally, the classification results are used as a basis to carry out an error and qualitative analysis that sheds light on the relative merits of how each model deals with synonymous and overlapping relations, and groups common errors to inform future directions. Furthermore, to gain insight into the merits of multi-label models, a single-label Random Forest (RF) classifier is developed and its results are included in the analysis. Of all multi-label models, the ML-NN exhibits the best overall performance when evaluated on both the dataset ground truth and the independent human evaluations. It, however, suffers from under-generating prepositions, while the rest of the models often generate more prepositions at the expense of precision. The clustering-based methods are also not quite consistent, although they do better than the other models in less frequent spatial configurations that other models struggle with. The results from the single-label RF classifier highlight the usefulness of having a multi-label model. Finally, the error analysis indicates that the majority of errors is due to lack of features that give cues on object position and orientation (object pose), the fixed frame of reference, and the failure to resolve depth in perspective view.
Detection of spatial relations between objects in images is currently a popular subject in image description research. A range of different language and geometric object features have been used in this context, but methods have not so far used explicit information about the third dimension (depth), except when manually added to annotations. The lack of such information hampers detection of spatial relations that are inherently 3D. In this paper, we use a fully automatic method for creating a depth map of an image and derive several different object-level depth features from it which we add to an existing feature set to test the effect on spatial relation detection. We show that performance increases are obtained from adding depth features in all scenarios tested.