TVGraz: Multi-Modal Learning of Object Categories by Combining Textual and Visual Features
| Authors | Khan Inayatullah, Saffari Amir, Bischof Horst |
|---|---|
| Appeared in | Proc. 33rd Workshop of the Austrian Association for Pattern Recognition, AAPR / ÖAGM 2009 |
| Pages | 213-224 |
| Date | 2009 |
| Abstract | Internet offers a vast amount of multi-modal and heterogeneous information mainly in the form of textual and visual data. Most of the current web-based visual object classification methods only utilize one of these data streams. As we will show in this paper, combining these modalities in a proper way often provides better results not attainable by relying on only one of these data streams. However, up to our knowledge, there is no publicly available dataset for benchmarking algorithms which use textual and visual data simultaneously. Therefore, in this work, we present an annotated multi-modal dataset, named TVGraz, which currently contains 10 visual object categories. The visual appearance of the objects in the dataset is challenging and offers a less biased benchmark. In order to facilitate the usage of this dataset in vision community, we additionally provide a preprocessed text data by using VIPS (VIsion-based Page Segmentation) method. We use a Multiple Kernel Learning (MKL) method to combine the textual and visual features in a proper way and show improved classification and ranking results with respect to the using only one of the data streams. |
| Link |
