Using Ancillary Text to Index Web-Based Multimedia Objects Lyne Da Sylva lyne.da.sylva@umontreal.ca EBSI, Université de Montréal James Turner james.turner@umontreal.ca EBSI, Université de Montréal PériCulture is the name of a research project at the Université de Montréal which is part of a larger project based at the Université de Sherbrooke. The parent project aimed to form a research network for managing Canadian digital cultural content. The project was financed by Canadian Heritage and was conducted during the fiscal year 2003-2004. PériCulture takes its name from péritexte and culture, péritexte being one of a number of terms used (in French, our working language) to mean ancillary text associated with images and sound. It is a sister project to DigiCulture, another part of the same larger research project which studied user behaviours in interactions with Canadian digital cultural content. The general research objective of PériCulture was to study indexing methods for Web-based nontextual cultural content, specifically still images, video, and sound. Specific objectives included: 1. identifying properties of ancillary text useful for indexing; 2. comparing various combinations of these properties in terms of performance in retrieval; 3. contributing to the development of bilingual and multilingual searching environments; 4. developing retrieval strategies using ancillary text and synonyms of useful terms found therein. In computer science, research into indexing images and sound focuses on the low-level approach, performing statistical manipulations on primitives in order to identify semantic content. This approach is also referred to as the content-based approach (e.g. Gupta and Jain, Lew). In information science, research into indexing images and sound focuses on associating textual information with the nontextual elements, and this often involves manipulating ancillary text. This approach is referred to as the high-level or concept-based approach (e.g. Rasmussen, O'Connor, O'Connor, and Abbas). A number of factors militate in favour of automating the high-level approach as much as possible. These include the very large volume of Web-based materials available, the disparity among cataloguing and indexing methods from one collection to another, and the high cost and relative inconsistency of human indexing. Our work in this project focuses on text associated with Web-based still images, and builds on previous work in this area of information science (e.g. Goodrum and Spink, Jörgensen, Jörgensen et al., Turner and Hudon). We identified a number of Web sites that met our criteria, i.e., that contained multimedia objects, that had text associated with these objects that was broader than file names and captions, that were bilingual (English and French), and that housed Canadian digital cultural content. We identified keywords that were useful in indexing and studied their proximity to the object described. We looked at indexing information contained in the Meta and Alt tags, and whether other tags contained useful indexing terms. We studied whether standards such as the Dublin Core were used. We identified Web-based resources for gathering synonyms for the keywords. Our study found that a large number of useful indexing terms are available in the ancillary text of many Web sites with cultural content. We evaluated various types of ancillary text as to their usefulness in retrieval. Our results suggest that these terms can be manipulated in a number of ways in automated retrieval systems to improve search results. Cross-language comparison of the results reinforces our previous research results, which suggest that indexing in other languages can be generated automatically from a single language using Web-based tools. Rich information that can be used for retrieval is available in many places on Web sites with cultural content, from the file name to explicit information in captions to descriptive information in surrounding text to the contents of various HTML tags. Algorithms need to be developed to exploit this information in order to improve retrieval. Finally, we feel that our work is useful because of the synergy created by the approaches we use. We are both interested in image indexing, but come from different fields. Lyne Da Sylva's expertise is in linguistics and James Turner's in information science. By working together, we are able to pool our knowledge and develop richer methods than would otherwise be available to either of us for approaching the question of automating indexing for images and other multimedia objects. Bibliography Goodrum, A. A. Spink Image searching on the Excite web search engine Information Processing and Management 27.2 295-312 2001 Gupta, A. Ramesh C. Jain Visual information retrieval Communications of the ACM 40.5 71-79 71-79 Jörgensen, Corinne Image attributes: an investigation PhD thesis, Syracuse University 1995 Jörgensen, Corinne Image attributes in describing tasks: an investigation Information Processing and Management 34.2/3 161-174 1998 Jörgensen, Corinne Alejandro Jaimes Ana B. Benitez Shih-Fu Chang A conceptual framework and empirical research for classifying visual descriptors Journal of the American Society for Information Science and Technology (JASIST) 52.11 938-947 2001 Lew, Michael S. Principles of visual information retrieval Springer New York 2001 O'Connor, Brian C. Mary K. O'Connor June M. Abbas User reactions as access mechanism: an exploration based upon captions for images Journal of the American Society for Information Science 50.8 681-697 1999 Rasmussen, Edie M. Indexing images Williams, Martha E. Annual Review of Information Science and Technology 32 169-196 2004 Turner, James M. Michèle Hudon Multilingual metadata for moving image databases: preliminary results Howarth, Lynne C. Christopher Cronin Anna T. Slawek L'avancement du savoir : élargir les horizons des sciences de l'information, Travaux du 30e congrès annuel de l'Association canadienne des scicnces de l'information Toronto 2002 34-45