Content-Based Access to Multimedia Information

S.K. Chang Department of Computer Science University of Pittsburgh Pittsburgh - PA 15260, USA Phone +1-412-624-8423 E-mail chang@cs.pitt.edu

Abstract: The key to content-based access of multimedia information is to discover, encode and maintain the associations among media objects. Media enhanced interaction metaphors employ multiple paradigms to facilitate the discovery, encoding and maintenance of these associations. Although the metaphors themselves may be conceptual, their effective incorporation into the user interface for multimedia information access often requires the use of visual querying mechanisms. To access multimedia information by content, various query mechanisms need to be combined, and the user interface should be visual as much as possible and also enable visual relevance feedback, user-guided navigation and user-controlled discovery of new associations. In this paper we first review the technology issues in the visual representation of the information space and strategies for visual reasoning. We then describe applications in digital library, medical information fusion and intelligent web searcher to illustrate different scenarios of multiparadigmatic interaction for accessing multimedia information. For integrated technology comparison, we give a taxonomy of visual querying paradigms and compare various interaction techniques. Several open research challenges are then discussed, with emphasis on the modeling, integration and evaluation of the Hypermapped Virtual World interaction metaphor.

Keywords: content-based information retrieval, multimedia information systems, visual querying systems

1. Introduction

Recent advances in storage technologies have made the creation of multimedia databases both feasible and cost-effective. Wideband communications also greatly facilitate the distribution of multimedia information across communication networks. Parallel computers lead to faster voice, image and video processing systems. High resolution graphics and dedicated co-processors enable the presentation of visual information with superior image quality. Multimedia information systems have found their way into many application areas, including geographical information systems (GIS), [LANG92] office automation (OA), [RAU96] distance learning, [LITTLE95] health care, [REIS92] computer-aided design (CAD), [ANUPAM94] computer-aided engineering (CAE), [KAWATA96] and scientific databases (SDB) applications [AHMED94].

Wider applications also lead to both more numerous and more sophisticated end users. Multimedia information systems, like other types of information systems, have increasingly become knowledge-based systems, with capabilities to perform many sophisticated tasks by accessing and manipulating domain knowledge. The above mentioned technological advances dictate a better methodology to design knowledge-based, user-specific multimedia information systems. The design methodology, taking into consideration the diversified application requirements and users' needs, should provide a unified framework for multimedia representation, querying, indexing and spatial/temporal reasoning.

Multimedia databases, when compared to traditional databases, have the following special requirements [FALOUTSOUS94], [FOX91]:

1. The size of the data items may be very large. The management of multimedia information therefore requires the accessing and manipulation of very large data items.

2. Storage and delivery of video data requires guaranteed and synchronized delivery of data.

3. Various query mechanisms need to be combined, and the user interface should be highly visual and should also enable visual relevance feedback and user-guided navigation.

4. The on-line, real-time processing of large volumes of data may be required for some types of multimedia databases.

The focus of this paper is on the third requirement mentioned above. For multimedia databases, there are not only different media types, but also different ways to query the databases. The query mechanisms may include free text search, SQL-like querying, icon-based techniques, querying based upon the entity-relationship (ER) diagram, content-based querying, sound-based querying, as well as virtual-reality (VR) techniques. Some of these query mechanisms are based upon traditional approaches, such as free text search ("retrieve all books authored by Einstein") and SQL query language ("SELECT title FROM books WHERE author = Einstein"). Some are developed in response to the special needs of multimedia databases such as content-based querying for image/video databases ("find all books containing the picture of Einstein") [FALOUTSOUS93], and sound-based queries that are spoken rather than written or drawn [TABUCHI91]. Some are dictated by new software/hardware technologies, such as icon-based queries that use icons to denote query targets ("a book") and objects ("a sketch of Einstein"), and virtual reality queries where the query targets and objects can be directly manipulated in a virtual reality environment (the book shelves in a Virtual Library). Except for the traditional approaches and those relying on sound, the other techniques share the common characteristic of being highly visual. Therefore, we will concern ourselves mainly with multiparadigmatic visual interface to accessing multimedia documents.

A visual interface to multimedia databases, in general, must support some type of visual querying language. Visual query languages (VQLs) are query languages based on the use of visual representations to depict the domain of interest and express the related requests. Systems implementing a visual query language are called Visual Query Systems (VQSs) [BATINI91], [CATARCI95]. They include both a language to express the queries in a pictorial form and a variety of functionalities to facilitate human-computer interaction. As such, they are oriented to a wide spectrum of users, ranging from people with limited technical skills to highly sophisticated specialists. In recent years, many VQSs have been proposed, adopting a range of different visual representations and interaction strategies. These interaction paradigms will be discussed in Section 5.1, but most existing VQSs restrict human-computer interaction to only one kind of interaction paradigm. However, the presence of several paradigms, each one with different characteristics and advantages, will help both naive and experienced users in interacting with the system. For instance, icons may well evoke the objects present in the database, while relationships among them may be better expressed through the edges of a graph, and collections of instances may be easily clustered into a form. Moreover, the user is not required to adapt his perception of the reality of interest to the different views presented by the various data models and interfaces.

The way in which the query is expressed depends on the visual representations as well. In fact, in the existing VQSs, icons are typically combined following a spatial syntax [CHANG95b], while queries on diagrammatic representations are mainly expressed by following links and forms are often filled with prototypical values. Moreover, the same interface can offer to the user different interaction mechanisms for expressing a query, depending on both the experience of the user and the kind of the query itself [CATARCI96].

To effectively and efficiently access information from multimedia databases, we can identify the following design criteria for the user interface: (1) various query mechanisms need to be combined seamlessly, (2) the user interface should be visual as much as possible, (3) the user interface should enable visual relevance feedback, (4) the user interface should support user-guided navigation, and (5) the user interface should facilitate the user-controlled discovery of new associations among media objects.

Addressing these issues, this paper explores the design of multiparadigmatic visual interfaces to multimedia databases. In Section 2, we discuss the visual representation of the information space. Strategies for visual reasoning are surveyed in Section 3. In Section 4 we describe three application examples to illustrate the concept of multiparadigmatic visual interface for multimedia databases. Section 5.1 gives a taxonomy of visual querying paradigms, and Section 5.2 deals with multimedia database interaction techniques. Open research challenges and some concluding remarks are given in Section 6.

2. Representation of the Information Space

In a visual interface for multimedia databases, the information stored in the multimedia database needs to be visualized in an information space. This visualization can either be carried out by the user in the user's mind, in which case it is essentially the user's conceptualization of the database; or the visualization could be accomplished by the system, in which case the visualization is generated on the display screen. In this Section, we describe the different representations of the information space.

Database objects, in general, are abstracted from real-life objects in the real world. Therefore, we can distinguish the logical information space and the physical information space. In the logical space, the abstract database objects are represented. In the physical space, the abstract database objects are materialized and represented as physical objects such as images, animation, video, voice, etc. The physical objects either mimic real-life objects such as objects in a virtual reality, or reflect real-life objects such as diagrams, icons and sketches.

The real world, from which the database objects are abstracted, is the environment that the database objects must relate to. The real world also is often abstracted in the information space. Only in the virtual reality information space, will the real world be represented in a direct way (see later).

The logical information space is a multi-dimensional space, where each point represents an object (a record, a tuple, etc.) from the database. A database object, e_j, or an example, is a point in this space. Conceptually, the entire information space then corresponds to all the database objects in a database. The logical information space is thus a unified view of the database, i.e. a universal relation.

Each attribute of a database object represents one dimension in this multi-dimensional space. Therefore, in the logical information space, different dimensions actually have different characteristics: continuous, numerical, discrete, or logical.

A query q_i is an arbitrary region in this information space. A clue x_k is also an arbitrary region in the logical information space, but it may contain additional directional information to indicate visual momentum, such as the direction of browsing. Therefore, an example e_j is a clue, and so is a visual query q_i. A hypermap (section 6) when used as a metaphor is also a clue.

The information retrieval problem is to construct the "most desirable" query q_i with respect to the examples e_j and the clues x_k presented by the user. The "most desirable" query is one which will retrieve the largest number of relevant database objects and whose "size" in the information space is relatively small. The process of visual reasoning, which will be discussed in Section 3, may help the user find the most desirable query from examples and clues.

The logical information space may be further structured into a logical information hyperspace, where the clue becomes hyperlinks that provides directional information, and the information space can be navigated the user by following the directional clues. Information is "chunked", and each chunk is illustrated by an example (the hypernode).

The physical information space consists of the materialized database objects. The simplest example is as follows. Each object is materialized as an icon, and the physical information space consists of a collection of icons. These icons can be arranged spatially, so that the spatial locations approximately reflect the relations among database objects. More recently, intelligent visualization systems are being developed, for task-specific visualization assistance [IGNATIUS94]. Such systems can offer assistance in deriving perceptually effective materialization of database objects.

In the physical information space, the objects reflect real-world objects, but the world is still an abstraction of the real world. One further step is to present information in a virtual reality information space. Virtual Reality allows the users to be placed in a 3D environment they can directly manipulate. What the users see on the screen will be the same as what can be experienced in the real world. 3D features can be used to present the results in a virtual reality (VR) setting. For example, the physical location of medical records can be indicated in a (simplified) 3D presentation of a Virtual Medical Laboratory by blinking icons. If the database refers to the books of a library, we can represent a Virtual Library in which the physical locations of books are indicated by blinking icons in a 3D presentation of the book stacks of the library. What the user sees on the screen will be the same (after simplifications) as what can be experienced in the real world. VR such as the Virtual Library or the Virtual Medical Laboratory can become a new query paradigm. For example, the user can select a book by picking it from the shelf, like in the real world.

It is worth noting that we are talking about "nonimmersive" VR [ROBERTSON93b] i.e. the user is placed in a 3D environment he or she can directly manipulate without wearing head-mounted stereo displays or special gloves, but acting only with mouse, keyboard, and monitor of a conventional graphics workstation. This is an alternative form of VR that is being explored in several research laboratories. The use of 3D modeling and rendering is the same as in immersive VR, because the scene is displayed with the same depth cues: perspective view, hidden-surface elimination, color, texture, lighting, shading. Researchers working in nonimmersive VR report that the user is drawn into the 3D world, since mental and emotional immersion takes place, in spite of the lack of visual or perceptual immersion. Moreover, mouse/keyboard controlled interaction techniques are easy to learn and use and are often faster than Dataglove interaction techniques. Therefore, significant advantages come from using such familiar and inexpensive tools, that lower startup costs. Indeed, immersive VR technology has still many limits and problems (producing and synchronizing stereo images, handling of immersive input devices, etc.), so that researchers spend much more time focusing on the devices rather than on applications and interaction techniques. As a further advantage, nonimmersive VR does not force office workers to wear special equipment that isolate them from their usual environment, minimizing psychological and physical stress that most users will not tolerate [BEAUDOUIN92]. On the other hand, new approaches on using the immersive VR, such as interactive Worlds-in-Miniature (WIM) [STOAKLEY95], may pave the way for more applications of the immersive VR.

Nonimmersive VR is a valuable interaction paradigm that will be fruitful in multimedia database applications, as well as in general business applications. When displays and input devices that are easily manageable and not intrusive will be available, immersive VR will become acceptable.

The above categorization can be summarized by the following table:

Table 1. Summary of information spaces. *Real objects in the physical information space reflect real-life objects, rather than mimic real-life objects.

3. Strategies for Visual Reasoning

Visual reasoning is the process of reasoning and making inferences, based upon visually presented clues. As mentioned in Section 2, visual reasoning may help the user find the most desirable query from examples and clues. In this Section, we survey strategies for visual reasoning.

Visual reasoning is widely used in human-to-human communication. For example, the teacher draws a diagram on the blackboard. Although the diagram is incomplete and imprecise, the students are able to make inferences to fill in the details, and gain an understanding of the concepts presented. Such diagram understanding relies on visual reasoning so that concepts can be communicated. Human also uses gestures [HANNE92] to communicate. Again, gestures are imprecise visual clues for the receiving person to interpret.

In human-to-computer communication, a recent trend is for the human to communicate to the computer using visual expressions. Typically, the human draws a picture, a structured diagram, or a visual example, and the computer interprets the visual expression to understand the user's intention. This has been called visual coaching, programming by example [MYERS86], or programming by rehearsal [GUOLD84], [HUANG90] by various researchers.

Visual reasoning is related to spatial reasoning, example-based programming, and approximate/vague retrieval. Spatial reasoning is the process of reasoning and making inferences, about problems dealing with objects occupying space [DUTTA89]. These objects can be either physical objects (e.g., books, chairs, cars, etc.) or abstract objects visualized in space (e.g. database objects). Physical objects are tangible and occupy physical space in some measurable sense. Abstract objects are intangible but nevertheless can be associated with a certain space in some coordinate system. Therefore, visual reasoning can be defined as spatial reasoning on abstract objects visualized in space.

Example-Based Programming refers to systems that allow the programmer to use examples of input and output data during the programming process [MYERS86]. There are two types of Example-Based Programming: Programming by Example and Programming with Example. "Programming by Example" refers to systems that try to guess or infer the program from examples of input and output or sample traces of execution. This is often called "automated programming" and has been an area of AI research. "Programming with Example" systems require the user to specify everything about the program (there is no inferencing involved), but the programmer can work out the program on a specific example. The system executes the programmer's commands normally, but remembers them for later re-use. Halbert [HALBERT84] characterizes Programming with Examples as "Do What I Did" whereas inferential Programming by Example might be called "Do What I Mean". Many recently developed visual programming systems utilized the example-based programming approach [GOULD84], [MYERS88], [SMITH77]. The approach described in Section 4.1 combines presentation of visual clues (programming by example) with query augmentation techniques (programming with example).

We now discuss visual reasoning approaches for databases. Most research in database systems is based on the assumptions of precision and specificity of both the data stored in the database, and the requests to retrieve data. In reality, however, both may be imprecise or vague. Motro characterizes three categories of imprecision and/or vagueness: (1) the data stored in the database is imprecise; (2) the retrieval request is imprecise; and (3) the user does not have a precise notion of the contents of the database [MOTRO88].

Imprecision in stored data can be dealt with by applying fuzzy sets theory to provide a linguistic description of the stored imprecise data. Fuzzy queries also allow the user to give imprecise retrieval requests. Such techniques are generally applicable when the source of imprecision is quantifiable into numbers, for example, "the age of a person is somewhere between 40 and 45" (imprecision in stored data), "retrieve all middle-aged employees" (imprecision in queries). However, when the source of imprecision is not easily quantifiable, for example, "find persons with faces similar to Einstein's face", the above techniques are less well suitable. Recent research in content-based retrieval may lead to techniques to address such problems.

Imprecision in the user's model may be classified as follows [MOTRO86]: incomplete knowledge of the data model, imprecise information on the database schema and/or its instance, vagueness of user goals, and incomplete knowledge about the interaction tools.

To deal with imprecision in user's model, several approaches have been investigated: (i) browsing techniques to provide different views of the database [MOTRO88]; (ii) heuristic interpretation of user' query to transform the user' query by a connective approach [D'ATRI89a], [WALD84], [CHANG79]; (iii) example-based techniques to generalize from selected examples [ZLOOF77], or to modify the original query if the answer is not considered satisfactory [WILLIAMS84], [MOTRO88]. The modification is done either interactively or automatically.

Browsing is generally effective and widely used but may be very wasteful on the user's time. Heuristic interpretation of user's query can lead to "false drops" or "false hits". Example-based techniques work well for some applications but are hard to generalize. In addition, two common limitations of these approaches [D'ATRI89b] are worth mentioning here: (1) the browsing environment and the querying environment are usually distinct, thus separating the learning and the querying activities. (2) knowledge about the user must be gathered to build the user profile (user model). The approach described in Section 4.1 integrates the querying environment (using the visual query) and the browsing environment.

4. Applications and Techniques

4.1. Application Example: Digital Library

In this and the next two sections, we look at three application examples. The first example is the digital library. To support multimedia information access we need a multiparadigmatic visual interface that supports progressive queries -- The Visual Query and Result Hypercube (VQRH) [CHANG94].

We have experimented with information retrieval using VQRH in two application domains: (a) the medical databases, and (b) the library databases. The subjects are students with no previous experience in using VQRH. We will not describe VQRH in details here. We just recall its basic features, namely: (1) the screen is divided into two main windows, in the left one the user formulates its query, and the results are shown in the right window; (2) both query and results can be visualized in any of the available paradigms for query and data representation; (3) the queries formulated during an interaction section are stored, with the corresponding results, as successive slices of the Hypercube, so each slice can be easily recalled.

The preliminary experiments indicate that the users have little difficulty in learning VQRH, and they can formulate queries after half an hour of interaction. They generally like the idea of progressive querying, and find it useful to be able to recall any past query-and-result slice. From such experiments, it is already clear that the visualization of the retrieved result is very important for the success of this approach. While in the initial design of VQRH only physical information spaces were used for presenting the data, in a second version of the prototype a VR information space was added. VR is established as a query paradigm, that is, the user selects with the mouse the items of this 3D space he or she is interested in.

When performing a query, the admissability conditions to switch between a logical paradigm (our previous paradigms are all logical paradigms) and a VR paradigm (such as the Virtual Library) can be defined as follows. For a logical paradigm, a VR-admissable query is an admissable query whose retrieval target object is also an object in VR. For example, the VR for the Virtual Library contains stacks of books, and a VR-admissable query could be any admissable query about books, because the result of that query can be indicated by blinking book icons in the Virtual Library. Conversely, for a VR paradigm, an LQ-admissable query is a VR where there is a single marked VR object that is also a database object, and the marking is achieved by an operation icon such as similar_to (find objects similar to this object), near (find objects near this object), above (find objects above this object), below (find objects below this object), and other spatial operators. For example, in the VR for the Virtual Library, a book marked by the operation icon similar_to is LQ- admissable and can be translated into the following query: "find all books similar to this book." An example of a VR-admissable logical query is illustrated in Figure 1. The query is to find books on bicycles. It is performed with the iconic paradigm. The result is presented as marked objects in a Virtual Library. The user can then navigate in this Virtual Library, and switch to the VR query paradigm. Figure 2 illustrates an LQ-admissable query. The query is to find books similar to a specific book about bicycles that has been marked by the user. The result is again rendered as marked objects in a Virtual Library. If we switch to a form-based representation, the result could also be rendered as items in a form. This example illustrates progressive querying can be accomplished with greater flexibility by combining the logical paradigms and the VR paradigms. The experimental VQRH system supports VR paradigms, but the similarity function must be supplied for the problem domain.

Figure 1. A VR-admissable logical query.

Figure 2. An LQ-admissable query

The experiment was useful for understanding the limitations of the screen design and their impact on system's usability. Some interesting characteristics of the VR paradigm emerged, which led to a revised screen design. Indeed, the distinction between query space and result space does not make sense in VR, since a query is performed by acting, with either the mouse or another pointer device, in the environment the user is in. The result of the query usually determines some modification of such an environment, and this new situation is that one on which a successive request has to be performed. As a consequence, when working with the VR paradigm, the user gets confused by the separation of the query window and the result window. This is easily understandable by looking at Figures 1 and 2. The situation depicted in Figure 1 does not create any problem for the user, who is formulating a query in the iconic paradigm but wants to see the result in VR, since VR actually provides visual indication of where the requested books can be found. Moreover, showing the results in separate windows, gives the user the possibility of viewing them in a different representation, providing the user the full advantage of viewing the data in different ways [BATINI91]. For example, while VR gives immediate indication about physical location of a book, in a form-based representation more details can be provided at once, such as title of the book, authors, exact number of pages, etc. Therefore, in a situation involving a change of paradigm between query and result representations, the user is perfectly comfortable with the two windows shown on the screen.

However, the user gets confused when working in VR, in the situation depicted in Figure 2. The two windows both show the book shelf, but the left-side window shows the VR query and the right-side window shows the VR result. When the user is visualizing the VR result, it is unnatural to go to a different window to modify the query. There should be only one window, showing both the VR result and the VR query. Therefore, in the new version of the VQRH prototype, the computer screen displays only one view of the Virtual Library at a time. The user first navigates in the Virtual Library and clicks on a bookshelf. The user then proceeds to click on individual books, and uses operators such as "near", "similar_to", "above", etc. to retrieve other books.

Extending the Virtual Library metaphor, we can consider map as a metaphor for navigation in an information space. Hypermaps, short for cartographic hyperdocuments, fill the gap between hyperdocuments and spatial information so that knowledge pertaining to several kinds of applications can be organized in a very elegant and efficient ay by means of anchors linking words to spatial zones and vice versa, or by linking literal information to coordinates [LAURINI90]. Applications include urban and environmental planning, architecture and mechanical designs, building maintenance, archeology, tourism and geographic information systems. Hypermaps can be used advantageously as a metaphor for the representation of all the multimedia hyperbase elements [CAPORAL97]. In GeoAnchor, a map can be built dynamically as a view of the multimedia hyperbase [CAPORAL97]. As shown in Figure 3, each displayed geometry is an anchor to either a geographic node or to a related node. Hence, the map on the screen acts both as an index to the nodes and as a view to the multimedia hyperbase. With this metaphor, semantic filtering can be accomplished as illustrated by the example of Figure 4, where user behavior determines the semantic weight of both the nodes and the links of the road network. If the access frequency of a secondary road such as 'D751' is much lower than that of a major road such as 'A10', the secondary road will not appear on the display for readability.

In a Virtual Library a hypermap can also be used as a metaphor to link the most frequently accessed items such as reading rooms, book shelves, etc. to present different views to the end user. This combined metaphor of Hypermapped Virtual Library (which is a combination of the VR information space and the logical information hyperspace) may lead to efficient access of multimedia information from a digital library.

Figure 3. A hypermap example (from [CAPORAL97]).

Figure 4. An example of semantic filtering (from [CAPORAL97]).

4.2. Application Example: Medical Information Fusion

The framework for human- and system-directed medical information retrieval, discovery and fusion [JUNGERT97] is best illustrated by Figure 5. As shown in Figure 5, we envision a three-level model for information: data, abstracted information, and fused knowledge. Information sources such as camera, sensors or computers usually provide continuous streams of data, which are collected and stored in medical databases. Such data need to be abstracted into various forms of abstractions, so that the retrieval, processing, consistency analysis and combination of abstracted information becomes possible. Finally, the abstracted information needs to be integrated and transformed into fused knowledge. These three levels of information form a hierarchy, but at any given moment there is the continuous transformation of data into abstracted information and then into fused knowledge.

Figure 5 illustrates the relationships among data sources, data, abstracted information and fused knowledge, with emphasis on diversity of data sources and multiplicity of abstracted representations. For example, a video camera is a data source that generates video data. Such video data can be transformed into various forms of abstracted representations:

o text (video-to-text abstraction by human agent or computer) o keyword (video-to-keyword abstraction by human agent or computer) o assertions (logical representation of abstracted facts) o qualitative spatial description (abstraction such as the symbolic projection [CHANG96b]) o time sequences of frames (abstraction where both spatial and temporal relations are preserved)

In Figure 5, a potentially viable transformation from data to abstracted representation is indicated by a small circle. Thus, from video it is possible to transform into qualitative spatial description or time sequence of frames. A supported transformation is indicated by a large circle in Figure 5. Thus the image data will be transformed into keywords, assertions (facts) and qualitative spatial description. It should be emphasized that there are more types of abstracted representations than what are shown in Figure 5. Conversely, certain information systems may only support text, keywords and assertions as the three allowable types of abstractions. The information sources in Figure 5 may include hard real-time sources (such as the signals captured by sensors), soft real-time sources (such as pre-stored video), and non-real-time sources (such as text, images and graphics from a medical database or a web site).

The transformation from data to information and then knowledge is effected by the coordinated efforts of the User, the Active Index System and the Decision Network. As shown in Figure 6, the user interacts with the Active Index System and the Decision Network to obtain information and create fused knowledge. The user can request the Active Index System to collect information from the sources. Since the active index can perform actions in response to the user's requests, [CHANG95a] the user is capable of controlling the sources to influence the type of data being collected. For example, the user may turn on or turn off the video camera or manually control the positioning of the camera. Moreover, the user can also provide missing information and evaluate the diagnosis produced by the Decision Network.

The Active Index System receives input data as messages, processes them and sends abstracted information as its output to the user or the Decision Network. Data are transformed into abstracted information through the active index cells which also serve as filters to weed out unwanted data. Some index cells can also perform spatial/temporal reasoning [CHANG96a] to generate spatially/temporally abstracted information in the form of assertions. An active index contains index cells that can be attached to sources, while a conventional index is for data already stored in the database. For example, index cells on sensors, web sites or web pages can be created so that an Active Index System can obtain information from selected sources and send it to the user or the Decision Network.

The Decision Network is a neural net called LAMSTAR [GRAUPE96] capable of storing knowledge, fusing knowledge and posing requests to the Active Index System to collect more information from the sources. The Decision Network can send messages to the Active Index System to activate index cells which then take appropriate actions to generate abstracted information. The Decision Network can also interact with the user. It can, for example, solicit the user's evaluation of its diagnosis to reorganize its internal knowledge base.

Figure 5. A framework for information retrieval, discovery and fusion from multiple sources.

Figure 6. Relationships among the user, the Active Index System, the Decision Network and the sources.

4.2.1. A Formal Definition of Semantic Consistency

A prototype of the experimental system AMIS2 [CHANG98] is at http://www.cs.pitt.edu/~jung/AMIS2. It can be used to check for consistency in information retrieval, discovery and fusion. To do so, a more formal definition of consistency is necessary. Our definition of consistency is based upon the transformational approach illustrated by the framework of Figure 5. It is different from the usual definitions of consistency in database theory or in AI theory, because we believe the problem of consistency for information discovery and fusion must be first addressed at the level of characteristic patterns detected in medical objects. This is where the active medical information system can make the most impact in drastically reducing the amount of medical information that ultimately must be handled by human operators.

We define consistency functions to check the consistency among media objects of the same media type, by concentrating on their characteristic patterns. For example, two assertions "there is a tumor in the left lung" and "there is no tumor in the left lung" can be checked for consistency, and two images of the same left lung can also be checked for consistency. These consistency functions are media-specific and domain-specific. For example, to check whether two medical images are consistent, the consistency function will verify whether the two images contain similar characteristic patterns such as arteries, bone structures, tissues. For different application domains, different consistency functions are needed.

To check whether media objects of different media types are consistent, they need to be transformed into media objects of the same media type so that the media-specific, domain-specific consistency function can be applied. Our viewpoint is that each object is characterized by some characteristic patterns that can be transformed into characteristic patterns in different media type. For example, the characteristic pattern is a tumor in the image media, which is transformed into the word "tumor" in the keyword media. The consistency function can then be applied to the characteristic patterns of objects of the same media type.

Let {oⁱ}_j be the j^th object of media type Mⁱ. Let {cⁱ}_k be the k^th characteristic pattern detected in an object {oⁱ}_j of media type Mⁱ. Let Cⁱ denote the set of all such characteristic patterns of media type Mⁱ. Let phi_1,2 be the transformation that maps characteristic patterns detected in objects of media type M¹ to characteristic patterns of media type M².

For each media type Mⁱ there is a consistency function K_i which is a mapping from 2^{{ Cⁱ }} (the space of all subsets of characteristic patterns in media type Mⁱ) to {T, F}. In other words it verifies that a set of characteristic patterns of media type Mⁱ are consistent.

A characteristic pattern {c¹}_k of media type M¹ is consistent with respect to media type M² if the transformed characteristic pattern phi_1,2({c¹}_k) is consistent with the set C² of all characteristic patterns of media type M², i.e. K_{2( {phi_1,2({c¹}_k)} union C²)} = T. A characteristic pattern {cⁱ}_k is consistent if it is consistent with respect to all media types M^j.

Finally, a multimedia information space is consistent at time t if every characteristic pattern of every media type is consistent at time t, and a multimedia information space is temporally consistent if it is consistent at all times.

As an example, an image of media type M¹ is examined and a possible tumor-like object is detected. This is a characteristic pattern {c¹}₁. The keywords describing findings by the medical doctor is of media type M². The transformation phi_1,2 maps characteristic pattern {c¹}₁ to phi_1,2({c¹}₁), which could be the keyword "tumor". If the consistency function K₂ verifies that the finding "tumor" is consistent with other findings, then the characteristic pattern {c¹}₁ is consistent with respect to media type M². If we can also verify that {c¹}₁ is consistent with other patterns detected in media M¹, and suppose the information space contains only objects of these two media types, then we have verified that {c¹}₁ is consistent.

The information space is temporally consistent if all such findings are consistent at all times. This can be verified only after we run the entire diagnostic procedure. For example, if the "tumor" characteristic pattern is detected at time t₁ but absent at time t₂ and again detected at time t₃, and t₁ < t₂ < t₃, then there may be temporal inconsistency. (This temporal inconsistency may lead to the discovery of an important event.)

In this example the transformation function is simply the labeling of characteristic patterns. The "tumor" characteristic pattern is the pattern detected by a pattern recognizer. There are image processing algorithms which will produce characteristic patterns. As for the consistency function, we can use similarity functions which accept as inputs the characteristic patterns in some media space (the simplest being keywords) and produce the output to verify whether the inputs are consistent [SANTINI96]. In other words, we can use similarity functions to determine whether the inputs are all within a certain distance. As will be explained in the next section, we can also use the neural network for consistency checking. For different media, we need to find the most suitable consistency functions.

4.2.2. Consistency Checking by Horizontal/Vertical Reasoning

As illustrated in Figure 5, information fusion is feasible when information from different sources can be converted into similar representations, indicated in Figure 5 by several large circles in the same horizontal row. For example, the system may support the transformation of image, text and web pages into assertions (facts), so that consistency checking among assertions is feasible. We call such reasoning horizontal reasoning because it combines information abstracted from different media encoded in the same uniform representation.

Another type of reasoning is applicable to data from similar media with different abstracted representations so that they can be combined and checked for consistency. We call such reasoning vertical reasoning because it combines information having different representations at different levels of abstraction.

Horizontal reasoning can be accomplished with the help of an artificial neural network due to its ability to combine information abstracted from different media and adequately encoded in the same uniform representation. Once a horizontally uniform representation is obtained, an artificial neural network can check for consistency. If the neural network's reliability R is less than a predefined threshold, then the inputs are regarded as inconsistent. In other words, the consistency function K is derived from R.

The active index can be used in vertical reasoning due to its ability to obtain information from different sources and actively connecting them by dynamic linking (using index cells). For example, we can link an image to a keyword to an assertion (fact), and then domain-specific algorithms can be applied to check their consistency. Vertical reasoning is associative and combines information in different representations. An artificial neural network with fixed connections is not as appropriate as an active index with flexible connections.

We now use AMIS2 to illustrate information fusion by horizontal/vertical reasoning. Patient information are abstracted from different media sources, including imaging devices, signal generators, instruments, etc. (vertical reasoning). Once abstracted and uniformly represented, the neural network is invoked to make a tentative diagnosis (horizontal reasoning). Using the active index, similar patient records are found by the Recursive Searcher (vertical reasoning). A retrieved patient record is compared with the target patient record (horizontal reasoning). If similar records lead to similar diagnosis then the results are consistent and the patient record (with diagnosis) is accepted and integrated into the knowledge base. If the diagnosis is different then the results are inconsistent and the negative feedback can also help the decision network learn.

In the vertical reasoning phase, in addition to comparing patient data, we can also compare images to determine whether we have found similar patient records. Therefore, content-based image similarity retrieval becomes a part of the vertical reasoning. Depending upon the application domain, image similarity can be based upon shape, color, volume or other attributes of an object, spatial relationship among objects, and so on.

This example illustrates the alternating application of horizontal reasoning (using the neural network for making predictions) and vertical reasoning (using dynamically created active index for making associations). Combined, we have an active information system for medical information fusion and consistency checking.

The research challenge is to find the appropriate visual querying mechanism to express horizontal/vertical reasoning in medical information fusion. One possibility is the Hypermapped Virtual Clinics metaphor, where virtual clinical laboratories are linked using horizontal reasoners, and lab test results are then linked to clinical databases using vertical reasoners.

4.3. Application Example: Intelligent Web Searcher

The issue of providing the user with a powerful and friendly query mechanism for accessing information on the Web has been very recently widely investigated. In particular, one of the most critical problems is to find effective ways to build models of the information of interest, and to design systems capable of integrating different and heterogeneous information sources into a common domain model. Popular keyword-based search engines can be regarded as first generation of such systems, that use feature-based representations (or keyword representations), modeling documents through feature vectors. Such representations make it easy to automatically classify documents, but offer limited capabilities for retrieving the information of interest, still burying the user under a heap of nonhomogeneous information.

In order to overcome such limitations, more sophisticated methods for representing information sources have been proposed by both the database and the artificial intelligence community. Such methods can be roughly classified as being based on database or knowledge representation techniques. The differences between the two approaches is mainly in defining either materialized or virtual views of the data extracted from the web sites, and in providing or not automatic (or semi-automatic) mechanisms for individuating such data.

Typically, in a database approach (see, e.g., [ARANEUS96], [CHAWATHE94]) a model of the information sources has to be explicitly specified by the user and there is no automatic translation from the site information to the data in the corresponding database. However, relying on well-established database techniques carries a number of advantages in the ease and effectiveness of access once the data is stored in the database.

On the other hand, in a knowledge based approach the idea is that the system handles an explicit representation of the information sources (which has again to be provided to the system), but the information requested by the user are retrieved at query time, by exploiting planning techniques which introduce a certain degree of flexibility in exploring the information sources and extracting information from them (see, e.g., [ARENS96], [LEVY96]), and in some cases even deal with incomplete information [ETZIONI94]. A serious drawback of such approaches is obviously the response time.

The problems still existing in both approaches lead us to propose an integrated solution in the WAG system [CATARCI97], which relies on a conceptual modeling language equipped with a powerful visual environment and on a knowledge representation tool which is meant to provide a simpler representation of the information, but the ability to reason about it. Differing from the other database approaches, WAG attempts to semi-automatically classify the information gathered from various sites based on the conceptual model of the domain of interest (instead of requiring an explicit description of the sources). However, the result of such a classification is materialized and dealt with by using effective database techniques. To illustrate this approach, we first describe an idealized scenario for intelligent information retrieval from the Web:

A user poses a few queries to locate information of interest. The user may also navigate to the web site to select some objects of interest. After this preliminary interaction, when the user is inactive or away, an intelligent information retrieval system will search the web sites to find all the relevant web pages which constitute a virtual site. It then builds conceptual views containing the information relevant to the specific domains. The information, even if originally expressed in different formats, will be presented in a unified way. The views will be conceptualized and populated in two different phases: a) an

during browsing sessions of the user; b) an

by the intelligent information retrieval system, which will visit related sites applying the existing search engines to populate the database. The user poses precise database queries and develops a deeper understanding of the problem domain. After a careful analysis of the database, the user again poses imprecise web queries and navigates the web to locate objects of interest, leading to the next cycle of intelligent information retrieval.

In the above scenario, the intelligent information retrieval system will structure the information in the web sites and present the relevant information to the user or store the relevant information in a database whose conceptual view is constructed by the system. To accomplish the objective suggested by this scenario, we propose the WAG (Web-At-a-Glance) system which can assist the user in creating a customized database by gleaning the relevant information from a web site containing multimedia data. WAG performs this task by first interacting with the user to construct a customized conceptual view of the web pages pertinent to the user's interests, and then populating the database with information extracted from the web pages.

The WAG system can be realized as an active index which contains multiple index cells (ic's) to perform various tasks. The most important components (realized as index cells) of WAG are the Searcher ic, the Page Classifier ic and the Conceptualizer ic. Each ic will in turn activate other ic's to cooperatively accomplish the objective. WAG could be used by different users. For each of them the active index builds a "personal environment" containing the user's profile, the conceptual views of the domains and the corresponding knowledge bases. However, a user who just starts interacting with WAG is allowed to import and possibly merge the personal environments of previous users (unless marked as reserved) so as to take advantage of the information they already discovered. The new personal environment resulting from the importing operations can be modified, extended, or even rejected by its owner. A further possibility could be to ask the web sites the permission to be marked as visited by a WAG user, and add to them link(s) to the conceptual base(s) of the corresponding WAG user(s) (also in this case the users' authorization is mandatory). This leads to the concept of BBCs and LBCs.

The prototyping of WAG components as web pages enhanced by active index cells has the advantage that the user can easily enter information to experiment with the prototype WAG system. Moreover, whenever the user accesses an html page, an associated ic can be instantiated to collect information to be forwarded to the WAG ic's, so that flexible on-line interaction can be supported for the Searcher, the Page Classifier and the Conceptualizer. In the current version of the light-weight WAG, the Conceptualizer is not included. Thus the user will pose an information retrieval request to search the web sites, observe the results produced by the Page Classifier, and formulate another request to search the web sites, and so on.

The experimental light-weight WAG can be accessed by anyone with a browser. The home page is at http://www.cs.pitt.edu/~jung/WAG. The following scenario describes how to use it:

Step 1: Initialize the WAG system

If this is the first time the user is using the WAG, the user must initialize it. If the user has previously initialized the WAG system, the user need not reinitialize it; the user can create more BBCs and LBCs, or move on to perform the search without creating more index cells. If the user wants to delete all previously created index cells, or if the system seems to have some problems, the user can reinitialize and create a new WAG index cell.

Step 2: Create Big Brothers BBCs

Big Brother index cells (BBCs) are search engines to search web sites on a global scale. Generally speaking, these are commercially available search engines, which will return URLs as the results of keyword searches. The user can create "yahoo" and "lycos" or the user's own BBCs. The user can enter the name of a search engine and click the button below to create a new BBC index cell.

Step 3: Create Little Brothers LBCs

Little Brother index cells (LBCs) monitor individual pages located at any web site. To create a little brother to monitor an individual page, the user can input the page's URL, followed by a positive integer, as shown in the following example:

http://www.cs.pitt.edu/~chang/index.html 3

where the page's full URL must be given explicitly. The LBC will be created, with three most frequently found keywords assigned to it. The user can also assign the user's own keywords to a page specifically. If the user inputs the URL followed by a list of keywords, then the LBC will be assigned these keywords.

Step 4: Perform the Search

The Searcher accepts the URL of the target page, or a keyword, and performs the search by sending messages to all the LBCs and BBCs. For the LBCs that monitor individual pages, those similar to the target page or having a matched keyword will respond, and the URLs of the corresponding pages are returned. For the BBCs that are commercially available search engines such as Yahoo and Lycos, the returned page is processed to yield a list of URLs, and only those URLs who pages are similar to the target page or having a matched keyword will be retained. To calculate similarity, the frequencies for keyword occurrence are first computed. Then the similarity measures between that page and the target page are calculated. Currently we compute three statistical similarity measures: Jaccard, Cosine and Dice. The similarity measures obtained by these three methods are averaged, and the result value (the average) is used as the similarity between this page and the target page. The thresholds for similarity measures can be set by the user. If the user does not set the various thresholds for similarity retrieval as well as the number of keywords to be matched in a page, default values will be used in the search.

Step 5: Classify Pages for Recursive Search

The user can classify the retrieved pages using the Page Classifier to decide whether the user wants to follow a page recursively. First the user displays the results. Then the user chooses those pages to be followed. The user can give a search width and a search depth. The search width is the maximum number of pages to be retrieved similar to a target page in one search step. The search depth is the number of search steps. The Searcher will then perform the recursive search. The recursive search is very powerful and may yield a large number of URLs, which can be regarded as the virtual site to be classified by the Page Classifier.

The ideal user interface for the WAG may allow the user to visualize the BBCs and LBCs in a physical information space. The user interface may also provide a hypermap (a logical information hyperspace) so that the BBCs and LBCs can be associated with the target documents. The combined metaphor is thus a Hypermapped Virtual World of Big Brothers and Little Brothers.

5. Integrated Technology Comparison

5.1. Taxonomy of Visual Querying Paradigms

As discussed in Section 2, the information stored in a multimedia database is organized in a logical information space. Such logical information needs to be materialized in the physical information space in order to allow the user to view it. We are particularly interested in materializations performed by using visual techniques. Therefore, visual query systems, as defined in Section 1, are needed. A survey of VQSs proposed in the last years is presented in [BATINI91]. In that paper the VQSs are also compared along three taxonomy criteria: 1) the visual representation, that is adopted to present the reality of interest and the applicable language operators; 2) the expressive power, that indicates what can be done by using the query language; 3) the interaction strategies, that are available for performing the queries.

The query paradigm, which settles the way the query is performed and represented, is very much dependent on the way the data in the database (that are the query operands) are visualized. The basic types of visual representations analyzed in [BATINI91] are form-based, diagrammatic, and iconic, according to the visual formalism primarily employed, namely forms, diagrams, and icons. A fourth type is the hybrid representation, that uses two or more visual formalisms.

A form can be seen as a rectangular grid having components that may be any combination of cells of groups of cells (subform). A form is intended to be a generalization of a table. It facilitates the users by exploiting the usual tendency of people to use regular structures for information processing. Moreover, computer forms are abstracted from conventional paper forms familiar to people in their daily activities. Form-based representations have been the first attempt to provide users with friendly interfaces for data manipulation, taking advantage of the bidimensionality of the computer screen. QBE has been a pioneer form-based query language [ZLOOF77]. The queries are formulated by filling appropriate fields of prototypical tables that are visualized on the screen.

Representations based on diagrams are widely adopted in existing VQSs. We use the word diagram with a very broad meaning, referring to any graphics that encodes information using position and magnitude of geometrical objects and/or shows the relationships among components. Referring to the different types of visual representations analyzed in [LOHSE94], our broad definition of diagram include graphs (such as bar, pie, histogram, scatterplot, etc.), graphic tables, network charts, structure diagrams, process diagrams. An important and useful characteristics of a diagram is that, if we modify its expression by following certain rules, its content can show new relationships [ECO75]. Often, a diagram uses visual elements that are one to one associated with specific concept types. Diagrammatic representations adopt as typical query operators the selection of elements, the traversal on adjacent elements and the creation of a bridge among disconnected elements.

The iconic representation uses sets of icons to denote both the objects of the database and the operations to be performed on them. In an icon we distinguish the pictorial part, i.e. the image shown on the screen, and the semantic part, i.e. the meaning that such an image conveys. The simplest way to associate a meaning to an icon is by exploiting the similarity with the referred object. If we have to represent an abstract concept, or an action, that does not have a natural visual counterpart, we have to take into account different correlation modalities between the pictorial and the semantic part [BATINI91], In iconic VQSs, a query is expressed primarily by combining icons. For example, icons may be vertically combined to denote conjunction (logical AND) and horizontally combined to denote disjunction (logical OR).

All the above representations present complementary advantages and disadvantages. In existing systems, only one type of representation is usually available. This significantly restrict the database users that can benefit from the system. An effective database interface should supply multiple representations, in order to provide different interaction paradigms, each one with different characteristics. Therefore, each user, either novice or expert, can choose the most appropriate paradigm to interact with the system. Such a kind of multiparadigmatic interface for databases has been proposed in [CATARCI96], where the selection of the appropriate interaction paradigm is made with reference to a user model that describes the user's interest and skills. Another interesting query paradigm is introduced in [CHANG94]. It is based on the idea that a Virtual Reality representation of the database application domain is available. An example was presented in Section 4.1.

The research on multiparadigmatic visual interfaces is conceptually similar to the research on multimodal interfaces for multimedia databases [BLATTNER92] Multimodal interfaces support multiple input/output channels for human-computer interaction. The rationale for providing different input and output mechanisms is to accommodate user diversity. Humans, by their very nature, have unpredictable behavior, different skills and a wide range of interests. Since we cannot obtain a priori information on how each user wishes to interact with the computer system, we need to create customizable human-computer interfaces, so that the users themselves will choose the best way to interact with the system, possibly by exploiting multiple input and output media.

Effective user interfaces are difficult to build. Multimodal and multimedia user interfaces are even more difficult to build. They have further requirements that need to be fully satisfied. The qualities for multimodal and multimedia interfaces have been studied in [HILL92], where the authors identified the following for consideration by the interface designer: (1) blended modalities, (2) appropriate resolvable ambiguity and tolerable probabilistic input capability such as whole sentence speech and gesture recognition, (3) distributed control of interaction among interface modules using protocols of cooperation, (4) real-time as well as after-the-fact access to interaction history, and (5) a highly modular architecture. In our view, one such quality, "blended modalities", requires special emphasis. Blending of modes means that at any point a user can continue input in a new, more pragmatically appropriate mode. The requirement "at any point" is not easy to achieve. In the multiparadigmatic interface described by [CATARCI96], conditions for allowing a paradigm switch during query formulation are carefully demonstrated. The problem needs to be investigated both from the system's and from the cognitive viewpoint. Besides any model that can help predict user behavior, extensive experimentation is needed with users in order to make sure that the presence of several modes does not create mental overload.

Another issue of great importance to multiparadigmatic interface design is that the expressive power achievable in the different modes, i.e. the kind of database operations that can be performed, may not be the same. For the different visual paradigms analyzed above, form-based and diagrammatic paradigms often provide the same expressive power as the relational algebra [BATINI91], but VR only allows selection of objects and retrieval of objects for which similarity functions have been specified. This is even more evident when we consider interaction through different media. A database expert will be very comfortable when performing queries with SQL. The same expressive power cannot be achieved, with the current technology, if we use either speech or stylus-drawn gesture; such modes have the further disadvantage of providing ambiguous or probabilistic input. Until now the design of interfaces avoids the use of such kind of inputs, because their ambiguity is unmanageable. Next generation interfaces should include such input modes, if appropriate to the task the interface is for, and provide means to resolve specific ambiguities. One possibility is changing the interaction mode, so that in the new mode a certain operation is no longer ambiguous.

5.2. Media Interaction Techniques

The computer technology is providing everybody the possibility of directly exploring information resources. One the one side, this is extremely useful and exciting. On the other side, the ever growing amount of information at disposal generates cognitive overload and even anxiety, especially in novice or occasional users. The current user interfaces are usually too difficult for novice users and/or inadequate for experts, who need tools with many options, and consequently limiting the actual power of the computer.

We recognize three different needs of people exploring information: 1) to understand the content of the database, 2) to extract the information of interest, and 3) to browse the retrieved information in order to verify that it matches what they wanted. To satisfy such needs, the user-interface designers are challenged to invent more powerful search techniques, simpler query facilities, and more effective presentation methods. When creating new techniques, we have to keep in mind the variability of the user population, ranging from first- time or occasional versus frequent users, from task-domain novices versus experts, from naive (requesting very basic information) versus sophisticated users (interested in very detailed and specific information). Since there is not a technique capable to satisfy the need of all such classes of users, the proposed techniques should be conceived as having a basic set of features, while additional features can be requested as users gain experience with the system.

A user interacting the first time with an information system should be allowed to easily navigate into the system in order to get a better idea of the kind of data that can be accessed. Since the information systems become larger and larger, while each user is generally interested in only a small portion of data, one of the primary goal of a designer is to develop some kind of filters to reduce the set of data that need to be taken into account. At Xerox in recent years a group of researchers has developed several information visualization techniques, with the aim of helping the users understand and process the information stored into the system [ROBERTSON93a]. They have created the "information workspaces", i.e. computer environments in which the information is moved from the original source, such as networked databases, and where several tools are at disposal of users for browsing and manipulating the information. One of the main characteristic of such workspaces is that they offer graphical representations of information that facilitate rapid perception of the overall patterns. Moreover, they use 3D and/or distortion techniques to show some portion of the information at a greater level of detail, but keeping it within a larger context. These are usually called fish eye techniques, but it is clearer to call them "focus + context", that better gives the idea of showing an area of interest (the focus) quite large and with detail, while the other areas are shown successively smaller and in less detail. Such an approach is very effective when applied to documents, and also to graphs. It achieves a smooth integration of local detail and global context. It has more advantages of other approaches to filter information, such as zooming or the use of two or more views, one of the entire structure and the other of a zoomed portion. The former approach shows local details but looses the overall structure, the latter requires extra screen space and forces the viewer to mentally integrate the views. In the "focus + context" approach, it is effective to provide animated transitions when changing the focus, so that the user remain oriented across dynamic changes of the display avoiding unnecessary cognitive load.

Shneiderman points out that the perfect search paradigm that retrieves all and only all of the desired items is unattainable [SHNEIDERMAN92]. Still, he suggests some ways for achieving flexible search. A first possibility for searches within documents is to allow "rainbow search". It is based on the fact that most word processors support several features (different fonts, sizes, styles, etc.) and text attributes (footnotes, references, etc.), therefore it could be useful to allow a search of all words in italic or a search through only footnotes. Another new technique is "search expansion": when looking for documents using some term, the system can also suggest more general (or specific) terms, synonyms, or related terms from a thesaurus or a data dictionary in order to perform a more complete search.

Search techniques applicable to multimedia data are very interesting. For instance, sound is included among data types of multimedia databases, and it could constitute both an output (as a response of the system) or an input (as a query). Some existing electronic dictionaries already provide both the meaning of words as well as their pronunciation, so offering full information on every requested word. In [MADHYASTHA95] the authors present "sonification", i.e. the mapping of data to sound parameters, as a rich but still unexplored technique for understanding complex data. The current technology has favored the development of the graphical dimension of user interfaces while limiting the use of the auditory dimension. This is also because the properties of the aural cues are not yet well understood as those of visual signals. Moreover, sound alone cannot convey accurate information without a visual context. The tool described in [MADHYASTHA95] uses sound to complement visualization, thus enhancing the presentation of complex data. Sound can be useful in some situations, for instance to set up an alarm when working with the computer to remember to do something at a certain time. The opposite is also true, i.e. visualization can help in analyzing sound. For example, it is useful for an expert performing a detailed analysis of a certain sound to look at the graphics of its amplitude in a given time interval.

We can think of a sound search in a music database. The user hums some notes and the system provides the list of symphonies that contain that string of notes. This is not difficult to achieve provided that the user input the notes in an unambiguous way (for example entering the notes on a staff connected to the computer) and the search is performed on the score sheets of symphonies stored with the music.

A system called Hyperbook uses sounds as an imitation of bird calls (either in the melody or in the tone) to retrieve specific bird families within an electronic book on birds [TABUCHI91]. The user can also retrieve a bird by drawing a silhouette of the bird. The descriptions provided by both techniques are incomplete since it is difficult for the user to give an exact specification. Hyperbook solves such queries on the basis of a data model, called metric spatial object data, which represents objects in the real world as points in a metric space. In order to select the candidate objects, distances are evaluated by the system, enabling the user to choose those objects (birds) which have a minimal distance from the query in the metric space.

Interesting and useful techniques can be exploited for searching images in a database on the basis of their pictorial contents. Given a sketch of a house, the user may want to find all pictures that contain that house. With the visual query system called Pictorial Query-By-Example (PQBE), Papadias and Sellis proposes an approach to the problem of content-based querying geographic and image databases [PAPADIAS95]. PQBE makes use of the spatial nature of spatial relations in the formulation of a query. This should allow users to formulate queries in a language close to their thinking. As in the case of the well-known Query-By-Example, PQBE generalizes from the example given by the user, but, instead of having skeleton relational tables, there are skeleton images.

Several researchers are proposing interaction environments exploiting different techniques than the visual ones. In [RICH94] an interactive multimedia prototype is shown that allows users seating in front of a terminal to experience with a virtual reality environment. The system integrates a number of key technologies and the purpose of the prototype is to experiment such new interaction possibilities. The users communicate with each other and with artificial agents through speech. The prototype also includes audio rendering, hand gesture recognition and body- position-sensing technology. The authors admit that their system is limited by the current technology, but they are confident that in a couple of years what is today expensive or yet impossible will be commonplace.

Traditional languages such as SQL allow the user to specify exact queries that indicate matches on specific field values. Non-expert and/or occasional users of the database are generally not able to directly formulate a query whose result fully satisfies their needs, at least in their first attempts. Therefore, the users may prefer to formulate a complex query by a succession of progressive simple queries, i.e. step by step, by first asking general questions, obtaining preliminary results, and then revisiting such outcomes to further refine the query in order to extract the result they are interested in. Since the results obtained up to a certain point may not converge to the expected data, a nonmonotone query progression should be allowed. During this process of progressive querying, an appropriate visualization of the preliminary results could give a significant feedback to the user. Moreover, it will provide hints about the right way to proceed towards the most appropriate final query. Otherwise, the user will immediately backtrack and try a different alternative path. Often, even if the user is satisfied with the result, he or she is also challenged to further investigate the database, and as a result may acquire more information from it.

The above described advantages of performing a progressive query through visual interaction, also displaying, in a suitable representation, the obtained partial results, has lead to the Visual Querying and Result Hypercube (VQRH), which is a tool that provides a multiparadigmatic approach for progressive querying and result visualization in database interaction [CHANG94]. Using the VQRH tool, the user interacts with the database by means of a sequence of partial queries, each displayed, together with the corresponding result, as one slice of the VQR Hypercube. Successive slices on the Hypercube store partial queries performed at successive times. Therefore, the query history is presented in a 3D perspective, and a particular partial query on a slice may be brought to the front of the Hypercube for further refinement by means of a simple mouse click.

Another powerful technique for querying a database is "dynamic query", that allows to do range search on multi-key data sets. The query is formulated through direct manipulation of graphical widgets such as buttons and sliders, one widget being used for every key. The result of the query is displayed graphically and quickly on the screen. It is important that the results fit on a single screen and that they are displayed quickly, since the users should be able to perform tens of queries in a few seconds and immediately see the results. Giving a query, a new query is easily formulated by moving with the mouse the position of a slider. This gives a sense of power but also fun to the user, that is challenged to try other queries and see how the result is modified. As in the case of the progressive query, the user can ask general queries and see what the results are, then he or she can better refine the query. An application of dynamic queries is shown in [SHNEIDERMAN92] and refers to real-estate database. There are sliders for location, number of bedrooms, price of homes in the Washington, D. C. area. The user moves these sliders to find appropriate homes. Selected homes are indicated by bright points of light on a Washington, D. C. map shown on the screen.

6. Open Research Challenges

As mentioned earlier, the key to content-based access of multimedia is the efficient and effective creation, encoding and maintenance of associations among media objects. The interaction metaphors employ multiple paradigms to facilitate making these associations. Virtual library, hypermap, vertical/horizontal reasoners, big/little brothers, etc. are some of the useful interaction metaphors to accomplish this goal. Although the metaphors themselves are conceptual, their effective incorporation into the user interface for multimedia often dictates the use of visual querying mechanisms. As discussed in Section 2, the appropriate interaction metaphor may involve a combination of several information spaces. To support visual querying, a challenge is to integrate the user's view of the information spaces with the underlying semantic model such as the InfoSleuth's ontology model. In fact, it seems that the Hypermapped Virtual World metaphor is quite compatible with the ontology model, so that an integrated model with four or five layers may do the job.

The use of visual interfaces combining different query mechanisms represents a step toward a truly effective utilization of multimedia information systems by large classes of users. The query mechanisms include speech, sound, gestures, etc., but the user interface should be highly visual to enable the user to gradually grasp the database contents, the navigation technique, the visual reasoning strategy, as well as the querying process. As it was pointed out in [CATARCI95], in the last ten years the research on visual query systems passed from conceiving merely hypothetical and vague ideas to building real systems. Yet, much more need to be done, both from the theoretical and the application-oriented point of view. Since the success of a complex application is largely due to the way it matches the users' expectations as well as their skills and learning ability, more efforts should be devoted to experimenting and validating the proposed interfaces, in order to provide an accurate evaluation of their usability, which is a crucial factor to the practical utilization of such interfaces for multimedia information systems.

Multimedia interface design involves various issues [BLATTNER92]. In determining the criteria for the evaluation of visual and multimedia user interface we should also take into consideration similar criteria for evaluating visual programming languages [KIPER97]. For content-based access of multimedia, we need also ask: Can relevant associations be easily discovered? Can new an associations be easily created? How are the associations encoded and maintained? These are some of the critical issues to be evaluated.

Although visual interfaces are supposedly "universal", the meanings of visual symbols (icons) do shift across cultural boundaries. There is no internationally accepted standard of visual symbols. An approach to solve this problem is to provide multilingual interfaces so that the user can shift to the user's native language in an interactive session. AMIS2, for example, allows the user to switch between English and Chinese any time during the interaction. This could be further augmented by speech feedback to enhance user's understanding. Again, it may be possible to integrate all these multilingual, multimodal, multiparadigmatic aspects in the unified information model.

References:

[AHMED94] Ahmed, Z., L. Wanger, and P. Kochevar, "An Intelligent Visualiza tion System for Earth Science Data Analysis," Journal of Visual Languages and Computing, vol. 5, no. 4, pp. 307-320, December 1994.
[ANUPAM94] Anupam, V. and C. L. Bajaj, "Shastra: Multimedia Collaborative Design Environment," IEEE Multimedia, pp. 39-49, 1994.
[ARANEUS96] The Araneus Project, http://poincare.inf.uniroma3.it:8080/Araneus/ araneus.html, 1996.
[ARENS96] Arens, Y., C. A. Knoblock, and W. Shen, "Query reformulation for dynamic information integration," Journal of Intelligent In formation Systems, 1996.
[BATINI91] Batini, C., T. Catarci, M. F. Costabile, and S. Levialdi, Visual Query Systems, Technical Report N. 04.91, Dipartimento di Informatica e Sistemistica, Universit` di Roma "La Sapien za", Italy, 1991 (revised in 1993).
[BEAUDOUIN92] Beaudouin-Lafon, M., Beyond the Workstation: Mediaspaces and Aug mented Reality, pp. 9-18, in Blattner, M. M. and R. B. Dannenberg (Eds.), Multimedia Interface Design, Addison-Wesley, 1992.
[BLATTNER92] Blattner, M. M. and R. B. Dannenberg (Eds.), Multimedia Interface Design, Addison-Wesley, 1992.
[CAPORAL97] Caporal, J. and Y. Viemont, "Maps as a Metaphor in a Geographical Hypermedia System," Journal of Visual Languages and Comput ing, vol. 8, no. 1, pp. 3-25, February 1997.
[CATARCI95] Catarci, T. and M. F. Costabile (Eds.), "Special Issue on Visual Query Systems," Journal of Visual Languages and Computing, vol. 6, no. 1, 1995.
[CATARCI96] Catarci, T., S. K. Chang, M. F. Costabile, S. Levialdi, and G. Santucci, "A Graph-based Framework for Multiparadigmatic Visual Access to Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 8, No. 3,, 1996, 455-475.
[CATARCI97] Catarci, T., S. K. Chang, L. B. Dong, and G. Santucci, "A Proto type Web-At-a-Glance System for Intelligent Information Re trieval," Proc. of SEKE'97, pp. 440-449, Madrid, Spain, June 18-20, 1997.
[CHANGH95a] Chang, H., T. Hou, A. Hsu, and S. K. Chang, "The Management and Applications of Tele-Action Objects," ACM Journal of Mul timedia Systems, vol. 3, no. 5-6, pp. 204-216, Springer Ver lag, 1995.
[CHANGH95b] Chang, H., T. Hou, A. Hsu, and S. K. Chang, "Tele-Action Objects for an Active Multimedia System," Proceedings of Second Int'l IEEE Conf. on Multimedia Computing and Systems, pp. 106-113, Wasnington, D.C., May 15-18, 1995.
[CHANG79] Chang, S. K. and J. S. Ke, "Translation of Fuzzy Queries for Re lational Database System," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 3, pp. 281-294, July 1979.
[CHANG90] Chang, S. K., "Visual reasoning for information retrieval from very large databases," Journal of Visual Languages and Com puting, vol. 1, no. 1, pp. 41-58, 1990.
[CHANG94] Chang, S. K., M. F. Costabile, and S. Levialdi, "Reality Bites Progressive Querying and Result Visualization in Logical and VR Spaces," Proc. of IEEE Symposium on Visual Languages, pp. 100-109, St. Luis, October 1994.
[CHANG95a] Chang, S. K., "Toward a Theory of Active Index," Journal of Visu al Languages and Computing, vol. 5, pp. 101-118, 1995.
[CHANG95b] Chang, S. K., G. Costagliola, G Pacini, M. Tucci, G. Tortora, B. Yu, and J. S. Yu, "Visual Language System for User Inter faces," IEEE Software, pp. 33-44, March 1995.
[CHANG96a] Chang, S. K., "Active Index for Content-Based Medical Image Re trieval," Journal of Computerized Medical Imaging and Graph ics, Special Issue on Medical Image Databases (S. Wong and H. K. Huang, eds.), pp. 219-229, Elsevier Science Ltd., 1996.
[CHANG96b] Chang, S. K. and E. Jungert, Symbolic Projection for Image Infor mation Retrieval and Spatial Reasoning, Academic Press, Lon don, 1996.
[CHANG96c] Chang, S. K., "Extending Visual Languages for Multimedia," IEEE Multimedia Magazine, vol. 3, no. 3, pp. 18-26, Fall 1996.
[CHANG98] Chang, S. K., D. Graupe, K. Hasegawa, and H. Kordylewski, "An Ac tive Multimedia Information System for Information Retrieval, Discovery and Fusion," International Journal of Software En gineering and Knowledge Engineering, vol. 8, no. 1, World Scientific Pub. Co., March 1998.
[CHAWATHE94] Chawathe, S. and et al., "The TSIMMIS Project: Integration of Heterogeneous Information Sources," Proc. of IPSJ Confer ence, pp. 7-18, 1994.
[CHEN96] Chen, P. W., G. Barry, and S. K. Chang, "A Smart WWW Page Model and its Application to On-Line Information Retrieval in Hy perspace," Proc. of Pacific Workshop on Distributed Mul timedia Systems DMS'96, pp. 220-227, Hong Kong, June 27-28, 1996.
[D'ATRI89a] D'Atri, A., P. Di Felice, and M. Moscarini, "Dynamic query in terpretation in relational databases," Information Sciences, vol. 14, no. 3, 1989.
[D'ATRI89b] D'Atri, A. and L. Tarantino, "From Browsing to Querying," Data Engineering, vol. 12, no. 2, pp. 46-53, June 1989.
[DUTTA89] Dutta, S., "Qualitative Spatial Reasoning: A Semi-Quantitative Approach Using Fuzzy Logic," Conference Proceedings on Very Large Spatial Databases, pp. 345-364, Santa Barbara, July 17-19, 1989.
[ECO75] Eco, U., A Theory of Semiotics, Indiana University Press, 1975.
[ETZIONI94] Etzioni, O. and D. Weld, "A Softbot-Based Interfacpto the Inter net," CACM, vol. 37, no. 7, 1994.
[FALOUTSOUS93] Faloutsous, C. et al., "Efficient and Effective Querying by Image Content," IBM Research Division Almaden Research Center Technical Report RJ9543 (83074), August 1993.
[FALOUTSOUS94] Faloutsous, C., R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic, and W. Equitz, "Efficient and Effective Querying by Image Content," Journal of Intelligent Information Sys tems, vol. 3, pp. 231-262, 1994.
[FOX91] Fox, E. A., "Advances in Interactive Digital Multimedia Systems," IEEE Computer, vol. 24, no. 10, pp. 9-21, October 1991.
[GOULD84] Gould, L. and W. Finzer, "Programming by Rehearsal," Byte, pp. 187-210, June 1984.
[GRAUPE96] Graupe, D. and H. Kordylewski, "A Large-Memory Storage and Re trieval Neural Network for Browsing and Medical Diagnosis," Proc. ANNIE Conf., St. Louis, Missouri, 1996.
[HALBERT84] Halbert, D. C., "Programming by Example," Xerox Office Systems Division, TR OSD-T8402, Dec 1984.
[HANNE92] Hanne, K. and H. Bullinger, "Multimodal Communication: Integrat ing Text and Gestures," Multimedia Interface Design, pp. 127-138, Addison-Wesley, 1992.
[HILL92] Hill, W., D. Wroblewski, T. McCandiess, and R. Cohen, "Architec tural Qualities and Principles for Multimodal and Multimedia Interfaces," Multimedia Interface Design, pp. 311-318, Addison-Wesley, 1992.
[HUANG90] Huang, K. T., "Visual Interface Design Systems," Principles of Visual Programming Systems, Prentice-Hall, 1990.
[IGNATIUS94] Ignatius, E., H. Senay, and J. Favre, "An Intelligent System for Task-Specific Visualization Assistance," Journal of Visual Languages and Computing, vol. 5, no. 4, pp. 321-338, De cember 1994.
[JUNGERT97] Jungert, E. and S. K. Chang, "Human- and System-Directed Fusion of Multimedia and Multimodal Information using the Sigma Tree Data Model," Prof. of VISual'97: Second International Conference on Visual Information Systems, San Diego, Cali fornia, Dec 15-17, 1997.
[KAWATA96] Kawata, Y., A. Kawasaki, W. Udomkitwanit, A. Yabu, H. Kobayashi, P. Wijayarathna, and M. Maekawa, "EVE: A visual specifica tion environment with support for formal descriptions of physical properties," Proc. of First Int'l Conf. on Visual Information Systems, pp. 518-529, Melbourne, Australia, February 5-6, 1996.
[KIPER97] Kiper, J. D., E. Howard, and C. Ames, "Criteria for Evaluation of Visual Programming Languages," Journal of Visual Languages and Computing, vol. 8, no. 2, pp. 175-192, 1997.
[KOVACEVIC97] Kovacevic, S., "A Compositional Model of Human-Computer Dialogs," Multimedia Interface Design, pp. 373-404, Addison-Wesley, 1992.
[LANG92] Lang, L., "GIS Comes to Life," Computer Graphics World, pp. 27 36, October 1992.
[LAURINI90] Laurini, R. and F. Milleret-Raffort, "Principles of Geomatic Hy permaps," Proceedings of the 4th Internationl Symposium on Spatial Data Handling, pp. 642-651, Zurich, Switzerland, June 23-27, 1990.
[LEVY96] Levy, A.Y., A. Rajaraman, and J.J. Ordille, "Querying-Answering Algorithms for Information Agents," Proceedings of the Thir teenth National Conference on Artificial Intelligence (AAAI-96), 1996.
[LITTLE95] Little, T. D. C. and D. Venkatesh, "The Use of Multimedia Tech nology in Distance Learning," Proceedings of IEEE Int'l Conference on Multimedia Networking, pp. 3-17, Aizu, Japan, September 1995.
[LOHSE94] Lohse, G. L., K. A Biolsi, N. Walker, and H. H. Rueter, "A Clas sification of Visual Representations," Communications of the ACM, vol. 37, no. 12, pp. 36-49, 1994.
[MADHYASTHA95] Madhyastha, T. M. and D. A. Reed, "Data Sonification: Do You See What I Hear?," IEEE Software, vol. 12, no. 2, pp. 45-56, March 1995.
[MOTRO86] Motro, A., "BAROQUE: An exploratory interface to relational data bases," ACM Trans. on Office Information Systems, vol. 4, no. 2, pp. 164-181, April 1986.
[MOTRO88] Motro, A., "VAGUE:A User Interface to Relational Database that Permits Vague Queries," ACM Trans. on Office Information Systems, vol. 6(3), pp. 187-214, July 1988.
[MYERS86] Myers, Brad A., "Visual Programming, Programming by Example, and Program Visualization: A Taxonomy," Proceedings of SIG CHI'86, pp. 59-66, Boston, MA, April 13-17, 1986.
[MYERS88] Myers, B. A., Creating User Interfaces by Demonstration, Academic Press, Boston, 1988.
[NIBLACK93] Niblack, W. and M. Flickner, "Find me the Pictures that look like this: IBM's Image Query Project," Advanced Imaging, April 1993.
[PAPADIAS95] Papadias, D. and T. Sellis, "Pictorial Query-By-Example," Journal of Visual Languages and Computing, vol. 6, no. 1, pp. 53-72, 1995.
[RAU96] Rau, H. and S. Skiena, "Dialing for Documents: An Experiment in Information Theory," Journal of Visual Languages and Comput ing, vol. 7, no. 1, March 1996.
[REIS92] Reis, H., D. Brenner, and J. Robinson, "Multimedia Communications in Health Care," New York Academy of Sciences Conference on Extended Clinical Consulting by Hospital Computer Networks, March 1992.
[RICH94] Rich, C., R. C. Walters, C. Strohecker, Y. Schabes, W. T. Free man, M. C. Torrance, A. R. Golding, and M. Roth, "Demonstra tion of an Interactive Multimedia Environment," IEEE Comput er, vol. 27, no. 12, pp. 15-22, 1994.
[ROBERTSON93a] Robertson, G. G., S. K. Card, and J. D. Mackinlay, "Information Visualization using 3D Interactive Animation," Communica tions of the ACM, vol. 36, no. 4, pp. 57-71, 1993.
[ROBERTSON93b] Robertson, G. G., S. K. Card, and J. D. Mackinlay, "Nonimmersive Virtual Reality," IEEE Computer, vol. 26, no. 2, pp. 81-83, 1993.
[SANTINI96] Santini, S. and R. Jain, "The Graphical Specification of Similar ity Queries," Journal of Visual Languages and Computing, vol. 7, no. 4, pp. 403-421, December 1996.
[SHNEIDERMAN92] Shneiderman, B., Designing the User Interface, Addison Wesley Publishing Company, 1992.
[SMITH77] Smith, D. C., Pygmalion: A Computer Program to Model and Stimu late Creative Thought, Birkhauser, Stuttgart, 1977.
[STOAKLEY95] Stoakley, R., M. J. Conway, and R. Pausch, "Virtual Reality on a WIM: Interactive Worlds in Miniature," Proc. of CHI-95, pp. 265-272, Denver, Colorado, May 7-11, 1995.
[TABUCHI91] Tabuchi,, "Hyperbook," Proc. of International Conference on Mul timedia Information Systems, Singapore, 1991.
[WALD84] Wald, J. A. and P. G. Sorenson, "Resolving the query inference problem using Steiner trees," ACM Trans. on Database Sys tems, vol. 9, no. 3, pp. 348-368, 1984.
[WILLIAMS84] Williams, M. D., "What makes RUBBIT run?," International Journal on Man-Machine Studies, vol. 21, no. 4, pp. 333-352, October 1984.
[ZLOOF77] Zloof, M. M., "Query by Example," IBM Systems Journal, vol. 16, no. 4, pp. 324-343, 1977.