Adlet: Active Document for Adaptive Information Integration

Shi-Kuo Chang and Taieb Znati
Department of Computer Science
University of Pittsburgh
Pittsburgh, PA 15260
{chang, znati}@cs.pitt.edu



ABSTRACT: We describe a new approach to resource discovery based on the concept of active documents advertising on the Internet, whereby a document dynamically builds metadata, joins a group of documents of the same interest and advertises the collected metadata to the members of the group. This abstraction of metadata is called an adlet, which is the core of our approach. This approach supports the visual specification and visualization of adlets, the creation and composition of adlets, the negotiation protocols and communications with the active network. Two important features make this approach applicable to applications in information fusion, information retrieval, data mining, geographic information systems, medical information systems and plant genome information gathering: a) any document, including web page, database record, video file, audio file, image and even paper documents, can be enhanced by an adlet and become an active document; b) any node in a nonactive network can be enhanced by adlet-savvy software and the adlet-enhanced node can co-exist with other non-enhanced nodes. An experimental prototype provides a testbed for feasibility studies in a hybrid active network environment.

Keywords: Active document, information fusion, active index system

A Presentation on Adlet System

An Experimental Plant Genome Adlet System



1. Introduction and Motivation

The Internet has become an indispensable means of communication and collaboration among people at different levels and in various capacities. It provides access to a bewilderingly large number and variety of resources, including text, audio and video files, scientific data, retail products, network services, and transcripts of conversations. Because of the scale and decentralized nature of this environment, the Internet has evolved into a chaotic repository of all types of information, making it difficult to locate resources of interest. If it is to continue to grow and thrive as new means of communication, new tools are needed to organize and support the Internet's resource discovery in a fashion that keeps pace with its exponential growth in size and diversity.

Over the past several years, a number of Internet information discovery tools and search engines have been introduced, including Lycos, Alta Vista, Infoseek, Yahoo and Harvest [AltaVista, Bowman94, Hardy95]. These automated tools bear most of the responsibility for organizing information on the Internet by automatically classifying and indexing collections of digital data. They periodically dispatch indexing engines, frequently referred to as "Web Crawlers", to download and examine documents in order to extract information that describes these documents [Cheong96]. The extracted information is stored in the search engine's database, along with the uniform resource locator of the site where the document resides. The stored information is later used to produce a list of resources in response to requests submitted by Internet users to query the search engine's database [Demsey96, Hardy95, Lagoze96].

These tools have become quite popular and are helping to redefine how people think about wide area network applications. In theory they can address the fundamental problem of resource indexing and discovery on the Internet. Consequently they have the potential to address the inability of human users to cope with the Internet's rapid growth and enormous data volume. Yet it has become clear that they are not well suited to supporting the ever evolving information infrastructure, characterized by enormous data volume, rapid growth in the user base, and burgeoning data diversity. This is mostly due to the fact that these tools are not capable of identifying basic characteristics of a document, such as its theme or genre, let alone its underlying meaning or context. They can only provide uniform and equal access to all resources in the Internet, for they cannot reliably extract routine information that a human indexer can find through a simple cursory inspection. As a result, queries issued by Internet users often produce an overwhelming number of responses, which frequently contain references to irrelevant material while leaving out more relevant ones.

This paper addresses some of the problems involved with resource discovery in the Internet. Our approach is based on the concept of active documents advertising, whereby a document dynamically builds metadata, joins a group of documents of the same interest and advertises the collected metadata to the members of the group. This abstraction of metadata is called an adlet, which is the core of our approach.

The characterization of metadata may range from a title or type, to key features of the document or of the author. The metadata is derived from a) the user specified detailed annotations of the document, b) the advertisement strategy and c) the recruiting strategy. The annotations provide in-depth characterization of a document and is used as basis for indexing, document retrieval and information fusion. The advertisement strategy defines the policies used by the active document in advertising itself to the rest of the Internet. The recruiting strategy determines the type of documents allowed to become associated and/or fused with this active document in its evolution toward an enhanced and richer document.

The paper is organized as follows. Section 2 presents the system architecture. The basic definitions for adlet are introduced in Section 3. The visual specification of adlets by annotation, the visualization of adlets and real/virtual sites, and the merging of adlets are discussed in Section 4. Section 5 describes our approach for adlets group management and negotiations. The construction and update of virtual graph, which together with the adlets form the knowledge base, is described in Section 6. An experimental prototype based upon the active index system is being built, as described in Section 7. In Section 8 we compare our approach with ongoing research. Concluding remarks about further research are given in Section 9.



2. System Architecture

The system architecture supporting active document advertising should address the issue of how to organize the resource space flexibly and dynamically. Rather than imposing a hierarchical structure, our approach allows the active document structure to evolve in accordance with usage patterns, based on the advertising and recruiting strategies specified by the user. More specifically the active documents dynamically organize and search the resource space by constructing links among themselves based on the metadata that describe their contents, types, context, the advertising and recruiting strategies specified by the user, and the semantics of the type of active document being sought. The links form a virtual graph, with a flexible set of hierarchies embedded within the graph to provide efficient searching. The virtual graph's edges are associated with weights that reflect the likelihood that the adlet attached at the other end is interested in recruiting this document or is itself representing a document that can be of interest to the current active document. The virtual graph's structure evolves over time through the use of virtual graph updating algorithms in accordance with user specified metadata. This approach provides the basis for the development of a scalable, customizable architecture for gathering, indexing, caching, replicating, accessing and fusing Internet information.

Adlets use the virtual graph to locate an adequate number of other adlets and their corresponding documents situated in their sites. Active index cells [Chang95] are associated with adlets representing abstracted active documents, and through these index cells advertisements can be sent and received in accordance with negotiation protocols.

The active index system [Chang96b] collects indexing information received from other adlets, suppresses duplicate information and filters out undesired information using the user-specific profile. The active index system then summarizes adlet information in various type-specific ways to generate structured indexing information that can be used by the Adlet Visualizer and search entities [Abit97]. The active index is designed to support multiple "views" in a way such that the gathered information about adlets can be easily extracted and correlated and integrated with local adlet information and information received from remote sites [Kim95, Sheth90]. The fusion of information will reflect the profile of the associated user and will facilitate the visualization of different views of the related active documents.



Figure 1. System architecture to support active document advertising.


Our approach can be summarized in plain language as follows: Adlet tells you what to look for and what to avoid, virtual graph tells you where to go, and negotiation protocol tells you when to go and how to go. Figure 1 shows the system architecture. The Adlet Visualizer supports visual specification and visualization of adlets. The Adlet Manager supports creation and composition of adlets. The Adlet Negotiator handles negotiation protocols and communications with the network. The Active Index System maintains the virtual graph and handles all messages to/from adlets and active documents.



3. Adlets

In this paper we concern ourselves mainly with the advertising, negotiation and exchange of active documents. The documents are active in the sense that each document can perform actions with autonomy. An adlet is the means by which an active document makes itself known to other active documents. An active document itself may post an adlet, or a user may post an adlet for an active document. An adlet has a target, i.e., it is meant only for active documents capable of negotiating and exchanging certain classes of documents. Usually the target can be defined based upon the notion of document hierarchy, i.e., documents satisfying certain conceptual relations are regarded as the target. Finally documents to be avoided constitute the non-target for this adlet.

The semantic aspects of the advertisement associated with an adlet must be defined in a controlled and uniform manner. The user/application must be able to request a style of advertisement, i.e., a certain advertising strategy.

Upon its creation, an adlet attempts to join a group of adlets, possibly creating a new one if such a group does not yet exist. Other adlets may join the group, leave the group, advertise to the group and receive advertisement from other members of the group.

Definition 1: An adlet is defined as adlet = (doc, profile, target, non-target, ad-strategy, prop) where doc uniquely identifies the document to be advertised, profile is a set of conceptual relations from a concept space characterizing this document, target is a set of conceptual relations characterizing documents to be recruited, non-target is a set of conceptual relations characterizing documents to be avoided, ad-strategy is the advertising strategy, and prop are other properties (see Section 4.1).

The profile is specified, in the simplest case, by keywords associated with the document. In general the profile is a set of conceptual relations, or a conceptual graph, derived from user's annotations. (We regard a concept as a special case of a conceptual relation with a null second part.) This information is used by the adlet to develop the virtual graph and maintain it as it evolves dynamically.

The target specifies the scope and the focus of the advertisement. The scope of an advertisement may be global, restricted or local. Within a target the focus of the advertisement may be all adlets or a selected subset of adlets. The non-target specifies what is out of the scope.

The ad-strategy determines the advertising strategy used for document advertisement. The strategy could be aggressive or reactive. An aggressive strategy initiates advertisement immediately after its creation and continues advertising in a periodic fashion. A reactive strategy on the other hand engages in advertisement only in response to advertisement received from other active documents.

Example 1: A document can be a WWW page and therefore doc = URL. The document is characterized by a set of keywords and therefore profile = {keyword1, ...., keywordn}. This WWW page is in the class of browsable documents and the intended target is also a selected collection of browsable documents, i.e., target = {'Browser'}, where 'Browser' characterizes a document class in the class hierarchy. The non-target is unspecified and therefore the default is the empty set. The advertising strategy is reactive.

Example 2: A document can be a database record. The record_id is its doc. The document is characterized by a conceptual schema describing the conceptual structure in which this record instance is situated, and therefore its profile = { schema_name }. The target is { 'financial record' }, denoting the collection of financial records of an enterprise. The non-target is { 'scientific' }. The advertising strategy is aggressive.

Example 3: An active document may post an adlet by itself. In this case, ad = (doc, profile, {'all'}, {'none'}, ad-strategy, prop). In other words, this adlet is posted by one active document for 'all', with 'none' to be avoided. Such an adlet needs to be refined so that it can aim at more specific targets.

Definition 2: Two adlets ad1 and ad2 can be partially ordered as follows, ad1 < ad2 if and only if the following holds: a) doc1 = doc2, b) profile1 = profile2, c) target1 is contained in target2, d) non-target1 contains non-target2, e) ad-strategy1 = ad-strategy2, and f) prop1 = prop2.

In other words ad1 is intended for a smaller target group, so ad1 is more refined than ad2.

Definition 3: The negotiation protocol is a sequence e-proc= ((d11,d12), (d21,d22), .., (dn1, dn2)), where each (di1,di2) is an exchanged pair of documents and the final pair (dn1,dn2) ia s pair of goal documents.

Example 4: Let the first entry be the consumer's and the second the producer's, then an information exchange negotiation protocol typically may look like: (('Mars',-), (-,'Heaven'), ('Sojourner', -), (-, doc-13), ('Thanks', -)).

Therefore, the adlets should induce a sequence of information exchanges e-seq between the consumer and the producer, such that e-seq contains e-proc as a sub-sequence. If this is feasible, then we say the negotiation protocol is supportable.

Definition 4: A negotiation protocol is supportable if the adlets can induce a sequence of information exchanges e-seq such that e-seq contains e-proc as a subsequence and the final pair (dn1,dn2) of e-seq is a pair of goal documents.

Additional constraints may be imposed on the supportable negotiation protocol, such as the progressive refinement of the adlets, leading to a more precise search criterion and consequently a smaller set of matching documents.



4. A Scenario

Adlets can be generated by active documents to interact with other adlets or active documents to accomplish the objectives of an advertisement plan. In this section we describe a scenario of active documents advertising using adlets.

4.1. Visual Specification of Adlets by Annotation

When a document type is defined, the user can visually specify the types and properties of its adlets. The design of this visual specification language is based upon the theory of visual languages. The mode of interaction is direct manipulation. The user annotates the objects on the screen by pointing, clicking and entering keywords.

The user first defines a concept space where the concepts are color-coded. Each adlet type has a dominant color which is the concept that characterizes it most closely, and other less dominant colors corresponding to the concepts that characterize it with varying degree of closeness. The darkness or lightness of a color indicates the closeness of this characterization. Each adlet type also has a dominant color which is the concept that attracts it most, and other less dominant colors for less attractive concepts. Finally, concepts that repulse the adlet are also defined. These color codes are shown in different areas on the adlet or dynamically shown in an alternating fashion.

Figure 2 illustrates the visual specification of an adlet which carries the following information: "I represent doc, I am profile, I love target, I hate non-target, I stay put (reactive) or travel (aggressive) and I behave like prop".



Figure 2. Visual specification of an adlet.


The attraction/repulsion property of an adlet is important in determining the motion behavior of an adlet. An attraction/repulsion force of an adlet to a color-coded concept can be one of the following, as specified by the user (perhaps by clicking on a sliding scale):

A1. Extreme attraction: the adlet must merge with an adlet/site of said color.
A2. Strong attraction: the adlet attempts to merge with an adlet/site of said color until precondition becomes 'false'.
A3. Weak attraction: the adlet makes a best effort attempt to merge with an adlet/site of said color.
A4. Neutral: no attraction/repulsion.
A5. Weak repulsion: the adlet makes a best effort attempt to avoid an adlet/site of said color.
A6. Strong repulsion: the adlet attempts to avoid an adlet/site of said color until precondition becomes 'false'.
A7. Extreme repulsion: the adlet avoids the adlet/site with said color, even if this means its own destruction.

The ad-strategy of an adlet type can be specified by clicking on either "reactive" or "aggressive". Other properties of an adlet type to be specified by the user include the following:

B. Destructability: Adlet can be destructive or nondestructive. Destructive adlet can be destroyed when a condition becomes 'true'. Nondestructive adlet cannot be destroyed under ordinary circumstances.

C. Regenerability: Adlet can be regenerating or nonregenerating. Regenerating adlet is able to regenerate instances of its identical type. Nonregenerating adlet cannot regenerate.

D. Migration: Adlet can be migrating or nonmigrating. A migrating adlet will seek out site that attracts it and migrate to that site. Nonmigrating adlet will not migrate by itself. (An aggressive/reactive ad-strategy leads to migrating/nonmigrating adlets.)

E. Temporal Sensitivity: Adlet can be temporally sensitive or nontemporally sensitive, depending on whether its predicates and actions involve time as parameters.

F. Location Sensitivity: Adlet can be location sensitive or nonlocation sensitive, depending on whether its predicates and actions involve location as parameters.

G. Context Sensitivity: Adlet can be context sensitive or noncontext sensitive, depending on whether its predicates and actions involve contextual variables as parameters.

After an initial session to perform detailed annotation, the user creates an adlet type which can be reused in the future when similar document type is encountered.



4.2. Visualization of Adlets and Sites

Adlets move from sites to sites to seek out new adlets or active documents, announce its own presence and collect information. In its movement adlets behave like magnets. An adlet or a collection of adlets can attract other adlets and pull them together. Conversely an adlet or a collection of adlets can repulse other adlets and push them apart. A collection of adlets is called a virtual site or simply a site. A virtual graph is the set of links and conceptual relations that characterizes a (virtual) site and its relations with other sites.

As described in the previous section, adlets and sites both have attraction/repulsion properties. Adlets are also directional. The direction of an adlet is a derived property and generally indicates the direction of greatest attraction/repulsion exerted by another adlet or another site. Site, on the other hand, does not have a directional property.



Figure 3. Visualization of adlets and sites.


Figure 3 illustrates the visualization of adlets and sites, where an adlet is indicated by a wedge-shaped object and the direction of an adlet is indicated by the direction of the wedge's longer axis. The sites are indicated by round-shaped objects. For real sites, the sites' locations are their geographical locations. For virtual sites, the sites' locations are not geographical locations, but their logical locations in some virtual space. The distance between two virtual sites is proportional to the force of mutual attraction. One virtual site may span several real sites, and one real site may contain multiple virtual sites.

As shown in Figure 3, an adlet A1 will not visit all the sites on the Web, which will be far too numerous. Rather, only sites mentioned in the virtual graph of the site containing the active document that generates adlet A1 will be visited by this adlet. Since this virtual graph is dynamically updated when adlets are merged, an adlet's effective sphere of motion will change over time.



4.3. Merging of Adlets

When adlets collide, i.e., when their mutual attraction force exceeds a certain threshold and they are in the same site, the composition rules dictate whether two adlets should be merged to form new adlets, with the possible side effect of the updating of the virtual graph. Therefore an active document can be linked to another active document through the interaction among adlets, leading to the discovery of new relationships and information fusion. The four most important adlet composition operators are:

(1) Concatenation: Adlet 1 and adlet 2 are combined into a train of adlets, but this train is regarded as a single adlet. The visual operator is a 'train'. (adlet1 adlet2 => adlet3)

(2) Equi-Join: Adlet 1 and adlet 2 are joined into a single new adlet, with properties inherited from both parents. The visual operator is a 'heart'. (adlet1 adlet2 => adlet3)

(3) BW-Join: Adlet 1 destroys adlet 2 but takes over some of the latter's properties. The visual operator is a 'spider' -- the Black Widow. (adlet1 adlet2 => adlet1)

(4) Split: Adlet 1 is split into adlets 2 and 3. The unary visual operator is represented by a 'pair of scissors'. ( adlet1 => adlet2 + adlet3)

These operators can be combined to specify more complex operators. For example two adlets can first be merged by BW-join and the new adlet is then split into two. Thus we can specify adlets that are: consumed, forever circulating, procreative, and so on.

The application of any of the above operators may also lead to the update of the virtual graph, because when the adlets are merged, the documents they refer to need also to be related. The dynamic links are added to the active documents and the directed arcs are added to the underlying virtual graph. Thus the effect of the interaction among adlets is a continuous, dynamic update of the virtual graph. A site's virtual graph and its adlets together form its knowledge base. Therefore an important issue is when and how to update the virtual graph of a site.



5. Adlets Group Management and Negotiations

As explained above, when a user creates/updates a document the side effect could be the creation and communication of adlets. Since adlets are light-weight and inexpensive, they are exchanged in the negotiation among active documents prior to the transmission of actual documents. Therefore actual documents are not transmitted unless both participants express interest through the negotiation using adlets. When two documents negotiate on the prospect of joining a group, what they are negotiating is to find the admissible common subgraphs of the conceptual graphs (the profiles and targets) for the two documents.

Active documents can be viewed as a distributed group of entities cooperating toward establishing a relationship to advertise, discover and mutually enhance their contents. The establishment of such a relationship requires mechanisms to support updating of the virtual graph and group membership, and efficient algorithms to establish minimum cost probabilistic multicast trees.

The mechanisms to support updating of the virtual graph should allow the creation of new virtual links, while maintaining existing ones, mutual discovery of active documents, the establishments of adlet groups dynamically and the recruitment of new adlet group members. These mechanisms should be generic, extensible and ensure co-existence of different execution environments and network node architectures. Our approach is to provide a generic capability that would allow the encapsulation of the packet carrying an adlet, referred capsule, so that it integrates in its payload a program to be executed at the receiving end of a targeted active document. In addition to the executable program, the capsule may carry various options, including but not limited to authentication, confidentiality and integrity. The capsule uses its original specification parameters in terms of type, profile, advertising and recruiting strategies, to perform the required actions to determine the likelihood of a "common interest" relationship between the active document where the adlets originated and the targeted active documents. In case of a match the targeted document is invited to join the group adlets.

The generic mechanisms for adlet discovery and group management require efficient multicast tree design algorithms to support point-to-multipoint communication among active documents. This tree is derived from the virtual graph by finding the dynamic links among active documents with similar content, interest and semantics. The multicast tree is used to send capsules to recruit new active documents or advertise the profile of the originating document. From the routing point of view, an efficient multicast algorithm only replicates capsules when necessary, namely at the branching points of the tree. Furthermore the algorithm allows for dynamic membership as new members may join the multicast tree while other members may leave. These changes may impact optimality of the multicast tree supporting the traffic. Consequently, the multicast algorithm should deal with on-line changes in an efficient manner so that the actual delay observed by existing and new members during the exchange of information is bound, while the overall cost of the tree is minimized. This requires the development of efficient and simple multicast algorithms that guarantee an acceptable level of quality of service, while managing efficiently the network resources [Waxman88, Imase91].

Several attempts have been made toward the development of an efficient solutions for the multicast problem [Kompella93, Sun95]. Very little research work has addressed this problem with the context of active networks supporting adlet communication and negotiation. An intuitive solution to this problem is to rebuild the tree using a static algorithm whenever there is a change [Dalal78, Hakimi71, Winter87]. However, such a solution may have repercussions for members who remain in the group since there may be a disturbance in the communication. Furthermore, such a change may cause packets to arrive out of order. Another solution is to permit local or partial reconfiguration when modification to the membership occurs [Estrin97, Pusateri97, Zhang93, Deering91]. Yet another approach is to start with an optimal tree and make minimal changes as group membership changes without causing disruption to the members who remain in the group. This approach, however, may lead to large inefficiency [Alrabiah97]. While several issues related to static multicast algorithms have been the focus of research, supporting dynamic trees in WDM networks has not been extensively investigated [Waxman88, Imase91].





6. Virtual Graph Construction and Update

Active documents in our approach evolve under constraints so that the system can improve over time. This evolution is made possible due to the interaction among adlets, the recruiting and establishments of groups, the dynamic linking of documents and the updating of virtual graphs. Since the knowledge base consists of the adlets and the virtual graph, an important issue in making the constrained evolution possible is when and how to update the virtual graph of a site. Every time a new adlet is created, either because a new document is created or because two adlets are merged, the system needs to decide whether to update the virtual graph of the active document responsible for generating this adlet.

A subgraph of the virtual graph represents a conceptual view of the collection of information in a virtual site. Adlets may carry time-stamps to facilitate conceptual view integration through adlet merging and virtual graph updating. A virtual graph consists of embedded hierarchies (trees) of adlets to facilitate graph processing and searching. An example is illustrated in Figure 4.

Figure 4. A virtual graph with embedded hierarchies of adlets.


A virtual graph consists of nodes N = {n1, ...., nk} and arcs A = {a1, ..., an}. Each node ni = (Si, Ti) is defined by a site Si and an adlet tree Ti. An arc (ni, nj) connects one node ni and another node nj if the two sites Si and Sj are conceptually related. Initially the virtual graph contains only a single node ni for the local site Si. Suppose this site has documents doc1, .., dock, the corresponding adlet tree Ti contains adlet1, ....., adletk, organized into a tree. Notice we assume the documents in a site, and thus the corresponding adlets, can be organized into a conceptual hierarchy.

In the virtual graph, each node has its own information such as how many and what kind of adlets reside. In addition to its local adlet information, a node has information about adlets which resides in other nodes. This information has distance information and this distance information will be used before an adlet starts traveling. The adlet will decide whether it should go for advertising or not depending on its scope of advertisement.

Since a node has distance information on all the adlets in a VG, as the adlet obtains richer information for itself by traveling through the network, this distance information can be used for retrieval of related documents. For example, suppose one adlet has information about its related documents. When a user poses a query for a document which the adlet represents, conditions with it would be that how (far) related documents should be retrieved. Since the adlet has information on related documents, the search engine can use this information to decide whether the related documents should be retrieved according to the distance information and depending on the given condition.

This distance information for the adlets that reside in other nodes can be shared among local adlets. Sharing the distance information may lead to low memory usage and effective computation for retrieval of related documents, because the search engine would be able to narrow down related documents that satisfy users' conditions before starting actual retrieving.
 

As explained above, hierarchies of adlets in a node can be dynamically computed as a new adlet is added to the node or an existing adlet is deleted from the node. Also, distances for conceptually related nodes can be expressed in some way such as using typical documents distances (center document distances in document clusters) between two nodes or average-like distances of related documents between two nodes. Therefore, we can focus on exchanging nodes' information in a VG.

6.1. VG Update Algorithm

In the following VG update algorithm, when a change, such as addition of an adlet and deletion of an adlet, occurs in a node, that node notifies the other nodes that a change has occurred.

The arcs connecting nodes represent the conceptual relations between nodes. Each node knows only directly connected nodes, so the node in which a change has occurred sends its updating information only to the connected nodes.

The node that received the updating information will check the received information referring to its current node-information-table, then update its node-information-table. If this node has a connected node, this node will send the updated information to its connected node. In this manner, the updated information will propagate through the VG.
 

Figure 5. An example of virtual graph updating.

6.2 Node Updating Algorithm

As illustrated by the example in Figure 5, the VG consists of nodes N = {n1, n2, n3, n4, n5}. For each connection of nodes, the distance is labeled on it. In the figure, the distance (arc) between nodes n1 and n2 can be expressed as (n1, n2) = 5.

Initially, each node makes a table, which looks like a routing table. The initial table for node n2 in this figure would be:

Table for n2

nodename n2
connectednode n1 5
adletname adlet6 0 null doc6 profile6 t6 nont6 aggressive
adletname adlet7 0 null doc7 profile7 t7 nont7 aggressive
adletname adlet8 0 null doc8 profile8 t8 nont8 aggressive
adletname adlet9 0 null doc9 profile9 t9 nont9 aggressive
adletname adlet10 0 null doc10 profile10 t9 nont9 progressive
The first row of the initial table for node n2:
            nodename n2

This line tells us the name of this node, which is "n2."

The second row:
            connectednode n1 5
This row tells us where this node, n2, is connected to and the distances to the node connected. From the row, n2 is connected to nodes n1 and the distances is 5.

The fourth through eighth rows from the top of this table, these rows tells us about the adlets that reside in this node.
Each value in the columns shows the adlet's name and its attributes.

Each row follows the format below:
    adletname name distance route doc_id profile
target non-target ad_strategy

The attributes for an adlet such as distance, profile, target, and non-target can be a pointer to another data structure or database systems.

Since node n2 is connected to node n1, when this node is added to the VG, node n2 sends its node's table to node n1. Node n1 is an existing node and it already has information about all the nodes except node n2's in the VG. After receiving table information from node n2, node n1 adds new connected node name "n2" in its table. Then, after adding new adlets' information to its table, node n1 sends back its information to node n2 since node n2 is a new node and doesn't have any information about other adlets in the VG. Then, node n2 regenerate its table according to the received information from node n1.

The node updating algorithm is summarized below.

Step 1: If a node (sender) has a change in its node-information-table, send new node-information-table to connected nodes (receivers).

Step 2: The receiver updates its table comparing with the table received from the sender.

Step 3: If the receiver has connected nodes, send the new table to them (the receiver becomes the sender), and go back to Step 1. If the receiver doesn't have connected nodes besides the sender, finish updating.

These steps are iterated until all the nodes have been updated. To implement the update algorithm described above, an update function vg_update( ) is called recursively. To avoid sending information back and forth between two nodes infinitely, a timestamp is used. Therefore, the sender will not send its update information to the node that has the same timestamp.

6.3. Migration of Adlets

When a new document is created, if the corresponding adlet is reactive, the adlet stays in the local site and nothing more happens. If this adlet is aggressive, this adlet will travel to other sites in a dynamically computed multicast set according to the following policy: If (ni, nj) is an arc in the virtual graph, we compute the conceptual distance between Si and Sj, and if the distance is less than a threshold then Sj is added to the multicast set according to increasing order of this conceptual distance. In other words Sj with the smallest conceptual distance is added to the multicast set first, followed by the one with the next smallest conceptual distance and so on, until the preset maximal size is reached. The adlet can then travel to the nodes in the multicast set according to the negotiation protocol.



7. An Experimental Prototype

In our implementation of the experimental prototype we employ the active index system that forms the basic layer to implement adlets and active documents. Our approach is that an adlet is instantiated through the instantiation of active index cells. Using this approach the Active Index System will manage all adlets (index cells) and handles adlet migration through distribution of the index cells.

We are currently building an experimental prototype based upon the active index system. CORBA is used as the middleware for index cell management. In what follows we give an example of adlets in action. The behavior of the active index system is described in parentheses.

1. Initial State: As shown in Figure 4, the virtual graph VG has three nodes n1, n2 and n3. Node n1 is for site S1 with adlets adlet1, adlet2 and adlet3. Node n2 is for site S2 with adlets adlet4, adlet5, adlet6, adlet7 and adlet8. Node n3 is for site S3 with adlets adlet9, adlet10, adlet11, adlet12 and adlet13. There are arcs a12 and a23, indicating S1 and S2 are conceptually related, and S2 and S3 are conceptually related.

2. Adlet Creation: Suppose document3 has just been added to site S1, and the corresponding adlet3 is added to the virtual graph. (In active index system, this means an index cell ic3 is added to site S1.)

3. Adlet Migration: If adlet3 is aggressive, it will travel to conceptually related nodes. Since there is an arc a12, adlet3 can travel to site S2. (In active index system, this means index cell ic3 will send a message to IC_Manager of site S2 to clone itself! But the original ic3 will stay at site S1. In other words, the original adlet3 will never leave its home site. It is its clone that will materialize in other sites!)

4. Determination of Matched Adlets: Adlet3, once at site S2, will try to merge with adlets there. This is done by finding whether its target matches another adlet's profile, and its non-target does not match another adlet's profile. Let us say adlets adlet5 and adlet7 are thus identified. (In active index system, the cloned ic3 will broadcast messages to all ic's at site S2 and ic5 and adlet7 will respond.)

5. Adlets Merging by BW-Join: Adlet3 and adlet5 are merged. The adlet composition operator is the BW operator, symbolized by the Spider. Adlet3 absorbs adlet5. This means document3 should absorb some of the information contained in document5. What happens is that updated adlet3 will travel back to site S1 to update original adlet3. (In active index system, the message sent by ic5 to ic3, will cause ic3 to send a message back to the original ic3 at site S1, causing it to be updated.)

6. Updating of Virtual Graph: Using the algorithm described in the previous section, the virtual graph is updated. (In active index system, updating of virtual graphs may be an action performed by a special ic, or even the IC_Manager itself.)

7. Adlets Merging by Equi-Join: Adlet3 and adlet7 are merged. The adlet composition operator is the Equi-Join operator, symbolized by the Heart. Adlet3 and adlet7 are merged. Again, updated adlet3 will travel back to site S1 to update original adlet3, and the virtual graphs at various sites need to be updated. It is possible, that document3 and document7 now form a group with special links between them.

8. Retrieval of Documents: When a user wants to retrieve document3, the related documents such as document7 will be retrieved. (The retrieval_ic is activated, which activates ic3 and in turn ic7, to retrieve the actual documents document3 and document7.)



The experimental adlet prototype will be applied to the NSF Plant Genome Research Program which has an objective to develop shared resources and research tools that will enable plant genome research to advance efficiently, rapidly and in a cost-effective manner. This plant genome adlet system will be accessible through the Internet using a browser. The users - scientists, educators and students - can place annotated active tags (i.e. adlets) on important genome data items. These adlets are active and can travel from node to node, collecting additional information. The adlets can be organized hierarchically so that only adlets in a certain class are active during a user's search. For example an expert may want to retrieve information annotated by fellow scientists only, while a student may want to retrieve information annotated by teacher, fellow students and expert scientists. The system will be tested by researchers and graduate students of the University of Pittsburgh.



8. Comparison with Other Ongoing Research

One of the recent development to overcome the limitations of Internet browsers and improve Web-browser technology aimed at adding capabilities to enable communication between HTTP servers and clients and provide interactive construction and delivery of images in response to user input. The basic idea is based on the observation that browsers have always been driven by user input. Two complimentary mechanisms, server push and client pull, were proposed to provide the server with the added capability to push new data down to the browser and the client to either reload the current data or obtain new data. A related research effort aimed at developing active objects which make it possible for a browser to download a program, execute it and display the program user's interface in a web page. The paradigm of active objects was pioneered by Sun's Hotjava browser with Java Aplets. Java, however, does not have language-level support for distributed programming. The language does contain an application program interface for remote method invocation, but does not provide support for distributed network objects. To address this shortcoming the paradigm of active objects needs to be further extended in the development of distributed active objects that can communicate with other active objects located on different machines across the Internet.

Automated programs such as Web crawlers or indexing robots generally do not have the capability to identify the main characteristics of a document, for the Web documents lack the structure necessary to reliably extract the routine information that may result from a simple cursory inspection performed by a human user on a library index. Several approaches have attempted to address this problem and develop more efficient automated classification methods. The most common approaches seek to attach metadata to files to facilitate the indexing and classification of Internet documents based on the collected metadata [Weibel95]. The most significant of these efforts is the Dublin Core Metadata program and the affiliated endeavor, the Warwick Framework [Demsey96, Lagoze96]. Other research efforts such as WebSEEk [Beigi98] suggest possibilities for the indexing of visual information and demonstrate tools to combine key-word indexing with image analysis. However these research efforts have not adequately addressed the problem that a Web crawler may overload the server when indexing information is sought. This problem is partially addressed by Harvest [Bowman94, Bowman95], which provides an efficient architecture based on information gatherers to reduce the amount of overhead caused by information exchange between crawlers and visited sites.

A different approach to provide better support for structuring and management of Internet information aims at developing a set of tools and languages based on a data model which is used to describe the scheme of a Web hypertext in the spirit of databases [Abit97]. Using the data model, tools are developed to support database views of the Web. These views can then be analyzed and integrated using common database techniques to generate hypertextual views of the Web. A class of these research efforts view the Web as a large graph of unstructured documents and provide support for queries based on the structure of the graph. Several query systems for unstructured data have been proposed. W3QS [Kono95] uses common information retrieval techniques and the organizational properties of the hypertext to allow for both structure specifying queries and content queries, and provide management tools for Web forms. WebSQL [Mend96], WebLog [Laks96], Lorel [Abit96] and unQL [Bune96] are yet other efforts toward the formalization of languages, to support query locality, restructuring, querying heterogeneous and semi-structured information, and the provision of proper database techniques and constructs to analyze HTML documents and extract their structure.

Other papers, such as TSIMMIS [Chawa94], aim at integrating Internet information from the Web by extracting data from heterogeneous sources and correlating these data to generate an integrated database representation of the information. Information Manifold [Levy96] provides a specific support for querying on the basis of declarative descriptions of the contents of database records accessible through a fill-in form-based interface. These approaches demonstrate the feasibility of manipulating and integrating unstructured Internet information but do not address the important aspect of how these data are discovered and retrieved. These systems and the solutions they provide are very valuable in achieving the ultimate goal of Internet browsing and information retrieval. The exponentially increasing volume of the networked information, however, require "guidance" on where the limited time of an Internet user is to be spend in order to locate the most relevant documents for a given purpose. New frameworks are required to address these issues and ease the burden of feeding the crawlers that repeatedly scan the Web sites with information to index, thereby causing enormous amount of computing time to be wasted.

The adlet framework takes a different approach than most of the research efforts described above. The basic tenet of the adlet framework is that in many cases it is almost impossible for a user browsing the Internet to specify accurately the type of information sought or prescribe the best action to obtain the desired documents. This is due to the fact that (i) Web crawlers, spiders or indexing robots have difficulty identifying characteristics of a document such as its overall theme or its genre, and (ii) that Web-browsers offer little or no support to a structured view of the information on the Internet.

The adlet framework provides a new approach to resource discovery based on the concept of active document advertising. In this paradigm, a document builds metadata using a high-level specification of user preferences and joins a group of documents of the same interest, thereby enhancing its own content and advertising the metadata to other active documents for their respective enhancements. This aspect of "adaptive document reinforcement" enables the discovery and indexing of related documents and provides the basis of the fusion and better structuring of the Internet information for a faster and easier extraction, retrieval and manipulation of relevant fused information. Furthermore, the concept of advertising among adlets avoids the necessity to export the entire contents of a given site across the network to make up an index. The capability of the adlets to join other adlets provides the basis for the development of a dynamic and distributed graph of related documents that can be retrieved on demand.



9. Discussion

Two important features make the above described approach applicable to applications in multimedia information fusion, information retrieval, data mining, geographic information systems, medical information systems and disaster management: a) any document, including web page, database record, video file, audio file, image and even paper documents, can be enhanced by an adlet and become an active document; b) any node in a nonactive network can be enhanced by adlet-savvy software and the adlet-enhanced node can co-exist with other non-enhanced nodes.

Active networks allow individual user, or groups of users, to inject customized programs into the nodes of the network. Active architectures enable a massive increase in the complexity and customization of the computation that is performed within the network, e.g., that is interposed between the communicating end points. We will deal with hybrid active networks where some or even all nonlocal nodes are unable to handle adlets. The approach is to provide an adlet extractor to extract adlets from documents. Thus at an adlet-enhanced node, incoming documents are converted into active documents by the extraction of adlets. The ad-strategy for such adlets will be reactive, i.e., these adlets that are extracted from documents will remain in their sites and do not travel. This way, adlet interaction is dealt with by the local node equipped with the adlet-savvy software. But as more and more nodes are enhanced with adlet-savvy software, the network becomes more active, and the behavior of adlets more sophisticated. The experimental system will provide a testbed for feasibility studies in such an active or hybrid network environment.

In Section 4 we described an approach for the visual specification of adlet types by annotation, and the visual operators for merging adlets. The semantics of this visual language [Chang96a] needs to be carefully investigated. Concerning visualization, we need to investigate how best to determine the direction of adlet movement. Since a site is characterized by a virtual graph, the attraction/repulsion must be defined in terms of this virtual graph. A further research topics is the following. Given an advertisement plan, we need to investigate how to express the advertisement visually by a visual sentence and translate the visual sentence into an executable plan based upon adlets.

We need to design the negotiation protocols, i.e., the e-proc's, so that given an advertisement plan the goal can be reached and the exchange of adlets and other network primitives leads to the goal of completed exchanges. In adlet interaction, who visits what site? Adlets migration may happen before or during or after adlets interaction. An important topic is how to apply multicast protocols in support of adlets negotiation protocols and routing. A large number of multicast routing algorithms were designed for either low end-to-end delay or efficient management of network resources. However, very few algorithms take both objectives into considerations. We need to develop new class of heuristics that will ensure acceptable average delays, based on the conceptual distance between sites, while minimizing the resources required to meet these delays. Based on these heuristics we can investigate the mechanisms and communication primitives to allow the dynamic encapsulation of these cooperating documents into a logical group. These primitives should allow the creation of a group of active documents, the addition of new members to the group and the exclusion of members from the group. Furthermore, these primitives must allow for the specification of the required level of quality of services in terms of conceptual distances between sites, the average and peak rate bandwidth requirements, and the acceptable delay variability when documents are actually exchanged.

As mentioned before the ad-strategy can be reactive or aggressive. A third type of ad-strategy can be introduced -- a progressive ad-strategy can start as a reactive strategy and become aggressive due to adlets merging or some other conditions becoming 'true'. We need to further explore different ad-strategies and their relationships to negotiation protocols.

Another related topic is the following. Given adlets, how to compose adlets (i.e. conceptual view integration) so that the initial conditions of negotiations are met. In the simplest case we need to verify the existence of a pair of profiles (profile1, profile2) consistent with (doc1, doc2). Adlet composition operators may be used in more complicated cases for information fusion negotiation protocols. These theoretical investigations will lead to a deeper understanding of document abstraction and advertising.



Acknowledgements: This research was supported in part by the National Science Foundation under grant IRI-9224563, "An Active Image Information System for Biomedical Image Databases". The virtual graph updating algorithm was developed by Miyawaki Kosuke. The Adlet prototype system is being developed by Ping-Wen Chen, Guanlin Shen and Longjiang Yang.



References:

[Abit96] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wienner, "The Local Query Language for Semi-Structured Data", http://www-db.standford.edu, 1996.

[Abit97] S. Abiteboul, "Querying Semi-Structured Data", In Sixth International Conference on Data Base Theory", Lecture Notes in Computer Sciences, 1997.

[Alrabiah97] T. Alrabiah and T. Znati, "A Simulation Framework for the Analysis of Multicast Tree Algorithms", 30th Annual Simulation Symposium, April 1997, 196-205.

[AltaVista] AltaVista Technology Home Page, http://www.altavista.com

[Beigi98] Mandis Beigi, Ana Benitez, and S.-F. Chang, "MetaSEEk: A Content-Based Meta Search Engine for Images," SPIE Conference on Storage and Retrieval for Image and Video Database, San Jose, Feb. 1998.

[BernesLee94] T. Berners-Lee, R. Cailliau, A. Lautonen, H. F. Neileson, and A. Secret, "The World Wide Web", Communications of the ACM, Vol. 37 no. 8, August 1994, 76-82.

[Bowman94] C. Bowman, P. B. Danzig, Udi Manber, and M. F. Schwartz,"Scalable Internet Resource Discovery: Research Problems and Approaches", Communications of the ACM, Vol. 37, No. 8, August 1994, 98-107.

[Bowman95] C. Bowman, P. B. Danzig, Udi Manber, and M. F. Schwartz, "The Harvest Information Discovery and Access System", Computer Networks and ISDN, Vol. 28, Nos. 1-2, Dec 1995, 119-125.

[Bune96] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu, "A Query Language and Optimization Techniques for Unstructured Data", In ACM SIGMOD, International Conference on Management of Data, SIGMOD'96, Montreal, Canada, pp. 505-516, 1996.

[Chang95] S. K. Chang, "Towards a Theory of Active Index", Journal of Visual Languages and Computing, Vol. 6, No. 1, March 1995, 101-118.

[Chang96a] S. K. Chang, "Extending Visual Languages for Multimedia", IEEE Multimedia Magazine, Fall 1996, Vol. 3, No. 3, 18-26.

[Chang96b] S. K. Chang, "Active Index for Content-Based Medical Image Retrieval", Journal of Computerized Medical Imaging and Graphics, Special Issue on Medical Image Databases (S. Wong and H. K. Huang, eds.), Elsevier Science Ltd., 1996, 219-229.

[Chawa94] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. D. Ullman, and J. Widom, "The TSIMMIS Project: Integration of Heterogeneous Information Souces", In IPSJ Conference, Tokyo, 1994.

[Chang98] S. K. Chang, D. Graupe, K. Hasegawa and H. Kordylewski, "An Active Multimedia Information System for Information Retrieval, Discovery and Fusion", International Journal of Software Engineering and Knowledge Engineering, March 1998, Vol. 8, No. 1, 139-160.

[Cheong96] F. C. Cheong, "Internet Agents, Spiders, Wanderers, Brokers and Bots", New Riders Publishing, 1996.

[Dalal78] Y. K. Dalal and R. M. Metcalfe, "Reverse Path Forwarding of Broadcast Packets" Communications of the ACM, Vol. 21 No. 12, December 1978, 1040-1048.

[Deering91] S. Deering, "Multicast Routing in a Datagram Internetwork", PhD Thesis, Stanford University, 1991.

[Demsey96] L. Dempsey and S. L. Weibel, "The Warwick Metadata Workshop: A Framework For the Deployment of Resource Description", D-lib Magazine, July-August 1996.

[Hakimi71] S. L. Hakimi, "Steiner's Problem in Graphs and its Implications" Networks, Vol. 1, November 1994, 113-133.

[Hardy95] D. R. Hardy, M. F. Schwartz, D. Wessels, "Harvest User's Manual, Version 1.2", Technical Report CU-CS-743-94, November 1994.

[Imase91] M. Imase and Bernard Waxman, "Dynamic steiner tree problem", SIAM Journal of Discrete Math, Vol. 4 No. 3, August 1991, 369-384.

[Kim95] W. Kim, Editor, "Modern Database Systems: The Object Model, Interoperability, and Beyond", ACM Press and Addison Wesley, 1995.

[Kompella93] V. Kompella, J. Pasquale, and G. Polyzos, "Multicast Routing for Multimedia Communication", IEEE/ACM Transactions on Networking, Vol. 1 No. 3, June 193, 286-292.

[Kono95] D. Konopnicki and O. Shmueli, "W3QS: A query system for the World-Wide Web", In Intern. Conf. on Very Large Data Bases (VLDB'95), pp- 54-65, 1995, Zurich.

[Lagoze96] Lagoze, Carl and Lynch, Clifford and Daniel, Ron, Jr. June, 1996. "The Warwick Framework: A Container Architecture for Aggregating Sets of Metadata.", Cornell Computer Science Technical Report TR96-1593. [Laks96] L. Lakshmann, F. Sadri, and I.N. Subramanian, "A Declarative Language for Querying and Restructuring the Web", In 6th Intern. Workshop on Research Issues in Data Engineering: Interoperability of Nontraditional Database Systems (RIDE-ND'96), 1996.

[Levy96] A.Y. Levy, A. Rajaraman, and J.J. Ordille, "Querying Heterogeneous Information Sources Using Source Description", In Intern. Conf. On Very Large Data Base, VLDB'96, Mumbai, Bombay, 1996.

[Mend96] A. Mendelzon, G. Mihaila, and T. Milo, "Querying the World Wide Web", In the First Int. Conf. on Parallel and Distributed Information Systems (PDIS'96), 1996. ftp:/db.torento.edu/pub/papers/websql.ps

[Pusateri97] T. Pusateri, "Distance Vector Multicast Routing Protocol" Internet Draft, February 1997.

[Seth90] A. P. Seth, and J. A. Larson, "Federal Database Systems for Managing Distributed, Heterogeneous and Autonomous Databases" ACM Computing Surveys, Vol. 2, No. 3, September 1990, 183-236.

[Sun95] Q. Sun and H. Langendoerfer, "Efficient Multicast Routing for Delay Sensitive Applications", Proceedings of Second Workshop Protocols Multimedia Systems, April 1995, 452-458.

[Waxman88] B. Waxman, "Routing of Multipoint Connections", IEEE Journal on Selected Areas in Communications, Vol. 6 no. 9, December 1988, 1611-1622.

[Weibel95] Weibel, Stuart, "Metadata: The Foundations of Resource Description", D-lib Magazine, July, 1995.

[Winter87] P. Winter, "Steiner Problem in Networks: A Survey", Networks, Vol. 17, 1987, 129-167.

[Zhang93] L. Zhang, S. Deering, D. Estrin, S. Shenker, and D. Zappala, "RSVP: A new Resource ReSerVation Protocol, IEEE Network, 1993, 8-18.