Adlet: Active Document for Adaptive Information Integration
Shi-Kuo Chang and Taieb Znati
Department of Computer Science
University of Pittsburgh
Pittsburgh, PA 15260
{chang, znati}@cs.pitt.edu
ABSTRACT:
We describe a new approach to resource discovery
based on the concept of active documents advertising on the Internet, whereby
a document dynamically builds metadata, joins a group of documents
of the same interest and advertises the collected metadata
to the members of the group.
This abstraction of metadata is called an adlet, which is the core of our approach.
This approach supports the visual specification and visualization of adlets,
the creation and composition of adlets,
the negotiation protocols and communications
with the active network.
Two important features make this approach applicable to
applications in information fusion, information retrieval, data mining,
geographic information systems, medical information systems and plant genome information gathering:
a) any document, including web page, database record, video file, audio file, image and
even paper documents, can be enhanced by an adlet and become an active document;
b) any node in a nonactive network can be enhanced by adlet-savvy software and
the adlet-enhanced node can co-exist with other non-enhanced nodes.
An experimental prototype provides a testbed for feasibility studies in a hybrid active network environment.
Keywords: Active document, information fusion, active index system
1. Introduction and Motivation
The Internet has become an indispensable means of
communication and collaboration among people at different
levels and in various capacities. It provides access to a
bewilderingly large number and variety of resources, including
text, audio and video files, scientific data, retail
products, network services, and transcripts of conversations.
Because of the scale and decentralized nature of this environment,
the Internet has evolved into a chaotic repository of all types of
information, making it difficult to locate resources of interest.
If it is to continue to grow and thrive as new means of
communication, new tools are needed to organize and support the Internet's
resource discovery in a fashion that keeps pace with its exponential growth in
size and diversity.
Over the past several years, a number of Internet information discovery tools
and search engines have been introduced, including Lycos,
Alta Vista, Infoseek, Yahoo and Harvest [AltaVista, Bowman94, Hardy95].
These automated tools bear most of the responsibility for
organizing information on the Internet by automatically
classifying and indexing collections of digital data.
They periodically dispatch indexing engines, frequently referred
to as "Web Crawlers", to download and examine documents in order
to extract information that describes these documents
[Cheong96]. The extracted
information is stored in the search engine's database, along with the
uniform resource locator of the site where the document resides.
The stored information is later used to produce a list of
resources in response to requests submitted by Internet users
to query the search engine's database [Demsey96, Hardy95, Lagoze96].
These tools have become quite popular and are
helping to redefine how people think about wide area network applications.
In theory they can address the fundamental problem of resource
indexing and discovery on the Internet. Consequently they have the potential
to address the inability of human users
to cope with the Internet's rapid growth and enormous data volume.
Yet it has become clear that they are not well suited to supporting the
ever evolving information infrastructure, characterized by enormous data
volume, rapid growth in the user base, and burgeoning data diversity.
This is mostly due to the fact that these tools are not capable of
identifying basic characteristics of a document, such as its theme or genre,
let alone its underlying meaning or context. They can only
provide uniform and equal access to all resources in the Internet, for
they cannot reliably extract routine
information that a human indexer can find through a simple cursory
inspection. As a result, queries issued by Internet users often produce
an overwhelming number of responses, which frequently contain
references to irrelevant material while leaving out more relevant ones.
This paper addresses some of the
problems involved with resource discovery in the Internet.
Our approach is based on the concept of active documents advertising, whereby
a document dynamically builds metadata, joins a group of documents
of the same interest and advertises the collected metadata
to the members of the group.
This abstraction of metadata is called an adlet, which is the core of our approach.
The characterization of metadata may range
from a title or type, to key features of the document or of the author. The
metadata is derived from a) the user specified
detailed annotations of the document,
b) the advertisement strategy and c) the recruiting
strategy. The annotations provide in-depth characterization of a document
and is used as basis for indexing, document retrieval and information fusion. The advertisement
strategy defines the policies used by the active document in advertising
itself to the rest of the Internet.
The recruiting strategy determines the type
of documents allowed to become associated and/or fused with this active document in
its evolution toward an enhanced and richer document.
The paper is organized as follows. Section 2 presents the system architecture.
The basic definitions for adlet are introduced in Section 3. The visual
specification of adlets by annotation, the visualization of adlets and
real/virtual sites, and the merging of adlets are discussed in Section 4.
Section 5 describes our approach for adlets group management and negotiations.
The construction and update of virtual graph, which together with the adlets
form the knowledge base, is described in Section 6. An experimental
prototype based upon the active index system is being built, as described
in Section 7. In Section 8 we
compare our approach with ongoing research. Concluding remarks about
further research are given in Section 9.
2. System Architecture
The system architecture supporting
active document advertising should address the issue of how to organize the resource space flexibly
and dynamically. Rather than
imposing a hierarchical structure, our approach allows the active
document structure to evolve in accordance with usage patterns,
based on the advertising and recruiting strategies specified by
the user. More specifically the active documents dynamically
organize and search the resource space by constructing links among
themselves based on the metadata that describe their contents, types,
context, the advertising and recruiting strategies
specified by the user, and the semantics of the type of active document
being sought. The links form a virtual graph, with a flexible set of
hierarchies embedded within the graph to provide efficient searching.
The virtual graph's edges are associated with weights that reflect
the likelihood that the adlet attached at the other end is interested
in recruiting this document or is itself representing a document that can be
of interest to the current active document. The virtual graph's
structure evolves over time through the use of virtual graph updating algorithms
in accordance with user specified metadata. This approach
provides the basis for the development of a scalable, customizable
architecture for gathering, indexing, caching, replicating, accessing and fusing
Internet information.
Adlets use the virtual graph to locate an adequate number
of other adlets and their corresponding documents situated in their sites.
Active index cells [Chang95] are associated with adlets representing abstracted
active documents, and through these index cells advertisements can be sent and received in accordance with negotiation protocols.
The active index system [Chang96b] collects indexing information received
from other adlets, suppresses duplicate information and filters
out undesired information using the user-specific profile. The active
index system then summarizes adlet information in various type-specific
ways to generate structured indexing information that can be
used by the Adlet Visualizer and search entities [Abit97].
The active index is designed to support multiple "views"
in a way such that the gathered information about adlets
can be easily extracted and correlated and integrated with
local adlet information and information
received from remote sites [Kim95, Sheth90]. The fusion
of information will reflect the profile of the
associated user and will facilitate the visualization of
different views of the related active documents.
Figure 1. System architecture to support active document advertising.
Our approach can be summarized
in plain language as follows:
Adlet tells you what to look for and what to avoid,
virtual graph tells you where to go, and
negotiation protocol tells you when to go and how to go.
Figure 1 shows the system architecture.
The Adlet Visualizer supports visual specification and visualization of adlets.
The Adlet Manager supports creation and composition of adlets.
The Adlet Negotiator handles negotiation protocols and communications
with the network.
The Active Index System maintains the virtual graph and handles all
messages to/from adlets and active documents.
3. Adlets
In this paper we concern ourselves mainly with the advertising,
negotiation and exchange of active documents. The documents are active in
the sense that each document can perform actions with autonomy.
An adlet is the means by which an active document makes itself known
to other active documents. An active document itself may post an adlet,
or a user may post an adlet for an active document. An adlet has a
target, i.e., it is meant only for active documents capable of
negotiating and exchanging certain classes of documents. Usually the
target can be defined based upon the notion of document hierarchy,
i.e., documents satisfying certain conceptual relations are regarded
as the target. Finally documents to be avoided constitute the non-target for
this adlet.
The semantic aspects of the advertisement
associated with an adlet must be defined in a controlled and
uniform manner. The user/application must be able to request a style
of advertisement, i.e., a certain advertising strategy.
Upon its creation, an adlet attempts to join a group of adlets,
possibly creating a new one if such a group does not yet exist.
Other adlets may join the group, leave the group, advertise
to the group and receive advertisement from other members of the group.
Definition 1: An adlet is defined as
adlet = (doc, profile, target, non-target, ad-strategy, prop) where
doc uniquely identifies the document to be advertised,
profile is a set of conceptual relations from a concept space characterizing this document,
target is a set of conceptual relations characterizing documents to be recruited,
non-target is a set of conceptual relations characterizing documents to be avoided,
ad-strategy is the advertising strategy, and
prop are other properties (see Section 4.1).
The profile is specified, in the simplest case, by keywords associated
with the document. In general the profile is a set of conceptual relations, or a conceptual graph, derived from user's annotations. (We regard a concept as a special case of a conceptual relation with a null second part.) This information is used by the adlet to develop
the virtual graph and maintain it as it evolves dynamically.
The target specifies the scope and the focus of the advertisement.
The scope of an advertisement may be global, restricted or local.
Within a target the focus of the advertisement may be all adlets or
a selected subset of adlets.
The non-target specifies what is out of the scope.
The ad-strategy determines the advertising strategy used for document advertisement.
The strategy could be aggressive or reactive. An aggressive
strategy initiates advertisement immediately after its creation
and continues advertising in a periodic fashion. A reactive strategy
on the other hand engages in advertisement only in response to advertisement
received from other active documents.
Example 1: A document can be a WWW page and therefore doc = URL.
The document is characterized by a set of keywords and therefore profile = {keyword1, ...., keywordn}.
This WWW page is in
the class of browsable documents and
the intended target is also a selected collection of browsable documents, i.e., target = {'Browser'},
where 'Browser' characterizes a document class in the
class hierarchy. The non-target is unspecified and therefore the default is the empty set.
The advertising strategy is reactive.
Example 2: A document can be a database record. The record_id is its doc.
The document is characterized by a conceptual schema
describing the conceptual structure in which this record instance is situated,
and therefore its profile = { schema_name }. The target
is { 'financial record' }, denoting the collection of financial records of an enterprise.
The non-target is { 'scientific' }. The advertising strategy is aggressive.
Example 3: An active document may post an adlet by itself. In this case,
ad = (doc, profile, {'all'}, {'none'}, ad-strategy, prop). In other words, this adlet is posted by one active
document for 'all', with 'none' to be avoided. Such an adlet
needs to be refined so that it can aim at more specific targets.
Definition 2: Two adlets ad1 and ad2 can be partially ordered as follows,
ad1 < ad2 if and only if the following holds:
a) doc1 = doc2,
b) profile1 = profile2,
c) target1 is contained in target2,
d) non-target1 contains non-target2,
e) ad-strategy1 = ad-strategy2, and
f) prop1 = prop2.
In other words ad1 is intended for a smaller target group,
so ad1 is more refined than ad2.
Definition 3: The negotiation protocol
is a sequence e-proc= ((d11,d12), (d21,d22), ..,
(dn1, dn2)), where each (di1,di2) is an exchanged pair of documents
and the final pair (dn1,dn2) ia s pair of goal documents.
Example 4:
Let the first entry be the consumer's and the second the producer's,
then an information exchange negotiation protocol typically may look like:
(('Mars',-), (-,'Heaven'), ('Sojourner', -), (-, doc-13), ('Thanks', -)).
Therefore, the adlets should induce a sequence of information exchanges e-seq
between the consumer and the producer, such that e-seq
contains e-proc as a sub-sequence.
If this is feasible, then we say the negotiation protocol is supportable.
Definition 4: A negotiation protocol is supportable if the adlets can
induce a sequence of information exchanges e-seq such that e-seq contains
e-proc as a subsequence and the final pair (dn1,dn2) of e-seq is
a pair of goal documents.
Additional constraints may be imposed on the supportable negotiation protocol,
such as the progressive refinement of the adlets, leading to a more precise
search criterion and consequently a smaller set of matching documents.
4. A Scenario
Adlets can be generated by active documents to interact with other
adlets or active documents to accomplish the objectives of an advertisement plan.
In this section we describe a scenario of active documents advertising using adlets.
4.1. Visual Specification of Adlets by Annotation
When a document type is defined,
the user can visually specify the types and properties of its adlets.
The design of this visual
specification language is based upon the theory of visual languages.
The mode of interaction is direct manipulation. The user annotates
the objects on the screen by pointing, clicking and entering keywords.
The user first defines a concept space where the concepts are color-coded.
Each adlet type has a dominant color which is the concept that characterizes
it most closely, and other less dominant colors corresponding to the concepts that
characterize it with varying degree of closeness.
The darkness or lightness of a color indicates the closeness of
this characterization. Each adlet type also has a dominant color which is
the concept that attracts it most, and other less dominant colors for less attractive concepts.
Finally, concepts that repulse the adlet are also defined.
These color codes are shown in different areas on the adlet or
dynamically shown in an alternating fashion.
Figure 2 illustrates the visual specification of an adlet
which carries the following information:
"I represent doc,
I am profile,
I love target,
I hate non-target,
I stay put (reactive) or travel (aggressive) and
I behave like prop".
Figure 2. Visual specification of an adlet.
The attraction/repulsion property of an adlet is important in determining
the motion behavior of an adlet.
An attraction/repulsion force of an adlet to a color-coded concept can
be one of the following, as specified by the user (perhaps by clicking on a sliding scale):
A1. Extreme attraction: the adlet must merge with an adlet/site of said color.
A2. Strong attraction: the adlet attempts to merge with an adlet/site of said
color until precondition becomes 'false'.
A3. Weak attraction: the adlet makes a best effort attempt to merge with an
adlet/site of said color.
A4. Neutral: no attraction/repulsion.
A5. Weak repulsion: the adlet makes a best effort attempt to avoid an
adlet/site of said color.
A6. Strong repulsion: the adlet attempts to avoid an adlet/site of said color
until precondition becomes 'false'.
A7. Extreme repulsion: the adlet avoids the adlet/site with said color, even if
this means its own destruction.
The ad-strategy of an adlet type can be specified by clicking on either
"reactive" or "aggressive".
Other properties of an adlet type to be specified by the user include the following:
B. Destructability: Adlet can be destructive or nondestructive. Destructive adlet can be
destroyed when a condition becomes 'true'. Nondestructive adlet cannot
be destroyed under ordinary circumstances.
C. Regenerability: Adlet can be regenerating or nonregenerating. Regenerating adlet is
able to regenerate instances of its identical type. Nonregenerating adlet
cannot regenerate.
D. Migration: Adlet can be migrating or nonmigrating. A migrating adlet will seek
out site that attracts it and migrate to that site. Nonmigrating adlet
will not migrate by itself.
(An aggressive/reactive ad-strategy leads to migrating/nonmigrating adlets.)
E. Temporal Sensitivity: Adlet can be temporally sensitive or nontemporally sensitive, depending on whether
its predicates and actions involve time as parameters.
F. Location Sensitivity: Adlet can be location sensitive or nonlocation sensitive, depending on whether
its predicates and actions involve location as parameters.
G. Context Sensitivity: Adlet can be context sensitive or noncontext sensitive, depending on whether
its predicates and actions involve contextual variables as parameters.
After an initial session to perform detailed annotation, the user creates an adlet type
which can be reused in the future when similar document type is encountered.
4.2. Visualization of Adlets and Sites
Adlets move from sites to sites to seek out new
adlets or active documents, announce its own presence and collect information. In its movement
adlets behave like magnets. An adlet or a collection of adlets can attract
other adlets and pull them together. Conversely an adlet or a collection of
adlets can repulse other adlets and push them apart. A collection of adlets
is called a virtual site or simply a site.
A virtual graph is the set of links and conceptual relations that
characterizes a (virtual) site and its relations with other sites.
As described in the previous section, adlets and sites both
have attraction/repulsion properties. Adlets are also directional.
The direction of an adlet is a derived property and generally indicates the direction of greatest
attraction/repulsion exerted by another adlet or another site. Site, on the
other hand, does not have a directional property.
Figure 3. Visualization of adlets and sites.
Figure 3 illustrates the visualization of adlets and sites, where an adlet
is indicated by a wedge-shaped object and the direction of an adlet is
indicated by the direction of the wedge's longer axis. The sites are indicated
by round-shaped objects. For real sites, the sites' locations are their
geographical locations. For virtual sites, the sites' locations are not
geographical locations, but their logical locations in some virtual space.
The distance between two virtual sites is proportional to the force
of mutual attraction.
One virtual site may span several real sites, and one real site may
contain multiple virtual sites.
As shown in Figure 3, an adlet A1 will not visit all the sites on the Web,
which will be far too numerous. Rather, only sites mentioned in the
virtual graph of the site containing the active document that generates adlet A1 will be
visited by this adlet. Since this virtual graph is dynamically updated
when adlets are merged, an adlet's effective sphere of motion will
change over time.
4.3. Merging of Adlets
When adlets collide, i.e., when their mutual attraction force exceeds a certain threshold
and they are in the same site,
the composition rules dictate whether two adlets should
be merged to form new adlets, with the possible
side effect of the updating of the virtual graph.
Therefore an active document can be linked to another
active document through the interaction among adlets, leading to the discovery of
new relationships and information fusion. The four most important adlet composition operators are:
(1) Concatenation: Adlet 1 and adlet 2 are combined into a train of adlets, but this
train is regarded as a single adlet. The visual operator is a 'train'.
(adlet1 adlet2 => adlet3)
(2) Equi-Join: Adlet 1 and adlet 2 are joined into a single new adlet, with properties
inherited from both parents. The visual operator is a 'heart'.
(adlet1 adlet2 => adlet3)
(3) BW-Join: Adlet 1 destroys adlet 2 but takes over some of the
latter's properties. The visual operator is a 'spider' -- the Black Widow.
(adlet1 adlet2 => adlet1)
(4) Split: Adlet 1 is split into adlets 2 and 3.
The unary visual operator is represented by a 'pair of scissors'.
( adlet1 => adlet2 + adlet3)
These operators can be combined to specify more complex operators. For
example two adlets can first be merged by BW-join and the new adlet is
then split into two.
Thus we can specify adlets that are: consumed, forever circulating, procreative, and so on.
The application of any of the above operators may also lead to the
update of the virtual graph, because when the adlets are merged, the
documents they refer to need also to be related. The dynamic links are
added to the active documents and the directed arcs are added to the
underlying virtual graph. Thus the effect of the interaction among
adlets is a continuous, dynamic update of the
virtual graph.
A site's virtual graph and its adlets together form its knowledge base.
Therefore an important issue is when and how to update the virtual graph of a site.
5. Adlets Group Management and Negotiations
As explained above, when a user creates/updates
a document the side effect could be the creation and communication of adlets.
Since adlets are light-weight and inexpensive, they are exchanged in the negotiation among
active documents prior to the transmission of actual documents. Therefore
actual documents are not transmitted unless both participants express interest
through the negotiation using adlets.
When two documents negotiate on the prospect of joining a group, what
they are negotiating is to find the admissible common subgraphs
of the conceptual graphs (the profiles and targets) for the two documents.
Active documents can be viewed as a distributed group
of entities cooperating toward establishing a relationship
to advertise, discover and mutually enhance their contents.
The establishment of such a relationship requires mechanisms
to support updating of the virtual graph and group membership,
and efficient algorithms to establish minimum cost probabilistic
multicast trees.
The mechanisms to support updating of the virtual graph should allow
the creation of new virtual links, while maintaining existing ones,
mutual discovery of active documents, the establishments of adlet groups
dynamically and the recruitment of new adlet group members.
These mechanisms should be generic, extensible and ensure co-existence
of different execution environments and network node architectures.
Our approach is to provide
a generic capability that would allow the encapsulation
of the packet carrying an adlet, referred capsule, so that it
integrates in its payload a program to be executed at the receiving end
of a targeted active document. In addition to the executable program, the
capsule may carry various options, including but not limited to authentication,
confidentiality and integrity. The capsule uses its original
specification parameters in terms of type,
profile, advertising and recruiting strategies, to perform the required actions to determine the likelihood of a "common interest" relationship
between the active document where the adlets originated and the targeted
active documents. In case of a match the targeted document is invited
to join the group adlets.
The generic mechanisms for adlet discovery and group management
require efficient multicast tree design algorithms
to support point-to-multipoint
communication among active documents. This tree is derived
from the virtual graph by finding the dynamic links
among active documents with similar content, interest and semantics.
The multicast tree is used to send capsules to recruit new
active documents or advertise the profile of the originating document.
From the routing point of view, an efficient multicast algorithm only
replicates capsules when necessary, namely at the branching points of
the tree. Furthermore the algorithm allows for dynamic
membership as new
members may join the multicast tree while other members may leave.
These changes may impact optimality of the multicast tree
supporting the traffic. Consequently, the multicast
algorithm should deal with on-line changes in an efficient manner so that
the actual delay observed by existing and new members during
the exchange of information is bound, while the overall cost of
the tree is minimized. This requires the development of efficient and simple
multicast algorithms that guarantee an acceptable level of
quality of service, while managing efficiently the network resources
[Waxman88, Imase91].
Several attempts have been made toward the development
of an efficient solutions for the multicast
problem [Kompella93, Sun95]. Very little research work
has addressed this problem with the context of active
networks supporting adlet communication and negotiation.
An intuitive solution to this problem is to rebuild the tree using a
static algorithm whenever there is a change
[Dalal78, Hakimi71, Winter87]. However, such a solution
may have repercussions for members who remain in the group since there
may be a disturbance in the communication. Furthermore, such a change
may cause packets to arrive out of order. Another solution is to
permit local or partial reconfiguration when modification to the
membership occurs [Estrin97, Pusateri97, Zhang93, Deering91].
Yet another approach is to start with an optimal tree
and make minimal changes as group membership changes without causing
disruption to the members who remain in the group. This approach,
however, may lead to large inefficiency [Alrabiah97].
While several issues related to static multicast algorithms
have been the focus of research,
supporting dynamic trees in WDM networks has not been
extensively investigated [Waxman88, Imase91].
6. Virtual Graph Construction and Update
Active documents in our approach evolve under constraints so that the system can improve
over time. This evolution is made possible due to the interaction among
adlets, the recruiting and establishments of groups, the dynamic linking of documents
and the updating of virtual graphs.
Since the knowledge base consists of the adlets and the virtual graph,
an important issue in making the constrained evolution possible is when and how to update the virtual graph of a site.
Every time a new adlet is created, either because a new document is created
or because two adlets are merged, the system needs to decide whether to
update the virtual graph of the active document responsible for generating this adlet.
A subgraph of the virtual graph represents a conceptual view of the
collection of information in a virtual site.
Adlets may carry time-stamps to facilitate conceptual view integration through adlet merging and virtual graph updating.
A virtual graph consists of embedded hierarchies (trees) of adlets to
facilitate graph processing and searching.
An example is illustrated in Figure 4.
Figure 4. A virtual graph with embedded hierarchies of adlets.
A virtual graph consists of nodes N = {n1, ...., nk} and arcs
A = {a1, ..., an}. Each node ni = (Si, Ti) is defined by a site Si and an
adlet tree Ti. An arc (ni, nj) connects one node ni
and another node nj if the two sites Si and Sj are conceptually related.
Initially the virtual graph contains only a single node ni for the local
site Si. Suppose this site has documents doc1, .., dock, the corresponding
adlet tree Ti contains adlet1, ....., adletk, organized into a
tree. Notice we assume the documents in a site, and thus the
corresponding adlets, can be organized into a conceptual hierarchy.
In the virtual graph,
each node has its own information such as how many and what kind of adlets
reside. In addition to its local adlet information, a node has information
about adlets which resides in other nodes. This information has distance
information and this distance information will be used before an adlet
starts traveling. The adlet will decide whether it should go for advertising
or not depending on its scope of advertisement.
Since a node has distance information on all the adlets in a VG, as
the adlet obtains richer information for itself by traveling through the
network, this distance information can be used for retrieval of related documents.
For example, suppose one adlet has information about its related documents.
When a user poses a query for a document which the adlet represents, conditions
with it would be that how (far) related documents should be retrieved.
Since the adlet has information on related documents, the search
engine can use this information to decide whether the related documents
should be retrieved according to the distance information and depending on
the given condition.
This distance information for the adlets that reside in other nodes
can be shared among local adlets. Sharing the distance information may
lead to low memory usage and effective computation for retrieval of related
documents, because the search engine would be able to narrow down related
documents that satisfy users' conditions before starting actual retrieving.
As explained above, hierarchies of adlets in a node can be dynamically computed
as a new adlet is added to the node or an existing adlet is deleted from
the node. Also, distances for conceptually related nodes can be expressed
in some way such as using typical documents distances (center document
distances in document clusters) between two nodes or average-like distances
of related documents between two nodes. Therefore, we can focus on exchanging
nodes' information in a VG.
6.1. VG Update Algorithm
In the following VG update algorithm, when a change, such as addition of an adlet and deletion
of an adlet, occurs in a node, that node notifies the other nodes that
a change has occurred.
The arcs connecting nodes represent the conceptual relations between
nodes. Each node knows only directly connected nodes, so the node in which
a change has occurred sends its updating information only to the connected nodes.
The node that received the updating information will check the received
information referring to its current node-information-table, then update its
node-information-table. If this node has a connected node, this node will send the
updated information to its connected node.
In this manner, the updated information will propagate through the VG.
Figure 5. An example of virtual graph updating.
6.2 Node Updating Algorithm
As illustrated by the example in Figure 5, the VG consists of nodes N = {n1, n2,
n3, n4, n5}. For each connection of nodes, the distance is labeled on it.
In the figure, the distance (arc) between nodes n1 and n2 can be expressed
as (n1, n2) = 5.
Initially, each node makes a table, which looks like a routing table.
The initial table for node n2 in this figure would be:
Table for n2
nodename n2
connectednode n1 5
adletname adlet6 0 null doc6
profile6 t6 nont6 aggressive
adletname adlet7 0 null doc7
profile7 t7 nont7 aggressive
adletname adlet8 0 null doc8
profile8 t8 nont8 aggressive
adletname adlet9 0 null doc9
profile9 t9 nont9 aggressive
adletname adlet10 0 null doc10
profile10 t9 nont9 progressive
The first row of the initial table for node n2:
nodename n2
This line tells us the name of this node, which is "n2."
The second row:
connectednode n1 5
This row tells us where this node, n2, is connected to and the distances
to the node connected. From the row, n2 is connected to nodes n1 and the
distances is 5.
The fourth through eighth rows from the top of this table, these rows
tells us about the adlets that reside in this node.
Each value in the columns shows the adlet's name and its attributes.
Each row follows the format below:
adletname name distance
route doc_id profile
target non-target ad_strategy
adletname
: tag string for the program to distinguish
adlet information from other information
name
: name
of adlet
distance
: distance to
reach this adlet from this node. If this adlet resides in this node, the
distance is 0 (zero).
route
: route
to be taken from this node when reaching this adlet. If this adlet resides
in this node the value is
null
doc_id
: document id this adlet represents. This
uniquely identifies the document to be advertised
profile
: profile
of the document this adlet represents. This is a set of conceptual relations
from a concept space characterizing this document
target
: a
set of conceptual relations characterizing documents to be recruited
no-target
: a
set of conceptual relations characterizing documents to be avoided
ad_strategy :
advertising strategy for this adlet
The attributes for an adlet such as distance, profile, target,
and non-target can be a pointer to another data structure or database
systems.
Since node n2 is connected to node n1, when this node is added to the
VG, node n2 sends its node's table to node n1. Node n1 is an existing node
and it already has information about all the nodes except node n2's in
the VG. After receiving table information from node n2, node n1 adds new
connected node name "n2" in its table. Then, after adding new adlets' information
to its table, node n1 sends back its information to node n2 since node
n2 is a new node and doesn't have any information about other adlets in
the VG. Then, node n2 regenerate its table according to the received information
from node n1.
The node updating algorithm is summarized below.
Step 1: If a node (sender) has a change in its node-information-table, send
new node-information-table to connected nodes (receivers).
Step 2: The receiver updates its table comparing with the table received
from the sender.
-
If the sender is a new node for the receiver, add an entry to the receiver's
table with distance set to that of the sender's node (this distance is in the sender's
table).
-
If a new adlet is in the receiver's table, add the name of adlet and distance
to the table. The distance is calculated by adding the distance between
the sender and receiver to the distance of adlets. The route for
this adlet is the name of the sender.
-
If the sender is not new, this means the sender should maintain all the
adlet information in the VG, and if a known adlet is in the receiver's
table but not in the sender's table, take off this entry
Step 3: If the receiver has connected nodes, send the new table to them
(the receiver becomes the sender), and go back to Step 1. If the receiver
doesn't have connected nodes besides the sender, finish updating.
These steps are iterated until all the nodes have been updated.
To implement the update algorithm described above, an update function vg_update(
) is called recursively. To avoid sending information back and
forth between two nodes infinitely, a timestamp is
used. Therefore, the sender will not send its update information to the
node that has the same timestamp.
6.3. Migration of Adlets
When a new document is created, if the corresponding adlet
is reactive, the adlet stays in the local site
and nothing more happens. If this adlet is aggressive, this adlet will
travel to other sites in a dynamically computed multicast set according
to the following policy:
If (ni, nj) is an arc in the virtual graph, we compute the conceptual distance between Si
and Sj, and if the distance is less than a threshold then Sj is added to the
multicast set according to increasing order of this conceptual distance. In other
words Sj with the smallest conceptual distance is added to the multicast
set first, followed by the one with the next smallest conceptual distance
and so on, until the preset maximal size is reached.
The adlet can then travel to the nodes in the multicast set according to
the negotiation protocol.
7. An Experimental Prototype
In our implementation of the experimental prototype we employ
the active index system that forms the basic layer to implement adlets and
active documents.
Our approach is that an adlet is instantiated through the
instantiation of active index cells. Using this approach the Active
Index System will manage all adlets (index cells) and handles adlet
migration through distribution of the index cells.
We are currently building an experimental prototype based upon the
active index system. CORBA is used as the middleware for index cell
management.
In what follows we give an example of adlets in action. The behavior
of the active index system is described in parentheses.
1. Initial State:
As shown in Figure 4, the virtual graph VG has three nodes n1, n2 and n3.
Node n1 is for site S1 with adlets adlet1, adlet2 and adlet3.
Node n2 is for site S2 with adlets adlet4, adlet5, adlet6, adlet7 and adlet8.
Node n3 is for site S3 with adlets adlet9, adlet10, adlet11, adlet12 and adlet13.
There are arcs a12 and a23, indicating S1 and S2 are conceptually related,
and S2 and S3 are conceptually related.
2. Adlet Creation:
Suppose document3 has just been added to site S1, and the corresponding
adlet3 is added to the virtual graph.
(In active index system, this means an index cell ic3 is added to site S1.)
3. Adlet Migration:
If adlet3 is aggressive, it will travel to conceptually related nodes.
Since there is an arc a12, adlet3 can travel to site S2.
(In active index system, this means index cell ic3 will send a message to
IC_Manager of site S2 to clone itself! But the original ic3 will stay
at site S1. In other words, the original adlet3 will never leave its
home site. It is its clone that will materialize in other sites!)
4. Determination of Matched Adlets:
Adlet3, once at site S2, will try to merge with adlets there. This is done
by finding whether its target matches another adlet's profile, and its
non-target does not match another adlet's profile. Let us say adlets
adlet5 and adlet7 are thus identified.
(In active index system, the cloned ic3 will broadcast messages to all ic's at
site S2 and ic5 and adlet7 will respond.)
5. Adlets Merging by BW-Join:
Adlet3 and adlet5 are merged. The adlet composition operator is the BW operator,
symbolized by the Spider. Adlet3 absorbs adlet5. This means document3 should
absorb some of the information contained in document5. What happens is that
updated adlet3 will travel back to site S1 to update original adlet3.
(In active index system, the message sent by ic5 to ic3, will cause ic3 to send
a message back to the original ic3 at site S1, causing it to be updated.)
6. Updating of Virtual Graph:
Using the algorithm described in the previous section, the virtual
graph is updated.
(In active index system, updating of virtual graphs may be an action performed
by a special ic, or even the IC_Manager itself.)
7. Adlets Merging by Equi-Join:
Adlet3 and adlet7 are merged. The adlet composition operator is the Equi-Join
operator, symbolized by the Heart. Adlet3 and adlet7 are merged. Again,
updated adlet3 will travel back to site S1 to update original adlet3, and
the virtual graphs at various sites need to be updated. It is possible,
that document3 and document7 now form a group with special links between
them.
8. Retrieval of Documents:
When a user wants to retrieve document3, the related documents
such as document7 will be retrieved.
(The retrieval_ic is activated, which activates ic3 and in turn ic7,
to retrieve the actual documents document3 and document7.)
The experimental adlet prototype will be applied to the NSF Plant
Genome Research Program which has an objective to develop
shared resources and research tools that will enable plant genome research to
advance efficiently, rapidly and in a cost-effective manner.
This plant genome adlet system will be accessible through the Internet
using a browser. The users - scientists, educators and students - can
place annotated active tags (i.e. adlets) on important genome data items.
These adlets are active and can
travel from node to node, collecting additional information. The adlets
can be organized hierarchically so that only adlets
in a certain class are active during a user's search. For example
an expert may want to retrieve information annotated by fellow
scientists only, while a student may want to retrieve information annotated
by teacher, fellow students and expert scientists.
The system will be tested by researchers and graduate students
of the University
of Pittsburgh.
8. Comparison with Other Ongoing Research
One of the recent development to overcome the limitations of Internet browsers
and improve Web-browser technology aimed at
adding capabilities to enable communication between
HTTP servers and clients and provide
interactive construction and delivery of images in response to user input.
The basic idea is based on the observation that
browsers have always been driven by user input. Two complimentary
mechanisms, server push and client pull, were proposed
to provide the server with the added capability to push
new data down to the browser and the client to either reload
the current data or obtain new data.
A related research effort aimed at developing
active objects
which make it possible for a browser to download a program, execute
it and display the program user's interface in a web page.
The paradigm
of active objects was pioneered by Sun's Hotjava browser with
Java Aplets. Java, however, does not have
language-level support for distributed
programming. The language does contain an application
program interface for remote method invocation, but
does not provide support for distributed network objects.
To address this shortcoming the paradigm of active objects
needs to be further extended in the
development of distributed active objects that can
communicate with other active objects located on different
machines across the Internet.
Automated programs such as
Web crawlers or indexing robots generally do not have the
capability to identify the main characteristics
of a document, for the Web documents lack the structure
necessary to reliably extract the routine information
that may result from a simple cursory inspection
performed by a human user on a library index.
Several approaches have attempted to address this
problem and develop more efficient automated classification
methods. The most common approaches seek
to attach metadata to files to facilitate the indexing
and classification of Internet documents based on the collected
metadata [Weibel95]. The most significant of these efforts is the Dublin
Core Metadata program and the affiliated endeavor, the Warwick
Framework [Demsey96, Lagoze96].
Other research efforts such as WebSEEk [Beigi98]
suggest possibilities
for the indexing of visual information and demonstrate
tools to combine key-word indexing with image analysis.
However these research
efforts have not adequately addressed the problem that a Web crawler
may overload the server when indexing
information is sought. This problem is partially
addressed by Harvest [Bowman94, Bowman95],
which provides an
efficient architecture based on information gatherers
to reduce the amount of overhead caused by information
exchange between crawlers and visited sites.
A different approach to provide better support for
structuring and management of Internet information
aims at developing a set of tools and languages
based on a data model which is used to describe
the scheme of a Web hypertext in the spirit
of databases [Abit97]. Using the data model, tools are
developed to support database views of the Web.
These views can then be analyzed and integrated
using common database techniques to generate
hypertextual views of the Web. A class of these
research efforts view the Web as a large
graph of unstructured documents and provide
support for queries based on the structure
of the graph. Several query systems
for unstructured data have been proposed. W3QS [Kono95]
uses common information retrieval techniques and
the organizational properties of the hypertext
to allow for both structure specifying queries
and content queries, and provide management tools
for Web forms. WebSQL [Mend96], WebLog [Laks96], Lorel [Abit96] and unQL [Bune96] are yet
other efforts toward the formalization of languages,
to support query locality, restructuring, querying heterogeneous
and semi-structured information, and the provision
of proper database techniques and constructs to analyze
HTML documents and extract their structure.
Other papers, such as TSIMMIS [Chawa94], aim at integrating Internet information
from the Web by extracting data from heterogeneous sources
and correlating these data to generate an integrated
database representation of the information. Information Manifold [Levy96]
provides a specific support for querying on the
basis of declarative descriptions of the contents
of database records accessible through a fill-in form-based
interface. These approaches demonstrate the feasibility
of manipulating and integrating unstructured
Internet information but do not address the important
aspect of how these data are discovered and retrieved.
These systems and the solutions they provide are
very valuable in achieving the ultimate goal
of Internet browsing and information retrieval.
The exponentially increasing volume of the networked
information, however, require "guidance" on where
the limited time of an Internet user is to be spend
in order to locate the most relevant documents for
a given purpose. New frameworks are required
to address these issues and ease the burden of
feeding the crawlers that repeatedly scan the Web sites
with information to index, thereby causing enormous amount of computing
time to be wasted.
The adlet framework takes a different approach than most
of the research efforts described above. The basic tenet
of the adlet framework is that in many cases it is almost
impossible for a user browsing the Internet
to specify accurately the type of information sought
or prescribe the best action to obtain the desired documents.
This is due to the fact that (i) Web crawlers, spiders
or indexing robots have difficulty identifying characteristics
of a document such as its overall theme or its genre, and (ii)
that Web-browsers offer little or no support to a structured
view of the information on the Internet.
The adlet
framework provides a new approach to resource discovery based
on the concept of active document advertising.
In this paradigm, a document
builds metadata using a high-level specification of user preferences
and joins a group of documents of the same interest, thereby
enhancing its own content and advertising the metadata
to other active documents for their respective enhancements. This
aspect of "adaptive document reinforcement" enables the discovery
and indexing of related documents and provides the basis
of the fusion and better structuring of the Internet information for a faster
and easier extraction, retrieval and manipulation of relevant fused information.
Furthermore, the concept of advertising among adlets
avoids the necessity to export the entire contents
of a given site across the network to make up an index. The capability
of the adlets to join other adlets provides the basis for the development
of a dynamic and distributed graph of related documents
that can be retrieved on demand.
9. Discussion
Two important features make the above described approach applicable to
applications in multimedia information fusion, information retrieval, data mining,
geographic information systems, medical information systems and disaster management:
a) any document, including web page, database record, video file, audio file, image and
even paper documents, can be enhanced by an adlet and become an active document;
b) any node in a nonactive network can be enhanced by adlet-savvy software and
the adlet-enhanced node can co-exist with other non-enhanced nodes.
Active networks allow individual user, or groups of
users, to inject customized programs into the nodes of the
network. Active architectures enable a massive increase in the
complexity and customization of the computation that is performed
within the network, e.g., that is interposed between the
communicating end points.
We will deal with hybrid active networks where some or even all nonlocal nodes are unable to handle
adlets. The approach is to provide an adlet extractor to extract adlets from documents.
Thus at an adlet-enhanced node, incoming documents are converted into
active documents by the extraction of adlets. The ad-strategy for such adlets
will be reactive, i.e., these adlets that are extracted from documents will remain
in their sites and do not travel. This way, adlet interaction is dealt with
by the local node equipped with the adlet-savvy software.
But as more and more nodes are enhanced with adlet-savvy software, the
network becomes more active, and the behavior of adlets more sophisticated.
The experimental system will provide a testbed for feasibility studies in such an active or hybrid network environment.
In Section 4 we described an approach for the visual specification of adlet types by
annotation, and the visual operators for merging
adlets. The semantics of this visual language [Chang96a] needs to be carefully investigated. Concerning visualization,
we need to investigate how best to determine the direction of adlet movement.
Since a site is characterized by a virtual graph, the attraction/repulsion must be
defined in terms of this virtual graph.
A further research topics is the following.
Given an advertisement plan, we need to investigate how to express the advertisement visually by
a visual sentence and translate the visual sentence into an executable
plan based upon adlets.
We need to design the negotiation protocols, i.e., the e-proc's, so that given an advertisement
plan the goal can be reached and the exchange of adlets and other network primitives
leads to the goal of completed exchanges.
In adlet interaction, who visits what site? Adlets migration may happen before or during or after
adlets interaction.
An important topic is how to apply multicast protocols in support of adlets negotiation protocols and routing.
A large number of multicast routing algorithms were designed for either low
end-to-end delay or efficient management of network resources.
However, very few algorithms take both objectives into considerations.
We need to develop new class of heuristics that
will ensure acceptable average delays, based on the conceptual
distance between sites, while minimizing the resources required to meet
these delays. Based on these heuristics we can investigate
the mechanisms and communication primitives to allow the dynamic encapsulation
of these cooperating documents into a logical group.
These primitives should allow the creation of a group
of active documents, the addition of new members to the group and the
exclusion of members from the group. Furthermore,
these primitives must allow for
the specification of the required level of quality of services
in terms of conceptual distances between sites,
the average and peak rate bandwidth requirements, and the
acceptable delay variability when documents are actually
exchanged.
As mentioned before the ad-strategy can be reactive or aggressive. A third type of ad-strategy
can be introduced -- a progressive ad-strategy can start as a reactive
strategy and become aggressive due to adlets merging or some
other conditions becoming 'true'. We need to further explore different
ad-strategies and their relationships to negotiation protocols.
Another related topic is the following.
Given adlets, how to compose adlets (i.e. conceptual view integration) so
that the initial conditions of negotiations are met. In the simplest case we need to verify the
existence of a pair of profiles (profile1, profile2) consistent with (doc1, doc2).
Adlet composition operators may be used in more complicated cases for
information fusion negotiation protocols.
These theoretical investigations will lead to a deeper understanding of
document abstraction and advertising.
Acknowledgements:
This research was supported in part by the National Science Foundation under grant IRI-9224563, "An Active Image Information System for Biomedical Image Databases".
The virtual graph updating algorithm was developed by Miyawaki Kosuke.
The Adlet prototype system is being developed by Ping-Wen Chen,
Guanlin Shen and Longjiang Yang.
References:
[Abit96] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wienner,
"The Local Query Language for Semi-Structured Data",
http://www-db.standford.edu, 1996.
[Abit97] S. Abiteboul, "Querying Semi-Structured Data", In Sixth International Conference on Data Base Theory", Lecture Notes in Computer Sciences, 1997.
[Alrabiah97] T. Alrabiah and T. Znati, "A Simulation Framework for the Analysis of Multicast Tree Algorithms", 30th Annual Simulation Symposium, April 1997, 196-205.
[AltaVista] AltaVista Technology Home Page, http://www.altavista.com
[Beigi98] Mandis Beigi, Ana Benitez, and S.-F. Chang, "MetaSEEk:
A Content-Based Meta Search Engine for Images," SPIE Conference on
Storage and Retrieval for Image and Video Database, San Jose, Feb. 1998.
[BernesLee94] T. Berners-Lee, R. Cailliau, A. Lautonen, H. F. Neileson, and A. Secret, "The World Wide Web", Communications of the ACM, Vol. 37 no. 8, August 1994, 76-82.
[Bowman94] C. Bowman, P. B. Danzig, Udi Manber, and M. F. Schwartz,"Scalable Internet Resource Discovery: Research Problems and Approaches", Communications of the ACM, Vol. 37, No. 8, August 1994, 98-107.
[Bowman95] C. Bowman, P. B. Danzig, Udi Manber, and M. F. Schwartz, "The Harvest Information Discovery and Access System", Computer Networks and ISDN,
Vol. 28, Nos. 1-2, Dec 1995, 119-125.
[Bune96] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu,
"A Query Language and Optimization Techniques for Unstructured
Data", In ACM SIGMOD, International Conference on Management
of Data, SIGMOD'96, Montreal, Canada, pp. 505-516, 1996.
[Chang95] S. K. Chang, "Towards a Theory of Active Index", Journal of Visual Languages and Computing, Vol. 6, No. 1, March 1995, 101-118.
[Chang96a] S. K. Chang, "Extending Visual Languages for Multimedia", IEEE Multimedia Magazine, Fall 1996, Vol. 3, No. 3, 18-26.
[Chang96b] S. K. Chang, "Active Index for Content-Based Medical Image Retrieval", Journal of Computerized Medical Imaging and Graphics, Special Issue on Medical Image Databases (S. Wong and H. K. Huang, eds.), Elsevier Science Ltd., 1996, 219-229.
[Chawa94] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland,
Y. Papakonstantinou, J. D. Ullman, and J. Widom, "The TSIMMIS Project:
Integration of Heterogeneous Information Souces", In IPSJ Conference,
Tokyo, 1994.
[Chang98] S. K. Chang, D. Graupe, K. Hasegawa and H. Kordylewski, "An Active Multimedia Information System for Information Retrieval, Discovery and Fusion", International Journal of Software Engineering and Knowledge Engineering, March 1998, Vol. 8, No. 1, 139-160.
[Cheong96] F. C. Cheong, "Internet Agents, Spiders, Wanderers, Brokers
and Bots", New Riders Publishing, 1996.
[Dalal78] Y. K. Dalal and R. M. Metcalfe, "Reverse Path Forwarding of Broadcast Packets" Communications of the ACM, Vol. 21 No. 12, December 1978, 1040-1048.
[Deering91] S. Deering, "Multicast Routing in a Datagram Internetwork", PhD Thesis, Stanford University, 1991.
[Demsey96] L. Dempsey and S. L. Weibel, "The Warwick Metadata Workshop: A Framework For the Deployment of Resource Description", D-lib Magazine,
July-August 1996.
[Hakimi71] S. L. Hakimi, "Steiner's Problem in Graphs and its Implications" Networks, Vol. 1, November 1994, 113-133.
[Hardy95] D. R. Hardy, M. F. Schwartz, D. Wessels, "Harvest User's Manual, Version 1.2", Technical Report CU-CS-743-94, November 1994.
[Imase91] M. Imase and Bernard Waxman, "Dynamic steiner tree problem", SIAM Journal of Discrete Math, Vol. 4 No. 3, August 1991, 369-384.
[Kim95] W. Kim, Editor, "Modern Database Systems: The Object Model, Interoperability, and Beyond", ACM Press and Addison Wesley, 1995.
[Kompella93] V. Kompella, J. Pasquale, and G. Polyzos, "Multicast Routing for Multimedia Communication", IEEE/ACM Transactions on Networking, Vol. 1 No. 3, June 193, 286-292.
[Kono95] D. Konopnicki and O. Shmueli, "W3QS: A query system
for the World-Wide Web", In Intern. Conf. on Very Large Data Bases
(VLDB'95), pp- 54-65, 1995, Zurich.
[Lagoze96] Lagoze, Carl and Lynch, Clifford and Daniel, Ron, Jr. June, 1996.
"The Warwick Framework: A Container Architecture for Aggregating Sets
of Metadata.", Cornell Computer Science Technical Report TR96-1593.
[Laks96] L. Lakshmann, F. Sadri, and I.N. Subramanian, "A Declarative
Language for Querying and Restructuring the Web", In 6th Intern. Workshop
on Research Issues in Data Engineering: Interoperability of Nontraditional
Database Systems (RIDE-ND'96), 1996.
[Levy96] A.Y. Levy, A. Rajaraman, and J.J. Ordille, "Querying Heterogeneous
Information Sources Using Source Description", In Intern. Conf. On
Very Large Data Base, VLDB'96, Mumbai, Bombay, 1996.
[Mend96] A. Mendelzon, G. Mihaila, and T. Milo, "Querying the World
Wide Web", In the First Int. Conf. on Parallel and Distributed
Information Systems (PDIS'96), 1996.
ftp:/db.torento.edu/pub/papers/websql.ps
[Pusateri97] T. Pusateri, "Distance Vector Multicast Routing Protocol" Internet Draft, February 1997.
[Seth90] A. P. Seth, and J. A. Larson, "Federal Database Systems for Managing Distributed, Heterogeneous and Autonomous Databases" ACM Computing Surveys, Vol. 2, No. 3, September 1990, 183-236.
[Sun95] Q. Sun and H. Langendoerfer, "Efficient Multicast Routing for Delay Sensitive Applications", Proceedings of Second Workshop Protocols Multimedia Systems, April 1995, 452-458.
[Waxman88] B. Waxman, "Routing of Multipoint Connections", IEEE Journal on Selected Areas in Communications, Vol. 6 no. 9, December 1988, 1611-1622.
[Weibel95] Weibel, Stuart, "Metadata: The Foundations of Resource
Description", D-lib Magazine, July, 1995.
[Winter87] P. Winter, "Steiner Problem in Networks: A Survey", Networks, Vol. 17, 1987, 129-167.
[Zhang93] L. Zhang, S. Deering, D. Estrin, S. Shenker, and D. Zappala, "RSVP: A new Resource ReSerVation Protocol, IEEE Network, 1993, 8-18.