## Entry Vocabulary Modules and Agents

### Barbara Norgard

June 28, 1998, Revised July 14, 1998

Abstract: The design of an agent that creates Entry Vocabulary Modules (EVMs) for remote repositories with unfamiliar metadata vocabularies is described. The linking provided by EVMs can serve as both an indexing and a search aid.

Acknowledgement: The work reported here was supported by Defense Advanced Research Projects Agency through DARPA Contract N66001-97-C-8541; AO# F477: Search Support for Unfamiliar Metadata Vocabularies.

### Introduction

This report is concerned with the design of an intelligent agent capable of unintrusively creating Entry Vocabulary Modules for remote repositories with not more than the access necessary for conducting a search. We discuss here the design of agents that create EVMs (Task B, Year one).

### What are Entry Vocabulary Modules?

The functionality of Entry Vocabulary Modules (EVMs) is based on association dictionaries that facilitate use of unfamiliar metadata vocabularies such as the classification schemes, thesauri, and vocabularies that are used to index bibliographic and other types of databases (e.g., LCSH, ABI/Inform, and PsycINFO). EVMs address the problem of linking ordinary language terms to controlled vocabulary terms. This linking can serve as both an indexing and a search aid.

Association dictionaries are created by linking ordinary language terms (words and noun phrases) to controlled vocabulary terms which are then ranked by co-occurrence frequencies. We use ordinary language terms that occur in titles and abstracts from bibliographic and other types of records. These records represent a particular domain of discourse. (How this domain is defined is discussed elsewhere (Kim 1998).) The controlled vocabulary terms we use are the indexing terms used by any of a number of MELVYL databases that cover general topic areas (e.g., BIOSIS and INSPEC) or other publicly available databases like the U.S. Patents databases. (MELVYL is the University of California's online library system. The MELVYL system includes a library catalog database, a periodicals database, article citation databases, and other files.)

### Entry Vocabulary Agents

Entry Vocabulary Agents are functional entities that facilitate the scoping of subdomains and building association dictionaries. We see the EVAs as having the following functional tasks: selecting a domain of discourse, representing a domain, pre-processing the data set, building an association dictionary, and adding dictionaries to desktops. Each step will be explained in the following.

### Selecting a domain of discourse

Most of our data sets contain four elements: article titles, journal titles, abstracts, and controlled vocabulary terms. They may also contain other elements, but we ignore them. Initial attempts to define domains of discourse have been driven partly by a desire to avoid biasing the data elements we use to build the dictionaries (i.e., titles and abstracts) with the search terms we use to define a domain in the form of a query. This is so we can evaluate them having controlled for the occurrence of the query term in the data we use to build the association dictionaries. Therefore we avoid collecting data based on terms that occur in the title and abstract fields and instead conduct searches against the journal title field.

However, there is nothing to prevent a casual user from defining their domain in other ways and we do not see any need to restrict them. So, we can imagine using the topic term to search on the title, abstract, or subject heading fields to form a data set.

#### Searching on journal title

By searching terms that are commonly used in journal titles of particular disciplines, we gather data sets that have a strong likelihood of reflecting the discourse of a research community. For example, the query find journal bio#'' retrieves citations to articles that were published in journals that have the stem bio'' in their journal titles. In this way we are able to identify articles in the domain of biotechnology as well as other domains that involve biology.

Our strategy for selecting journal titles is currently a topic of investigation. We are making use of the Science Citation Index (SCI) and the Social Science Citation Index (SSCI) to identify the most frequently cited (and therefore frequently used) journals in a particular domain. (This is discussed in another report on the work of this project (Kim 1998).)

#### Considering the size of data sets

Another consideration that has driven our selection of a domain has been the size of the record set judged roughly by number of records retrieved by a query on the journal title field. We have found record sets of as small as 4,000 records to be effective (Plaunt 1998). Sets as large as 13,000 have been used. This concern about the size of the data sets could be handled in other ways. For instance, a random sample could be extracted from a large set. Set size could also be controlled by filtering on various attributes such as date or publication type.

Theoretically, large sets can be handled effectively, but a rapid response time is crucial in a dynamic environment. When EVMs are created in real time, the time it takes to download a set and process it becomes a significant factor in determining optimal set size. A user can be expected to allow some time for a dictionary to be custom built for his or her needs, but even the patience of an understanding user has limits. At this point, we can only guess how long someone is willing to wait. Determining the minimal size necessary to adequately cover the sublanguage of a domain is a question yet to be answered with any confidence. Since this prototype is envisioned serving as a desktop utility, methods of achieving quick responses without sacrificing comprehensive coverage should be given high priority.

### Representing a domain

Representing a domain is accomplished by forming a query appropriate to retrieve a record set of reasonable size to build a useful association dictionary. By useful, we mean that enough language must be covered to account for the majority of queries in a domain. Just what constitutes a domain is the subject of another discussion paper (Norgard 1997).

#### Subdomain agents: Assisting a user in defining a domain

Helping a user define a domain is a challenge. The domain should represent a fairly narrow topic and be represented by enough data. Our current approach is to have a subdomain agent maintain lists of the top journals for various disciplinary topics and present these topic areas as subdomains to the user. The user selects a subdomain from the list. The domain agent then uses the top ten or so journal titles to collect a set of records of an appropriate size to build an EVM.

The data we use to build association dictionaries is typically acquired by downloading sets of MELVYL records retrieved from a query specifying a topic of discussion in a domain. (An exception to this approach has been the U.S. Patents records which require the application of another set of procedures.) These record sets are tagged. In MELVYL, the required tags are: AN, AT (or MT), AB, and SU (or TH). Some databases have special tags unique to them (e.g., BIOSIS uses CC and BC tags). Including other non-required tags is not a problem, but they are ignored by our scripts.

Currently, we conduct these record downloads by starting script at the Unix prompt. (The script program is a common Unix utility used for recording interactions with a computer.) Then a telnet session is initiated with MELVYL and a database is selected (e.g., INSPEC or BIOSIS). A set of records is identified that represents a recognized domain of discourse. This record set must satisfy our size criteria for a good set'' and represent the discourse generated by some shared activity that can be defined by a query on the journal title field.

Then, with the script still running, we issue a request to MELVYL for a continuous display of the records with the required tags. When the record display ends, the script process is terminated thereby capturing all the records in a single file (named typescript). This file is then ready to be processed and transformed into an association dictionary.

This is a point at which agent functionality in the form of communicating with other systems comes into play. We call this the data retrieval agent.

Communicating with other internet systems

Parts of this data set gathering process can be fully automated with an Expect-like function that establishes communciation with MELVYL, issues the query, and collects the records in a file. For other aspects of the OASIS project, we have used a Perl version of Expect (called Comm.pl) to query MELVYL and download records with satisfactory results.

Expect is written in Tcl (a scripting language). Tcl has proven to be unsatisfactory for processing with large amounts of data in a flexible manner. For instance, large arrays kill Tcl, while Perl handles them with facility. Of course there are work arounds, but Perl presents no such barriers. Therefore, we do not use the Tcl version of Expect. The Perl version has presented no problems handling large data sets.

Why not Z39.50?

We have considered using Z39.50 which is perhaps theoretically more elegant, but it has some drawbacks, one of which is the limit on the number of records returned per transaction. This state of affairs can be dealt with, but we have not pursued the adaptation necessary to incorporate this approach.

Another significant drawback to relying on Z39.50 is the paucity of sites that both comply with the standard and seem suited to our research interests. Until the advantages of Z39.50 become more compelling, we will focus on other issues.

### Prepping agent: pre-processing data sets

The records in the downloaded set must undergo preparation for processing. We build two kinds of association dictionaries: a word-based dictionary and a phrase-based dictionary. And we treat them somewhat differently before making the associations between ordinary language terms and controlled vocabulary terms.

See Kim and Norgard (1998) for a more detailed discussion of this process using natural language processing techniques.

#### Extractor agent: extracting terms

Terms (nouns or noun-phrases) are extracted from the PoS-tagged titles and abstracts. These are called A-terms. The A-terms are then associated with the controlled vocabulary terms assigned to each record, which we call B-terms.

### Builder agent: building an association dictionary

At this stage, the procedure for building a word-based dictionary or phrase-based dictionary is the same. For each record, each word or noun phrase (A-term) from the training text records (currently we are using titles and abstracts) is paired with each controlled vocabulary term assigned to that record (B-term). The result of this process is a list of A-term $\rightarrow$ B-term pairs for each record.

#### Collecting frequency counts

To create the contingency tables needed for computing the degree of association between A-terms and B-terms the following counts are collected:
1. Frequency counts for each unique pair.
2. Frequency counts for A-terms (words or noun phrases from titles and abstracts).
3. Frequency counts for B-terms (controlled vocabulary terms).

#### Creating contingency tables

Frequencies for four combinations of A-term and B-term are calculated: (1) the A-term and the B-term; (2) the A-term but not the B-term; (3) the B-term but not the A-term; and (4) neither the A-term nor the B-term.

#### Computing degrees of association

Computing the degree of association between A-terms and B-terms completes the steps for creating a dictionary of ordinary language terms associated with controlled vocabulary terms. This computation uses the values from the contingency tables created in the previous step. These associations are ranked by a log frequency weight. The -2log lambda statistic used is discussed elsewhere (Plaunt 1998).

### Directory agent: Adding dictionaries to desktops

When the sequence of steps just described is implemented as an automatic process, the resulting association dictionary will be added to some sort of desktop catalog that the user can access and use for searching.

This could be implemented could be as simple as automatically adding a link to a web page for a new association dictionary. Following that link would trigger an interactive search session against the association dictionary named by the link.

### Using an association dictionary

We currently maintain a simple web interface to a small sample of association dictionaries for a limited number of domains. Ordinary language queries can be submitted to a dictionary and a ranked list of controlled vocabulary terms is returned.

Association dictionaries can be searched through a web-browser forms interface with ordinary language queries. The searcher is presented with a ranked list of the most likely controlled vocabulary terms to retrieve information related to a given query. Within certain limits, those controlled vocabulary terms can be used to search the appropriate database. Searching the MELVYL databases other than the library catalog is limited to users associated with the UC system and is not open to the general public, but we provide access to other information resources, such as the U.S. Patents database, to anyone.

### Future work

We define an intelligent EVM agent in terms of software to support it by prompting responses from users and by executing the several decisions and procedures involved in the use of EVMs by people.

Many parts of this dictionary building process are not yet automated, but could be. The parts that are automated require more integration so that this process may proceed without intervention except when user input is necessary or desired. The user should have control over as much of the process as he or she desires. This suggests that there should be levels of control. Some users will want more control and others less.

We prefer to make this process accessible through a web interface so that it will not be limited by platform idiosyncracies. The user will come to this application with a wish to know more about a certain topic the language of which is unfamiliar. We provide methods of specifying a topic area. In our initial design, this will be a list of topic areas selected from SCI and SSCI. The agent would go off and determine whether or not a data set of reasonable size can be gathered for a topic. There are other ways in which this could also proceed. Say, for instance, I want to know what the general trends in computational linguistics are these days. Searching the INSPEC database on the thesaurus term computational linguistics'' retrieves 2,740 records from the 1993-1998 database. The 1990-1992 database contains 1,025 records. The 1985-1989 database has 1,145 records. The 1980-1984 database returns 320 records and the 1969-1979 database finds us 313 records. Putting these four sets of records together would produce an adequate data set for building an association dictionary on the topic of computational linguistics''. This would not be difficult for the agents managed by an EVM to handle.

### References

[Kim 1998] Kim, Y. (1998).   Sensitivity of entry vocabulary modules to subdomains. Technical report.[ HTML]

[Norgard 1997] Norgard, B. and Y. Kim (1997).   Domains and sublanguages. Technical report.

[Plaunt forthcoming] Plaunt, C. and B. A. Norgard (forthcoming). An association based method for automatic indexing with a controlled vocabulary. Journal of the American Society for Information Science.[ HTML]