Metadata Logo
Search Support for Unfamiliar Metadata Vocabularies

 

Search Support for Unfamiliar Metadata Vocabularies.
DARPA contract. Proposal as submitted
.

The University of California, Berkeley, School of Information Management & Systems has been awarded a $954,180 research contract See News release and Project web homepage.
Principal Investigator: Michael Buckland,   buckland@sims.berkeley.edu
Co-P.I.s: Fred Gey   gey@ucdata.berkeley.edu  and Ray Larson,   ray@sims.berkeley.edu
The contract was awarded by the Defense Advanced Research Projects Agency (DARPA Contract N66001-97-C-8541; AO# F477; $954,184.) and will run from July 1997 through June 2000.

Proposal to DARPA BAA 97-09: Excerpts relating to work to be done.

SECTION II. Detailed Proposal Information
    II. A. INNOVATIVE CLAIMS FOR THE PROPOSED RESEARCH.
    II. B. DELIVERABLES.
    II. C. STATEMENT OF WORK.
    II. D. RESULTS
    II. F. COST, SCHEDULE, MILESTONES.
    II. G. TECHNICAL RATIONALE, TECHNICAL APPROACH AND ... PLAN
    II. I. KEY PERSONNEL, QUALIFICATIONS, EFFORT
    II. J. PROPOSER'S PREVIOUS ACCOMPLISHMENTS
    SECTION III. ADDITIONAL INFORMATION. Bibliography.

SECTION II. Detailed Proposal Information

II. A. INNOVATIVE CLAIMS FOR THE PROPOSED RESEARCH.

An analyst needs to select information from an increasing population of heterogeneous repositories with quite diverse metadata vocabularies (categorization, classification, indexing semantics). Necessarily, the number -- and percentage -- of metadata vocabularies that are unfamiliar to any given analyst is increasing steeply. When they encounter an unfamiliar metadata vocabulary, how are they to know which codes or terms will lead them to what they want? Problems of intellectual access arise at three levels: (1) Selecting which repositories to search. (2) Adaptively coping with alien metadata vocabulary (assigned indexing terms or classification/concept codes, automatically-derived categorization) in each remote repository; and (3) Aggregating data from heterogeneous repositories.

The proposed research is focussed on #2, enabling an analyst to use unfamiliar metadata effectively and efficiently. This problem has two levels:

(i) Selecting suitable vocabulary for formulating a search; and

(ii) Understanding relationships between terms /codes within the metadata system of the remote repository, e.g. navigating within a thesaurus. Metadata vocabularies ordinarily contain complex internal relationships between related terms.

Even when metadata schemes have been developed and implemented in the rapidly expanding environment of network accessible, but heterogeneous repositories, each analyst is faced by an explosive increase in unfamiliar descriptive metadata systems. One cannot expect analysts to have expert familiarity with more than a few -- a diminishing percentage! -- of available repositories.

For decades a massive continuing investment that has been made and continues to be made in the development and application of indexing and categorizing schemes, "manual", automatic, and computer-aided. The special innovative claim of the proposed research is that it is revolutionary because it differs from, complements, and adds value to all that massive investment by DARPA and others in categorization and indexing. It will demonstrate technology that enable analysts to use their own vocabulary to enter and make expert use of unfamiliar metadata vocabularies. We call this technology "Entry Vocabulary Modules".

The scientific and technical merit is of the proposed research is that (i) goes substantially beyond the existing state-of-the-art, which ordinarily depends on the expensive human crafting of links withing and between vocabularies (e.g. UMLS); (ii) is based on searching of fragments existing within the metadata and its database; (iii) it uses advanced probabilistic techniques; (iv) it allows the analyst to use familiar search terms from a familiar vocabulary; (v) it is designed for use in a networked environment; (vi) it allows for rapid deployment with unfamiliar metadata schemes without requiring the consent, knowledge or participation of the remote database administrator.

II. B. DELIVERABLES.

Consider a simple straightforward search. An analyst seeks coastal pollution and submits a search to a very large library catalog to look identify books and the principal health sciences bibliography MEDLINE to find journal articles. A subject search on the topical phrase "coastal pollution" retrieves nothing on the MELVYL library catalog, the largest in the US, since in the Library of Congress Subject Headings (LCSH) it is not established either as a heading or a cross-reference. Nor does a subject keyword search on the terms "coastal" and "pollution" retrieve anything. Relevant material is in there, but it is categorized under "Marine pollution", "Water quality" "Coastal zone management" "Estuarine pollution" and other headings.

The same coastal pollution query addressed to MEDLINE file also yields no retrievals either as an subject phrase or as a subject keyword search. Again there is, in fact, relevant material, but it is under "Seawater", "Water pollutants", "water microbiology", "Bathing beaches" and so on.

On inspection, the individual headings can be seen to be seen to be more or less plausible, but who would have had the imagination to have thought of more than one or two of these? And note that the vocabulary actually used in these two systems is quite different. There is not even any overlap in metadata vocabulary used in the two systems for the same topic in this example, so even a familiarity with the actual vocabulary used in one of them would not be much help when searching the other.

LC Subject Headings and the Medical Subject Headings (MeSH) of MEDLINE are relatively familiar and straightforward examples of metadata vocabulary. More specialized repositories are much more opaque. Suppose an analyst wanted to data on exports and imports of automobiles and looked in the Census Bureaus's U.S. Imports and Exports numeric datasets on CD-ROM. A search on the term "automobile" will find nothing. A search on "cars" will lead to "Railway or Tramway Stock". The data are there under "Passenger Motor Vehicles, Spark Ignition Engine". And for anyone who doubts the assertion that metadata vocabularies tend to be stylized and non-obvious "dialects" should try some searches using the US Patent Office classification scheme.

The problem is doubly exacerbated when dealing with foreign language materials. The ordinary problems of translation are compounded by specialized domain usage. A recent study of military aerial reconnaissance using photography from tethered balloons illustrates this well. The investigation would have been more effective and less costly if the analyst had known to search for tethered balloons under the term "aerostat". The search could have been even more cost-effective if the analyst had known, when searching the German technical literature, that in the First World War, when this technology was in heaviest use, the German term used for a tethered observation balloon was "Drachen", even though in standard German and in dictionaries "Der Drachen" means a kite, not a balloon.

Experienced analysts, like experienced reference librarians, know that familiarity with the source (database, reference work) is critical for effective, reliable searching -- each has its own quirks and personality -- and that this familiarity comes from experience, from frequent use. Furthermore, the more that has been invested in the enhancement of the source, i.e. the richer the metadata, the more important this personal experience and familiarity becomes.

But the rapid increase in network accessible repositories increases the number -- and proportion -- that are unfamiliar. Buckland (1992a) has argued that the most cost-effective single investment for improving effectiveness in the searching of repositories would be technology to assist the searcher in coping with unfamiliar metadata vocabularies. Federal funding from the U.S. Department of Education (HEA II) and adaptation of the "classification clustering" technique developed by Professor Ray Larson has enabled the development of the three different "proof-of-concept" prototypes and a clear path for the development and deployment of technology to support more cost-effective use of unfamiliar metadata schemes. We call this technology "Entry Vocabulary Modules" because they provide the analyst with intellectual and semantic "entry" into the range of codes or terms ("vocabularies") of diverse metadata schemes.

do this. Meanwhile, the U.S. Department of Education funding source has been abolished. We therefore request DARPA funding to support the research and development necessary for the following deliverables:

Task A: A set of prototype Entry Vocabulary Modules for a challenging range of examples, with demonstrated deployment in three modes: As an amenity provide by a repository; as a network-accessible amenity; and embedded in a solo or collaborative work environment.

Task B: An intelligent agent capable of unintrusively creating Entry Vocabulary Modules for remote repositories with nor more than the access necessary for conducting a search.

Task C: A Report on the sensitivity of Entry Vocabulary Modules to variations between subdomains within repositories. Should they be designed for subdomains as well as for entire repositories?

Task D: Addition of natural language processing techniques to enhance the statistical term co-occurrence approach of the initial "proof-of-concept" prototypes already developed.

Task E: Recommendations for the improvement of "Codebook" metadata documentation for numeric databases.

The software developed will be made freely available, subject to acknowledgment. It is our experience, however, that in this area technology transfer comes from the demonstration of convincing functionality, subsequently implemented in system developers' own code, rather than from direct transfer of the software itself.



II. C. STATEMENT OF WORK.

The work is designed to develop Entry Vocabulary Module technology to provide a cost-effective remedy for the difficulties that arise for analysts when searching network accessible repositories given the following combination:

1. A massive, increasing investment world-wide in making repositories accessible over networks;

2. A massive, increasing investment world-wide in providing indexing, categorizing, and other metadata; and

3. Increasing difficulty for analysts when searching because the number of unfamiliar repositories is increasing (1. above) and the amount and proportion of unfamiliar metadata vocabularies is also increasing. Decreasing search effectiveness is the predicable result.

Previous Federally-funded research and development by the Proposers has resulted in three "proof-of-concept" prototypes of new technology for providing detailed support for analysts when searching unfamiliar databases. These are openly available in the World Wide Web for examination.

The Work to be performed included the development of a suite of prototype Entry Vocabulary Modules in a diverse selection of challenging metadata environments:

a. From uncontrolled English language to a structured thesaurus.

b. From uncontrolled English language to a complex concept categorization.

c. From uncontrolled English language to a library classification scheme.

d. From uncontrolled English language to the U.S. Patent Office Classification scheme.

e. From an uncontrolled foreign language to a U.S. metadata scheme as an example of translingual access.

f. Cross-classification between two dissimilar metadata schemes.

g. From uncontrolled English language to a Federal numerical database of socio-economic data.

h. Between different natural language vocabularies, e.g. English, French and German.

Implementation contexts: "Entry vocabulary" functionality has very extensive implementation potential and can be usefully positioned in several ways: As an amenity on the analyst's workstation; As an amenity on a repository server; and as an amenity on a work-centered computing environment.

An Entry Vocabulary Module can also be used for computer-assisted categorization.

The work will include four Associated Research Tasks:

B. An Intelligent EMV Agent to develop Entry Vocabulary Modules unintrusively.

C. An analysis of how sensitivity metadata vocabulary is to the specific subdomains of analysts' interest.

D. The incorporation of Natural Language techniques to augment statistical term frequency technique currently used.

E. Recommendations for improved metadata "codebooks" for numeric databases.

The work will be disseminated through:

- publicly available, no-cost, web-based demonstration of prototypes;

- descriptive reports at the conferences and in the technical journals most likely to reach prospective implementers and adopters; and

- articles on the theoretical and technical issues that arise.

The software developed will be made freely available, subject to acknowledgment. It is our experience that in this area technology transfer comes from the demonstration of convincing functionality, subsequently implemented in system developers' own code, rather than from direct transfer of the software itself.

There are no specific contractor requirements beyond the financial support in the Budget.



II. D. RESULTS.

The proposed research will develop, document, make functioning operational prototypes Entry Vocabulary Modules freely available through open web-based access. The prototypes will enable analysts to make expert, efficient use of unfamiliar metadata schemes.

The deliverables can be viewed in terms of a matrix of Metadata Access Challenges and Implementation Contexts:

A. Metadata Vocabulary Challenges. A range of examples designed to test and demonstrate the versatility of Entry Vocabulary Module technology differing challenges:

a. From uncontrolled English language to a structured thesaurus: A preliminary prototype providing access to a subdomain of the INSPEC thesaurus can be viewed at:

URL http://briet.sims.berkeley.edu/oasis/inswater.html

b. From uncontrolled English language to a complex concept categorization: A preliminary prototype providing access to a subdomain of BIOSIS (Biological Abstracts) can be viewed at:

http://briet.sims.berkeley.edu/oasis/biosis.html See the example of the result of a search for BIOSIS concept codes relating to "water pollution" at the end of Section III.

c. From uncontrolled English language to a library classification scheme: Co-Principal Investigator Ray Larson has implemented an Entry Vocabulary Module for a subdomain of the Library of Congress Classification that includes Information Management, Bibliography, and Library and Information Science. Broader coverage has recently been demonstrated in the campus' Physical Sciences libraries, include Astronomy, Geology, Math and Physics.

d. From uncontrolled English language to the U.S. Patent Office Classification scheme, which is a very good example of a metadata vocabulary that is important for technology transfer, but difficult for non-specialists: A limited, preliminary prototype can be accessed at

URL http://briet.sims.berkeley.edu/oasis/bigpatents.html See the example of the result of a search for Patent Office Classification codes relating to "needle" at the end of Section III.

e. From an uncontrolled foreign language to a U.S. metadata scheme as an example of translingual access: The tentative initial choice is from French to the Library of Congress Classification.

f. Cross-classification between two dissimilar metadata schemes: Between the U.S. Patent Office Classification and the International Patent Classification.

g. From uncontrolled English language to a Federal numerical database of socio-economic data using the code-book documentation as metadata.

h. Between different natural languages, as related in their common application in a specific domain: Tentatively trilingually between English, French and German.

Resources permitting, additional research and demonstration prototypes may also be undertaken in relation to GIS metadata, sound data, and/or image databases.

Implementation contexts: "Entry vocabulary" functionality has very extensive implementation potential and can be usefully positioned in several ways. We propose the following:

a. As an amenity on the analyst's client to provide assistance when accessing a unfamiliar remote repository;

b. As an amenity on (or invoked from) a repository server to aid remote searchers unfamiliar with the local metadata scheme of that repository;

c. As an amenity on a work-centered computing environment. Specifically the Berkeley NSF / DARPA / NASA Digital Libraries Initiative project has developed a Multi-Valent Document architecture with "Magic Lens" (and similar) features to allow additional operations on selected fragments of text. The proposed research would add an "extended search" to the Multi-Valent Document architecture so that any fragment any document could become the basis for a search query and a point of departure for a search of related material in remote data bases.

Note that to the extent that an Entry Vocabulary Module can provide a ranked list of probably relevant metadata terms for any fragment of text, it can also be used for computer-assisted categorization. Larson (1992b) has demonstrated that our proposed methodology is rather reliable in proposing Library of Congress Classification numbers to book titles. For example, we expect that an Entry Vocabulary Module for the Patent Office classification would be useful support for the assignment of classification numbers and well as for patent searching.

Associated Research Tasks. Within this matrix of deliverables, we are especially interested in the following research aspects:

B. Intelligent EMV Agent. To reduce the amount of human expertise involved in developing Entry Vocabulary Modules, an Intelligent Agent will be developed. The Agent, once pointed at a target repository, will obtain a retrieved set of sufficient to serve as the training set and will then create an Entry Vocabulary Module, and add it to those available in the assigned work environment.

C. Sub-domain sensitivity. Ordinarily metadata vocabularies are studied as a whole, but in practice analysts are rarely interested in the whole. They are usually interested in some specific subdomain reflecting their particular interest. Hence we have concentrated on Entry Vocabulary Modules of topical, work-related subdomains. For example we have developed two subdomain Entry Vocabulary Modules in the INSPEC thesaurus - for "Information Science" and for "Water." The sensitivity of search research using subdomain Entry Vocabulary Modules needs to be tested.

D. Natural Language techniques. Development of Entry Vocabulary Modules has so far been derived from term frequency occurrences. It is intended to experiment with the effects of also drawing on natural language processing techniques and, where applicable, the syntactical relationships sometimes found in classification and categorization schemes.

E. Numeric database "codebook" recommendations. The experience derived from the research on computer aided access to numeric database will yield recommendations for good metadata standards and practice for Federal and other numeric data bases.

The research would be disseminated through:

- publicly available, no-cost, web-based demonstration of prototypes;

- descriptive reports at the conferences and in the technical journals most likely to reach prospective implementers and adopters; and

- articles on the theoretical and technical issues that arise.

The software developed will be made freely available, subject to acknowledgment. It is our experience that in this area technology transfer comes from the demonstration of convincing functionality, subsequently implemented in system developers' own code, rather than from direct transfer of the software itself.

II. F. COST, SCHEDULE, MILESTONES.

The research program is designed as an whole with the deliverables described spaced out evenly over a three year program, July 1, 1997 through June 30, 2000 is assumed.

Task A: Development of Entry Vocabulary Modules.
A.1. English to thesaurus (INSPEC)
Year 1: Prototype. Year 2: All environments.

A.2. English to concept categories (BIOSIS).
Year 1: Prototype. Year 2: All environments.

A.3. English to library classification (LCC).
Year 1: Prototype. Year 2: All environments.

A.4. English to non-library classification (Pat Off.).
Year 1: Prototype. Year 2: All environments.

A.5. French to library classification (LCC).
Year 2: Prototype. Year 3: All environments.

A.6. English to numeric database (US Foreign Trade database).
Year 2: Prototype. Year 3: All environments.

A.7. Metadata to metadata (US Patent to/fro Internat. Pat.).
Year 2: Prototype. Year 3: All environments.

A.8. Multilingual within domain (English, French, German).
Year 2: Prototype. Year 3: All environments.

B. Intelligent agent to "crack" alien metadata vocabularies.
Year 1: Design. Year 2: Demonstration. Year 3: Prototype.

C. Sensitivity of search results to subdomain.
Year 1: Design; Year 2: Analysis; Year 3: Results.

D. Addition of natural language processing techniques.
Year 1: Design. Year 2: Analysis. Year 3: Results.

E. Recommendations and Guidelines for metadata code-book.
Year 2: Draft. Year 3: Results.



II. G. TECHNICAL RATIONALE, TECHNICAL APPROACH AND CONSTRUCTIVE PLAN FOR ACCOMPLISHMENT.

Technical Rationale.

The rationale is two-fold:

I. Many of the most information-rich repositories, especially bibliographical and textual databases, have some form of classification, coding, or indexing. But these are more or less stylized and cannot be used effectively or efficiently except by searchers who are familiar with them. The explosive increase in heterogeneity assures that intimately familiarity decreasing occurs. Entry Vocabulary Modules provide the kind of expert prompting that an expert search intermediary would. Unfortunately, a sufficient population of human expert search intermediaries is unaffordable.

ii. An increasing number of techniques are being developed for automatic categorization of repositories for which human indexing is unavailable. But as Melvil Dewey found when he developed his famous Decimal Classification in 1876, one needs a natural language index to a categorization scheme for it to be conveniently usable. The purpose of an Entry Vocabulary Module is to provide guidance in the transition from familiar vocabulary, whether ordinary English or from a familiar categorization scheme, to the unfamiliar, providing the kind of expert prompting that an expert search intermediary would.

In both cases, the Entry Vocabulary Module provides a value-added enhancement to the return on the original investment in generating metadata.

Technical Approach

The specific technique to create a ranked list of probably relevant terms in the target metadata vocabulary from any given searcher input is the "Classification clustering" technique developed for this purpose by Ray Larson, using probabilistic interpretation of vector-spaced retrieval, extended by William Cooper's Staged Logistic Regression method. (Larson, R. R. 1991. "Classification clustering, probabilistic retrieval, and the online catalog. Library Quarterly v. 61, no. 2 (April 1991): 133-173.)

Commentary on Numeric Databases.

An information system which utilizes a specialized vocabulary for classification and search purposes is the Census Bureau's U.S. Imports and Exports numeric datasets on CD-ROM. The following table shows the amount of U.S. automobile imports for the past several years:

General Imports Imports for Consumption
Year Quantity Customs Value Quantity Customs Value
PASS MTR VEH, SPARK IGN ENG, NOT OV 1,000 CC
Unit of Quantity -- Number
1991 173,597 $783,208,626 173,097 $779,772,191
1992 166,951 $736,087,145 171,134 $738,847,548
1993 200,043 $904,605,255 204,215 $907,734,708
1994 178,562 $753,516,749 178,562 $753,516,749
1995 153,871 $593,016,125 153,871 $593,016,125
 
PASS MTR VEH,SPARK IGN ENG, >1000CC BUT =< 1500CC
Unit of Quantity -- Number
1991 438,192 $2,847,528,668 522,682 $2,906,408,570
1992 395,639 $2,830,742,368 483,106 $2,904,641,625
1993 370,294 $2,911,080,560 467,137 $3,007,590,997
1994 367,507 $3,078,543,610 445,939 $3,147,853,412
1995 351,433 $3,203,539,647 410,945 $3,264,222,883

Clearly the figures shown in this database have a critical effect on the U.S. balance of trade, and searching the data will be an important activity for those engaged is strategic assessment of the future of the United States economy. Yet if one does a commodity search using the word "automobiles" on the commonly used WWW database (http://govinfo.kerr.orst.edu/impexp.html) one finds no results. Moreover, if one does the search using the word "cars," one obtains the misleading result "Railway or Tramway Stock, etc." A searcher interested in this database must be aware that the general classification heading for this commodity group is "Tractors, Vehicles for Pass, Goods, Special Purposes" and the particular classification for cars is "Passenger Motor Vehicles, Spark Ignition Engine" as above.

Another example of strategic importance concerns military armament exports. In this area a search in the exports database using the word "Rockets" yields the misleading commodity category "Bearings, Transmission, Gaskets, Misc". Restricting the search to the singular form "rocket" yields an additional three categories:

  • Photographic or Cinematographic Goods
  • Engines, Parts, Etc
  • Arms and Ammunition, Parts and Accessories Thereof

the last of which specifically concerns military weapons exports from the United States. Indeed the specific term "rocket" is found only in the following category:

MISSILE & ROCKET LAUNCHERS AND SIMILAR PROJECTORS (9301009050)
Unit of Quantity-Number
U.S. Exports Of Merchandise For The Year-to-Date Through November, 1996
  Domestic Exports Foreign Exports
Exports by St. Unit of Quantity 25,904 838
F.A.S. Export Value $205,701,914 $24,718,267

and completely misses a larger export category:

GUIDED MISSES (9306900020)
Unit of Quantity-Number
U.S. Exports Of Merchandise For The Year Through November, 1996
  Domestic Exports Foreign Exports
Exports by St. Unit of Quantity 2,811 4
F.A.S. Export Value $447,601,613 $375,000

while the general heading category for this section is: BOMBS, GRENADES, ETC (9306).

Clearly researchers who wish to mine this database for information pertinent to the strategic economic position of the United States have a need for a tool which will bridge the gap between common terminology and the highly specialized classification scheme which has evolved for categorizing data these data.

Geographic identification and aggregation issues.

The export database also includes information on point-to-point shipments of various commodities. For example, a certain quantity of guided missiles has been shipped to the State of Bahrain, east of Saudi Arabia:

GUIDED MISSES
Unit of Quantity -- Number
TO: Bahrain THRU: WILM NC
1991 15 $403,305
1992 712 $6,188,016
1993 8 $527,190
1994 1 $246,440
1995 39 $1,092,686
TO: Bahrain THRU: SAN FRN
1991 24 $5,143,786
1992 0 0
1993 0 0
1994 0 0
1995 0 0

If one wishes to obtain aggregate shipments for the "Middle East", it is unlikely that the unexperienced searcher would be able to identify Bahrain as a component Middle East country which would be a significant export customer of the United States. In order to effectively search the geographic dimension of these data, bridges must be constructed between a common language geographic specification and the particular special geographic name which resides in the geographic metadata in the database.

Clearly researchers who wish to mine this database for information pertinent to the strategic economic position of the United States have a need for tools which will bridge the gap between common terminology and the highly specialized classification scheme which has evolved for categorizing data these data.

The International Harmonized Commodity Classification Code scheme consists of a hierarchical 10 digit code with associated description. As of December 1995 there were 16,746 possible codes which might be associated with commodities being either exported or imported. The Census Bureau has released a CD-ROM "U.S. Exports Commodity Classification" which has 21,635 lines of description of the codes. This description contains some additional common English words which annotate the hierarchical coding scheme, but not nearly enough training data to accurately bridge the gap between the two forms. We also propose to utilize the ARPA sponsored WordNet lexical database (http://www.ito.darpa.mil/Summaries95/
B370--Princeton.html, http://www.cogsci.princeton.edu/~wn/w3wn.old.html)
which has entries such as the following:

5 senses of car
  ---
Sense 1: car, auto, automobile, machine, motorcar -- 4-wheeled; usually propelled by an internal combustion engine; "he needed a car to get to work".
=> motor vehicle, automotive vehicle -- a self-propelled wheeled vehicle that does not run on rails.

The WordNet database will be utilized as the starting point for probabilistic similarity searches for common word synonyms for commodity classification descriptions. We will consider WordNet sense descriptions as documents which provide evidence for the commodity concept in accordance with the methodologies described in (Larson, 1991).

Relationship to the Multi-Valent Document architecture.

The Multi-Valent Document architecture is a noteworthy product of the Berkeley Digital Libraries Initiative project. In it, each document is composed of multiple "layers'', each with a set of "behaviors'' that implement particular functions using one or more layers. Typically the

layers include the image of a scanned document page image, an underlying layer of full text in ASCII based on OCR conversion of the page image, and a layer that provides the positional information relating the image to the underlying text. Typical behaviors include such things as the ability to search for words in the ASCII text and highlight matching words on the page image.

Summary

The proposed research would add an "external search" behavior to the Multi-Valent Document architecture that so that when a user feels the need for additional information (such as citations to supporting or relevant documents or the documents themselves) the user simply selects the text and invokes the "external search" behavior to seek out and retrieve such information. The external search behavior might query the current document, a local database and remote depositories to find such documents or data, and then retrieve the document itself, or an annotation (such as citation and abstract) for viewing by the user.

The external search behavior would using the appropriate entry vocabulary modules to construct a search that would then be processed using the remote search features of the Cheshire system (implemented using the Z39.50 information retrieval protocol) as well as external Web search engines.

Constructive Plan

The development of Entry Vocabulary Modules has been in progress as part of the OASIS research program developing prototypes of enhancements to online library catalogs and other databases (Buckland, Butler, Norgard & Plaunt 1993). It is now proposed to migrate this functionality also to an online public access bibliographic searcher, CHESHIRE, a "next generation" library catalog being made available in the Berkeley campus science libraries; as part of the Berkeley Digital Library project; and at (or invocable from) a repository server. In addition an implementation as computer-aided indexing will be undertaken at a site to be determined. These varied deployments will provide the range of implementations described in B.2 above.



II. I. KEY PERSONNEL, QUALIFICATIONS, EFFORT.

The Proposed Principal Investigator is Professor Michael K. Buckland. The Co-Principal Investigators are Dr. Fredric Gey and Professor Ray R. Larson.

The Proposed Principal Investigator, Michael K. Buckland, is Professor in the School of Information Management and Systems at the University of California and has published extensively on the theory, practice, and historical development of Information Management. As Assistant Vice President for Library Plans and Policies in the University of California system-wide administration, 1983-87, he was responsible for the development of the MELVYL system and a pioneering intercampus packet-switched network. He was formerly Dean of the School of Library and Information Studies at Berkeley and is currently President-Elect of the American Society for Information Science, an association of 4,000 professionals, founded in 1937 as the American Documentation Institute and dedicated to improving access to recorded information.

Professor Buckland has extensive experience as a professional librarian, and in library research and library administration in the U.K. and in the U.S. Continuously for six years Professor Buckland has team-taught with Dr Clifford Lynch a seminar on "Extended Retrieval" addressing problems in retrieval from networked heterogeneous repositories.

Co-Principal Investigator Fredric Gey has a Ph.D. in Information Science, University of California, Berkeley (1993) focussed on probabilistic algorithms for text and document retrieval. He is Data Archivist and Assistant Director at the UC Data Archive & Technical Assistance, an organized research unit with $2.5 million annual extramural funding, a staff of ten professionals, thirty-five graduate student research assistants, and five undergraduate computer science student assistants. The unit serves as the data library for social science and health statistics numeric databases for the Berkeley campus. He manages a collection of 2,000+ datasets on 4,000 computer tapes on IBM-3090 under VM/CMS. Total archive size of 800 gigabytes.

Dr. Gey has twenty years experience in technical management as a project leader and/or principal investigator for multi-year, multi-million dollar government software research year, multi-million dollar government software research and development projects. Thirty years experience working with computers, from mainframes to micros.

He is General Chair designate for SIGIR99, the 22nd Annual Conference on Research and Development in Information Retrieval, Summer 1999. He has been a participant in the ANSI Committee X3H4 for Information Resource Dictionary Standards (IRDS).

Ray R. Larson, Associate Professor in the School of Information Management and Systems, University of California at Berkeley. His Ph. D. Dissertation in Library and Information Studies was on Workload characteristics an computer system utilization in online catalogs (1986). Prof. Larson has been engaged in the design and development of software for information retrieval systems since 1978. He received the American Society for Information Science Journal Best Paper Award in 1993 and has been Chair of the ASIS special interest group on Human-Computer Interaction. He is currently Associate Editor ACM Transactions on Information Systems.

This School is fortunate in having doctoral students that are exceptionally well-prepared for this kind of research. One of those who would be engaged in the proposed research is Barbara A. Norgard, who has a background in information retrieval and in linguistics, six years of experience in developing exactly this kind of software, including the proof-of-concepts prototypes noted in this proposal, and several publications in refereed publications.



II. J. PROPOSER'S PREVIOUS ACCOMPLISHMENTS

The proposed research builds directly on prior achievements in the development and demonstration of innovative enhancements to online retrieval systems. A significant characteristic of our research approach, reflected in our past work and publications, is that it brings an approach and emphasis that differs and complements both from operational repository services and from the techniques and technical infrastructure being developed by Computer Scientists. This proposal comes from the School of Information Management and Systems dedicated to exploring the human aspects of access to operational information services as well as the technical and technological aspects. The Research team has substantial personal experience in the provision of operational information services; in the development and operation of heavily-used production online services for demanding users; and in teaching the theory and practice of indexing, cataloging, and classification to information professionals. The interest is in "bridging technology" that enables information storage and retrieval theory, innovative software approaches, and managers of operational production systems to be brought together.

Principal Investigator Michael Buckland and Co-Principal Investigator Ray Larson have both been Faculty Investigators in a series of DARPA-related projects:

- The SEQUOIA 2000, concerned with decentralized access to a large-scale environmental data repository;

- The Computer Science Technical Reports project; and

- The Berkeley NSF / ARPA / NASA Digital Libraries project. The proposed research will be complement, extend, and be carefully coordinated with the work in the DLI project.

The proposed work draws heavily in the CHESHIRE prototype system developed by Professor Ray Larson. Which included an Entry Vocabulary Module providing access from uncontrolled English language to a formal classification scheme, a segment of the Library of Congress Classification that includes Information Management, Bibliography and Library and Information Science.

The proposed research extends the long-term OASIS research program led by Professor Buckland and funded 1990 - 1996 by the U.S. Dept of Education under the Higher Education Act Title II, a funding source that has now been discontinued.

Co-principal investigator Dr. Fredric Gey brings a special expertise to the project. He has over twenty years experience in working with economic and social science numeric databases and numerous publications in this area. His work as assistant director of the UC Data Archive led him to jointly develop the UC CD-ROM Information System (Merrill, Parker, Gey, Stuber 1995) whose software has been adopted by the Census Bureau as their primary access mechanism for Census data on the world wide web. In addition Dr Gey is an expert on probabilistic document retrieval and leads the UC Berkeley team participation in the annual ARPA-NIST sponsored TREC (Text REtrieval Conference) which demonstrates advanced text retrieval algorithms against the TIPSTER document collections of more than one million documents. Dr. Gey is the General Chair-designate of the ACM SIGIR99 Conference, the 22nd Annual Conference on Research and Development in Information Retrieval, to be held at UC Berkeley during the summer of 1999. Dr. Gey expects to apply probabilistic search techniques to bridge between ordinary search vocabulary and the specialized classification metadata of the U.S. Imports/Exports commodity datasets issued on CD-ROM by the Census Bureau.

Dr. Gey's previous accomplishments include a digital library of 135 gigabytes of Federal statistical information (primarily 1990 Census data) from 270 CD-ROM disks on 45 CD-ROM jukeboxes. This archive is arguably the largest single digital library on the internet.

He has also been Project Director of the UC Berkeley Text Retrieval Research Group, directing research in probabilistic text retrieval and multi-lingual (Chinese and Spanish) document retrieval as part of the TREC-5 conference of the National Institute of Standards and Technology. Awarded 3-year NSFgrant for "Probabilistic Retrieval from Full-Text Document Collections using Logistic Regression," beginning August 15, 1996.

He directed the UC DATA development of Internet access to Federal statistical information on CD-ROM jukebox servers. Initiated project for online browsing of statistical database codebooks (structured document technical documentation). Other projects include the development of a California Latina/Latino Demographic Data Book and the California Welfare Research Project (1992-1997), a multi-million dollar, 5-year longitudinal study combining county administrative welfare case files, and numerous other projects involving metadata schema and numeric databases.

Professor Larson's principal prior accomplishments include further development and testing of an experimental online catalog system (called CHESHIRE II) for experimental evaluation of information retrieval and hypertext techniques to improve online catalog search performance and usability. He is also engaged in experimental investigation of an automatic method of classification clustering for improving topical access in online catalog systems, and extension of automatic classification methods to automatic indexing of networked information resources like the World Wide Web.

Other recent work includes the development of sub-element access methods for database systems to support text retrieval and other specialized retrieval functions and the development of full-text indexing and retrieval methods for Berkeley's NSF/NASA/ARPA Digital Library Project. Past work includes being Faculty investigator in charge of development of IR methods in the "electronic repository" for the Sequoia 2000 project, a large-scale storage and retrieval system to support researchers investigating global change and being Co-Investigator on the NRI/ARPA Electronic Libraries project, Developing a network accessible electronic library of computer science technical reports in cooperation with Stanford, MIT, Cornell, and Carnegie-Mellon. Berkeley PI, Robert Wilensky, other co-investigators were Michael Stonebraker, Michael Buckland, Clifford Lynch.

He is now Principal Investigator, The Cheshire Demonstration and Evaluation Project sponsored by U.S. Dept. of Education (HEAIIA) grant. This project will deploy a next-generation online catalog system in a working library environment and evaluate its use and acceptance by local library patrons and remote network users.

The extensive prior experience of the Proposers in the kind of research proposed provides a credible basis for performance.

The proposed research extends the Proposers' contribution to the Berkeley Digital Libraries Initiative project. It extends long-term collaboration with the Berkeley campus UCData, a major numeric data archive managed by Dr Fred Gey and the University of California systemwide MELVYL system directed by Dr Clifford Lynch.

SECTION III. ADDITIONAL INFORMATION.

Bibliography of relevant technical papers and research notes.

Annotated items are examples of work that demonstrate the competence of the proposed Principal Investigators to undertake the work proposed.

Buckland, M. K. 1991a. Information and Information Systems. New York: Greenwood Press, 1991. (Also paperback ed., New York: Praeger, 1991; Chinese ed., 1994). Systematic introduction to retrieval-based information services and their social environment.

Buckland, M. K. 1991b. Information retrieval of more than text. Journal of the American Society for Information Science 42, no 8 (September 1991): 586-588. Historical background to the extension of information retrieval beyond textual documents.

Buckland, M. K. 1992a. Agenda for online catalog designers. Information Technology and Libraries 11, no. 2 (June 1992):157-163. On the strategic significance of providing support for searching unfamiliar metadata vocabularies.

Buckland, M. K. 1992b. Redesigning Library Services. Chicago: American Library Association, 1992. (Japanese ed., Tokyo: Keisoshobou, 1994. Korean ed. in preparation.) A best-selling explanation of the issues underlying the development of digital libraries.

Buckland, M. K. 1994a. From catalog to selecting aid. ALCTS Newsletter 5, no. 2 (1994): Insert A-D. (From catalog to gateway: Briefings from the CFFC, 2). On the transformation of library catalogs into network search support systems.

Buckland, M. K. 1994b. The potential offered by "Extended retrieval". In: Expanding Access to Science and Technology: The Role of Information Technologies. Proceedings of the Second International Symposium on the Frontiers of Science and Technology, Kyoto, 12-14 May 1992. Ed. by I. Wesley-Tanaskovic, J. Tocatlian, and K. H. Roberts. Tokyo: United Nations University, 1994. Pp. 133-143. Problems and solutions in searching across multiple, unfamiliar, heterogeneous repositories.

Buckland, M. K. 1995. Searching multiple digital libraries: A design analysis. Technical report. 1995. URL http://info.sims.berkeley.edu/research/oasis/multisrch.html Specifications for intelligent adaptive search of networked repositories.

Buckland, M. K., M. H. Butler, B. A. Norgard & C. Plaunt. 1992. OASIS: A front-end for prototyping catalog enhancements. Library Hi Tech, Issue 40 (1992):7-22. Summary of prototyped information search aids.

Buckland, M. K., M. H. Butler, B. A. Norgard & C. Plaunt. 1993a. OASIS: Prototyping graphical interfaces to networked information. In: American Society for Information Science. Proceedings of the 56th Annual Meeting, Columbus, October 24-28, 1993. Medford, NJ: Learned Information, 1993. Summary of OASIS research.

Buckland, M. K., M. H. Butler, B. A. Norgard & C. Plaunt. 1993b. Prototyping enhanced online search capability. In: National Online Meeting, 14th, 1993. Proceedings. Ed. by M. Williams, 51-56. Medford, NJ: Learned Information, 1993. Summary of research.

Buckland, M. K., M. H. Butler, B. A. Norgard & C. Plaunt. 1994. Union records and dossiers: Extended bibliographic information objects. In: Navigating the Networks. Proceedings of the ASIS Mid-year Meeting, Portland, 1994. Medford, NJ: Learned Information, 1994. Pp.43-57. On the integration on information objects from heterogeneous repositories.

Buckland, M. K. & D. Florian. 1991. Expertise, task complexity, and the role of intelligent information systems. Journal of the American Society for Information Science 42, no. 9 (October 1991):635-643. An analysis of human-computer partnering in information selection.

Buckland, M. K. & F. Gey. 1994. The relationship between recall and precision. Journal of the American Society for Information Science 45, no. 1 (Jan. 1994):12-19. Clarification and theoretical explanation of the trade-off between Recall and Precision found in retrieval evaluation studies. Circumstances in which both can be improved.

Buckland, M. K., B. A. Norgard & C. Plaunt. 1993. Filing, filtering, and the first few found. Information Technology and Libraries 12, no. 3 (Sept 1993): 311-319. On the pervasiveness of ordering and ranking mechanisms in information storage and retrieval.

Buckland, M. K. & C. Plaunt. 1994. On the construction of selection systems. Library Hi Tech 48 (1994):15-28. Analysis of the anatomy of information management systems leads to a simple, recursive model with three components (collections, transformers, and partitioners) that can represent the functionality of complex operational filtering and retrieval systems. Design implications.

Evans, David A. & Chengxiang Zhai. 1996. Noun-phrase analysis in unrestricted text for information retrieval. Proceedings of the 34th Annual Meeting of the ACL. Santa Cruz, CA. pp. 17-24.

Gey, F. 1990. Accessing City-County Data Book via DBASE-III: Census CD-ROMs from the Ground Up," Presented to the 1990 Conference of the International Association of Social Science Information Service and Technology (IASSIST90), Poughkeepsie, New York, May 30 to June 2, 1990, published in IASSIST Quarterly, V. 14, No. 3/4, Fall/Winter 1990, pp 35-47. Description of basic access mechanisms for Census Bureau data bases distributed on CD-ROM.

Gey, F. 1994. Inferring Probability of Relevance Using the Method of Logistic Regression. In: Proceedings of SIGIR94, the 17th annual ACM conference on Research and Development in Information Retrieval, Dublin, Ireland, July 4-6, 1994, pp. 222-231. Demonstration of the fundamental algorithms of probabilistic text retrieval using logistic regression as the machine learning technique.

Gey, F. & A. Chen. Intelligent Boolean Filtering for Routing Retrieval. UC DATA Information Systems Technical Report IS-96-1, January 1996. Experiments with combining boolean and probabilistic queries.

Gey, F, A. Chen, J. He, J. Meggs and L. Xu. 1996. Term Importance, Boolean Conjunct Training, Negative Terms, and Foreign Language Retrieval, Probabilistic Algorithms at TREC-5. In: Proceedings of the Fifth Text REtrieval Conference (TREC-5), National Institute for Standards and Technology, Washington, DC (November 20-22, 1996). Experiments with different approaches to query construction from natural language information need descriptions.

Gey, F., A. Chen, J. He, J. Meggs and L. Xu. 1997. "Term Importance in Routing Retrieval," submitted to SIGIR97, the 20th annual ACM conference on Research and Development in Information Retrieval, Philadelphia, PA, July 26-31,1997. An innovative method for finding and stressing important query terms for recurrent information needs.

He, Jianzhang, and Fredric Gey, "Online Codebook Browsing and Conversational Survey Analysis," Social Science Computer Review, Vol. 14, No. 2, Summer 1996, pp. 181-186. Representation method for interactive subject browsing of data dictionaries for social science datasets coupled with direct exploratory analysis.

Larson, R. R. 1989. An Automatic Method of Enhancing Topical Searching for Online Catalogs based on Classification Clustering. In The Annual Review of OCLC... Research, July 1988 - June 1989. Dublin, Ohio: OCLC Online Computer Library Center, Inc., 1989.

Larson, R. R. 1991. Classification Clustering, Probabilistic Information Retrieval and the Online Catalog. Library Quarterly, vol. 61, no. 2 (April), 1991, pp. 133-173.

Larson, R. R. 1992a. Evaluation of Advanced Retrieval Techniques in an Experimental Online Catalog. Journal of the American Society for Information Science, v. 43 no. 1 (January 1992), pp. 34-53.

Larson, R. R. 1992b. Experiments in Automatic Library of Congress Classification. Journal of the American Society for Information Science, v. 43 no. 2 (March 1992), pp. 130-148

Larson, R. R. 1996. Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace. IN: Hardin, Steve (Ed), ASIS '96 Proceedings of the 59th ASIS Annual Meeting , Baltimore, MD, Oct 21-24, 1996. Medford, NJ : Information Today.

Larson, R. R., J. McDonough, L. Kuntz, P. O'Leary, & R. Moon. 1996. Cheshire II: Designing a Next-Generation Online Catalog. Journal of the American Society for Information Science, 47(7) (July 1996), p. 555-567 .

Merrill, Deane, Nathan Parker, Fredric Gey, and Chris Stuber, "The University of California CD-ROM Information System," Special issue on Digital Libraries of the Communications of the ACM, v 38, no 4, April 1995, pp. 51-52. Description of largest social science data archive (175 gigabytes) available on the world-wide-web.

Norgard, B. A., M.G. Berger, M. K. Buckland, & C. Plaunt. 1993. The online catalog: From technical services to access service. Advances in Librarianship 17 (1993):111-148. Extensive, evaluative review of literature on problems and innovation on online bibliographic retrieval systems.

Schatz, B. et al., 1996. Interactive term suggestion for users of digital libraries: Using subject thesauri and co-occurence for information retrieval. In: Proceedings of the First ACM International Conference Digital Libraries. New York: ACM Press.