![]() |
|
|
Proposal to DARPA BAA 97-09: Excerpts relating to work to be
done. SECTION II. Detailed Proposal Information
II. A. INNOVATIVE CLAIMS FOR THE
PROPOSED RESEARCH. An analyst needs to select information from an increasing population
of heterogeneous repositories with quite diverse metadata vocabularies
(categorization, classification, indexing semantics). Necessarily, the
number -- and percentage -- of metadata vocabularies that are unfamiliar
to any given analyst is increasing steeply. When they encounter an unfamiliar
metadata vocabulary, how are they to know which codes or terms will lead
them to what they want? Problems of intellectual access arise at three
levels: (1) Selecting which repositories to search. (2) Adaptively coping
with alien metadata vocabulary (assigned indexing terms or classification/concept
codes, automatically-derived categorization) in each remote repository;
and (3) Aggregating data from heterogeneous repositories. The proposed research is focussed on #2, enabling an analyst to use unfamiliar metadata effectively and efficiently. This problem has two levels: (i) Selecting suitable vocabulary for formulating a search; and (ii) Understanding relationships between terms /codes within the metadata
system of the remote repository, e.g. navigating within a thesaurus. Metadata
vocabularies ordinarily contain complex internal relationships between
related terms. Even when metadata schemes have been developed and implemented in the
rapidly expanding environment of network accessible, but heterogeneous
repositories, each analyst is faced by an explosive increase in unfamiliar
descriptive metadata systems. One cannot expect analysts to have expert
familiarity with more than a few -- a diminishing percentage! -- of available
repositories. For decades a massive continuing investment that has been made and continues
to be made in the development and application of indexing and categorizing
schemes, "manual", automatic, and computer-aided. The special innovative
claim of the proposed research is that it is revolutionary because it
differs from, complements, and adds value to all that massive investment
by DARPA and others in categorization and indexing. It will demonstrate
technology that enable analysts to use their own vocabulary to enter and
make expert use of unfamiliar metadata vocabularies. We call this technology
"Entry Vocabulary Modules". The scientific and technical merit is of the proposed research is that
(i) goes substantially beyond the existing state-of-the-art, which ordinarily
depends on the expensive human crafting of links withing and between vocabularies
(e.g. UMLS); (ii) is based on searching of fragments existing within the
metadata and its database; (iii) it uses advanced probabilistic techniques;
(iv) it allows the analyst to use familiar search terms from a familiar
vocabulary; (v) it is designed for use in a networked environment; (vi)
it allows for rapid deployment with unfamiliar metadata schemes without
requiring the consent, knowledge or participation of the remote database
administrator. Consider a simple straightforward search. An analyst seeks coastal
pollution and submits a search to a very large library catalog to
look identify books and the principal health sciences bibliography MEDLINE
to find journal articles. A subject search on the topical phrase "coastal
pollution" retrieves nothing on the MELVYL library catalog, the largest
in the US, since in the Library of Congress Subject Headings
(LCSH) it is not established either as a heading or a cross-reference.
Nor does a subject keyword search on the terms "coastal" and "pollution"
retrieve anything. Relevant material is in there, but it is categorized
under "Marine pollution", "Water quality" "Coastal zone management" "Estuarine
pollution" and other headings. The same coastal pollution query addressed to MEDLINE file also
yields no retrievals either as an subject phrase or as a subject keyword
search. Again there is, in fact, relevant material, but it is under "Seawater",
"Water pollutants", "water microbiology", "Bathing beaches" and so on.
On inspection, the individual headings can be seen to be seen to be more
or less plausible, but who would have had the imagination to have thought
of more than one or two of these? And note that the vocabulary actually
used in these two systems is quite different. There is not even any overlap
in metadata vocabulary used in the two systems for the same topic in this
example, so even a familiarity with the actual vocabulary used in one
of them would not be much help when searching the other. LC Subject Headings and the Medical Subject Headings (MeSH)
of MEDLINE are relatively familiar and straightforward examples
of metadata vocabulary. More specialized repositories are much more opaque.
Suppose an analyst wanted to data on exports and imports of automobiles
and looked in the Census Bureaus's U.S. Imports and Exports numeric
datasets on CD-ROM. A search on the term "automobile" will find nothing.
A search on "cars" will lead to "Railway or Tramway Stock". The data are
there under "Passenger Motor Vehicles, Spark Ignition Engine". And for
anyone who doubts the assertion that metadata vocabularies tend to be
stylized and non-obvious "dialects" should try some searches using the
US Patent Office classification scheme. The problem is doubly exacerbated when dealing with foreign language
materials. The ordinary problems of translation are compounded by specialized
domain usage. A recent study of military aerial reconnaissance using photography
from tethered balloons illustrates this well. The investigation would
have been more effective and less costly if the analyst had known to search
for tethered balloons under the term "aerostat". The search could have
been even more cost-effective if the analyst had known, when searching
the German technical literature, that in the First World War, when this
technology was in heaviest use, the German term used for a tethered observation
balloon was "Drachen", even though in standard German and in dictionaries
"Der Drachen" means a kite, not a balloon. Experienced analysts, like experienced reference librarians, know that
familiarity with the source (database, reference work) is critical for
effective, reliable searching -- each has its own quirks and personality
-- and that this familiarity comes from experience, from frequent use.
Furthermore, the more that has been invested in the enhancement of the
source, i.e. the richer the metadata, the more important this personal
experience and familiarity becomes. But the rapid increase in network accessible repositories increases the number -- and proportion -- that are unfamiliar. Buckland (1992a) has argued that the most cost-effective single investment for improving effectiveness in the searching of repositories would be technology to assist the searcher in coping with unfamiliar metadata vocabularies. Federal funding from the U.S. Department of Education (HEA II) and adaptation of the "classification clustering" technique developed by Professor Ray Larson has enabled the development of the three different "proof-of-concept" prototypes and a clear path for the development and deployment of technology to support more cost-effective use of unfamiliar metadata schemes. We call this technology "Entry Vocabulary Modules" because they provide the analyst with intellectual and semantic "entry" into the range of codes or terms ("vocabularies") of diverse metadata schemes. do this. Meanwhile, the U.S. Department of Education funding source has
been abolished. We therefore request DARPA funding to support the research
and development necessary for the following deliverables: Task A: A set of prototype Entry Vocabulary Modules for a challenging
range of examples, with demonstrated deployment in three modes: As an
amenity provide by a repository; as a network-accessible amenity; and
embedded in a solo or collaborative work environment. Task B: An intelligent agent capable of unintrusively creating Entry
Vocabulary Modules for remote repositories with nor more than the access
necessary for conducting a search. Task C: A Report on the sensitivity of Entry Vocabulary Modules to variations
between subdomains within repositories. Should they be designed for subdomains
as well as for entire repositories? Task D: Addition of natural language processing techniques to enhance
the statistical term co-occurrence approach of the initial "proof-of-concept"
prototypes already developed. Task E: Recommendations for the improvement of "Codebook" metadata documentation
for numeric databases. The software developed will be made freely available, subject to acknowledgment.
It is our experience, however, that in this area technology transfer comes
from the demonstration of convincing functionality, subsequently implemented
in system developers' own code, rather than from direct transfer of the
software itself. The work is designed to develop Entry Vocabulary Module technology to
provide a cost-effective remedy for the difficulties that arise for analysts
when searching network accessible repositories given the following combination:
1. A massive, increasing investment world-wide in making repositories
accessible over networks; 2. A massive, increasing investment world-wide in providing
indexing, categorizing, and other metadata; and 3. Increasing difficulty for analysts when searching because
the number of unfamiliar repositories is increasing (1. above)
and the amount and proportion of unfamiliar metadata vocabularies
is also increasing. Decreasing search effectiveness is the predicable
result. Previous Federally-funded research and development by the Proposers has
resulted in three "proof-of-concept" prototypes of new technology for
providing detailed support for analysts when searching unfamiliar databases.
These are openly available in the World Wide Web for examination. The Work to be performed included the development of a suite of prototype
Entry Vocabulary Modules in a diverse selection of challenging metadata
environments: a. From uncontrolled English language to a structured thesaurus. b. From uncontrolled English language to a complex concept categorization. c. From uncontrolled English language to a library classification scheme. d. From uncontrolled English language to the U.S. Patent Office Classification scheme. e. From an uncontrolled foreign language to a U.S. metadata scheme as an example of translingual access. f. Cross-classification between two dissimilar metadata schemes. g. From uncontrolled English language to a Federal numerical database of socio-economic data. h. Between different natural language vocabularies, e.g. English, French
and German. Implementation contexts: "Entry vocabulary" functionality has
very extensive implementation potential and can be usefully positioned
in several ways: As an amenity on the analyst's workstation; As an amenity
on a repository server; and as an amenity on a work-centered computing
environment. An Entry Vocabulary Module can also be used for computer-assisted categorization.
The work will include four Associated Research Tasks: B. An Intelligent EMV Agent to develop Entry Vocabulary Modules unintrusively.
C. An analysis of how sensitivity metadata vocabulary is to the specific
subdomains of analysts' interest. D. The incorporation of Natural Language techniques to augment statistical
term frequency technique currently used. E. Recommendations for improved metadata "codebooks" for numeric databases.
The work will be disseminated through: - publicly available, no-cost, web-based demonstration of prototypes; - descriptive reports at the conferences and in the technical journals most likely to reach prospective implementers and adopters; and - articles on the theoretical and technical issues that arise. The software developed will be made freely available, subject to acknowledgment.
It is our experience that in this area technology transfer comes from
the demonstration of convincing functionality, subsequently implemented
in system developers' own code, rather than from direct transfer of the
software itself. There are no specific contractor requirements beyond the financial support
in the Budget. The proposed research will develop, document, make functioning operational
prototypes Entry Vocabulary Modules freely available through open web-based
access. The prototypes will enable analysts to make expert, efficient
use of unfamiliar metadata schemes. The deliverables can be viewed in terms of a matrix of Metadata Access
Challenges and Implementation Contexts: A. Metadata Vocabulary Challenges. A range of examples
designed to test and demonstrate the versatility of Entry Vocabulary Module
technology differing challenges: a. From uncontrolled English language to a structured thesaurus: A preliminary prototype providing access to a subdomain of the INSPEC thesaurus can be viewed at: URL http://briet.sims.berkeley.edu/oasis/inswater.html b. From uncontrolled English language to a complex concept categorization: A preliminary prototype providing access to a subdomain of BIOSIS (Biological Abstracts) can be viewed at: http://briet.sims.berkeley.edu/oasis/biosis.html See the example of the
result of a search for BIOSIS concept codes relating to "water
pollution" at the end of Section III. c. From uncontrolled English language to a library classification scheme:
Co-Principal Investigator Ray Larson has implemented an Entry Vocabulary
Module for a subdomain of the Library of Congress Classification that
includes Information Management, Bibliography, and Library and Information
Science. Broader coverage has recently been demonstrated in the campus'
Physical Sciences libraries, include Astronomy, Geology, Math and Physics.
d. From uncontrolled English language to the U.S. Patent Office Classification scheme, which is a very good example of a metadata vocabulary that is important for technology transfer, but difficult for non-specialists: A limited, preliminary prototype can be accessed at URL http://briet.sims.berkeley.edu/oasis/bigpatents.html See the example
of the result of a search for Patent Office Classification codes relating
to "needle" at the end of Section III. e. From an uncontrolled foreign language to a U.S. metadata scheme as
an example of translingual access: The tentative initial choice is from
French to the Library of Congress Classification. f. Cross-classification between two dissimilar metadata schemes: Between
the U.S. Patent Office Classification and the International Patent Classification.
g. From uncontrolled English language to a Federal numerical database
of socio-economic data using the code-book documentation as metadata.
h. Between different natural languages, as related in their common application
in a specific domain: Tentatively trilingually between English, French
and German. Resources permitting, additional research and demonstration prototypes
may also be undertaken in relation to GIS metadata, sound data, and/or
image databases. Implementation contexts: "Entry vocabulary" functionality
has very extensive implementation potential and can be usefully positioned
in several ways. We propose the following: a. As an amenity on the analyst's client to provide assistance when accessing
a unfamiliar remote repository; b. As an amenity on (or invoked from) a repository server to aid remote
searchers unfamiliar with the local metadata scheme of that repository;
c. As an amenity on a work-centered computing environment. Specifically
the Berkeley NSF / DARPA / NASA Digital Libraries Initiative project has
developed a Multi-Valent Document architecture with "Magic Lens" (and
similar) features to allow additional operations on selected fragments
of text. The proposed research would add an "extended search" to the Multi-Valent
Document architecture so that any fragment any document could become the
basis for a search query and a point of departure for a search of related
material in remote data bases. Note that to the extent that an Entry Vocabulary Module can provide a
ranked list of probably relevant metadata terms for any fragment of text,
it can also be used for computer-assisted categorization. Larson (1992b)
has demonstrated that our proposed methodology is rather reliable in proposing
Library of Congress Classification numbers to book titles. For example,
we expect that an Entry Vocabulary Module for the Patent Office classification
would be useful support for the assignment of classification numbers and
well as for patent searching. Associated Research Tasks. Within this matrix of deliverables,
we are especially interested in the following research aspects: B. Intelligent EMV Agent. To reduce the amount of human
expertise involved in developing Entry Vocabulary Modules, an Intelligent
Agent will be developed. The Agent, once pointed at a target repository,
will obtain a retrieved set of sufficient to serve as the training set
and will then create an Entry Vocabulary Module, and add it to those available
in the assigned work environment. C. Sub-domain sensitivity. Ordinarily metadata vocabularies
are studied as a whole, but in practice analysts are rarely interested
in the whole. They are usually interested in some specific subdomain reflecting
their particular interest. Hence we have concentrated on Entry Vocabulary
Modules of topical, work-related subdomains. For example we have developed
two subdomain Entry Vocabulary Modules in the INSPEC thesaurus - for "Information
Science" and for "Water." The sensitivity of search research using subdomain
Entry Vocabulary Modules needs to be tested. D. Natural Language techniques. Development of Entry
Vocabulary Modules has so far been derived from term frequency occurrences.
It is intended to experiment with the effects of also drawing on natural
language processing techniques and, where applicable, the syntactical
relationships sometimes found in classification and categorization schemes.
E. Numeric database "codebook" recommendations. The
experience derived from the research on computer aided access to numeric
database will yield recommendations for good metadata standards and practice
for Federal and other numeric data bases. The research would be disseminated through: - publicly available, no-cost, web-based demonstration of prototypes; - descriptive reports at the conferences and in the technical journals most likely to reach prospective implementers and adopters; and - articles on the theoretical and technical issues that arise. The software developed will be made freely available, subject to acknowledgment. It is our experience that in this area technology transfer comes from the demonstration of convincing functionality, subsequently implemented in system developers' own code, rather than from direct transfer of the software itself. II. F. COST, SCHEDULE, MILESTONES.
II. G. TECHNICAL RATIONALE, TECHNICAL
APPROACH AND CONSTRUCTIVE PLAN FOR ACCOMPLISHMENT. I. Many of the most information-rich repositories, especially bibliographical
and textual databases, have some form of classification, coding, or indexing.
But these are more or less stylized and cannot be used effectively or
efficiently except by searchers who are familiar with them. The explosive
increase in heterogeneity assures that intimately familiarity decreasing
occurs. Entry Vocabulary Modules provide the kind of expert prompting
that an expert search intermediary would. Unfortunately, a sufficient
population of human expert search intermediaries is unaffordable. ii. An increasing number of techniques are being developed for automatic
categorization of repositories for which human indexing is unavailable.
But as Melvil Dewey found when he developed his famous Decimal Classification
in 1876, one needs a natural language index to a categorization scheme
for it to be conveniently usable. The purpose of an Entry Vocabulary Module
is to provide guidance in the transition from familiar vocabulary, whether
ordinary English or from a familiar categorization scheme, to the unfamiliar,
providing the kind of expert prompting that an expert search intermediary
would. In both cases, the Entry Vocabulary Module provides a value-added enhancement
to the return on the original investment in generating metadata. The specific technique to create a ranked list of probably relevant terms
in the target metadata vocabulary from any given searcher input is the
"Classification clustering" technique developed for this purpose by Ray
Larson, using probabilistic interpretation of vector-spaced retrieval,
extended by William Cooper's Staged Logistic Regression method. (Larson,
R. R. 1991. "Classification clustering, probabilistic retrieval, and the
online catalog. Library Quarterly v. 61, no. 2 (April 1991):
133-173.)
Clearly the figures shown in this database have a critical effect on the U.S. balance of trade, and searching the data will be an important activity for those engaged is strategic assessment of the future of the United States economy. Yet if one does a commodity search using the word "automobiles" on the commonly used WWW database (http://govinfo.kerr.orst.edu/impexp.html) one finds no results. Moreover, if one does the search using the word "cars," one obtains the misleading result "Railway or Tramway Stock, etc." A searcher interested in this database must be aware that the general classification heading for this commodity group is "Tractors, Vehicles for Pass, Goods, Special Purposes" and the particular classification for cars is "Passenger Motor Vehicles, Spark Ignition Engine" as above.
Another example of strategic importance concerns military armament exports.
In this area a search in the exports database using the word "Rockets"
yields the misleading commodity category "Bearings, Transmission, Gaskets,
Misc". Restricting the search to the singular form "rocket" yields an
additional three categories:
the last of which specifically concerns military weapons exports from the United States. Indeed the specific term "rocket" is found only in the following category: MISSILE & ROCKET LAUNCHERS AND SIMILAR PROJECTORS (9301009050)
and completely misses a larger export category: GUIDED MISSES (9306900020) Unit of Quantity-Number U.S. Exports Of Merchandise For The Year Through November, 1996
while the general heading category for this section is: BOMBS, GRENADES, ETC (9306). Clearly researchers who wish to mine this database for information pertinent
to the strategic economic position of the United States have a need for
a tool which will bridge the gap between common terminology and the highly
specialized classification scheme which has evolved for categorizing data
these data. The export database also includes information on point-to-point shipments
of various commodities. For example, a certain quantity of guided missiles
has been shipped to the State of Bahrain, east of Saudi Arabia:
If one wishes to obtain aggregate shipments for the "Middle East", it is unlikely that the unexperienced searcher would be able to identify Bahrain as a component Middle East country which would be a significant export customer of the United States. In order to effectively search the geographic dimension of these data, bridges must be constructed between a common language geographic specification and the particular special geographic name which resides in the geographic metadata in the database. Clearly researchers who wish to mine this database for information pertinent to the strategic economic position of the United States have a need for tools which will bridge the gap between common terminology and the highly specialized classification scheme which has evolved for categorizing data these data. The International Harmonized Commodity Classification Code scheme consists
of a hierarchical 10 digit code with associated description. As of December
1995 there were 16,746 possible codes which might be associated with commodities
being either exported or imported. The Census Bureau has released a CD-ROM
"U.S. Exports Commodity Classification" which has 21,635 lines of description
of the codes. This description contains some additional common English
words which annotate the hierarchical coding scheme, but not nearly enough
training data to accurately bridge the gap between the two forms. We also
propose to utilize the ARPA sponsored WordNet lexical database (http://www.ito.darpa.mil/Summaries95/
5 senses of car The WordNet database will be utilized as the starting point for probabilistic
similarity searches for common word synonyms for commodity classification
descriptions. We will consider WordNet sense descriptions as documents
which provide evidence for the commodity concept in accordance with the
methodologies described in (Larson, 1991). Relationship to the Multi-Valent Document architecture. The Multi-Valent Document architecture is a noteworthy product of the Berkeley Digital Libraries Initiative project. In it, each document is composed of multiple "layers'', each with a set of "behaviors'' that implement particular functions using one or more layers. Typically the layers include the image of a scanned document page image, an underlying layer of full text in ASCII based on OCR conversion of the page image, and a layer that provides the positional information relating the image to the underlying text. Typical behaviors include such things as the ability to search for words in the ASCII text and highlight matching words on the page image.
Summary The proposed research would add an "external search" behavior to the Multi-Valent Document architecture that so that when a user feels the need for additional information (such as citations to supporting or relevant documents or the documents themselves) the user simply selects the text and invokes the "external search" behavior to seek out and retrieve such information. The external search behavior might query the current document, a local database and remote depositories to find such documents or data, and then retrieve the document itself, or an annotation (such as citation and abstract) for viewing by the user.
The external search behavior would using the appropriate entry vocabulary modules to construct a search that would then be processed using the remote search features of the Cheshire system (implemented using the Z39.50 information retrieval protocol) as well as external Web search engines.
Constructive Plan The development of Entry Vocabulary Modules has been in progress as part
of the OASIS research program developing prototypes of enhancements to
online library catalogs and other databases (Buckland, Butler, Norgard
& Plaunt 1993). It is now proposed to migrate this functionality also
to an online public access bibliographic searcher, CHESHIRE, a "next generation"
library catalog being made available in the Berkeley campus science libraries;
as part of the Berkeley Digital Library project; and at (or invocable
from) a repository server. In addition an implementation as computer-aided
indexing will be undertaken at a site to be determined. These varied deployments
will provide the range of implementations described in B.2 above. II. I. KEY PERSONNEL, QUALIFICATIONS,
EFFORT. The Proposed Principal Investigator is Professor Michael K. Buckland.
The Co-Principal Investigators are Dr. Fredric Gey and Professor Ray R.
Larson. The Proposed Principal Investigator, Michael K. Buckland, is Professor
in the School of Information Management and Systems at the University
of California and has published extensively on the theory, practice, and
historical development of Information Management. As Assistant Vice President
for Library Plans and Policies in the University of California system-wide
administration, 1983-87, he was responsible for the development of the
MELVYL system and a pioneering intercampus packet-switched network. He
was formerly Dean of the School of Library and Information Studies at
Berkeley and is currently President-Elect of the American Society for
Information Science, an association of 4,000 professionals, founded in
1937 as the American Documentation Institute and dedicated to improving
access to recorded information. Professor Buckland has extensive experience as a professional librarian,
and in library research and library administration in the U.K. and in
the U.S. Continuously for six years Professor Buckland has team-taught
with Dr Clifford Lynch a seminar on "Extended Retrieval" addressing problems
in retrieval from networked heterogeneous repositories. Co-Principal Investigator Fredric Gey has a Ph.D. in Information Science, University of California, Berkeley (1993) focussed on probabilistic algorithms for text and document retrieval. He is Data Archivist and Assistant Director at the UC Data Archive & Technical Assistance, an organized research unit with $2.5 million annual extramural funding, a staff of ten professionals, thirty-five graduate student research assistants, and five undergraduate computer science student assistants. The unit serves as the data library for social science and health statistics numeric databases for the Berkeley campus. He manages a collection of 2,000+ datasets on 4,000 computer tapes on IBM-3090 under VM/CMS. Total archive size of 800 gigabytes.
Dr. Gey has twenty years experience in technical management as a project leader and/or principal investigator for multi-year, multi-million dollar government software research year, multi-million dollar government software research and development projects. Thirty years experience working with computers, from mainframes to micros. He is General Chair designate for SIGIR99, the 22nd Annual Conference on Research and Development in Information Retrieval, Summer 1999. He has been a participant in the ANSI Committee X3H4 for Information Resource Dictionary Standards (IRDS).
Ray R. Larson, Associate Professor in the School of Information Management
and Systems, University of California at Berkeley. His Ph. D. Dissertation
in Library and Information Studies was on Workload characteristics
an computer system utilization in online catalogs (1986). Prof. Larson
has been engaged in the design and development of software for information
retrieval systems since 1978. He received the American Society for Information
Science Journal Best Paper Award in 1993 and has been Chair of the ASIS
special interest group on Human-Computer Interaction. He is currently
Associate Editor ACM Transactions on Information Systems. This School is fortunate in having doctoral students that are exceptionally
well-prepared for this kind of research. One of those who would be engaged
in the proposed research is Barbara A. Norgard, who has a background in
information retrieval and in linguistics, six years of experience in developing
exactly this kind of software, including the proof-of-concepts prototypes
noted in this proposal, and several publications in refereed publications.
II. J. PROPOSER'S PREVIOUS ACCOMPLISHMENTS
The proposed research builds directly on prior achievements in the development
and demonstration of innovative enhancements to online retrieval systems.
A significant characteristic of our research approach, reflected in our
past work and publications, is that it brings an approach and emphasis
that differs and complements both from operational repository services
and from the techniques and technical infrastructure being developed by
Computer Scientists. This proposal comes from the School of Information
Management and Systems dedicated to exploring the human aspects of access
to operational information services as well as the technical and technological
aspects. The Research team has substantial personal experience in the
provision of operational information services; in the development and
operation of heavily-used production online services for demanding users;
and in teaching the theory and practice of indexing, cataloging, and classification
to information professionals. The interest is in "bridging technology"
that enables information storage and retrieval theory, innovative software
approaches, and managers of operational production systems to be brought
together. Principal Investigator Michael Buckland and Co-Principal Investigator
Ray Larson have both been Faculty Investigators in a series of DARPA-related
projects: - The SEQUOIA 2000, concerned with decentralized access to a large-scale
environmental data repository; - The Computer Science Technical Reports project; and - The Berkeley NSF / ARPA / NASA Digital Libraries project. The proposed research will be complement, extend, and be carefully coordinated with the work in the DLI project.
The proposed work draws heavily in the CHESHIRE prototype system developed
by Professor Ray Larson. Which included an Entry Vocabulary Module providing
access from uncontrolled English language to a formal classification scheme,
a segment of the Library of Congress Classification that includes Information
Management, Bibliography and Library and Information Science. The proposed research extends the long-term OASIS research program led
by Professor Buckland and funded 1990 - 1996 by the U.S. Dept of Education
under the Higher Education Act Title II, a funding source that has now
been discontinued. Co-principal investigator Dr. Fredric Gey brings a special expertise
to the project. He has over twenty years experience in working with economic
and social science numeric databases and numerous publications in this
area. His work as assistant director of the UC Data Archive led him to
jointly develop the UC CD-ROM Information System (Merrill, Parker, Gey,
Stuber 1995) whose software has been adopted by the Census Bureau as their
primary access mechanism for Census data on the world wide web. In addition
Dr Gey is an expert on probabilistic document retrieval and leads the
UC Berkeley team participation in the annual ARPA-NIST sponsored TREC
(Text REtrieval Conference) which demonstrates advanced text retrieval
algorithms against the TIPSTER document collections of more than one million
documents. Dr. Gey is the General Chair-designate of the ACM SIGIR99 Conference,
the 22nd Annual Conference on Research and Development in Information
Retrieval, to be held at UC Berkeley during the summer of 1999. Dr. Gey
expects to apply probabilistic search techniques to bridge between ordinary
search vocabulary and the specialized classification metadata of the U.S.
Imports/Exports commodity datasets issued on CD-ROM by the Census Bureau.
Dr. Gey's previous accomplishments include a digital library of 135 gigabytes of Federal statistical information (primarily 1990 Census data) from 270 CD-ROM disks on 45 CD-ROM jukeboxes. This archive is arguably the largest single digital library on the internet. He has also been Project Director of the UC Berkeley Text Retrieval Research Group, directing research in probabilistic text retrieval and multi-lingual (Chinese and Spanish) document retrieval as part of the TREC-5 conference of the National Institute of Standards and Technology. Awarded 3-year NSFgrant for "Probabilistic Retrieval from Full-Text Document Collections using Logistic Regression," beginning August 15, 1996. He directed the UC DATA development of Internet access to Federal statistical
information on CD-ROM jukebox servers. Initiated project for online browsing
of statistical database codebooks (structured document technical documentation).
Other projects include the development of a California Latina/Latino Demographic
Data Book and the California Welfare Research Project (1992-1997), a multi-million
dollar, 5-year longitudinal study combining county administrative welfare
case files, and numerous other projects involving metadata schema and
numeric databases. Professor Larson's principal prior accomplishments include further development
and testing of an experimental online catalog system (called CHESHIRE
II) for experimental evaluation of information retrieval and hypertext
techniques to improve online catalog search performance and usability.
He is also engaged in experimental investigation of an automatic method
of classification clustering for improving topical access in online catalog
systems, and extension of automatic classification methods to automatic
indexing of networked information resources like the World Wide Web. Other recent work includes the development of sub-element access methods
for database systems to support text retrieval and other specialized retrieval
functions and the development of full-text indexing and retrieval methods
for Berkeley's NSF/NASA/ARPA Digital Library Project. Past work includes
being Faculty investigator in charge of development of IR methods in the
"electronic repository" for the Sequoia 2000 project, a large-scale storage
and retrieval system to support researchers investigating global change
and being Co-Investigator on the NRI/ARPA Electronic Libraries project,
Developing a network accessible electronic library of computer science
technical reports in cooperation with Stanford, MIT, Cornell, and Carnegie-Mellon.
Berkeley PI, Robert Wilensky, other co-investigators were Michael Stonebraker,
Michael Buckland, Clifford Lynch. He is now Principal Investigator, The Cheshire Demonstration and Evaluation
Project sponsored by U.S. Dept. of Education (HEAIIA) grant. This project
will deploy a next-generation online catalog system in a working library
environment and evaluate its use and acceptance by local library patrons
and remote network users. The extensive prior experience of the Proposers in the kind of research
proposed provides a credible basis for performance. The proposed research extends the Proposers' contribution to the Berkeley
Digital Libraries Initiative project. It extends long-term collaboration
with the Berkeley campus UCData, a major numeric data archive managed
by Dr Fred Gey and the University of California systemwide MELVYL system
directed by Dr Clifford Lynch. SECTION III. ADDITIONAL INFORMATION. Bibliography of relevant technical papers and research notes. Annotated items are examples of work that demonstrate the competence
of the proposed Principal Investigators to undertake the work proposed.
Buckland, M. K. 1991a. Information and Information Systems.
New York: Greenwood Press, 1991. (Also paperback ed., New York: Praeger,
1991; Chinese ed., 1994). Systematic introduction to retrieval-based information
services and their social environment. Buckland, M. K. 1991b. Information retrieval of more than text. Journal
of the American Society for Information Science 42, no 8 (September
1991): 586-588. Historical background to the extension of information
retrieval beyond textual documents. Buckland, M. K. 1992a. Agenda for online catalog designers. Information
Technology and Libraries 11, no. 2 (June 1992):157-163. On the strategic
significance of providing support for searching unfamiliar metadata vocabularies.
Buckland, M. K. 1992b. Redesigning Library Services. Chicago:
American Library Association, 1992. (Japanese ed., Tokyo: Keisoshobou,
1994. Korean ed. in preparation.) A best-selling explanation of the issues
underlying the development of digital libraries. Buckland, M. K. 1994a. From catalog to selecting aid. ALCTS Newsletter
5, no. 2 (1994): Insert A-D. (From catalog to gateway: Briefings from
the CFFC, 2). On the transformation of library catalogs into network search
support systems. Buckland, M. K. 1994b. The potential offered by "Extended retrieval".
In: Expanding Access to Science and Technology: The Role of Information
Technologies. Proceedings of the Second International Symposium on the
Frontiers of Science and Technology, Kyoto, 12-14 May 1992. Ed. by
I. Wesley-Tanaskovic, J. Tocatlian, and K. H. Roberts. Tokyo: United Nations
University, 1994. Pp. 133-143. Problems and solutions in searching across
multiple, unfamiliar, heterogeneous repositories. Buckland, M. K. 1995. Searching multiple digital libraries: A design
analysis. Technical report. 1995. URL http://info.sims.berkeley.edu/research/oasis/multisrch.html
Specifications for intelligent adaptive search of networked repositories.
Buckland, M. K., M. H. Butler, B. A. Norgard & C. Plaunt. 1992. OASIS:
A front-end for prototyping catalog enhancements. Library Hi Tech,
Issue 40 (1992):7-22. Summary of prototyped information search aids. Buckland, M. K., M. H. Butler, B. A. Norgard & C. Plaunt. 1993a.
OASIS: Prototyping graphical interfaces to networked information. In:
American Society for Information Science. Proceedings of the 56th
Annual Meeting, Columbus, October 24-28, 1993. Medford, NJ: Learned
Information, 1993. Summary of OASIS research. Buckland, M. K., M. H. Butler, B. A. Norgard & C. Plaunt. 1993b.
Prototyping enhanced online search capability. In: National Online
Meeting, 14th, 1993. Proceedings. Ed. by M. Williams, 51-56. Medford,
NJ: Learned Information, 1993. Summary of research. Buckland, M. K., M. H. Butler, B. A. Norgard & C. Plaunt. 1994. Union
records and dossiers: Extended bibliographic information objects. In:
Navigating the Networks. Proceedings of the ASIS Mid-year Meeting,
Portland, 1994. Medford, NJ: Learned Information, 1994. Pp.43-57.
On the integration on information objects from heterogeneous repositories.
Buckland, M. K. & D. Florian. 1991. Expertise, task complexity, and
the role of intelligent information systems. Journal of the American
Society for Information Science 42, no. 9 (October 1991):635-643.
An analysis of human-computer partnering in information selection. Buckland, M. K. & F. Gey. 1994. The relationship between recall and
precision. Journal of the American Society for Information Science
45, no. 1 (Jan. 1994):12-19. Clarification and theoretical explanation
of the trade-off between Recall and Precision found in retrieval evaluation
studies. Circumstances in which both can be improved. Buckland, M. K., B. A. Norgard & C. Plaunt. 1993. Filing, filtering,
and the first few found. Information Technology and Libraries
12, no. 3 (Sept 1993): 311-319. On the pervasiveness of ordering and ranking
mechanisms in information storage and retrieval. Buckland, M. K. & C. Plaunt. 1994. On the construction of selection
systems. Library Hi Tech 48 (1994):15-28. Analysis of the anatomy
of information management systems leads to a simple, recursive model with
three components (collections, transformers, and partitioners) that can
represent the functionality of complex operational filtering and retrieval
systems. Design implications. Evans, David A. & Chengxiang Zhai. 1996. Noun-phrase analysis in
unrestricted text for information retrieval. Proceedings of the 34th
Annual Meeting of the ACL. Santa Cruz, CA. pp. 17-24. Gey, F. 1990. Accessing City-County Data Book via DBASE-III: Census CD-ROMs
from the Ground Up," Presented to the 1990 Conference of the International
Association of Social Science Information Service and Technology (IASSIST90),
Poughkeepsie, New York, May 30 to June 2, 1990, published in IASSIST
Quarterly, V. 14, No. 3/4, Fall/Winter 1990, pp 35-47. Description
of basic access mechanisms for Census Bureau data bases distributed on
CD-ROM. Gey, F. 1994. Inferring Probability of Relevance Using the Method of
Logistic Regression. In: Proceedings of SIGIR94, the 17th annual ACM
conference on Research and Development in Information Retrieval, Dublin,
Ireland, July 4-6, 1994, pp. 222-231. Demonstration of the fundamental
algorithms of probabilistic text retrieval using logistic regression as
the machine learning technique. Gey, F. & A. Chen. Intelligent Boolean Filtering for Routing
Retrieval. UC DATA Information Systems Technical Report IS-96-1,
January 1996. Experiments with combining boolean and probabilistic queries.
Gey, F, A. Chen, J. He, J. Meggs and L. Xu. 1996. Term Importance, Boolean
Conjunct Training, Negative Terms, and Foreign Language Retrieval, Probabilistic
Algorithms at TREC-5. In: Proceedings of the Fifth Text REtrieval
Conference (TREC-5), National Institute for Standards and Technology,
Washington, DC (November 20-22, 1996). Experiments with different approaches
to query construction from natural language information need descriptions.
Gey, F., A. Chen, J. He, J. Meggs and L. Xu. 1997. "Term Importance in
Routing Retrieval," submitted to SIGIR97, the 20th annual ACM conference
on Research and Development in Information Retrieval, Philadelphia, PA,
July 26-31,1997. An innovative method for finding and stressing important
query terms for recurrent information needs. He, Jianzhang, and Fredric Gey, "Online Codebook Browsing and Conversational
Survey Analysis," Social Science Computer Review, Vol. 14, No.
2, Summer 1996, pp. 181-186. Representation method for interactive subject
browsing of data dictionaries for social science datasets coupled with
direct exploratory analysis. Larson, R. R. 1989. An Automatic Method of Enhancing Topical Searching
for Online Catalogs based on Classification Clustering. In The Annual
Review of OCLC... Research, July 1988 - June 1989. Dublin, Ohio:
OCLC Online Computer Library Center, Inc., 1989. Larson, R. R. 1991. Classification Clustering, Probabilistic Information
Retrieval and the Online Catalog. Library Quarterly, vol. 61,
no. 2 (April), 1991, pp. 133-173. Larson, R. R. 1992a. Evaluation of Advanced Retrieval Techniques in an
Experimental Online Catalog. Journal of the American Society for Information
Science, v. 43 no. 1 (January 1992), pp. 34-53. Larson, R. R. 1992b. Experiments in Automatic Library of Congress Classification. Journal of the American Society for Information Science, v. 43 no. 2 (March 1992), pp. 130-148 Larson, R. R. 1996. Bibliometrics of the World Wide Web: An Exploratory
Analysis of the Intellectual Structure of Cyberspace. IN: Hardin, Steve
(Ed), ASIS '96 Proceedings of the 59th ASIS Annual Meeting ,
Baltimore, MD, Oct 21-24, 1996. Medford, NJ : Information Today. Larson, R. R., J. McDonough, L. Kuntz, P. O'Leary, & R. Moon. 1996.
Cheshire II: Designing a Next-Generation Online Catalog. Journal of
the American Society for Information Science, 47(7) (July 1996),
p. 555-567 . Merrill, Deane, Nathan Parker, Fredric Gey, and Chris Stuber, "The University
of California CD-ROM Information System," Special issue on Digital Libraries
of the Communications of the ACM, v 38, no 4, April 1995, pp.
51-52. Description of largest social science data archive (175 gigabytes)
available on the world-wide-web. Norgard, B. A., M.G. Berger, M. K. Buckland, & C. Plaunt. 1993. The
online catalog: From technical services to access service. Advances
in Librarianship 17 (1993):111-148. Extensive, evaluative review
of literature on problems and innovation on online bibliographic retrieval
systems. Schatz, B. et al., 1996. Interactive term suggestion for users of digital libraries: Using subject thesauri and co-occurence for information retrieval. In: Proceedings of the First ACM International Conference Digital Libraries. New York: ACM Press. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||