Mark Steinberger
mark@csc.albany.edu
The data involved in scientific communication is highly complex, especially in the mathematical sciences, where sgml and its relatives are currently inadequate to convey the required notation.
Two streams of documents need to be considered: preprints and journals. On the preprint side, much information is available in individuals' home pages. This material is intrinsically anarchic, and is subject to change and dislocation without notice. On the other hand, the offerings can be much richer than what is available from preprint servers: Authors can place supporting documents, computer programming code, or interactive web programs alongside their preprints, or dynamically linked to them. The site can be as complex as the author can make it.
Nevertheless, since the data is transitory, it would be difficult to systematically catalogue it without the authors' frequent cooperation in providing metadata in standardized form. And, at least in the United States, it would be surprising to see most authors providing metadata in any form. The use of a web crawler specific to mathematics, such as Jim Richardson's MathSearch indexer (which, unfortunately, restricts its indexing to English language sites), may be the best solution for accessing information on mathematicians' home pages for some time to come. In other words, treat the material as a blob, and make an undifferentiated index of it all.
Preprint servers serve a narrower purpose. It is an important one, and one that would permit excellent indexing. Before we get into the details of this particular data source, let us consider the forms of data associated with a mathematical research article.
Html is currently inadequate for conveying mathematics. The current options are generally derived from Donald Knuth's TeX typesetting system. TeX can be used to produce dvi files, postscript files, HyperTeX dvi files, and pdf files.
Pdf files and HyperTeX dvis have the advantage of allowing both internal and external hypertext links. And these may connect the article to a web of other articles, interactive web programs, etc. In other words, HyperTeX or pdf gives the article the same facility of web use available with html, and also provides real typesetting.
The internal links are especially important for on-line reading, as it is difficult to hunt for a particular location in an article without them. Readers of paper articles can leaf quickly through the pages to find the statement of a theorem when it is used in the argument for a subsequent result. This is more difficult when reading electronically, and the links work to excellent advantage. In fact, reading on-line with good internal links is more efficient than leafing through a paper document.
Pdf has a number of other advantages over the other formats:
For these and other reasons (see [S1]), pdf is likely to win out over the other formats for math journals. Springer, Project Muse, Kluwer, Elsevier, Academic Press, and many others have been using pdf on their web sites for mathematical journals.
Authoring in the mathematical community is done in TeX. Authors submit TeX files to most preprint archives, and, increasingly, to journals. Most journals do not display TeX source files, but most preprint servers do.
This makes it potentially very easy to access information from the extant preprints: At least on an individual site, it is possible to make a keyword index of the full text of all the preprints on file.
It would be even more useful if all the preprint archives were to merge. This proposal is currently on the table. Paul Ginsparg has agreed to open his preprint server at Los Alamos National Labs to all of mathematics, creating as many new subject categories as needed. His archive does work from TeX source files, and produces dvi, postscript, and pdf files automatically. There is an initiative within the mathematical community to take advantage of his offer.
A common archive also permits collection of structured data on the papers. For instance, the submission forms for the LANL archive include a number of data fields (including abstract), which can then be searched separately for keywords. And more complex data queries could be implemented in the future. The site is an excellent testbed for metatdata techniques.
The LANL archive has an advantage from the point of view of data management in that while updates of preprints are permitted, the old copies are preserved on line, and all copies are date stamped.
Journals display the final form of a scholarly work, and form an important filtering mechanism for the reviewing system. It is the journal version of an article that will be referenced, if possible, and that will receive links from Zentralblatt and Math Reviews. They should also provide the most stable links to the article. See [S2] for further discussion of these issues.
Thus, it is important to be able to conduct data searches restricted to journals alone.
Journals also are potentially much more complex from a data standpoint than preprint archives.
First, Journals can archive the same kind of supporting materials, interactive web programs, etc., that authors can place in their own personal web sites, as well as commentary by other mathematicians. The New York Journal of Mathematics and the Electronic Journal of Combinatorics are already starting to do this.
This opens the question of who is to catalogue these and how?
Second, most journals will not distribute TeX source, and will only distribute graphical formats such as pdf, dvi, etc. Indexing pdf files is apparently possible, but would be cumbersome, due to the size of the files, even if it were otherwise efficient. I do not know of a way to index postscript or dvi files.
Thus, a practical full text index should probably be made from the TeX source, which is generally not accessible off-site. It is quite possible to index the TeX source, and then deflect the link produced by the search engine to some publicly accessible format of the paper. At the New York Journal of Mathematics, the TeX source is indexed by glimpse. If the TeX source for a paper contains a match, the search engine provides a link to the html abstract page for the paper. The abstract page in turn contains links to all four graphical formats of the paper, as well as to other pages in the journal.
The use of glimpse is particularly useful, as the search output contains lines in the text containing the matches. This enables the reader to screen out matches in which the keywords are being used in a different meaning or context. For instance, the word "spectrum" has several distinct meanings in different branches of mathematics. The display of lines in the output for a search on "spectrum" will enable the reader to screen out matches to meanings foreign to the intended context.
This makes it possible for a particular journal to provide very good data to the public, but at some cost in a lack of uniformity between journals: Different journals have different indexers and search engines (sometimes indexing different forms of data, as well). Some types of indices permit simultaneous searches of distinct remote sites and others do not.
A common full text search of all the math journals would be very useful. So would a common standard for the forms of data and types of queries. Neither of these is likely any time soon. But there is an initiative to create a common search capability among the independent electronic math journals.
On the other hand it is quite feasible to create a common index of more succinct data, similar to the indices of Dublin Core metadata to be found, e.g., in the Electronic Library at the University of Osnabrueck. But doing so will require the cooperation of the journal publishers.
Many publishers are already providing similar data, including abstracts, in html abstract pages for their journals, such as those used by the New York Journal of Mathematics and the Pacific Journal of Mathematics.
A database of such abstract pages could be assembled quickly at minimal cost. This should be done regardless of other initiatives, as retrospective conversion to meet Dublin Core standards is unlikely in general.
For the future, it is important to obtain standardized metadata from publishers. Dublin Core is presumably the best standard to adopt, given its prevalence in European data-sharing initiatives and its standing in the library community. Nevertheless, with a bit of perl programming, it should be possible to integrate other forms of metadata into a common index, provided each publisher is willing to provide a consistent format.
[S1] M. Steinberger, Making Optimal Use of the Electronic Environment, Proceedings of the Conference on Electronic Communications in Mathematics, The Geometry Center, Minneapolis, MN, http://www.geom.umn.edu/docs/cecm/ .
[S2] M. Steinberger, Electronic Mathematics Journals (http://www.ams.org/notices/199601/steinberger.html), Notices of the American Mathematical Society 43 (1996), 13-16.