Conference paper or proceedings

Comparing the access to and legibility of Japanese language texts in Massive Digital Libraries

In previous studies, Weiss and James have examined the impact of Massive Digital Libraries (MDLs) on the development of libraries in terms of copyright, metadata, accessibility and diversity. This paper continues these investigations by presenting the results of a study conducted in 2013-2014 that examines the coverage and accessibility of Japanese language books in two MDLs, Google Books and HathiTrust. A random sample of 5000 Japanese-language books with publication dates prior to 1943 was extracted from the OCLC WorldCat database; of these another 800 were randomly selected and 400 titles were examined. The titles were queried in both Google Books and HathiTrust. The texts were then examined for their level of typical user access, their accuracy in metadata and their quality of scans. Despite their likely public domain status within Japan and in the United States, 0.2% (N=1) of the sampled texts were visible in Google Books as full texts. While 12.5% (N=50) of the sample were visible in HathiTrust. Within the full view texts, errors in scanning and metadata were identified, including problems with legibility ("moji tsubure") in 68% of visible texts; distorted content (including slanted and upside-down pages) in 90%; motion or blur of turning pages captured by digital cameras in 48%; extra-textual objects (3-D items not part of text; i.e. fingers, hands, book holders, etc.) in 94%; and use of heavily-defaced, dirty or fragile source material in 28%. The most common metadata errors were missing bibliographic information, especially missing page numbers (in 18% of texts) and incomplete tables of contents (in 22%); and problems associated with poor OCR, especially unusable keywords and common phrases (in 50% of texts) that appear to be random words, articles, and unpronounceable symbols.