Saturday, November 21, 2009

Week 11 reading notes

Even though I was signed in to the Pitt Library website, I couldn’t access the articles by David Hawking without being prompted to pay for each article, so I wasn’t able to read them.

Shreeves, S. L., Habing, T. O., Hagedorn, K., & Young, J. A. (2005). Current developments and future trends for the OAI protocol for metadata harvesting. Library Trends, 53(4), 576-589.

“The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) has been widely adopted since its initial release in 2001. Initially developed as a means to federate access to diverse e-print archives through metadata harvesting and aggregation, the protocol has demonstrated its potential usefulness to a broad range of communities.”

“The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) has been widely adopted since its initial release in 2001. Initially developed as a means to federate access to diverse e-print archives through metadata harvesting (Lagoze & Van de Sompel, 2003), the protocol has demonstrated its potential usefulness to a broad range of communities. According to the Experimental OAI Registry at the University of Illinois Library at Urbana–Champaign (UIUC) (Experimental OAI Registry at UIUC, n.d.), there are currently over 300 active data providers using the production version (2.0) of the protocol from a wide variety of domains and institution types. Developers of both open source and commercial content management systems (such as D-Space and CONTENTdm) are including OAI data provider services as part of their products.”

“The OAI world is divided into data providers or repositories, which traditionally make their metadata available through the protocol, and service providers or harvesters, who completely or selectively harvest metadata from data providers, again through the use of the protocol (Lagoze & Van de Sompel, 2001).”

“As the OAI community has matured, and especially as the number of OAI repositories and the number of data sets served by those repositories has grown, it has become increasingly difficult for service providers to discover and effectively utilize the myriad repositories. In order to address this difficulty the OAI research group at UIUC has developed a comprehensive, searchable registry of OAI repositories (Experimental OAI Registry at UIUC, n.d.).”

MICHAEL K. BERGMAN, “The Deep Web: Surfacing Hidden Value” http://www.press.umich.edu/jep/07-01/bergman.html

“Traditional search engines can not "see" or retrieve content in the deep Web — those pages do not exist until they are created dynamically as the result of a specific search. Because traditional search engine crawlers can not probe beneath the surface, the deep Web has heretofore been hidden.
The deep Web is qualitatively different from the surface Web. Deep Web sources store their content in searchable databases that only produce results dynamically in response to a direct request. But a direct query is a "one at a time" laborious way to search. BrightPlanet's search technology automates the process of making dozens of direct queries simultaneously using multiple-thread technology and thus is the only search technology, so far, that is capable of identifying, retrieving, qualifying, classifying, and organizing both "deep" and "surface" content.”

•“Public information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.
•The deep Web contains 7,500 terabytes of information compared to nineteen terabytes of information in the surface Web.
•The deep Web contains nearly 550 billion individual documents compared to the one billion of the surface Web.
•More than 200,000 deep Web sites presently exist.
•Sixty of the largest deep-Web sites collectively contain about 750 terabytes of information — sufficient by themselves to exceed the size of the surface Web forty times.
•On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites; however, the typical (median) deep Web site is not well known to the Internet-searching public.
•The deep Web is the largest growing category of new information on the Internet.
•Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.
•Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web.
•Deep Web content is highly relevant to every information need, market, and domain.
•More than half of the deep Web content resides in topic-specific databases.
•A full ninety-five per cent of the deep Web is publicly accessible information — not subject to fees or subscriptions.”

“It has been said that what cannot be seen cannot be defined, and what is not defined cannot be understood. Such has been the case with the importance of databases to the information content of the Web. And such has been the case with a lack of appreciation for how the older model of crawling static Web pages — today's paradigm for conventional search engines — no longer applies to the information content of the Internet.”

“The sixty known, largest deep Web sites contain data of about 750 terabytes (HTML-included basis) or roughly forty times the size of the known surface Web. These sites appear in a broad array of domains from science to law to images and commerce. We estimate the total number of records or documents within this group to be about eighty-five billion.

Roughly two-thirds of these sites are public ones, representing about 90% of the content available within this group of sixty. The absolutely massive size of the largest sites shown also illustrates the universal power function distribution of sites within the deep Web, not dissimilar to Web site popularity or surface Web sites. One implication of this type of distribution is that there is no real upper size boundary to which sites may grow.”

“Directed query technology is the only means to integrate deep and surface Web information. The information retrieval answer has to involve both "mega" searching of appropriate deep Web sites and "meta" searching of surface Web search engines to overcome their coverage problem. Client-side tools are not universally acceptable because of the need to download the tool and issue effective queries to it. Pre-assembled storehouses for selected content are also possible, but will not be satisfactory for all information requests and needs. Specific vertical market services are already evolving to partially address these challenges. These will likely need to be supplemented with a persistent query system customizable by the user that would set the queries, search sites, filters, and schedules for repeated queries.”

No comments:

Post a Comment