Next Generation Cybertools: Very Large Semi-Structured Datasets for Social Science Research



Next Generation Cybertools

As part of the NSF's Cyberinfrastructure initiative, Cornell University has a major grant to carry out social science research on very large semi-structured datasets.

The starting point for this research is the Web. The flood of available on-line information – from Web pages to chat logs – has the potential to open up new frontiers in social science research on collective behavior of individuals. However, there are significant obstacles in realizing this opportunity. The project team, composed of experts from social and computer science, was drawn together by the enormous promise of a unique and largely untapped dataset: the Internet Archive's 40-billion page collection of Web pages. These snapshots of the Web have been captured and archived about every two months for nearly ten years. Large portions of the data are being moved to Cornell’s supercomputing center.

The research program has two major components:

The Web Lab supports researchers in computer science, the social sciences, and humanities, whose interests lie in the information on the Web, and computer scientists, who carry out research on the Web as an information structure. Although based at Cornell, the collection is designed for use by researchers from other universities and research centers.

Further information

Acknowledgments

This is an NSF Next Generation Cybertools project, grant number SES-0537606. Additional support for the Web Library comes from NSF grants CNS-0403340 and DUE-0127308, Unisys, Microsoft and Dell, and from Cornell University.

This work would not be possible without the forethought and longstanding commitment of the Internet Archive to capture and preserve the content of the Web for future generations.