Search This Blog

Tuesday, April 24, 2012

Harvard Library releases "big data" metadata

Ok, this sounds cool, tho a little hard to say what the quality control was like... but it is a big data set... . 

“This is Big Data for books,” said David Weinberger, co-director of Harvard’s Library Lab. “There might be 100 different attributes for a single object.” At a one-day test run with 15 hackers working with information on 600,000 items, he said, people created things like visual timelines of when ideas became broadly published, maps showing locations of different items, and a “virtual stack” of related volumes garnered from various locations.


Robin said...

Just to clarify, the release is of over 12 million records, not 600,000.

Unknown said...

Thanks I corrected the title of the post - although I think the article is not really clear about what was released in the actual data set... as I mentioned in discussions on twitter...
...but is it 12 million RECORDS vs a total of metadata about 12 million items?

The actual quote is "Harvard is making public the information on more than 12 million books, videos, audio recordings, images, manuscripts, maps, and more things inside its 73 libraries."
Each of those could have multiple records or in some cases, records might be shared amongst items. It also depends on what type of records. For example, in my library, each item could have 3 records (or 1 big record with 2 subrecords), if you want to get to that detailed level.
I'm not trying to be confrontational, but records is not the same thing as items.

Robin said...

12.3 million MARC records. See for the policy in general, linking to the page on this dataset, plus more documentation. This dataset is bibs only, no holdings or item information. As for quality control, it's most of a research library catalog -- no mystery there. It'll be a mixed bag ranging from glorious rare book records to skimpy recon records keyed from cards and everything in between.

Unknown said...

thanks so mcuh for the info! I do really appreciate it. I realize it was just a short blog post, but it would have been useful to include the URL to the actual project page as well as info about what type of records were used. My assumption would be that it would be MARC but it could have been XML or DC, or RDFa, or well most any metadata schema. ;-) As for quality control, I was wondering about internal quality control of the project. I will take a look at the documentation - thanks. MARC cleanup - my almost daily life... LOL

Robin said...

I agree about the link and other info! However, the NYT author wrote what he chose to write; Harvard didn't write it. The question was how best to get the word out that the metadata release was happening, and the NYT certainly did that.