Search This Blog

Tuesday, January 7, 2014

Thoughts on CATALOGING, RDA, and metadata in netflix

 I have so many thoughts on this nexflix article but they can all be summed as humans and machines working together to organize, describe and provide relevance (the best of both worlds!) : semantic cataloging. Of course, libraries have been organizing, categorizing, and describing materials from the beginning, but RDA is a big step forward. With the end of print card catalogs and record limits (for the most part), the amount of data within a library catalog record can be much more expansive. Other library databases like repositories and digital libraries, generally have not faced record limits nor have they been tied to MARC (which has its own pros/cons). Of course, quantity doesn't always equal quality, either, but under RDA, we can provide as much description as we would like.

Another aspect of RDA is breaking up more data into smaller bits. Information that might have only appeared in a free text note field or was omitted from a library catalog record, may now be included in -- in some cases, as part of a controlled vocabulary, such as relator codes. These CODES provide information about the relationship of a particular person to a variety of things and can be used to build different kinds of linking, relevance, and all sorts of things! Libraries could create mechanisms so that users and others can more easily use the data to dynamically build lists or collections that are relevant to them (there's the semantic aspect!)  Of course, in order to use the data to make new things, it has to be open

Netflix has had a similar evolution in metadata. Thinking to what our nexgen library catalog systems could be like, let's look at what Netflix has done (and what a few folks have done with their data, which could only happen with at least, some of the data being open). 

It starts with people creating data and machine data collection:

"They [workers] capture dozens of different movie attributes. They even rate the moral status of characters. When these tags are combined with millions of users viewing habits, they become Netflix's competitive advantage. "

Much like traditional cataloging work, tagging is only as good as the tagger. The advantage that libraries have had is that the staff who do this sort of work (cataloging) most likely have some sort of training or relevant education.

In most popular social media (facebook, twitter, etc.) and image gallery sites (flickr, youtube, etc.) sub-tags if any, are limited: geographic (GIS , frequently from phone or camera gps coordinates in the exif metadata), subjects (topics as input by the uploader or tagger), names (user who uploaded or who tags other users in item), dates (item uploaded), access (public/private/select user group), system file information (file format, name, etc.)  and rights (copyright, permissions, etc.) are among the most common. For some image sites, exif data will automatically be loaded in, most frequently date, type of camera, file information and general image specs (size, resolution, etc.) ; other information such as rights (copyright)  is less likely to be picked up.  Facebook's support of metadata is marginal* (EXIF metadata is stripped out) and while Flickr does support the most metadata for images*, it relies primarily on the user to fill out the forms correctly to describe and assign the metadata. (See for more information about EXIF and social media).

In terms of search, crowdsourced metadata can be a challenge. It is only as good (and complete!) as the user who creates it. If you have ever searched for hashtags in twitter, or tags in Flickr, you will see they are used every way imaginable. Hashtags are used as a statement #fail #thisisstupid #greatread,  duplicated #ala (multiple things with the same keyword),  or misspelled #teh (the), with little in the way of quality control placed on them.

However, there is some structure in place, which facilitates searching by hashtag/tag vs. date.

While libraries have had better systems in that the metadata was created by experts and experienced staff, much of the data in a traditional MARC record is unstructured. Funny, no? We think of MARC as being so structured and while it is in terms of field order and use and the fixed field (character placement is essential there), it is not so structured within some fields, like the 5XX fields or even within the 245 (title/statement of responsibility) field. As long as the indicators are correct and the subfields are input correctly, the content within that field is really a type of free text. albeit with some rules for inputting. For example, while the 245 was and remains under RDA as a transcription field (key it as you see it), there are still "shortcuts" (i.e., ways to minimize data recorded) under RDA (See: a nice overview of changes between AAC2 and RDA). So, while it's transcription, it's not exactly ALWAYS word for word (albeit more so with RDA).

The third major component is that the data is open, or at least partially open.With siloed data, this experiment would have not been possible. Having siloed data decreases its ability to be used by others, as well.

So, how was Netflix able to make this successful from a metadata standpoint?

  • a defined (controlled) vocabulary (subject headings, authorities): " The same adjectives appeared over and over. Countries of origin also showed up, as did a larger-than-expected number of noun descriptions like Westerns and Slasher..."  
  • a structure (for catalogers, a similarity to how subject headings are formatted in a traditional library catalog), in netflix:  
    • Region, Awards named first (at least for Oscars)
    • Adjectives (Keywords, subject headings)
    • Dates and places named last (akin to a geographic subdivision)
"If a movie was both romantic and Oscar-winning, Oscar-winning always went to the left: Oscar-winning Romantic Dramas. Time periods always went at the end of the genre: Oscar-winning Romantic Dramas from the 1950s....
In fact, there was a hierarchy for each category of descriptor. Generally speaking, a genre would be formed out of a subset of these components:
Region + Adjectives + Noun Genre + Based On... + Set In... + From the... + About... + For Age X to Y"
 Akin to traditional subject headings:
6510 Sardinia (Italy) $v maps $v Early works to 1800  
650 0  $a Beach erosion $z Florida $z Pensacola Beach $x History $y 20th century $v Bibliography.

  •  data bits that can be repackaged: "little "packets of energy" that compose each movie.... "microtag."" (the smaller the data bits, the more they can be repackaged in different ways) 
 "Netflix's engineers took the microtags and created a syntax for the genres..... "

Thinking back to nexgen systems: RDA is providing a fairly good foundation to go beyond the traditional catalog. When done right (more vs. less, quality AND quantity), cataloging will net structured data bits that can be repackaged and relationship information that can build provide links between previously unrelated items (at least within the catalog); provided the data is open to be used and mechanisms are built so that users can create their own catalog experience. In that world, cataloging truly becomes semantic.

Open Bibliographic Data,
AACR2 compared to RDA, field by field:  
How netflix reverse engineered hollywood:

*Disclaimer: I have no idea what the backend systems of sites do with metadata; my thoughts are based upon the user experience. 

No comments: