Search This Blog

Friday, May 3, 2013

Big Data Symposium @ UGA - my notes



NFS funded projects supporting data-enabled science & engineering
Predictive Analytics center of excellence (pace)


UCSD kubrary on digital curation projects (library)


Data intensive vs. computational - Focus on memory and bandwidth


traditional system  re: archival data
archival data difficult to access; difficult to share archival data by multiple users; write-once-read-never
Few incentives for users to retain only high-value data (!)
Didn’t want to retain role of stale data - move away from tape-based archive


New system:
Data oasis (parallel file system)


Project storage: addresses top challenges and risks for PIs; supports replication as a backup mechanism; operating under business model  


SDSC Cloud (use openstack and rackspace; object based storage ) - easy to share via URLs, portals, etc; always there; simple to set permissions; can log transactions (e.g., how many people downloaded that data?); scalable; use for curated data collections; site hosting; APIs; serving images/videos; backup service (most used)
Challenges: user interface not as useful; needed tools for large data management; seen as too expensive by some (cost based); data sharing new idea; possibly seen as redundant by researchers using oasis


Looking for economy of scale (manage 3 supercomputer; running 4 separate file structure to keep stability , best way to manage)
No tape archive but keep allocated space for temporary storage (e.g., 3 months) move data back to home institutions


What are relevant benchmarks for big data?


blueprint for the digital university rci.ucsd.edu/_files/RCIDTReportFinal2009.pdf
  • How to lab personnel work with librarians to curate their data? How do you make data sharable for those who did not create it?
  • How much work is required to curate and what are the options?
  • sustainable (business model)
  • Embedded librarians with lab teams - metadata specialists, domain scientists - a lot of it was about metadata - free curation
  • 16 months thoughts: reuse existing tools ; use existing digital assessment management system; helped researchers to better understand data organization and description in order to make them shareable; put in multiple places - not important where it is housed; that it is accessable
  • Data lifecycle? from creation to curation
  • curation is human skill based ; judgment; social resistence to sharing data outside the research group
Findings
  • Researchers see the data lifecycle as a single workflow
  • No standard definition of a dataset (inconsistent metadata)
  • stewardship
Campus should create a Data Lifecycle Advisory Council to include campus administrators, researchers and librarians. The council should be tasked with advising the campus on 1)what data the university should steward for the long term 2)who should pay at each stage of the data life cycle and 3) How intellectual property rights should be determined


Thought: Librarian would work with researcher or group for a finite amount of time to set up data management plan; after that time period consultants could be hired to do the work if needed


Cronopolis (sp?)  Data repository project


Shared data center: Colocation (host IT equiment in energy efficient, manned data center); users may $2500/rack/year
Reliance of pay to play : survey; need to stay within price points to get adoption


Survey findings - What they need
high speed networking
reliable storage
data durability - backups/copies/tiered storage
Compliant environment (health related, etc)


What they use for storage: 1) network 2)USB
Backups: a copy in the NAS; local drive; USB; email/dropbox  


Metadata annotation requirements - 25% say yes;  
What risk is associated with campus sponsored infrastructure: would the program would exist over long term (longevity) and overall cost (bait and switch)


Need to get people on board in order to build something sustainable
Adoption takes time ; individual touch; trusted colleagues; marketing and managing up


No comments: