content divergent: Big Data Symposium @ UGA

Big Data Infrastructure Colloquim

(Compiled by Robin Fay @georgiawebgurl ; http://www.contentdivergent.blogspot.com)

http://calendar.uga.edu/detail/big-data-infrastructure-colloquim1/

NFS funded projects supporting data-enabled science & engineering

Predictive Analytics center of excellence (pace)

UCSD kubrary on digital curation projects (library)

Data intensive vs. computational - Focus on memory and bandwidth

traditional system re: archival data

archival data difficult to access; difficult to share archival data by multiple users; write-once-read-never

Few incentives for users to retain only high-value data (!)

Didn’t want to retain role of stale data - move away from tape-based archive

New system:

Data oasis (parallel file system)

Project storage: addresses top challenges and risks for PIs; supports replication as a backup mechanism; operating under business model

SDSC Cloud (use openstack and rackspace; object based storage ) - easy to share via URLs, portals, etc; always there; simple to set permissions; can log transactions (e.g., how many people downloaded that data?); scalable; use for curated data collections; site hosting; APIs; serving images/videos; backup service (most used)

Challenges: user interface not as useful; needed tools for large data management; seen as too expensive by some (cost based); data sharing new idea; possibly seen as redundant by researchers using oasis

Looking for economy of scale (manage 3 supercomputer; running 4 separate file structure to keep stability , best way to manage)

No tape archive but keep allocated space for temporary storage (e.g., 3 months) move data back to home institutions

What are relevant benchmarks for big data?

blueprint for the digital university rci.ucsd.edu/_files/RCIDTReportFinal2009.pdf

How to lab personnel work with librarians to curate their data? How do you make data sharable for those who did not create it?
How much work is required to curate and what are the options?
sustainable (business model)
Embedded librarians with lab teams - metadata specialists, domain scientists - a lot of it was about metadata - free curation
16 months thoughts: reuse existing tools ; use existing digital assessment management system; helped researchers to better understand data organization and description in order to make them shareable; put in multiple places - not important where it is housed; that it is accessable
Data lifecycle? from creation to curation
curation is human skill based ; judgment; social resistence to sharing data outside the research group

Findings

Researchers see the data lifecycle as a single workflow
No standard definition of a dataset (inconsistent metadata)
stewardship

Campus should create a Data Lifecycle Advisory Council to include campus administrators, researchers and librarians. The council should be tasked with advising the campus on 1)what data the university should steward for the long term 2)who should pay at each stage of the data life cycle and 3) How intellectual property rights should be determined

Thought: Librarian would work with researcher or group for a finite amount of time to set up data management plan; after that time period consultants could be hired to do the work if needed

Cronopolis (sp?) Data repository project

Shared data center: Colocation (host IT equiment in energy efficient, manned data center); users may $2500/rack/year

Reliance of pay to play : survey; need to stay within price points to get adoption

Survey findings - What they need

high speed networking

reliable storage

data durability - backups/copies/tiered storage

Compliant environment (health related, etc)

What they use for storage: 1) network 2)USB

Backups: a copy in the NAS; local drive; USB; email/dropbox

Metadata annotation requirements - 25% say yes;

What risk is associated with campus sponsored infrastructure: would the program would exist over long term (longevity) and overall cost (bait and switch)

Need to get people on board in order to build something sustainable

Adoption takes time ; individual touch; trusted colleagues; marketing and managing up

content divergent

Search This Blog

Friday, May 3, 2013

Big Data Symposium @ UGA - my notes

No comments:

Speaking/Teaching

Previously...

Visitors