Big Data Infrastructure Colloquim
NFS funded projects supporting data-enabled science & engineering
Predictive Analytics center of excellence (pace)
UCSD kubrary on digital curation projects (library)
Data intensive vs. computational - Focus on memory and bandwidth
traditional system re: archival data
archival data difficult to access; difficult to share archival data by multiple users; write-once-read-never
Few incentives for users to retain only high-value data (!)
Didn’t want to retain role of stale data - move away from tape-based archive
New system:
Data oasis (parallel file system)
Project
storage: addresses top challenges and risks for PIs; supports
replication as a backup mechanism; operating under business model
SDSC
Cloud (use openstack and rackspace; object based storage ) - easy to
share via URLs, portals, etc; always there; simple to set permissions;
can log transactions (e.g., how many people downloaded that data?);
scalable; use for curated data collections; site hosting; APIs; serving
images/videos; backup service (most used)
Challenges:
user interface not as useful; needed tools for large data management;
seen as too expensive by some (cost based); data sharing new idea;
possibly seen as redundant by researchers using oasis
Looking for economy of scale (manage 3 supercomputer; running 4 separate file structure to keep stability , best way to manage)
No tape archive but keep allocated space for temporary storage (e.g., 3 months) move data back to home institutions
What are relevant benchmarks for big data?
blueprint for the digital university rci.ucsd.edu/_files/RCIDTReportFinal2009.pdf
- How to lab personnel work with librarians to curate their data? How do you make data sharable for those who did not create it?
- How much work is required to curate and what are the options?
- sustainable (business model)
- Embedded librarians with lab teams - metadata specialists, domain scientists - a lot of it was about metadata - free curation
- 16 months thoughts: reuse existing tools ; use existing digital assessment management system; helped researchers to better understand data organization and description in order to make them shareable; put in multiple places - not important where it is housed; that it is accessable
- Data lifecycle? from creation to curation
- curation is human skill based ; judgment; social resistence to sharing data outside the research group
Findings
- Researchers see the data lifecycle as a single workflow
- No standard definition of a dataset (inconsistent metadata)
- stewardship
Campus
should create a Data Lifecycle Advisory Council to include campus
administrators, researchers and librarians. The council should be tasked
with advising the campus on 1)what data the university should steward
for the long term 2)who should pay at each stage of the data life cycle
and 3) How intellectual property rights should be determined
Thought:
Librarian would work with researcher or group for a finite amount of
time to set up data management plan; after that time period consultants
could be hired to do the work if needed
Cronopolis (sp?) Data repository project
Shared data center: Colocation (host IT equiment in energy efficient, manned data center); users may $2500/rack/year
Reliance of pay to play : survey; need to stay within price points to get adoption
Survey findings - What they need
high speed networking
reliable storage
data durability - backups/copies/tiered storage
Compliant environment (health related, etc)
What they use for storage: 1) network 2)USB
Backups: a copy in the NAS; local drive; USB; email/dropbox
Metadata annotation requirements - 25% say yes;
What
risk is associated with campus sponsored infrastructure: would the
program would exist over long term (longevity) and overall cost (bait
and switch)
Need to get people on board in order to build something sustainable
Adoption takes time ; individual touch; trusted colleagues; marketing and managing up
No comments:
Post a Comment