Irods research data storage

From ICT science
Jump to navigation Jump to search

Intro

This project intends to provide in a data storage solution for all research data across it's entire life-span (Data Life Cycle - DLC). All services are supported by a joint effort of UU-ICT and ICT-Beta. We are currently involved in a pilot implementation with the Cell Biology group to store research data using services provided by Research-IT and cloud storage. This involves an Irods server providing access to cloud storage and analysis of metadata .

Access

Login using emailaddress as user (lower case only) and solis password:
  • get Cyberduck and use it to open https://science.data.uu.nl
  • configure Cyberduck for the IRODS-protocol when transferring large or many files:
  1. download the Irods.cyberduckprofile
  2. open this profile in Cyberduck (use open with and point)
  3. fill in username PAM:j.doe@uu.nl (emailaddress in lower case only)
  4. close screen and then open the science.data.uu.nl - Irods preset
  5. enter your solis password
  6. navigate up one directory by pressing the triangle icon in the upper right corner
general information on using Cyberduck: http://irods.org/2015/09/howtocyberduck/
  • alternative (not recommended): how to mount a WebDav drive in the OS-finder (Windows/Mac) see: Personal_storage
  • mounting the storage under Linux (Ubuntu16):
  1. You can connect from your filemanager, option connect to: davs://science.data.uu.nl
  2. Make a mount:
   sudo apt-get install davfs2
   sudo mkdir  /mnt/irods
   sudo mount.davfs 'https://science.data.uu.nl' /mnt/irods
Automatic mounting and storing your username/password See: https://wiki.archlinux.org/index.php/Davfshttps://wiki.archlinux.org/index.php/Davfs

Metadata

Adding metadata

Metadata is used to classify researchdata independent of directory structure. This will enable you to find data lateron in a search-engine like style. Irods can be configured to extract metadata from the files itself if the syntax is known (i.e. TIFF files). This is not standard. Below you find the naming-scheme that can be used to add metadata to any directory. Adding a metadata.txt file to a folder will add this metadata to all data contained in the folder (also subfolders). For varying classifications down the line: just add another metadata.txt file containing the changes only. You can use a metadata_subtitle.txt file naming scheme for your own purposes (replace SUBTITLE with appropriate name)

Naming scheme general metadata

name type possible values description
ownerID alphaNum solisID/groupID conform to existing ID's as much as possible
deviceID alphaNum deviceID enter the ID of the device used to obtain the data
projectName alphaNum projectName official name of the project
localName alphaNum localName make up your own naming scheme for local purposes
startDate date dd/mm/yyyy start date of project
endDate date dd/mm/yyyy planned end date of project
storageReferenceID alphaNum create syntax for storagereferenceID this ID can be used to refer to existing physical storage
dataType alpha raw / analysis / archive / re-use classification describes the lifecycle stage of the data
deleteAfter alphaNum 1 / 3 / 5 / 10 / never / delete number of years data should be kept after endDate
confidentiality alpha basic / sensitive / critical / public following BIV-classification this determines privacy and legal status
description alphaNum, 250 chars free text any additional data you might want to add

remarks:

  • default values are bold
  • the naming scheme can be extended for adoption to your own dataset, but should always be documented for proper use by the entire research group

I also need this!

ICT-Beta can guide you in the proces to a research data storage solution.

You can start by answering the following questions

  • how would you describe your current research data storage setup/situation? (good/not good enough/bad)
If not good: what needs to be done to improve this?
  • what kind of data do you want to store?
Use the Data Life Cyle to classify the kind of data:
gathering data / processing data: measurements / processing data: analysis / archiving data (consolidate / publicizing data (results) / re-using data (follow up)
  • how much storage do you need (by DLC category in TB) , what is the expected growth rate? (default: +20% per year)
remark: quality and speed of storage is derived from DLC category
  • what is your first priority in properly storing research data? Ie: current research: raw or data-analysis used as a back-up or live data, archive of previous research, publicizing data
  • when would you like to start? How much data would you be able to classify and move per month? Also see:
  • appoint a data-steward to guide the transition process of your data
  • what kind of specific metadata do you use/need in order to classify your research data? (think about data-retrieval lateron)
  • how long do you want to store the data for? (ie: 1/3/5/10 years or indefinitely)
  • also think about confidentiality: basic / sensitive / critical / none
  • what is your current financial investment in research data storage? Do you receive any specific budget for storage of research data?