Data Management and Retention

man looking at laptop

The Principal Investigator is the custodian of research data and is responsible for the collection, management, and retention of research data. Research data must be retained in sufficient detail and for an adequate period of time to enable appropriate responses to questions about accuracy, authenticity, primacy and compliance with laws and regulations governing the conduct of the research.

The importance of data management and sharing cannot be understated. An appropriately developed data management plan ensures:

  • Integrity of research - Research data, including detailed experimental protocols, all primary data, and procedures of reduction and analysis, are the essential components of scientific progress. Integrity requires meticulous attention to the acquisition and maintenance of research data. Questions about the integrity of the research are often answered by inspecting and reanalyzing the primary data. Planning ahead for data sharing provides for verification of results and potential extension of the research. Planning ahead for data dissemination allows distribution of the data to a broader audience.

  • Intellectual property protection - Research data are legal documents for the purpose of establishing patent rights. Legal challenges to inventorship often require producing the original data with recorded dates. Proprietary issues can also drive data access and sharing practices.

  • Ensuring confidentiality - Sponsors and/or the university may want data to be kept confidential for proprietary or security reasons. Regulations to protect human subjects may require data to be kept confidential. Confidentiality concerns will dictate how data is collected, retained and shared.

Compliance with sponsor's requirements - Requirements can include how long data should be kept, with whom data can be shared, and who has rights to the data.

Research Repositories and Stored Data

  • A key component of research data management plans is the archiving of data, samples and other research products, and the preservation of access to them. UTech Research Computing and Infrastructure can help Principal Investigators identify the requirements that must be satisfied by any data storage solution for data created as part of the data management and data sharing plans.

Research Data Repository Requirements

  • University Policy on Collection and Retention of Research Data -- Faculty Handbook, pp. 83-85.

  • External funding agencies will each have different requirements regarding storage, retention, and availability of research data. Information regarding each agency should be gathered from the agencies themselves. Visit our Funding Resources page for common agency links.

HIPAA and Repository Security

  • The Health Insurance Portability and Accountability Act (HIPAA) security rule requires that everyone with access to electronic personal health information (ePHI) implement safeguards to protect against inappropriate and unauthorized access to patient health data. As stipulated in HIPAA rules and regulations:

    • Protect all ePHI created, received, maintained, or transmitted.

    • Ensure that patient data is safeguarded against potential hacking and unauthorized access.

    • Partner with HIPAA experts at CWRU to ensure compliance with the requirements of the HIPAA Security Rule.

    • Any data sharing solution involving ePHI would fall under the HIPAA Security Rule.

Research Data Repository security

  • As creator of the data, the researcher owns the copyright to it. The copyright holder determines access and reuse of the data.

  • It is a best practice to create a rights statement explaining what use others may make of your data.

  • Consider a repository that allows an embargo of your data.

    • For a set period of time, only metadata about your data will appear

    • An embargo note will indicate that the data is not currently available for re-use.

    • Permits awareness of your data, so that others will not duplicate your work.

    • Allows peers to contact you about your data and its availability.

Research Data Repository Options

  • Principal Investigators at CWRU can utilize UTech data storage solutions such as Box for data created under your data management and data sharing plans.

  • Use an existing discipline-specific database, data repository, data enclave, or archive store and disseminate the data. See the Data Repositories List below for additional options.

  • Use the open-access services of professional societies and journals that publish the results of research.

    • The American Institute of Physics (AIP) has a depository for material that is supplementary to papers appearing in journals published by or through AIP. Appropriate items for deposit include data tables.

  • Use existing institutional solutions or “cloud” storage.

  • Within the CWRU infrastructure, storage can be arranged on Digital Case, the CWRU Google Docs, CWRU Box, CWRU OSF, or as an add-on to existing accounts already using network storage. Note: There may be some administrative/account maintenance fees in addition to physical storage fees. Costs should be properly accounted for in the grant proposal budget.

How can I get information about Data Repositories and Data Management?

  • Contact with any questions regarding Data Repositories and Data Management and their potential uses in faculty research. RCCI staff will assist to the best of our ability.

Where can I find more information about UTech Research Computing and the HPC?

In addition to the Additional HPC Information section below, please view our Research Computing Brochure.

Data Repositories List

The following sites maintain lists of many repositories that accept research data:

This following is a selected list of data repositories available through other institutions. If you know of any other data repositories that should be included, please send the details to the UTech Service Desk ( CWRU is not responsible for any of the content of the sites listed here.

  • Long Term Ecological Research Network - The Long Term Ecological Research (LTER) Network is a collaborative effort involving more than 1800 scientists and students investigating ecological processes over long temporal and broad spatial scales. The Network promotes synthesis and comparative research across sites and ecosystems and among other related national and international research programs.

  • American Mineralogist Crystal Structure Database - This site is an interface to a crystal structure database that includes every structure published in the American Mineralogist, The Canadian Mineralogist, European Journal of Mineralogy and Physics and Chemistry of Minerals, as well as selected datasets from other journals.

  • Crystallography Open Database - The COD, once finalized, will be nothing else than a keyword-searchable Web server of crystal structure atomic coordinates, preserving the data after publication as well as unpublished data.

  • Digital Library for Earth System Education - DLESE is a distributed community effort involving educators, students, and scientists working together to improve the quality, quantity, and efficiency of teaching and learning about the Earth system. In pursuing this mission DLESE provides access to Earth data sets and imagery, including the tools and interfaces that enable their effective use in educational settings.

  • e-Depot Nederlandse Archeologie - An archive of digital data on archaeological research from the Netherlands

  • e-Crystals - Crystal Structure Report Archive - eCrystals - Southampton is the archive for Crystal Structures generated by the Southampton Chemical Crystallography Group and the EPSRC UK National Crystallography Service.

  • Oak Ridge National Laboratory Distributed Active Archive Center - The Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) is a NASA-sponsored source for biogeochemical and ecological data and models useful in environmental research. All of our data sets and model products are free of any costs to you (including shipping).

  • Inter-University Consortium for Political and Social Research - The Inter-university Consortium for Political and Social Research is an organization of member institutions working together to: Acquire and preserve social science data, provide open and equitable access to these data, and promote effective data use.

  • International Food Policy Research Institute - IFPRI provides the following types of agriculture and socio-economic datasets: Geospatial Data, Household and Community-level Surveys, Institution-level Surveys, Regional Data, and Social Accounting Matrices.

  • National Geoscience Data Repository System - The NGDRS is a system of geoscience data repositories, providing information about their respective holdings accessible through a web-based super catalog.

  • University Corporation for Atmospheric Research - Climate atmospheric data from the UCAR organization and other participating institutions.

  • Publishing Network for Geoscientific & Environmental Data - PANGAEA is a public digital library for science aimed at archiving, publishing and distributing geo-referenced data with special emphasis on environmental, marine and geological basic research.

  • Reciprocal - The Reciprocal Net is a distributed database used by research crystallographers to store information about molecular structures; much of the data is available to the general public. The Reciprocal Net project is still under development.

  • RRUFF Project - The RRUFF Project is an integrated database of Raman spectra, X-ray diffraction and chemistry data for minerals, with the goal of creating a complete set of high quality spectral data from well characterized minerals.

  • Scripps Institution of Oceanography Explorer - Data, documents and images from 822 expeditions by the Scripps Institution of Oceanography (SIO) since 1903.

  • Strasbourg Astronomical Data Center - The CDS is a data center dedicated to the collection and worldwide distribution of astronomical data and related information.

  • Data Archiving and Networked Services - DANS is responsible for providing permanent access to research material from the humanities and social sciences. The present DANS collection contains the datasets of the Netherlands Historical Data Archive (NHDA), the Steinmetz Archive and the Scientific Statistical Agency (WSA).

  • British Atmospheric Data Centre - The BADC is the Natural Environment Research Council's (NERC) Designated Data Centre for the Atmospheric Sciences.

  • NERC Earth Observation Data Centre - The NEODC is tasked with the acquisition, archiving and provision of access to remotely sensed data of the surface of the Earth acquired by satellite and airborne sensors.

  • British Oceanographic Data Centre - BODC holds wealth of publicly accessible marine data collected using a variety of instruments and samplers and collated from many sources. They handle biological, chemical, physical and geophysical data and their databanks contain measurements of nearly 10,000 different oceanographic variables.

  • Antarctic Environmental Data Centre - The AEDC coordinates the management of data collected by UK funded scientists in Antarctica and the Southern Ocean.

  • United Kingdom Data Archive - The UK Data Archive (UKDA) is a centre of expertise in data acquisition, preservation, dissemination and promotion and is curator of the largest collection of digital data in the social sciences and humanities in the UK.

  • Centre for Ecology & Hydrology - CEH is a major custodian of environmental data for the UK. We have significant capabilities in data collation and management, and information systems development. We use these skills, together with our data archives, to support large-scale, long-term environmental research.

  • Ensembl - Ensembl is a joint project between EMBL - EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes.

  • United States National Virtual Observatory - NVO's objective is to enable new science by greatly enhancing access to data and computing resources. NVO makes it easy to locate, retrieve, and analyze data from archives and catalogs worldwide.

  • RCSB Protein Data Bank - The Protein Data Bank (PDB) is the single worldwide depository of information about the three-dimensional structures of large biological molecules, including proteins and nucleic acids. These are the molecules of life that are found in all organisms including bacteria, yeast, plants, flies, and mice, and in healthy as well as diseased humans.

  • National Center for Biotechnology Information - Established as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information

  • European Molecular Biology Laboratory - European Bioinformatics Institute - The EBI is a centre for research and services in bioinformatics. The Institute manages databases of biological data including nucleic acid, protein sequences and macromolecular structures.

  • Maize Genetics and Genomics Database - MaizeGDB is the community database for biological information about the crop plant Zea mays ssp. mays. Genetic, genomic, sequence, gene product, functional characterization, literature reference, and person/organization contact information are among the data types accessible through this site.

  • Scholars Digital Library of Analytics - The Scholars Digital Library of Analytics prides itself as an intact repository of data sets for use in research, education, and reference. Included with each set of data is a description of what the data was initially used for, its subject area, and its number of rows and columns.

  • National Nuclear Data Center - The NNDC collects, evaluates, and disseminates nuclear physics data for basic nuclear research and for applied nuclear technologies. The NNDC is a worldwide resource for nuclear data.

  • Geosciences Network - The GEON project is a collaboration among a dozen PI institutions and a number of other partner projects, institutions, and agencies to develop cyberinfrastructure in support of an environment for integrative geoscience research.

  • Incorporated Research Institutions for Seismology - The IRIS is a university research consortium dedicated to exploring the Earth's interior through the collection and distribution of seismographic data. Their collection includes waveform data, channel response data, and Event (earthquake) catalogs.

  • Southern California Earthquake Center - The SCEC's mission is to gather data on earthquakes in Southern California and elsewhere, integrate information into a comprehensive and physics-based understanding of earthquake phenomena; and communicate understanding to society at large as useful knowledge for reducing earthquake risk.

  • UNAVCO - The UNAVCO Facility exists to support research investigators in their use of Global Positioning System technology for Earth sciences research. The Facility performs this task in part by archiving GPS data and data products for current and future applications.

  • Biomedical Informatics Research Network Data Repository - To further promote a collaborative research environment, the BIRN has undertaken the development of the public BIRN Data Repository (BDR) for the biomedical research community. The BDR will provide researchers with a venue to share and exchange their data with the broader biomedical research community, providing for the means to capture, curate, store, query, view, and download imaging and related data.

  • National Center for Atmospheric Research - Data sets include information collected from research facilities and tools, as well as information from climate and weather models created and compiled by NCAR scientists and those in our science community.

  • Encyclopedia of DNA Elements - The National Human Genome Research Institute launched ENCODE to carry out a project to identify all functional elements in the human genome sequence. The project is being conducted in three phases: a pilot project phase, a technology development phase and a planned production phase.

  • The Arabidopsis Information Resource - The Arabidopsis Information Resource collects information and maintains a database of genetic and molecular biology data for Arabidopsis thaliana, a widely used model plant.

  • Alaska Satellite Facility - Synthetic Aperture Radar Distributed Active Archive Center - The Alaska Satellite Facility, downlinks, processes, archives, and distributes SAR data from the European Space Agency's ERS-1 and ERS-2 satellites, NASDA's JERS-1 satellite, and the Canadian Space Agency's RADARSAT-1 satellite.

  • Goddard Earth Sciences Data and Information Services Center - The GES DISC is the home (archive) of Precipitation, Atmospheric Chemistry and Dynamics, and information, as well as data. We are one of eight NASA Science Mission Directorate DAACs that offer Earth science data, information, and services to research scientists, applications scientists, applications users, and students.

  • Global Hydrology Resource Center - The GHRC provides both historical and current Earth science data, information, and products from satellite, airborne, and surface-based instruments. The GHRC acquires basic data streams and produces derived products from many instruments spread across a variety of instrument platforms.

  • National Oceanographic Data Center - NODC maintains and updates a national ocean archive with environmental data acquired from domestic and foreign activities and produces products and research from these data which help monitor global environmental changes. These data include physical, biological and chemical measurements derived from in situ oceanographic observations, satellite remote sensing of the oceans, and ocean model simulations.

  • Universal Protein Resource - The UniProt consortium aims to support biological research by maintaining a high quality database that serves as a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces freely accessible to the scientific community.

  • Atmospheric Radiation Measurement Climate Research Facility Data Archive - The ARM Archive supports the scientific field experiments of the Atmospheric Radiation Measurement (ARM) Program by storing and distributing the large quantities of data collected from these experiments. These data are used to research atmospheric radiation balance and cloud feedback processes, which are critical to the understanding of global climate change.

  • National Space Science Data Center - The National Space Science Data Center serves as the permanent archive for NASA space science mission data. "Space science" means astronomy and astrophysics, solar and space plasma physics, and planetary and lunar science.

  • Harvard M.I.T. Data Center - HMDC is the principal distributor of quantitative social science data from major international data consortia for Harvard and MIT.

  • Purdue Ionomics Information Management System - PiiMS provides integrated workflow control, data storage, and analysis to facilitate high-throughput data acquisition, along with integrated tools for data search, retrieval, and visualization for hypothesis development. PiiMS currently contains data on shoot concentrations of P, Ca, K, Mg, Cu, Fe, Zn, Mn, Co, Ni, B, Se, Mo, Na, As, and Cd in over 60,000 shoot tissue samples of Arabidopsis (Arabidopsis thaliana), including ethyl methanesulfonate, fast-neutron and defined T-DNA mutants, and natural accession and populations of recombinant inbred lines from over 800 separate experiments, representing over 1,000,000 fully quantitative elemental concentrations.

  • National Snow and Ice Data Center - NSIDC support(s) "research into our world's frozen realms: the snow, ice, glacier, frozen ground, and climate interactions that make up Earth's cryosphere. Scientific data, whether taken in the field or relayed from satellites orbiting Earth, form the foundation for the scientific research that informs the world about our planet and our climate systems.

Dryad - Dryad is an international repository of data underlying peer-reviewed articles in the basic and applied bio-sciences, including ecology, biology, and medicine. From the National Evolutionary Synthesis Center (NESCent) and the University of North Carolina Metadata Research Center, in coordination with a large group of journals and societies.

Questions regarding the management of research data should be addressed to the Associate Vice President for Research.