Data management basics

From Larkum Lab

Jump to: navigation, search

Principles

  1. Back up as much as possible.
  2. Separate code and data: there are several Template repositories for GIN.
  3. Keep track of everything you did as much as possible: I recommend the use of version-control systems, including GIN.
  4. Cite the source in terms of the URL: it is related to Publishing your data.

More info may be found in Organizing your code and data.

Terminology

Metadata

"Metadata" refers to:

  1. the parameters you controlled during the experiments to obtain your raw data, as well as
  2. the uncontrolled parameters during the experiments that you think may have affected your raw data.

Raw data

"Raw data" is what you get through observation / acquisition after controlling various conditions.

Derived data

"Derived data" is any other types of data you would obtain by processing the raw data and the metadata.

Notes

The distinction between raw vs derived data may be related to the file formats, but it is NOT determined by its file format. For example, your electrophysiology acquisition may always result in a set of ".ibw" files, but the ".ibw" files that you generate as the result of processing (e.g. the I-V curves) are considered to be derived data.

In which file format can I use to store my data?

"Ideal" options

Most open-data advocates would encourage you to use a certain "open" format (normally referring to those that are openly maintained by community, instead of being maintained as proprietary assets by a company). These formats include:

  • ".txt" files (raw text)
  • ".md", ".adoc", ".rst", or ".htm" files (raw text-based, formatted documents; Markdown, AsciiDoc, ReStructuredText or HTML)
  • ".csv" or ".tsv" files (raw text-based tables; comma-separated (CSV) or tab-separated (TSV))
  • ".png", ".jpg", or ".tif" files (open image formats)
  • ".xml", ".json", or ".yaml" files (structured data formats)
  • ".h5" files (HDF5-archived data)

Practical options

What if the data you have is not among the formats listed above (which is most likely)? Don't worry. In most cases, there have already been community efforts to read those types of data on different platforms without paying anything for a certain expensive software. As long as it can be read (i) without paying anything and (ii) without having to use a certain OS (Windows/UNIX), I would say these data files are practically "open".

For example:

  • Using free libraries in Python, you can open:
    • Matlab data files (with ".mat" extension) with the use of scipy.
    • Igor Pro wave files (with ".ibw" extension), Spike2 binary files (with ".smr" extension) and lot more (Axon, Blackrock, Plexon, Intan etc.) by using neo.
  • Fiji or ImageJ can open image files of various formats, including those specific to a certain microscopes (Zeiss, Olympus etc.).
  • VLC can open video files of various formats.

If your format is not listed above, you can always Google it.

Unrecommended options

Unfortunately, there are always some formats that cannot be opened without paying some extra money, or you have to use a certain platform (often Windows-specific). For example:

  • The ".fig" and compiled binary files of Matlab
  • The ".smrx" compressed binary files of Spike2
  • Videos encoded using the Matrox Imaging Library (MIL)

There is a difficulty with telling what to do with these files because in many cases there are no other options. Also how to define the raw data is always the matter of discussion (for example, could we consider it to be "raw data" if the original data was modified into another format?). But my recommendation is:

  • It is better backing them up and storing them safely than doing nothing with them.
  • Try to export your data to another, more "open" format if you don't mind:
    • For Spike2 data files, exporting to ".smr" files will gain much more portability.
    • For MIL video files, you have an option of exporting them to the H.264 video files (e.g. by using ZR View).

Use of version-control systems

I encourage you to use a version-control system (VCS) to manage your project because it can help you throughout the process of data management:

  1. A VCS helps you keep track of your project while keeping your desktop clean.
  2. At the same time, a VCS can serve as a cloud back-up storage.
  3. A VCS (and its cloud/online counterpart) helps you share your data and results with others (in many cases, it also comes with the option of data publication).

I recommend the use of GIN, with some precautions.

Problems that a VCS can solve

TODO

Core concepts

TODO