Contributing Data

This guideline is relevant to researchers and collaborators who have worked with the Hakai Institute Juvenile Salmon Program (JSP), or collaborators who would like to contribute data to this growing database. This data will be incorporated into this R package, the Hakai Ecological Information Management System (EIMS), and receive a digital object identifier (DOI). The goal of this guide is to provide a framework for collaborators to organize their raw data into a tidy data set to facilitate this process.

Collaborators, including graduate students, should make their data available following the publication of their research papers, thesis, or dissertations. When submitting data, you should provide a tidy data set (Wickham 2014), as well as a metadata code book describing each variable in the tidy data set. We recommend you follow these guidelines as soon as possible after the start of your project, to make laying out your tidy data table framework as easy as possible.

The Tidy Data Table

The purpose of having tidy data is to enable seamless integration with other associated JSP data in the Hakai EIMS. It is compiled from raw data obtained from sample processing, which has been cleaned and organized into a standardized format. The raw data should have already undergone quality assurance/quality control; however, beyond that, data values should not have undergone any other form of modification (e.g. summarization, removals, or omissions). The general structure of a tidy data table is as follows:

  1. Each variable measured should be in a column
  2. Each unique observation of variables should be in a row
  3. Different observational units should be stored in a separate table, ie. survey data stored separately from fish data.
  4. Each table should have only one ‘type’ of variables (eg. length measurements stored separately from weather), each table should include a common column that allows them to be joined. This collection of tidy data tables forms a tidy data set with a relational database structure.

Organizing data

For each data table, what constitutes an ‘observation’ and therefore a row, may be different. It is better to store data that are at different levels in a relational hierarchy like this, in separate tables, so that each table is tidy. For example, it is common, but not ideal, to enter all your data from a survey on one row like this:

survey_id date sea_state cloud_cover temp_0m temp_1m
D123 2017-07-01 1 25 13.4 12.2
D124 2017-07-02 2 100 15.5 14.5

In the above example, the unit of observation is a survey. Including the temperature data in the survey table, as above, makes the temperature data table not tidy because there is more than one observation of the temperature variable in the same row. Therefore, variables which have multiple observations per survey, must be stored in a separate data table, as in the example below. The unit of observation in this table is a temperature measurement. Dividing up data into different tables, depending on what you classify as a unit of observation is the foundation of creating tidy data tables.

survey_id depth temp
D123 0 13.4
D123 1 12.2
D124 0 15.5
D124 1 14.5

Lastly, each data table must have one column in common—a primary key— that allows you to join data tables (survey_id in this case). Aside from having a column to relate each observation to a primary key, avoid having more than one common variable between tables. This to minimize unnecessary data replication and more importantly errors when joining tables with the same column name that aren’t the primary key.

Cleaning and formatting data

If an observation does not have a recorded variable (i.e., a blank cell), that cell should be filled with NA. Dates and times should follow the ISO format. Dates should be expressed as yyyy-mm-dd and times should be in 24-hour as hh:mm:ss. The top row of each column should be a header, and follow an unambiguous naming scheme, e.g. age_class instead of AC. Do not use spaces in column header names (underscores are preferred), and avoid capitalization at all costs.

File formatting

If you’re using excel, each tidy data table should be in a separate .CSV file (not multiple worksheets in an .xlsx workbook). The files should not have any macros applied to the data, and no columns/cells should be highlighted, emboldened, italicized, colour coded, etc.

Metadata Code Book

Included along with the submission of the tidy data set should be a code book, which contains information about each variable:

  • The full name of the variable
  • Unit of measurement
  • Full description of collection method (equipment, protocol, methods). You may want to cite papers that have full method description that you followed, or point to a manufacturer’s website for the piece of equipment you used.
  • Notes: Can be anything you think is relevant. For example, how the variable should or should not be used during analysis.

The code book should be prepared in a spreadsheet file and should be broken up so that it’s clear what data table each variable resides in. For example:

Metadata codebook

Table Variable name Units or Categories Description

Literature Cited:

Wickham, H. (2014). Tidy Data. Journal of Statistical Software. Vol. 89.10:1–23.