vignettes/data_contributing_guidelines.Rmd
data_contributing_guidelines.Rmd
This guideline is relevant to researchers and collaborators who have worked with the Hakai Institute Juvenile Salmon Program (JSP), or collaborators who would like to contribute data to this growing database. This data will be incorporated into this R package, the Hakai Ecological Information Management System (EIMS), and receive a digital object identifier (DOI). The goal of this guide is to provide a framework for collaborators to organize their raw data into a tidy data set to facilitate this process.
Collaborators, including graduate students, should make their data available following the publication of their research papers, thesis, or dissertations. When submitting data, you should provide a tidy data set (Wickham 2014), as well as a metadata code book describing each variable in the tidy data set. We recommend you follow these guidelines as soon as possible after the start of your project, to make laying out your tidy data table framework as easy as possible.
The purpose of having tidy data is to enable seamless integration with other associated JSP data in the Hakai EIMS. It is compiled from raw data obtained from sample processing, which has been cleaned and organized into a standardized format. The raw data should have already undergone quality assurance/quality control; however, beyond that, data values should not have undergone any other form of modification (e.g. summarization, removals, or omissions). The general structure of a tidy data table is as follows:
For each data table, what constitutes an ‘observation’ and therefore a row, may be different. It is better to store data that are at different levels in a relational hierarchy like this, in separate tables, so that each table is tidy. For example, it is common, but not ideal, to enter all your data from a survey on one row like this:
survey_id | date | sea_state | cloud_cover | temp_0m | temp_1m |
---|---|---|---|---|---|
D123 | 2017-07-01 | 1 | 25 | 13.4 | 12.2 |
D124 | 2017-07-02 | 2 | 100 | 15.5 | 14.5 |
In the above example, the unit of observation is a survey. Including the temperature data in the survey table, as above, makes the temperature data table not tidy because there is more than one observation of the temperature variable in the same row. Therefore, variables which have multiple observations per survey, must be stored in a separate data table, as in the example below. The unit of observation in this table is a temperature measurement. Dividing up data into different tables, depending on what you classify as a unit of observation is the foundation of creating tidy data tables.
survey_id | depth | temp |
---|---|---|
D123 | 0 | 13.4 |
D123 | 1 | 12.2 |
D124 | 0 | 15.5 |
D124 | 1 | 14.5 |
Lastly, each data table must have one column in common—a primary key— that allows you to join data tables (survey_id in this case). Aside from having a column to relate each observation to a primary key, avoid having more than one common variable between tables. This to minimize unnecessary data replication and more importantly errors when joining tables with the same column name that aren’t the primary key.
If an observation does not have a recorded variable (i.e., a blank cell), that cell should be filled with NA. Dates and times should follow the ISO format. Dates should be expressed as yyyy-mm-dd
and times should be in 24-hour as hh:mm:ss
. The top row of each column should be a header, and follow an unambiguous naming scheme, e.g. age_class
instead of AC
. Do not use spaces in column header names (underscores are preferred), and avoid capitalization at all costs.
Included along with the submission of the tidy data set should be a code book, which contains information about each variable:
The code book should be prepared in a spreadsheet file and should be broken up so that it’s clear what data table each variable resides in. For example: