With our new data server (Moonrise) up and ready to store our sequences, it is time to start being more careful about where and when we move our most important data, and how we keep track of it. I’ve devised a system for storing our data for those of you who will be accessing it via the servers. Only Chris, Sariel, Frances and I will have write access to the directories where data is stored. If you would like your data stored, you will have to fill out a form which I’ve created which will give us all the information we need to store it in its right place. Here is the form.
This is inserting a little bureaucracy into our system, and it’s going to be a pain, but in the long run it will make things much easier. We currently have data which we had a very difficult time finding because the owner is no longer in the lab. With a system like the new one, that will not happen.
We will store our WGS, RNASeq, and GBS data in separate folders. This will make finding your data easier in most cases.
Here are the directory structures for the three types of data:
WGS -> Species -> Cultivar -> Library -> Experiment -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt
RNASeq -> Experiment (if unnecessary, the following directories can be omitted) -> Species -> Cultivar/Population -> Library -> file_name_including_library_size_and_depth_coverage.fq METADATA.txt
GBS is a little more complex, and we will separate things in two ways.
GBS -> Cut Site/Enzyme (data with different cut sites might be incompatible) -> Data type (population genetics, mapping data) -> From here on things diverge
Pop -> Group* -> files_with_descriptive_names.fq METADATA.txt
Map -> Experiment -> Species -> files_with_descriptive_names.fq METADATA.txt
*groups are based loosely on clades, and on how much data for each species we have (annuus_wild, annuus_cult, argophyllus, bolanderi_exilus, petiolaris_clade, hybrids, perennials, tuberosus_cult)
Generally, file names should include data not encoded in the directory structure, but important enough to be seen when perusing the data. Things like depth of coverage, library size, etc. seem appropriate for all three data types, but for types with which I’m not as familar (GBS), suggestions would be appreciated.