Last updated: 2015-04-11
Code version: 3275cbc6932e4def9f737a536299aa2022c5c7a1
The first of four flow cells was sequenced at the Functional Genomics Facility (FGF). Here is the message from Pieter Faber:
We finished one Flowcell of your Illumina sequencing experiment (4 flowcells SR100). No technical problem were encountered. I attached several QC files in support. Declining sequencing quality at higher bp-length is (in part) caused by presence of adaptor only sequences (up to 40% at bp 100 according to FastQC).
I have uploaded the data in fastq format to the Genomics Core ftp server data server (gilad Lab folder, folder = /NGS/150320_700819F_0303_AC6WYKACXX-YG-SR100-FC-1). For access a username and password have been provided to the PI previously (not your CNET ID!!).
Let me know if other files and / or explanations are needed.
IMPORTANT: While the Core has looked at your data and found no technical irregularities we urge you to perform data analysis/QC in a timely manner and contact us with any questions/comments you may have within 3 months after the data were made available to you!!
Nick has recently downloaded data from the FGF FTP site. In order to download the data from the command-line, you have to modify the path that Pieter sends. There does not appear to be a “Gilad lab folder”. I checked with Filezilla. Instead, you need to preface /NGS
with /Genomics_Data
.
I am storing the data in internal_supp
so that it is easier for PoYuan and me to collaborate.
cd /mnt/gluster/data/internal_supp/singleCellSeq/
mkdir raw
cd raw
nohup wget --user=gilad --password='<password>' -r ftp://fgfftp.uchicago.edu/Genomics_Data/NGS/150320_700819F_0303_AC6WYKACXX-YG-SR100-FC-1/ &
mv nohup.out 150320_700819F_0303_AC6WYKACXX-YG-SR100-FC-1.log
This recrursively downloads all the files. Unfortunately it also includes all the directories above the final listed directory. This is really annoying because the data is so nested. I tried running wget
with the option -np
(“no parent”), but this did not make a difference. The download took ~10 hours. It started at 12:05:18 and ended at 21:54:33. However, wget
reported that it only took ~5 hours, so I’m not sure how it is calculated. Here’s the final line of output:
Downloaded: 1375 files, 129G in 4h 59m 1s (7.37 MB/s)
To remove the unnecessary directories from the FGF FTP site, I moved the files.
mv fgfftp.uchicago.edu/Genomics_Data/NGS/150320_700819F_0303_AC6WYKACXX-YG-SR100-FC-1/ 150320_700819F_0303_AC6WYKACXX-YG-SR100-FC-1/
rmdir -p fgfftp.uchicago.edu/Genomics_Data/NGS/
Next I removed the extraneous CASAVA directories and added the flow cell name to the filename.
cd -
mkdir fastq
I did this with the following Python code:
import glob
import shutil
files = glob.glob('raw/150320_700819F_0303_AC6WYKACXX-YG-SR100-FC-1/FastQ/Project_YG-SR100-1/Sample*/*fastq.gz')
target_dir = 'fastq/'
log = open('rearrange_C6WYKACXX.log', 'w')
log.write('original\tnew\n')
for f in files:
path = f.strip('fastq.gz').split('/')
flow_cell = path[1].split('_')[-1][1:10]
file_parts = path[-1].split('_')[:-1]
new_name = target_dir + '.'.join(file_parts + [flow_cell]) + '.fastq.gz'
log.write(f + '\t' + new_name + '\n')
shutil.move(f, new_name)
log.close()