Last updated: 2015-04-22
Code version: beec580194ceb295a5bfbc87750bb1f59f4ff273
The third of four flow cells was sequenced at the Functional Genomics Facility (FGF). Here is the message from Pieter Faber:
We finished Flowcell #4 of your Illumina sequencing experiment (4 flowcells SR100). No technical problem were encountered. I attached several QC files in support.
I have uploaded the data in fastq format to the Genomics Core ftp server data server (gilad Lab folder, folder = /NGS/150402_700819F_0306_BC72JMACXX-YG-SR100-FC-4)
To download, need to preface /NGS
with /Genomics_Data
.
cd /mnt/gluster/data/internal_supp/singleCellSeq/raw
echo "wget --user=gilad --password='<password>' -r ftp://fgfftp.uchicago.edu/Genomics_Data/NGS/150402_700819F_0306_BC72JMACXX-YG-SR100-FC-4/" \
| qsub -l h_vmem=2g -N fc4 -cwd -V -j y -o 150402_700819F_0306_BC72JMACXX-YG-SR100-FC-4.log
The download took ~9 hours. It started at 15:04:37 and ended at 00:16:17. However, wget
reported that it only took ~5 hours. Here’s the final line of output:
Downloaded: 1184 files, 96G in 4h 54m 6s (5.54 MB/s)
To remove the unnecessary directories from the FGF FTP site, I moved the files.
mv fgfftp.uchicago.edu/Genomics_Data/NGS/150402_700819F_0306_BC72JMACXX-YG-SR100-FC-4 150402_700819F_0306_BC72JMACXX-YG-SR100-FC-4
rmdir -p fgfftp.uchicago.edu/Genomics_Data/NGS/
Next I removed the extraneous CASAVA directories and added the flow cell name to the filename.
cd -
I did this with the following Python code:
import glob
import shutil
files = glob.glob('raw/150402_700819F_0306_BC72JMACXX-YG-SR100-FC-4/FastQ/Project_YG-SR100-4/Sample*/*fastq.gz')
target_dir = 'fastq/'
log = open('rearrange_C72JMACXX.log', 'w')
log.write('original\tnew\n')
for f in files:
path = f.strip('fastq.gz').split('/')
flow_cell = path[1].split('_')[-1][1:10]
file_parts = path[-1].split('_')[:-1]
new_name = target_dir + '.'.join(file_parts + [flow_cell]) + '.fastq.gz'
log.write(f + '\t' + new_name + '\n')
shutil.move(f, new_name)
log.close()