官术网_书友最值得收藏!

Getting ready

While NGS is all about big data, there is a limit to how much I can ask you to download as a dataset for this book. I believe that 2 to 20 GB of data for a tutorial is asking too much. While the 1,000 Genomes' VCF files with realistic annotations are in this order of magnitude, we will want to work with much less data here. Fortunately, the Bioinformatics community has developed tools to allow for the partial download of data. As part of the SAMtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. On the command line, perform the following:

tabix -fh ftp://ftp-
trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 | bgzip -c > genotypes.vcf.gz

tabix -p vcf genotypes.vcf.gz

If the preceding link does not work, be sure to check the dataset page at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/blob/master/Datasets.ipynb for an update.

The first line will partially download the VCF file for chromosome 22 (up to 17 Mbp) of the 1,000 Genomes Project. Then, bgzip will compress it.

The second line will create an index, which we will need for direct access to a section of the genome. As usual, you have the code to do this in a Notebook (Chapter02/Working_with_VCF.ipynb file).

主站蜘蛛池模板: 孝感市| 房产| 永州市| 六盘水市| 黔东| 清河县| 吴堡县| 荣昌县| 礼泉县| 威海市| 长治县| 武定县| 夏邑县| 凤台县| 双江| 西华县| 增城市| 克山县| 昭平县| 淮南市| 永胜县| 黎城县| 黔西县| 龙游县| 镇远县| 汾阳市| 宣汉县| 琼中| 项城市| 鄂伦春自治旗| 威海市| 华亭县| 阳新县| 定陶县| 赣榆县| 东丰县| 大埔区| 保亭| 永胜县| 拉萨市| 边坝县|