官术网_书友最值得收藏!

Getting ready

In the best-case scenario, you have a VCF file with proper filters applied. If this is the case, you can just go ahead and use your file. Note that all VCF files will have a FILTER column, but this might not mean that all of the proper filters were applied. You have to be sure that your data is properly filtered.

In the second case, which is one of the most common, your file will have unfiltered data, but you'll have enough annotations and can apply hard filters (there is no need for programmatic filtering). If you have a GATK annotated file, refer to http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set.

In the third case, you have a VCF file that has all the annotations that you need, but you may want to apply more flexible filters (for example, "if read depth > 20, accept if mapping quality > 30; otherwise, accept if mapping quality > 40").

In the fourth case, your VCF file does not have all the necessary annotations and you have to revisit your BAM files (or even other sources of information). In this case, the best solution is to find whatever extra information you can find and create a new VCF file with the required annotations. Some genotype callers (such as GATK) allow you to specify which annotations you want; you may also want to use extra programs to provide more annotations. For example, SnpEff (http://snpeff.sourceforge.net/) will annotate your SNPs with predictions of their effect (for example, if they are in exons, are they coding or non-coding?).

It's impossible to provide a clear-cut recipe. It will vary with your type of sequencing data, your species of study, and your tolerance to errors, among other variables. What we can do is provide a set of typical analysis that is done for high-quality filtering.

In this recipe, we will not use data from the Human 1,000 Genomes Project. We want dirty unfiltered data that has a lot of common annotations that can be used to filter it. We will use data from the Anopheles gambiae 1,000 Genomes Project (Anopheles is the mosquito vector involved in the transmission of the parasite that causes malaria), which makes filtered and unfiltered data available. You can find more information on this project at http://www.malariagen.net/projects/vector/ag1000g.

We will get a part of the centromere of chromosome 3L for around 100 mosquitoes, which is followed by a part somewhere in the middle of this chromosome (and index both):

tabix -fh ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC.phase1.AR1.vcf.gz 3L:1-200000 |bgzip -c > centro.vcf.gz
tabix -fh ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC.phase1.AR1.vcf.gz 3L:21000001-21200000 |bgzip -c > standard.vcf.gz

tabix -p vcf centro.vcf.gz

tabix -p vcf standard.vcf.gz

If the links do not work, be sure to check https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition/blob/master/Datasets.ipynb for updates. As usual, the code for downloading this data is available on the Chapter02/Filtering_SNPs.ipynb Notebook.

Finally, a word of warning on this recipe: the level of Python here will be slightly more complicated than usual. The more general code we will write will be easier to reuse for your specific case. We will use functional programming techniques (lambda functions) and the partial function application extensively.

主站蜘蛛池模板: 武陟县| 涞水县| 大洼县| 留坝县| 晋中市| 蛟河市| 广元市| 芦山县| 南汇区| 晋宁县| 黄冈市| 广州市| 金乡县| 庄河市| 临清市| 崇明县| 泽普县| 自贡市| 宿州市| 商洛市| 舞钢市| 禄劝| 蒲江县| 文安县| 凤凰县| 皋兰县| 高清| 古田县| 彝良县| 涿州市| 柞水县| 九龙县| 镇沅| 邯郸县| 台安县| 阿勒泰市| 沧州市| 林口县| 扬中市| 濉溪县| 来凤县|