- Bioinformatics with Python Cookbook
- Tiago Antao
- 482字
- 2021-06-10 19:01:48
How to do it...
Take a look at the following steps:
- Let's start by inspecting the information that we can get per record:
import vcf
v = vcf.Reader(filename='genotypes.vcf.gz')
print('Variant Level information')
infos = v.infos
for info in infos:
print(info)
print('Sample Level information')
fmts = v.formats
for fmt in fmts:
print(fmt)
We start by inspecting the annotations that are available for each record (remember that each record encodes a variant, such as SNP, CNV, INDELs, and so on, and the state of that variant per sample). At the variant (record) level, we find AC: the total number of ALT alleles in called genotypes, AF: the estimated allele frequency, NS: the number of samples with data, AN: the total number of alleles in called genotypes, and DP: the total read depth, among others. There are others, but they are mostly specific to the 1,000 Genomes Project (here, we will try to be as general as possible). Your own dataset may have more annotations (or none of these).
At the sample level, there are only two annotations in this file: GT—genotype, and DP—the per sample read depth. You have the per variant (total) read depth and the per sample read depth; be sure not to confuse both.
- Now that we know which information is available, let's inspect a single VCF record:
v = vcf.Reader(filename='genotypes.vcf.gz')
rec = next(v)
print(rec.CHROM, rec.POS, rec.ID, rec.REF, rec.ALT, rec.QUAL, rec.FILTER)
print(rec.INFO)
print(rec.FORMAT)
samples = rec.samples
print(len(samples))
sample = samples[0]
print(sample.called, sample.gt_alleles, sample.is_het, sample.phased)
print(int(sample['DP']))
We will start by retrieving the standard information: the chromosome, position, identifier, reference base, (typically just one), alternative bases (you can have more than one, but it's not uncommon as a first filtering approach to only accept a single ALT, for example, only accept biallelic SNPs), quality (as you might expect, Phred-scaled), and filter status. Regarding the filter status, remember that whatever the VCF file says, you may still want to apply extra filters (as in the next recipe).
We then print the additional variant-level information (AC, AS, AF, AN, DP, and so on), followed by the sample format (in this case, DP and GT). Finally, we count the number of samples and inspect a single sample to check whether it was called for this variant. Also, the reported alleles, heterozygosity, and phasing status (this dataset happens to be phased, which is not that common) are included.
- Let's check the type of variant and the number of nonbiallelic SNPs in a single pass:
from collections import defaultdict
f = vcf.Reader(filename='genotypes.vcf.gz')
my_type = defaultdict(int)
num_alts = defaultdict(int)
for rec in f:
my_type[rec.var_type, rec.var_subtype] += 1
if rec.is_snp:
num_alts[len(rec.ALT)] += 1
print(num_alts)
print(my_type)
We will now use the now common Python default dictionary. We find that this dataset has INDELs (both insertions and deletions), CNVs, and, of course, SNPs (roughly two-thirds being transitions with one-third transversions). There is a residual number (79) of tri-allelic SNPs.
- 數據結構和算法基礎(Java語言實現)
- Apache Spark 2.x Machine Learning Cookbook
- 零基礎玩轉區塊鏈
- KnockoutJS Starter
- 數據結構習題解析與實驗指導
- Building Machine Learning Systems with Python(Second Edition)
- Python項目實戰從入門到精通
- Couchbase Essentials
- FFmpeg開發實戰:從零基礎到短視頻上線
- OpenCV with Python Blueprints
- Simulation for Data Science with R
- Learning Ionic(Second Edition)
- PHP程序設計經典300例
- 精通Rust(第2版)
- Mastering Android Application Development