官术网_书友最值得收藏!

How to do it...

Take a look at the following steps:

  1. Let's start by inspecting the information that we can get per record:
import vcf
v = vcf.Reader(filename='genotypes.vcf.gz')
print('Variant Level information')
infos = v.infos
for info in infos:
print(info)
print('Sample Level information')
fmts = v.formats
for fmt in fmts:
print(fmt)

We start by inspecting the annotations that are available for each record (remember that each record encodes a variant, such as SNP, CNV, INDELs, and so on, and the state of that variant per sample). At the variant (record) level, we find AC: the total number of ALT alleles in called genotypes, AF: the estimated allele frequency, NS: the number of samples with data, AN: the total number of alleles in called genotypes, and DP: the total read depth, among others. There are others, but they are mostly specific to the 1,000 Genomes Project (here, we will try to be as general as possible). Your own dataset may have more annotations (or none of these).

At the sample level, there are only two annotations in this file: GT—genotype, and DP—the per sample read depth. You have the per variant (total) read depth and the per sample read depth; be sure not to confuse both.

  1. Now that we know which information is available, let's inspect a single VCF record:
v = vcf.Reader(filename='genotypes.vcf.gz')
rec = next(v)
print(rec.CHROM, rec.POS, rec.ID, rec.REF, rec.ALT, rec.QUAL, rec.FILTER)
print(rec.INFO)
print(rec.FORMAT)
samples = rec.samples
print(len(samples))
sample = samples[0]
print(sample.called, sample.gt_alleles, sample.is_het, sample.phased)
print(int(sample['DP']))

We will start by retrieving the standard information: the chromosome, position, identifier, reference base, (typically just one), alternative bases (you can have more than one, but it's not uncommon as a first filtering approach to only accept a single ALT, for example, only accept biallelic SNPs), quality (as you might expect, Phred-scaled), and filter status. Regarding the filter status, remember that whatever the VCF file says, you may still want to apply extra filters (as in the next recipe).

We then print the additional variant-level information (AC, AS, AF, AN, DP, and so on), followed by the sample format (in this case, DP and GT). Finally, we count the number of samples and inspect a single sample to check whether it was called for this variant. Also, the reported alleles, heterozygosity, and phasing status (this dataset happens to be phased, which is not that common) are included.

  1. Let's check the type of variant and the number of nonbiallelic SNPs in a single pass:
from collections import defaultdict
f = vcf.Reader(filename='genotypes.vcf.gz')
my_type = defaultdict(int)
num_alts = defaultdict(int)
for rec in f:
my_type[rec.var_type, rec.var_subtype] += 1
if rec.is_snp:
num_alts[len(rec.ALT)] += 1
print(num_alts)
print(my_type)

We will now use the now common Python default dictionary. We find that this dataset has INDELs (both insertions and deletions), CNVs, and, of course, SNPs (roughly two-thirds being transitions with one-third transversions). There is a residual number (79) of tri-allelic SNPs.

主站蜘蛛池模板: 潢川县| 山阳县| 桦南县| 山东| 盐边县| 永泰县| 镇坪县| 剑河县| 扎赉特旗| 常德市| 从江县| 江孜县| 鲁甸县| 华安县| 辽源市| 阿克| 淅川县| 甘德县| 石嘴山市| 南安市| 武鸣县| 南雄市| 延津县| 凤阳县| 永年县| 平泉县| 柘城县| 平安县| 三台县| 广西| 当阳市| 基隆市| 鄂温| 乳源| 西藏| 武陟县| 济宁市| 昌图县| 临颍县| 阳曲县| 兴山县|