Read#
A simple example on how Variant can read and how can be treated.
[1] -
from os import getcwd
from os.path import dirname
from openvariant import Annotation, Variant
dataset_file = f'{dirname(getcwd())}/datasets/sample1/22f5b2f.wxs.maf.gz'
annotation_file = f'{dirname(getcwd())}/datasets/sample1/annotation_maf.yaml'
Annotation object generated from annotation file. Parameters:
annotation_path- Path of annotation file.
Variant object to iterate through the parsed file. Parameters:
path- Path of input file.annotation- Annotation object which input will be parsed.skip_files- Skip unreadable files and directories.
One of the main functions of Variant is read.It will generate an iterator to scan the parsed file.
read function parameters:
where- Filter expression.group_key- Key to group rows.
In this example, it will get the 10 first lines of parsed files through an annotation file.
[2] -
annotation = Annotation(annotation_path=annotation_file)
result = Variant(path=dataset_file, annotation=annotation)
for n_line, line in enumerate(result.read()):
print(f'Line {n_line}: {line}')
if n_line == 9:
break
Line 0: {'POSITION': '16963', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '17691', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '98933', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 3: {'POSITION': '139058', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 4: {'POSITION': '186112', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 5: {'POSITION': '187146', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 6: {'POSITION': '187153', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 7: {'POSITION': '187264', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 8: {'POSITION': '187323', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 9: {'POSITION': '187363', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
As we can see in the output each line is a dict where the key is the field of the parsed result and the value is the value in that cell.
Variant has different attributes than we can explore:
[3] -
print('Headers: ', result.header)
print('Input file: ', result.path)
Headers: ['POSITION', 'DATASET', 'SAMPLE', 'STRAND_REF', 'PLATFORM']
Input file: /home/dmartinez/openvariant/examples/datasets/sample1/22f5b2f.wxs.maf.gz
Also, we can check the Annotation which input file was parsed.
Annotation file path -
pathFormat -
formatAnnotations -
annotationsColumns -
columnsDelimiter -
delimiterExcludes -
excludesPatterns -
patternsStructure -
structure
[4] -
print(result.annotation.annotations)
{'PLATFORM': ('STATIC', 'WGS'), 'POSITION': ('INTERNAL', ['Position', 'Start', 'Start_Position', 'Pos', 'Chromosome_Start', 'POS'], <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3b20>, nan), 'DATASET': ('FILENAME', <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3940>, re.compile('(.*)')), 'SAMPLE': ('DIRNAME', <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3520>, re.compile('(.*)')), 'STRAND': ('INTERNAL', ['Strand', 'Chromosome_Strand', ''], <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3310>, nan), 'STRAND_REF': ('MAPPING', ['STRAND'], {'+': 'POS', '-': 'NEG'})}
One of the parameter to read function is where. You will be able to apply a conditional filter. The possible operations can be:
==- Equal.!=- Not equal.<=- Less or equal than.<- Less than.>=- More or equal than.>- More than.
One example of this parameter is the following one:
[5] -
annotation = Annotation(annotation_path=annotation_file)
result = Variant(path=dataset_file, annotation=annotation)
for n_line, line in enumerate(result.read(where="POSITION == 186112")):
print(f'{line}')
{'POSITION': '186112', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Also, read allows group_key as a parameter which it will group rows depending on its value.
Variant can be combined with findfiles as it shows the following example. It will print the 3 first lines of each input file.
[6] -
from os.path import basename
from openvariant import findfiles
dataset_folder = f'{dirname(getcwd())}/datasets/sample1'
for file_path, annotation in findfiles(base_path=dataset_folder):
result = Variant(path=file_path, annotation=annotation)
n_line = 1
print('File: ', basename(file_path), '\n')
for n_line, line in enumerate(result.read()):
print(f'Line {n_line}: {line}')
if n_line == 2:
print("\n")
break
File: 5a3a743.wxs.maf.gz
Line 0: {'POSITION': '65872', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '131628', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '183697', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
File: 22f5b2f.wxs.maf.gz
Line 0: {'POSITION': '16963', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '17691', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '98933', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
File: 345c90e.raw_somatic_mutation.vcf.gz
Line 0: {'POSITION': '10267', 'DATASET': '345c90e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}
Line 1: {'POSITION': '10273', 'DATASET': '345c90e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}
Line 2: {'POSITION': '10321', 'DATASET': '345c90e', 'PLATFORM': 'WGS', 'INFO': 'WGS:C_T'}
File: de46011.raw_somatic_mutation.vcf.gz
Line 0: {'POSITION': '10105', 'DATASET': 'de46011', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}
Line 1: {'POSITION': '10381', 'DATASET': 'de46011', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}
Line 2: {'POSITION': '10438', 'DATASET': 'de46011', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_T'}
File: 3a70e22.raw_somatic_mutation.vcf.gz
Line 0: {'POSITION': '10033', 'DATASET': '3a70e22', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}
Line 1: {'POSITION': '10075', 'DATASET': '3a70e22', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}
Line 2: {'POSITION': '10087', 'DATASET': '3a70e22', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}
File: 4c0b87e.raw_somatic_mutation.vcf.gz
Line 0: {'POSITION': '10105', 'DATASET': '4c0b87e', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}
Line 1: {'POSITION': '10241', 'DATASET': '4c0b87e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}
Line 2: {'POSITION': '10267', 'DATASET': '4c0b87e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}