Read#

A simple example on how Variant can read and how can be treated.

[1] - 
from os import getcwd
from os.path import dirname
from openvariant import Annotation, Variant

dataset_file = f'{dirname(getcwd())}/datasets/sample1/22f5b2f.wxs.maf.gz'
annotation_file = f'{dirname(getcwd())}/datasets/sample1/annotation_maf.yaml'

Annotation object generated from annotation file. Parameters:

  • annotation_path - Path of annotation file.

Variant object to iterate through the parsed file. Parameters:

  • path - Path of input file.

  • annotation - Annotation object which input will be parsed.

  • skip_files - Skip unreadable files and directories.

One of the main functions of Variant is read.It will generate an iterator to scan the parsed file.

read function parameters:

  • where - Filter expression.

  • group_key - Key to group rows.

In this example, it will get the 10 first lines of parsed files through an annotation file.

[2] - 
annotation = Annotation(annotation_path=annotation_file)
result = Variant(path=dataset_file, annotation=annotation)

for n_line, line in enumerate(result.read()):
    print(f'Line {n_line}: {line}')
    if n_line == 9:
        break
Line 0: {'POSITION': '16963', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '17691', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '98933', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 3: {'POSITION': '139058', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 4: {'POSITION': '186112', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 5: {'POSITION': '187146', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 6: {'POSITION': '187153', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 7: {'POSITION': '187264', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 8: {'POSITION': '187323', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 9: {'POSITION': '187363', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}

As we can see in the output each line is a dict where the key is the field of the parsed result and the value is the value in that cell.

Variant has different attributes than we can explore:

[3] - 
print('Headers: ', result.header)
print('Input file: ', result.path)
Headers:  ['POSITION', 'DATASET', 'SAMPLE', 'STRAND_REF', 'PLATFORM']
Input file:  /home/dmartinez/openvariant/examples/datasets/sample1/22f5b2f.wxs.maf.gz

Also, we can check the Annotation which input file was parsed.

  • Annotation file path - path

  • Format - format

  • Annotations - annotations

  • Columns - columns

  • Delimiter - delimiter

  • Excludes - excludes

  • Patterns - patterns

  • Structure - structure

[4] - 
print(result.annotation.annotations)
{'PLATFORM': ('STATIC', 'WGS'), 'POSITION': ('INTERNAL', ['Position', 'Start', 'Start_Position', 'Pos', 'Chromosome_Start', 'POS'], <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3b20>, nan), 'DATASET': ('FILENAME', <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3940>, re.compile('(.*)')), 'SAMPLE': ('DIRNAME', <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3520>, re.compile('(.*)')), 'STRAND': ('INTERNAL', ['Strand', 'Chromosome_Strand', ''], <openvariant.annotation.builder.Builder object at 0x7fa8bc0b3310>, nan), 'STRAND_REF': ('MAPPING', ['STRAND'], {'+': 'POS', '-': 'NEG'})}

One of the parameter to read function is where. You will be able to apply a conditional filter. The possible operations can be:

  • == - Equal.

  • != - Not equal.

  • <= - Less or equal than.

  • < - Less than.

  • >= - More or equal than.

  • > - More than.

One example of this parameter is the following one:

[5] - 
annotation = Annotation(annotation_path=annotation_file)
result = Variant(path=dataset_file, annotation=annotation)

for n_line, line in enumerate(result.read(where="POSITION == 186112")):
    print(f'{line}')
{'POSITION': '186112', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}

Also, read allows group_key as a parameter which it will group rows depending on its value.

Variant can be combined with findfiles as it shows the following example. It will print the 3 first lines of each input file.

[6] - 
from os.path import basename
from openvariant import findfiles

dataset_folder = f'{dirname(getcwd())}/datasets/sample1'

for file_path, annotation in findfiles(base_path=dataset_folder):
    result = Variant(path=file_path, annotation=annotation)

    n_line = 1
    print('File: ', basename(file_path), '\n')
    for n_line, line in enumerate(result.read()):
        print(f'Line {n_line}: {line}')
        if n_line == 2:
            print("\n")
            break
File:  5a3a743.wxs.maf.gz

Line 0: {'POSITION': '65872', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '131628', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '183697', 'DATASET': '5a3a743', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}


File:  22f5b2f.wxs.maf.gz

Line 0: {'POSITION': '16963', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 1: {'POSITION': '17691', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}
Line 2: {'POSITION': '98933', 'DATASET': '22f5b2f', 'SAMPLE': 'SAMPLE1', 'STRAND_REF': 'POS', 'PLATFORM': 'WGS'}


File:  345c90e.raw_somatic_mutation.vcf.gz

Line 0: {'POSITION': '10267', 'DATASET': '345c90e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}
Line 1: {'POSITION': '10273', 'DATASET': '345c90e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}
Line 2: {'POSITION': '10321', 'DATASET': '345c90e', 'PLATFORM': 'WGS', 'INFO': 'WGS:C_T'}


File:  de46011.raw_somatic_mutation.vcf.gz

Line 0: {'POSITION': '10105', 'DATASET': 'de46011', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}
Line 1: {'POSITION': '10381', 'DATASET': 'de46011', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}
Line 2: {'POSITION': '10438', 'DATASET': 'de46011', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_T'}


File:  3a70e22.raw_somatic_mutation.vcf.gz

Line 0: {'POSITION': '10033', 'DATASET': '3a70e22', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}
Line 1: {'POSITION': '10075', 'DATASET': '3a70e22', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}
Line 2: {'POSITION': '10087', 'DATASET': '3a70e22', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}


File:  4c0b87e.raw_somatic_mutation.vcf.gz

Line 0: {'POSITION': '10105', 'DATASET': '4c0b87e', 'PLATFORM': 'WGS', 'INFO': 'WGS:A_C'}
Line 1: {'POSITION': '10241', 'DATASET': '4c0b87e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}
Line 2: {'POSITION': '10267', 'DATASET': '4c0b87e', 'PLATFORM': 'WGS', 'INFO': 'WGS:T_C'}