Group by#

A simple example where we can find how group by task works. This task is able with command-line.

[1] - 
from os.path import dirname
from os import getcwd
from openvariant import group_by

dataset_folder = f'{dirname(getcwd())}/datasets/sample2'
annotation_path = f'{dirname(getcwd())}/datasets/sample2/annotation.yaml'

group_by task allows us to group the rows depending on the value of an output field.

  • base_path - Input path to explore and parse.

  • annotation_path - Path of the annotation path.

  • script - Command-line to execute with the result of the parsing.

  • key_by - Key to group rows.

  • where - Filter expression.

  • cores - Maximum processes to run in parallel.

  • quite - Do not show the progress meanwhile the parsing is running.

  • header - Show header on the result.

  • skip_files - Skip unreadable files and directories.

On the following example we can see a general case for group by task:

[2] - 
for group, values, script_used in group_by(base_path=dataset_folder, annotation_path=annotation_path, script=None, key_by="CANCER", quite=True):
    print(f'Group: {group}')
    for row in values:
        print(row)
    print("\n")
Group: MESO
ACAP3   1p36.33 MESO
ACTRT2  1p36.32 MESO
AGRN    1p36.33 MESO
ANKRD65 1p36.33 MESO
ATAD3A  1p36.33 MESO
ATAD3B  1p36.33 MESO
ATAD3C  1p36.33 MESO
AURKAIP1        1p36.33 MESO
B3GALT6 1p36.33 MESO


Group: ACC
ACAP3   1p36.33 ACC
ACTRT2  1p36.32 ACC
AGRN    1p36.33 ACC
ANKRD65 1p36.33 ACC
ATAD3A  1p36.33 ACC
ATAD3B  1p36.33 ACC
ATAD3C  1p36.33 ACC
AURKAIP1        1p36.33 ACC
B3GALT6 1p36.33 ACC


One of the parameters on count task is where. You will be able to apply a conditional filter. The possible operations can be:

  • == - Equal.

  • != - Not equal.

  • <= - Less or equal than.

  • < - Less than.

  • >= - More or equal than.

  • > - More than.

One example of this parameter is the following one:

[3] - 
for group, values, script_used in group_by(base_path=dataset_folder, annotation_path=annotation_path, script=None,where="SYMBOL == 'ATAD3C'", key_by="CANCER", quite=True):
    print(f'Group: {group}')
    for row in values:
        print(row)
    print("\n")
Group: MESO
ATAD3C  1p36.33 MESO


Group: ACC
ATAD3C  1p36.33 ACC


Also, on group by task, there is script parameter which will allow to the user to execute a command shell on the parsed result. In the following example we can see how many characters there are in each group of the parsed output:

[4] - 
for group, values, script_used in group_by(base_path=dataset_folder, annotation_path=annotation_path, script="wc -m", key_by="CANCER", quite=True):
    print(f'Group: {group}')
    for row in values:
        print(row)
    print("\n")
Group: MESO
181


Group: ACC
172