Graph module

Provides main class and helpers classes and functions to handle pangenome graph and related data structures

Imports

Path conversion utility functions

These functions are utility, but because they are directly related to graph structures, they are places in this module and not in Util module

calcNodeLengths

 calcNodeLengths (graph)

Simple function that calculates node lengths (in visual columns).

If it is nucleotide graph, it will actually calculate a number of nucleotides in each node, but if it is non-nucleotide graph, then it will return 1 for each node.

/home/pigrenok/.pyenv/versions/3.10.9/envs/pygengraph/lib/python3.10/site-packages/fastcore/docscrape.py:225: UserWarning: Unknown section Return
  else: warn(msg)
/home/pigrenok/.pyenv/versions/3.10.9/envs/pygengraph/lib/python3.10/site-packages/fastcore/docscrape.py:225: UserWarning: Unknown section Example
  else: warn(msg)

initialPathAnalysis

 initialPathAnalysis (graph, nodeLengths)

This function creates auxiliary data structures to make it easier to work with paths and their relationships with nodes.

getNodesStructurePathNodeInversionRate

 getNodesStructurePathNodeInversionRate (pathNodeArray, pathDirArray,
                                         pathLengths,
                                         inversionThreshold=0.5)

Generate a dict of dicts which stores information about inversion rate of each node for each path.

convertPathsToGraph

 convertPathsToGraph (fullPath, doSorting=False, v2=False)

getPathNodeInversionRate

 getPathNodeInversionRate (pathNodeArray, pathDirArray, pathLengths,
                           inversionThreshold=0.5)

deprecated Generate a dict of dicts which stores information about inversion rate of each node for each path.

/home/pigrenok/.pyenv/versions/3.10.9/envs/pygengraph/lib/python3.10/site-packages/fastcore/docscrape.py:225: UserWarning: Unknown section Notes
  else: warn(msg)

pathNodeDirToCombinedArray

 pathNodeDirToCombinedArray (pathNodeArray, pathDirArray)

Combines path node and direction arrays (provided by graph.initialPathAnalysis function as pathNodeArray and pathDirArray (output indices 1 and 3)).

getNextNodePath

 getNextNodePath (pathNodeArray, pathLengths)

When following path to find next in graph order, it will not necessarily be 𝑘+1 , it can be 𝑘+𝑝 if 𝑘+1,…,𝑘+𝑝−1 are not passed by the path.

Create a list of all unique node numbers in each path by - Either set(path) - *** or np.unique(path) Preferable option would be selected according to the selection of the options for next step

And then do one of the following - Just leave the list as it is and in a loop check if 𝑘+𝑝 (with 𝑝=1,… ) is in the path (k+p in pathUnique) Probably slow - Sort the list pathUnique and for every node where we need to find next in order just find its index and take the next one pathUnique[pathUnique.index(k)+1] - Create a dict for each node (except the last one) with key as 𝑘 and value as 𝑘+𝑝 . On break in path at position 𝑘 we get the next as dict[k] - *** Do np.diff(np.sort(pathUnique)) and create a dict for each node after which diff!=1 with key as 𝑘 and value as 𝑘+𝑝 . Then when we get to the break in path at position k. We check k in dict and if True, then the next is dict[k], otherwise it is k+1

This function implements generating the dict as described in *** (triple starred) options, which are currently used options.

Assistant class implementing link getting mechanism.

LinkGetter

 LinkGetter (nodes, links)

This auxiliary class allows creating virtual subscribable structure inside the main GenomeGraph class to access links as Iterable and Subscribable object. Full links (with directionality) are available in the main class. This class provides simplified view on the links without directionality (the fact that a link goes from node A to node B).

Main class holding all data and main methods operating the graph

Graph definition, constructor and some utils for constructor

GenomeGraph

 GenomeGraph (gfaPath=None, doOverlapCleaning=True, paths=None,
              nodes=None, nodesData=None, links=None, pathsDict=None,
              sequenceFilesDict=None, annotationFiles=None,
              pangenomeFiles=None, doBack=False, verbose=True, **kwargs)

This is a constructor of a class GenomeGraph. This class allows to hold not only vanila genome graph, but also a lot of extra information and also manipulate graphs in multiple ways, including sorting graphs, adding and deleting nodes and links, and manipulating metadata etc.

At the moment, there are four ways an instance can be created. It depends on what parameters are passed to the constructor. If parameters for more than 1 option is passed, there is a priority queue which constructor follows. Each option and its priorities are provided below.

Priority 1: If you pass gfaPath as actual path to gfa file, then it will be loaded as is. In this case, the following options are available:

accessionsToRemove: list or None (default). If not None, a list of strings, if any of the string contains in pathname, the path will be ignored while loading.

isGFASeq: boolean (default: True). Whether the graph should be considered as sequence graph (True) or as gene/block graph (False).
Priority 2: If nodes, links and paths are provided (not None), they should be as following:

nodes: list[str]: a list of strings with node IDs (unique)

links: dict{int:dict{str:list[tuple(int,str)]}}: a dict with keys integers with 1-basednode numbers (in the order as in self.nodes) from which the link starts. Value is a dict with key of directionality of the from node (‘+’ for normal direction or ‘-’ for inverse). Value is a list of tuples with two values: first integer is 1-based node number of node to which the link is going and second string is ‘+’ or ‘-’ representing the directionality in which this node is represented in this link.

‘paths’: list[list[str]]: List, which contains a list for each accession/path, which is represented by a list of strings, each of which has a format ‘{1-based node number}{directionality}’, where {1-based node number} is an integer 1-based number of node using the order as in nodes, {directionality} is either ‘+’ for normal direction, or ‘-’ for inverted.
Priority 3: if pathsDict is provided then the graph is created from the paths for multiple accessions. pathsDict is a dict{int:list[str]}; keys are names of accessions, and values are lists of strings of the following format ‘{node name}{directionality}’, where {node name} is identifiable unique name which identifies the node, {directionality} is either ‘+’ for normal direction, or ‘-’ for inverted. Note, that this can create no-sequence graph only (e.g. gene graph). Sequences can be added later on through adding sequences to GenomeGraph.nodesData list.

An extra optional parameter is:

nodeNameLengths: list[int] or None, a list of alternative node lengths. By default, each node will be represented as a single cell/column, but if provided, variable length can be provided.
Priority 4: If annotationFiles is not None, but is a list of paths to annotation (gff3) files, then the following extra options are available:

‘sequenceFilesDict’: a dict{str:str}, where keys are IDs of accessions used in annotation files and value is a path to FASTA file (relative to gff files). Assumption is that FASTA sequence names are the same as GFF3 sequence names.

pangenomeFiles: a list[str], a list of GFF3 files for the same intervals as in annotationFiles, ID fields in GFF2 metadata should be the same.

accOrder: list or None (default), Order of accessions in the graph. If None, accessions will be sorted in alphabetical order.

chromosome: str or None, if None, create one graph for all chromosomes (not fully implemented, see manual), otherwise, create only one graph for given chromosome.

doUS: boolean (default: False) Add unrelated sequence blocks between annotated genes/blocks.

refAnnotationFile: str. If given, it has to be a path to gff3 file with reference annotation with reference notation for gene names. In main annotations reference gene names should be identified either in “gene” records under “AT” key (prioritised), or in “mRNA” records under “Name” key (fallback). If ATMap is provided, then

refSequenceFile: str or None (default). If provided with path, then it will be used to obtain sequences of each block/gene.

transmapFile: a tab delimited file with column “Orthogroup”, which contains similarity ID for genes and a column with name given by transmapFileRefCol which contains reference annotation gene names.

transmapFileRefCol: str or None, a name of column for reference gene names in transmapFile

refAccession: str or None (default). Accession ID for reference annotation (should be provided if refAnnotationFile if provided).

Import graph from file

From GFA

From Paths

From annotations

Import or create annotation

From annotation files

GenomeGraph.loadAnnotations

 GenomeGraph.loadAnnotations (annotationPath, seqSuffix)

This function should allow adding interval metadata (annotations) to sequence (nucleotide) graph. It has never been properly tested.

Create annotation from nodes (artificial annotation)

GenomeGraph.updateAnnotationFromNodes

 GenomeGraph.updateAnnotationFromNodes (isSeq=True)

This function is used only for primitive block graphs (e.g. gene and chain graphs) if there is no proper annotation available (e.g. graph was created from paths and some extra information about nodes is needed).

It takes “name” of each node either from graph.nodes (if isSeq is False) or from graph.nodesData (if isSeq is True).

Parameters ##########

isSeq: Whether it contains names as names or as seq.

Graph sorting

GenomeGraph.generateTremauxTree

 GenomeGraph.generateTremauxTree (byPath=True)

This function generates Tremaux tree for our graph. It is not a simple Tremaux tree and requires an adjustment process, which happens inside the TremauxTree class constructor.

GenomeGraph.treeSort

 GenomeGraph.treeSort (byPath=True, bubblePriorityThreshold=0.5)

This is the main function for sorting graph. It requires some further work, but works relatively well as is.

Export/Save (to GFA)

GenomeGraph.toGFA

 GenomeGraph.toGFA (gfaFile, doSeq=True)

Recording existing graph structures to GFA v1 file + some json and joblib files to preserve extra metadata.

Elements operations (for nodes, links, annotations, etc.)

Add node, link and accession (not properly implemented or not tested)

Here another function is needed to add accession with relevant provate nodes and links. Then possibly the functions below will be used but as private function, not external API.

GenomeGraph.addAccessionAnnotation

 GenomeGraph.addAccessionAnnotation (annotationFile, sequenceFile=None)

Ideally, a function should be able to add one accesstion to existing graph. When implemented, _graphFromAnnotation should be using this function.

GenomeGraph.addLink

 GenomeGraph.addLink (fromNode, fromStrand, toNode, toStrand)

Need testing. Not sure if it actually makes sense as links not present in any of the paths does not play any role.

GenomeGraph.addNode

 GenomeGraph.addNode (nodeID, data=None)

Need testing. Again, there is no point of adding a node to a graph if this node will not be present in any of the paths.

Node Inversion

GenomeGraph.invertNodes

 GenomeGraph.invertNodes ()

This function look at inversion/strand of each node in each path. If more than half of paths passing node in “inverted” direction, then inverstion should be switched over (currently inverted passes should become normal and normal passes should become inverted.) It is done every time a graph is loaded. Possibly, it should be possible to not doing it as it will take a lot of time for larger graphs.

Node removal methods

GenomeGraph.removeNodes

 GenomeGraph.removeNodes (nodeIDsToRemove)

This (and related to it) function allows removal of a node from a graph. In normal situation, it should not be done to a graph as it will make it invalid in most cases, it is very important functionality for removal of overlaps (see below).

Node Substitution (not implemented)

Node overlap removal

GenomeGraph.removeOverlaps

 GenomeGraph.removeOverlaps ()

When the graph (nucleotide only) is loaded, overlaps are allowed and can be provided using CIGAR strings (it will not be checked). Such overlaps can appear for instance when a compacted de bruijn graph is used (e.g. generated by CUTTLEFISH). They should be removed in order to make the graph not artificially overcomplicated.

Unfortunately, this current implementation is not working properly and needs to be looked at in details. It is most probably overcomplicated and overthought. It should be relatively easy to do.

Utility methods

Link conversion methods

There are two different types of data sructure for links: based on sets and dicts. Normally, dict type is used, but for some specific operation sets type is needed. The following two function does the convertion between the two.