Export module

Provides functionality to export graph to Pantograph data storage.

Imports and templates

warnings.filterwarnings("ignore")

Functions intro

Notation and terminology

In documentation, we refer to graph nucleotides, columns and components. Components contain columns and columns contain nucleotides.

In the code variable names and comments use slightly different notation. Columns in documentation are bins in code and comments, whereas graph nucleotides in documentation are called columns in the code and comments. This happened for the legacy reasons, i.e. originally there was no nucleotide numbers (columns) in the visualised graph structure and components were split into bins (literally, equal sized bins). It is not true anymore, but old terminology left here.

Ideally all variable names and comments should be changes in line with documentation notation, but I have no idea when this can happen.

For various operational or legacy reasons, some of the data structures (usually, lists/array) use 0-based indexing, whereas some others (usually dicts) can be 0-based or 1-based. Here are the main structures with numerical indexing and their index bases:

  • components: keys: 0-based, values: occupants: 0-based, binNumbers: 0-based
  • componentToNode: keys: 0-based, values: 1-based
  • nodeToComponent: keys: 0-based, values: 1-based
  • newToOldInd and oldToNewInd: both index and values are 0-based numbers of components in previous and current zoomlayer.
  • fromLinks: top level keys (from nodes): 1-based, bottom level keys (to nodes): 1-based, values (list of participants): 0-based
  • toLinks: top level keys (to nodes): 1-based, bottom level keys (from nodes): 1-based, values (list of participants): 0-based
  • fromComponentLinks: top level keys (from components): 1-based, bottom level keys (to components): 1-based, values (set of participants): 0-based
  • toComponentLinks: top level keys (to components): 1-based, bottom level keys (from components): 1-based, values (set of participants): 0-based

Generating base layer

This set of functions generate the data structures for initial, lowest level zoom (nucleotide or minimum unit resolution). The main orchestration function is baseLayerZoom.

Functions


outLeftRight

 outLeftRight (nodeInversionInPath, leftFarLink, rightFarLink, reason,
               debug=False, inversionThreshold=0.5)

checkForBreak

 checkForBreak (nodeIdx, nodeLen, nodePathsIdx, nodeSeqInPath,
                uniqueNodePathsIDs, pathNodeCount, pathLengths,
                pathNodeArray, pathDirArray, occupancy, inversion,
                fromLinks, toLinks, nBins, maxLengthComponent, blockEdges,
                inversionThreshold=0.5, debug=False)

Function to check whether the component should be broken before (left) and/or after (right) it.

Type Default Details
nodeIdx
nodeLen
nodePathsIdx
nodeSeqInPath
uniqueNodePathsIDs
pathNodeCount
pathLengths
pathNodeArray
pathDirArray
occupancy
inversion
fromLinks
toLinks
nBins
maxLengthComponent
blockEdges
inversionThreshold float 0.5
debug bool False
Returns leftFarLink: bool. Shows whether there is a far link on the left that will require component break.
/home/pigrenok/.pyenv/versions/3.10.9/envs/pygengraph/lib/python3.10/site-packages/fastcore/docscrape.py:225: UserWarning: Unknown section Return
  else: warn(msg)

nodeStat

 nodeStat (nodeIdx, pathNodeArray, nodeLengths)

Function calculate information about node as part of the overall graph.


finaliseComponentBase

 finaliseComponentBase (component, components, componentNucleotides,
                        matrix, occupants, nBins, componentLengths,
                        nucleotides, zoomLevel, accessions,
                        inversionThreshold=0.5)

processAnnotationInterval

 processAnnotationInterval (posStart, posEnd, annotation, res)

combineAnnotation

 combineAnnotation (combAnnotation)

updateEdges

 updateEdges (accEdge, edgeAccessions, compNum)

Function fills up either accStarts or accEnds (on which component each accession starts and on which ends). compNum is assumed to be 1-based.

Wrapper

Now ‘positions’ key in metadata contains either one position (chr:posStart..posEnd) or two comma separated positions where one is genomic position, and another one is pangenomic position.


baseLayerZoom

 baseLayerZoom (graph, outputPath, outputName, pathNodeArray,
                pathDirArray, pathLengths, nodeLengths,
                pathNodeLengthsCum, maxLengthComponent, blockEdges,
                CPUS=32, inversionThreshold=0.5, isSeq=True, debug=False,
                debugTime=False)

Generating zoom layer

This set of functions (with nextLayerZoom being main orchestration function) doing the job of generating next zoom level by collapsing columns and then components together after smaller non-linear links are removed (by different set of functions).

Finalising bin and component


getOccInv

 getOccInv (binColLengths, binBlockLength, binOcc, binInv,
            inversionThreshold=0.5)

combineIntervals

 combineIntervals (posPath)

recordBinZoom

 recordBinZoom (occ, inv, binPosArray, nBins, nCols, binBlockLength,
                binBlockLengths, binColLengths, binColStart, binColStarts,
                binColEnd, binColEnds, matrix, inversionThreshold=0.5)

getAverageInv

 getAverageInv (binBlockLengths, matrixPathArray)

finaliseComponentZoom

 finaliseComponentZoom (component, components, componentLengths, nBins,
                        nCols, occupants, binBlockLengths, binColStarts,
                        binColEnds, matrix, starts, ends, forwardPaths,
                        invertedPaths, compInvNum, compInvDen,
                        inversionThreshold=0.5)
Type Default Details
component
components
componentLengths componentNucleotides,
nBins
nCols
occupants
binBlockLengths
binColStarts
binColEnds
matrix
starts
ends
forwardPaths
invertedPaths
compInvNum
compInvDen
inversionThreshold float 0.5

finaliseBinZoom

 finaliseBinZoom (compNum, binOcc, binInv, binPosArray, nBins, nCols,
                  binBlockLength, binBlockLengths, binColLengths,
                  binColStart, binColStarts, binColEnd, binColEnds,
                  matrix, newComponent, newComponents,
                  newComponentLengths, newFromComponentLinks,
                  newToComponentLinks, occupants, linkLengths, starts,
                  ends, forwardPaths, invertedPaths, pathsToInversion,
                  newToOldInd, oldToNewInd, inversionThreshold=0.5)
Type Default Details
compNum
binOcc
binInv
binPosArray
nBins
nCols
binBlockLength
binBlockLengths
binColLengths
binColStart
binColStarts
binColEnd
binColEnds
matrix
newComponent
newComponents
newComponentLengths compAccDir,#newComponentNucleotides,
newFromComponentLinks
newToComponentLinks
occupants
linkLengths
starts
ends
forwardPaths
invertedPaths
pathsToInversion
newToOldInd
oldToNewInd
inversionThreshold float 0.5

Break component?


getMatrixPathElement

 getMatrixPathElement (matrix, pathID)

checkChange

 checkChange (compNum, components, zoomLevel, blockEdges)

joinComponents

 joinComponents (leftComp, rightComp, maxLengthComponent,
                 inversionThreshold=0.5)

!!! ⚠️ Currently not used

If the joining was successful, the function will return a joined component.

If the joining was not successful and was aborted for one of the following reasons, it will return a list of original components. The reasons for aborting the joining can be the following: - In one of the paths the invertion is lower than threshold in one component and higher in the other. - Left component contains at least one end - Right component contains at least one start

The function will not check links for coming or going on the right of the left component and left of the right component. It will just get left links from left component and right links from right component and assign them to the new component.


checkLinksZoom

 checkLinksZoom (compNum, fromComponentLinks, toComponentLinks)

checkForBreaksZoom

 checkForBreaksZoom (zoomLevel, compNum, components, fromComponentLinks,
                     toComponentLinks, blockEdges)

splitPositiveNegative

 splitPositiveNegative (compID, accs, components)

This function simply pulls all accession presented in the component and split them into forward and inversed.

Type Details
compID
accs
components
Returns posAcc: list[int]. IDs of accession which has forward direction in given component.

intersectAccLists

 intersectAccLists (accList, dirDict)

Main layer generation function + assistant function


isStartEnd

 isStartEnd (compNum, components)

nextLayerZoom

 nextLayerZoom (zoomLevel, components, componentLengths,
                fromComponentLinks, toComponentLinks, graph, accStarts,
                accEnds, maxLengthComponent, linkLengths, pairedLinks,
                interconnectedLinks, blockEdges, inversionThreshold=0.5,
                debug=False, debugTime=False)
Type Default Details
zoomLevel
components
componentLengths componentNucleotides,
fromComponentLinks
toComponentLinks
graph
accStarts
accEnds
maxLengthComponent
linkLengths
pairedLinks
interconnectedLinks
blockEdges
inversionThreshold float 0.5
debug bool False
debugTime bool False

Clear elements too small to show

This set of functions (with the orchestrating function being clearInvisible) look at earlier identified non-linear link to size (or number of nucleotides) associations and if the next zoom level is larger than some sizes, then these links are removed (with reinstating of some of linear links instead).

After that Isolation blocks are identified and removed. Isolation block is a contiguous block of components (columns) that are connected only to each other but not to any of components outside the block.

processCollapsibleBlocks

 processCollapsibleBlocks (zoomLevel, linkLengths, pairedLinks,
                           interconnectedLinks, fromComponentLinks,
                           toComponentLinks)

clearRearrangementBlocks

 clearRearrangementBlocks (zoomLevel, blockEdges)

Find isolated blocks

Identify empty edges


testStartEnd

 testStartEnd (compNum, isLeft, components, accStarts, accEnds)

findEmptyEdges

 findEmptyEdges (fromComponentLinks, toComponentLinks, accStarts, accEnds,
                 components)

Identify all empty edges by simply finding components that do not appear either in toComponentLinks (left empty) or fromComponentLinks (right empty)

Identify isolated blocks


createNewBoundaries

 createNewBoundaries (blockStart, blockEnd, externalLinksComps,
                      leftEmptyList, rightEmptyList)
# Test for `createNewBoundaries`
import numpy as np

st = [2,5,6,8]
end = [2,3,4,6,8,9,10,11]

blocks = [[2,11],[2,3],[5,11],[8,11],[8,9],[8,11]]
blockSplits = [[[2,3],[5,11]],[[2,2]],[[6,6],[8,11]],[[8,9]],[[8,8]],[]]
externals = [[4],[3],[5,7],[10],[9],[8,9,10,11]]

for bl,blSpl,ext in zip(blocks,blockSplits,externals):
    blSplTT = createNewBoundaries(*bl,ext,st,end)
    assert blSpl == blSplTT,f'Expected {blSpl}, but got {blSplTT}'
# Another test for `createNewBoundaries`
leftEmptyList = [2056, 3080, 3081, 2092, 2099, 1593, 3643, 2627, 1116, 2653, 2655, 3168, 2658, 613, 1637, 1638, 106, 1654, 2695, 2192, 1169, 1686, 2714, 3757, 2233, 3781, 723, 1240, 224, 1761, 1762, 1766, 3323, 1804, 786, 2331, 802, 2850, 807, 811, 1839, 1841, 3396, 3397, 1863, 3400, 843, 3423, 1898, 1899, 882, 884, 3463, 402, 2451, 3478, 408, 3482, 934, 426, 1962, 3504, 3516, 3519, 3520, 451, 1994, 1995, 972, 2506, 463, 3024, 1493, 1494, 3542, 1525]
rightEmptyList = [402, 2451, 3478, 407, 280, 3482, 2848, 802, 934, 807, 426, 811, 2091, 2092, 3757, 1839, 3504, 1841, 3516, 3519, 3405, 463, 722, 1240, 1761, 1762, 1766, 1899]
blockStart = 3396
blockEnd = 3405
externalLinksComps = [3396, 3397, 3398, 3399, 3400, 3401, 3402, 3403, 3404, 3405]

createNewBoundaries(blockStart,blockEnd,externalLinksComps,leftEmptyList,rightEmptyList)
[]

identifyIsolatedBlocks

 identifyIsolatedBlocks (leftEmptyList, rightEmptyList,
                         fromComponentLinks, toComponentLinks, components)

Removing Isolated Blocks


updateLinksRemoveComp

 updateLinksRemoveComp (oldToNewInd, fromComponentLinks, toComponentLinks,
                        linkLengths, pairedLinks, interconnectedLinks,
                        blockEdges, accStarts, accEnds)

removeIsolatedBlocks

 removeIsolatedBlocks (isolatedBlockList, components, componentLengths,
                       fromComponentLinks, toComponentLinks, accStarts,
                       accEnds, linkLengths, pairedLinks,
                       interconnectedLinks, blockEdges)

Clearing small element wrapping function


clearInvisible

 clearInvisible (zoomLevel, linkLengths, pairedLinks, interconnectedLinks,
                 blockEdges, fromComponentLinks, toComponentLinks,
                 accStarts, accEnds, components, componentLengths)

Exporting layer

These functions, with the main one being exportLayer, are exporting prepared zoom level (cleaned and collapsed by other functions) into Pantograph Visualisation tool data structures (JSON chunk files).


createZoomLevelDir

 createZoomLevelDir (outputPath, outputName, zoomLevel)

Creates a directory for zoom level chunks. The function will take care of correct directory level separator.


finaliseChunk

 finaliseChunk (rootStruct, zoomLevel, chunk, nucleotides, nBins,
                chunkNum, curCompCols, prevTotalCols, outputPath,
                outputName)

addLinksToComp

 addLinksToComp (compNum, components, fromComponentLinks,
                 toComponentLinks)

searchIndicesPosRecord

 searchIndicesPosRecord (redisConn, redisCaseID, zoomLevel, accessions,
                         posMapping)

exportLayer

 exportLayer (zoomLevel, components, componentNucleotides,
              fromComponentLinks, toComponentLinks, rootStruct,
              outputPath, outputName, maxLengthComponent, maxLengthChunk,
              inversionThreshold=0.5, redisConn=None, redisCaseID=None,
              accessions=None, debug=False)

Main exporter wrapper with its helper functions

This is the main orchestrating function that export a single graph to Pantograph Visualisation tool with a couple of auxiliary functions.


recordZoomLevelForDebug

 recordZoomLevelForDebug (zoomNodeToComponent, zoomComponentToNodes,
                          zoomComponents, nodeToComponent,
                          componentToNodes, components, zoomLevel)

A function which records result of segmentation to dictionaries, which holds results for all zoom levels. It is currently used only for debugging purposes and in normal operation all zoom level dictionaries are not created and used.

Type Details
zoomNodeToComponent
zoomComponentToNodes
zoomComponents
nodeToComponent
componentToNodes
components
zoomLevel
Returns Returns modified dictionaries with zoom in the beginning of the names. Theoretically,

searchIndicesGeneRecord

 searchIndicesGeneRecord (redisConn, redisCaseID, geneMapping,
                          genPosMapping, altChrGenPosMapping,
                          genPosSearchMapping, pangenPosSearchMapping)

Recording prepared metadata structures into Redis DB


exportToPantograph

 exportToPantograph (graph=None, inputPath=None, GenomeGraphParams={},
                     outputPath=None, outputName=None, outputSuffix=None,
                     isSeq=True, nodeLengths=None, redisConn=None,
                     zoomLevels=[1], fillZoomLevels=True,
                     maxLengthComponent=100, maxLengthChunk=20,
                     inversionThreshold=0.5, debug=False,
                     returnDebugData=False)

This function is used by exportProject function and should not normally be used independently now.

Project generation


exportProject

 exportProject (projectID, projectName, caseDict, pathToIndex,
                pathToGraphs, redisHost=None, redisPort=6379, redisDB=0,
                suffix='', maxLengthComponent=100, maxLengthChunk=6,
                inversionThreshold=0.5, isSeq=True, zoomLevels=[1],
                fillZoomLevel=True)

This is the only function that should normally be used to export a set of graphs (e.g. a graph per chromosome) to Pantograph Visualisation tool as a project (or interconnected structure).

Exporting of each graph creates a case directory _

with bin2file.json file which describes the case overall and each zoom level. At the same time, each zoom level is contained in multiple chunk JSON files, each zoom level n is in the directory n inside the case directory. Each JSON chunk files contains all required information to visualise up to maxLengthChunk components at a given zoom level.

ALl case directories are in project directory together with <projectID>_project.json, which is simply provides association between case names and and corresponding directory name.

Finally, information about the project will be recorded to Pantograph Visualisation tool data index to make it discoverable by the tool.

In addition, no metadata is recorded into these files as it inflates it very quickly. Instead, a very simple (optional) API works alongside main Pantograph Visualisation tool which provides a lot of various metadata on request if API available or do nothing if not. This API uses Redis DB with special DB schema.

When graphs are exported some metadata (annotations, genome and pangenome positions) can be recorded to Redis DB. If Redis DB is not available or recording of metadata is not needed, then parameter redisHost should be omitted. Otherwise, if Redis DB is available and metadata should be recorded, then redisHost should be set to the hostname (or IP address) of the Redis DB server