Export module

Provides functionality to export graph to Pantograph data storage.

Imports and templates

warnings.filterwarnings("ignore")

Functions intro

Notation and terminology

In documentation, we refer to graph nucleotides, columns and components. Components contain columns and columns contain nucleotides.

In the code variable names and comments use slightly different notation. Columns in documentation are bins in code and comments, whereas graph nucleotides in documentation are called columns in the code and comments. This happened for the legacy reasons, i.e. originally there was no nucleotide numbers (columns) in the visualised graph structure and components were split into bins (literally, equal sized bins). It is not true anymore, but old terminology left here.

Ideally all variable names and comments should be changes in line with documentation notation, but I have no idea when this can happen.

For various operational or legacy reasons, some of the data structures (usually, lists/array) use 0-based indexing, whereas some others (usually dicts) can be 0-based or 1-based. Here are the main structures with numerical indexing and their index bases:

components: keys: 0-based, values: occupants: 0-based, binNumbers: 0-based
componentToNode: keys: 0-based, values: 1-based
nodeToComponent: keys: 0-based, values: 1-based
newToOldInd and oldToNewInd: both index and values are 0-based numbers of components in previous and current zoomlayer.
fromLinks: top level keys (from nodes): 1-based, bottom level keys (to nodes): 1-based, values (list of participants): 0-based
toLinks: top level keys (to nodes): 1-based, bottom level keys (from nodes): 1-based, values (list of participants): 0-based
fromComponentLinks: top level keys (from components): 1-based, bottom level keys (to components): 1-based, values (set of participants): 0-based
toComponentLinks: top level keys (to components): 1-based, bottom level keys (from components): 1-based, values (set of participants): 0-based

Generating base layer

This set of functions generate the data structures for initial, lowest level zoom (nucleotide or minimum unit resolution). The main orchestration function is baseLayerZoom.

Functions

outLeftRight

 outLeftRight (nodeInversionInPath, leftFarLink, rightFarLink, reason,
               debug=False, inversionThreshold=0.5)

recordLinks

 recordLinks (nodeIdx, nextNode, pathID, step, nodeInversionInPath,
              nonLinearCond, pathNodeArray, fromLinks, toLinks,
              debug=False, inversionThreshold=0.5)

checkForBreak

 checkForBreak (nodeIdx, nodeLen, nodePathsIdx, nodeSeqInPath,
                uniqueNodePathsIDs, pathNodeCount, pathLengths,
                pathNodeArray, pathDirArray, occupancy, inversion,
                fromLinks, toLinks, nBins, maxLengthComponent, blockEdges,
                inversionThreshold=0.5, debug=False)

Function to check whether the component should be broken before (left) and/or after (right) it.

	Type	Default
nodeIdx
nodeLen
nodePathsIdx
nodeSeqInPath
uniqueNodePathsIDs
pathNodeCount
pathLengths
pathNodeArray
pathDirArray
occupancy
inversion
fromLinks
toLinks
nBins
maxLengthComponent
blockEdges
inversionThreshold	float	0.5
debug	bool	False
Returns	`leftFarLink`: bool. Shows whether there is a far link on the left that will require component break.

/home/pigrenok/.pyenv/versions/3.10.9/envs/pygengraph/lib/python3.10/site-packages/fastcore/docscrape.py:225: UserWarning: Unknown section Return
  else: warn(msg)

nodeStat

 nodeStat (nodeIdx, pathNodeArray, nodeLengths)

Function calculate information about node as part of the overall graph.

finaliseComponentBase

 finaliseComponentBase (component, components, componentNucleotides,
                        matrix, occupants, nBins, componentLengths,
                        nucleotides, zoomLevel, accessions,
                        inversionThreshold=0.5)

processAnnotationInterval

 processAnnotationInterval (posStart, posEnd, annotation, res)

combineAnnotation

 combineAnnotation (combAnnotation)

updateEdges

 updateEdges (accEdge, edgeAccessions, compNum)

Function fills up either accStarts or accEnds (on which component each accession starts and on which ends). compNum is assumed to be 1-based.

Wrapper

Now ‘positions’ key in metadata contains either one position (chr:posStart..posEnd) or two comma separated positions where one is genomic position, and another one is pangenomic position.

baseLayerZoom

 baseLayerZoom (graph, outputPath, outputName, pathNodeArray,
                pathDirArray, pathLengths, nodeLengths,
                pathNodeLengthsCum, maxLengthComponent, blockEdges,
                CPUS=32, inversionThreshold=0.5, isSeq=True, debug=False,
                debugTime=False)

Transfer from nodes to components (links and other structures)

This is one of the first processes happening while exporting graph. While graph operates with nodes (which can be linearly connected with each other in all paths), then exporting works with components. In almost all cases, components have at least some non-linear links with other components on both sides. The only exclusion is when a component is too large and split into several ones. In this case two components will be connected by 100% linear links. Also, graph operates with paths along with nodes, whereas exporting works with components and accession-specific links between them.

These functions (with main orchestrating one is nodeToComponentLinks) are converting nodes and paths to components and links.

splitforwardInversedNodeComp

 splitforwardInversedNodeComp (pathList, component, isInverse)

fillLinksBase

 fillLinksBase (nodeInComp, nodeToComponent, fromLinks, toLinks,
                fromComponentLinks, toComponentLinks, compNum, components,
                doLeft=True, doRight=True)

convertLink

 convertLink (linkFrom, linkTo, translateDict, forwardLinks, isZoom)

recordUpdatedPairedLink

 recordUpdatedPairedLink (firstLinkSet, secondLinkSet, firstLink,
                          secondLink, substituteLink, pairedLinksConv)

convertRemovableComponents

 convertRemovableComponents (translateDict, linkLengths, pairedLinks,
                             interconnectedLinks, blockEdges,
                             forwardLinks, isZoom=True)

translateDict should be a dict in format {<old node/component id 0-based>:<new component id 1-based>} pathNodeInv should be a dict of dicts of the following structure: {:{<nodeId 1-based>:}}

This is done through fromLinks and toLinks and throught associated directions of available accessions. For this we need to loop through strands and do it separately for each strand.

For paired links there is a possibility that a single node link will give several component links. In this case, the cross product of all first and second links will be added to converted paired links.

❗The substitute links should be added only to the paths that contained both first and second links in the first place. This should be controlled in link removal routine.

nodeToComponentLinks

 nodeToComponentLinks (components, componentToNode, nodeToComponent,
                       fromLinks, toLinks, graph, fromComponentLinks,
                       toComponentLinks, linkLengths=None,
                       pairedLinks=None, interconnectedLinks=None,
                       blockEdges=None, debug=False)

Identifying collapsible links and rearrangement blocks (works incorrectly, left now for compatibility).

In order to be able to generate multiple zoom levels of the graph view, non-linear links describing small (too small to show at the given zoom level) rearrangements should disappear whereas links describing larger blocks should persist. This will allow to see larger rearrangements clearly on higher zoom levels.

In order to do it, each link should be associated with some size (or rearrangement), so, that when each zoom level is generated, they can be removed when the rearrangement cannot be shown at the given zoom level.

Some links are also associated with each other, and when they are removed new links (usually linear ones) should be reinstated to make larger rearrangements clearer.

At the moment, the process of identifying these sizes is not working great as it leaves too much non-linear links to the very top level where suddenly all non-linear links disappear and the whole graph from over-complicated jumps to pretty much trivial without any rearrangements. If to use digital map analogy, most of country roads persist while you zoom out on the map until almost the whole Earth is in view and then at some point the view becomes just a blue/green ball with very rough boundary of continents and oceans.

At the moment, all associated links get into a pool of so called interconnected links and if one link gets associated with specific size, then all links get the same association, and then maximum size is selected. But that means that if one link describes one small and also on the edge of large rearrangement, and another link is only associated with large rearrangement, then the latter link will also be associated with the size of large rearrangement and will stay until the zoom level where the large rearrangement is too small to show. That is incorrect.

I think, each link should get its own associations with sizes (and maximum should be taken) and clearing of the link should happen individually. Yet, if one link with smaller size and one with larger size are paired, the reinstated link should appear after smaller link removed.

Another alternative is just to get contiguous blocks in each path and associate each link pair (describing start and end of each block) as a pair of links that needs to be cleared in association with the size of this block. Need control of repeats in these blocks. If it happens, then a single link can describe a whole rearrangement. In addition, an extra control for inversion is also needed. In particular, if outside the block the numbers do not create a range to fin the inverted node (e.g. 1+,4-,3+, or 3+,2-,5+), then it should be ignorred for this step. It means there is a smaller rearrangement within larger one.

Another alternative (described in TODO) is to convert paths of nodes to paths of edges and operate with them. I guess, it is not far away from the previous paragraph.

Identifying path breaks

findBreaksInPath

 findBreaksInPath (combinedArray, nextNodeDict)

identifyPathBreaks

 identifyPathBreaks (combinedNodeDirArray, pathLengths, pathNextNode)

Block processing

interweaveArrays

 interweaveArrays (a, b)

extractGapsBlocks

 extractGapsBlocks (block, path, nodeLengths, getComplex=False)

This function either split block by gaps (e.g. block [1,2,4,5,6,8] will yield [1,2],[4,5,6],[8])

If getComplex is set to True, then first gaps are filtered for nodes that are not passed by the path. After that, edges are identified and then for them nodes not passed by the path are filtered out. Then we find the longest block out of edges, and then the longest edge combine with all gaps and find the shortest one. That shortest one is going to be the one returned.

E.g. block [1,2,4,5,8] will give edges [1,2],[4,5],[8] and gaps [3],[6,7].

If path does not contain 6, then edges will be the same, but gaps will be [3],[6]

If path does not contain 3, then edges will be [1,2,4,5],[8] and gaps [6,7]

The exact block which will be returned depends on sizes of each node.

checkSplitBlock

 checkSplitBlock (block, gapList=None)

Not used at the moment

Function checks if the block has any gaps and split into a list of blocks between gaps (alternatively fill gaps or leave things as they are). At the moment the gapped block will be converted to list of blocks between gaps

blockListToLengths

 blockListToLengths (blockList, nodeLengths)

convertBlocksToLengths

 convertBlocksToLengths (linksBlocks, nodeLengths)

Converting blocks associated with each link to lengths and then selecting the longest one (?)

Link processing

addToLinkPool

 addToLinkPool (link1, link2, interconnectedLinks)

blockFromSingleLink

 blockFromSingleLink (pathID, link, pathNodeInversionRate, pathNextNode)

Identify block from a single link It is the block that the link bounds, i.e.: If link if forward then it is inside the link + any side that is inverted If link is backward, then it is inside + any side that is normal direction.

checkIndividualLink

 checkIndividualLink (link, pathID, usedSecondInPairLink)

Function checks if this link is already second in pair. If it is, then it is not considered separately (return True?). Otherwise, it should be considered and block generated (using blockFromSingleLink) and associated with this link.

processDoublePairedLinks

 processDoublePairedLinks (leftLink, rightLink, pathID, doublePairedLinks,
                           pairedLinks, interconnectedLinks, linksBlocks,
                           pathNextNode)

processIndividualLink

 processIndividualLink (link, pathID, pathNodeInversionRate, pathNextNode,
                        usedSecondInPairLink)

recordLinkBlockAssociation

 recordLinkBlockAssociation (link, blockList, linksBlocks)

findNextNode

 findNextNode (node, combinedArray)

processPseudoPair

 processPseudoPair (breakPos, returnPos, pathID, pathNodeArray,
                    combinedNodeDirArray, pathNextNode, nodeLengths,
                    usedSecondInPairPath, pairedLinks, linksBlocks)

processStartsEnds

 processStartsEnds (mainLink, linkStarts, linkEnds, interconnectedLinks,
                    forwardLinks)

Currently not in use.

TODO!!! Need to add checks for whether one link is intersecting the other or one is fully inside.

postprocessLinksBlocks

 postprocessLinksBlocks (linksBlocks, interconnectedLinks)

processPathBreaks

 processPathBreaks (pathBreakCoordPairs, pathNodeArray, pathNextNode,
                    combinedNodeDirArray, pathNodeInversionRate,
                    pathLengths, nodeLengths, forwardLinks)

Rearrangement blocks

addBlockEdge

 addBlockEdge (edge, size, blockEdges)

identifyRearrangementBlocks

 identifyRearrangementBlocks (nodesStructure, nodeLengths)

block Edges is a dict with a structure: : pointing to the node before (!) the break. In other words, if it is the start of the block, it will point to the node just before the block, and if it is the end of the block, it will point to the last node of the block.

Wrapper

getRemovableStructures

 getRemovableStructures (graph=None, nodeLengths=None, pathLengths=None,
                         pathNodeArray=None, pathDirArray=None,
                         pathNextNode=None, forwardLinks=None,
                         inversionThreshold=0.5)

getBlockEdges

 getBlockEdges (graph=None, nodeLengths=None, pathLengths=None,
                pathNodeArray=None, pathDirArray=None, pathNextNode=None,
                forwardLinks=None, inversionThreshold=0.5)

Generating zoom layer

This set of functions (with nextLayerZoom being main orchestration function) doing the job of generating next zoom level by collapsing columns and then components together after smaller non-linear links are removed (by different set of functions).

Finalising bin and component

addLink

 addLink (fromComp, fromStrand, toComp, toStrand, pathList,
          fromComponentLinks, toComponentLinks)

def getOccInvChange(binColLengths,binBlockLength,binOcc,binInv,prevOcc,prevInv,inversionThreshold=0.5):
    occChanged = False
    invChanged = False
    occ = {}
    inv = {}
    
    for pathID in binOcc:
        
        # Averaging occupancy
        occ[pathID] = sum([bl*bo for bl,bo in zip(binColLengths,binOcc[pathID])])/binBlockLength
        # Do comparison through floor and then abs difference > 0
        if np.abs(np.floor(occ[pathID]+0.5)-np.floor(prevOcc.get(pathID,occ[pathID])+0.5))>0 \
            and occ[pathID]>0.5 and prevOcc.get(pathID,occ[pathID])>0.5:
            occChanged = True
        prevOcc[pathID] = occ[pathID]
        
        # Averaging invertion
        inv[pathID] = sum([bl*bo*bi for bl,bo,bi in zip(binColLengths,binOcc[pathID],binInv[pathID])])/(binBlockLength*occ[pathID])
        if (inv[pathID]-inversionThreshold)*(prevInv.get(pathID,inv[pathID])-inversionThreshold)<0 or \
        (inv[pathID]-inversionThreshold)*(prevInv.get(pathID,inv[pathID])-inversionThreshold)==0 and \ 
        inv[pathID]*prevInv.get(pathID,inv[pathID])>inversionThreshold*inversionThreshold:
            # The second comdition after `or` is taking the case where one is equal to inversionThreshold
            # and another is more than inversionThreshold.
            invChanged = True
        prevInv[pathID] = inv[pathID]
        
    return occChanged,invChanged,occ,inv,prevOcc,prevInv

getOccInv

 getOccInv (binColLengths, binBlockLength, binOcc, binInv,
            inversionThreshold=0.5)

combineIntervals

 combineIntervals (posPath)

recordBinZoom

 recordBinZoom (occ, inv, binPosArray, nBins, nCols, binBlockLength,
                binBlockLengths, binColLengths, binColStart, binColStarts,
                binColEnd, binColEnds, matrix, inversionThreshold=0.5)

getAverageInv

 getAverageInv (binBlockLengths, matrixPathArray)

finaliseComponentZoom

 finaliseComponentZoom (component, components, componentLengths, nBins,
                        nCols, occupants, binBlockLengths, binColStarts,
                        binColEnds, matrix, starts, ends, forwardPaths,
                        invertedPaths, compInvNum, compInvDen,
                        inversionThreshold=0.5)

	Type	Default	Details
component
components
componentLengths			componentNucleotides,
nBins
nCols
occupants
binBlockLengths
binColStarts
binColEnds
matrix
starts
ends
forwardPaths
invertedPaths
compInvNum
compInvDen
inversionThreshold	float	0.5

finaliseBinZoom

 finaliseBinZoom (compNum, binOcc, binInv, binPosArray, nBins, nCols,
                  binBlockLength, binBlockLengths, binColLengths,
                  binColStart, binColStarts, binColEnd, binColEnds,
                  matrix, newComponent, newComponents,
                  newComponentLengths, newFromComponentLinks,
                  newToComponentLinks, occupants, linkLengths, starts,
                  ends, forwardPaths, invertedPaths, pathsToInversion,
                  newToOldInd, oldToNewInd, inversionThreshold=0.5)

	Type	Default	Details
compNum
binOcc
binInv
binPosArray
nBins
nCols
binBlockLength
binBlockLengths
binColLengths
binColStart
binColStarts
binColEnd
binColEnds
matrix
newComponent
newComponents
newComponentLengths			compAccDir,#newComponentNucleotides,
newFromComponentLinks
newToComponentLinks
occupants
linkLengths
starts
ends
forwardPaths
invertedPaths
pathsToInversion
newToOldInd
oldToNewInd
inversionThreshold	float	0.5

Break component?

getMatrixPathElement

 getMatrixPathElement (matrix, pathID)

checkChange

 checkChange (compNum, components, zoomLevel, blockEdges)

joinComponents

 joinComponents (leftComp, rightComp, maxLengthComponent,
                 inversionThreshold=0.5)

!!! ⚠️ Currently not used

If the joining was successful, the function will return a joined component.

If the joining was not successful and was aborted for one of the following reasons, it will return a list of original components. The reasons for aborting the joining can be the following: - In one of the paths the invertion is lower than threshold in one component and higher in the other. - Left component contains at least one end - Right component contains at least one start

The function will not check links for coming or going on the right of the left component and left of the right component. It will just get left links from left component and right links from right component and assign them to the new component.

checkLinksZoom

 checkLinksZoom (compNum, fromComponentLinks, toComponentLinks)

checkForBreaksZoom

 checkForBreaksZoom (zoomLevel, compNum, components, fromComponentLinks,
                     toComponentLinks, blockEdges)

Update links

splitPositiveNegative

 splitPositiveNegative (compID, accs, components)

This function simply pulls all accession presented in the component and split them into forward and inversed.

	Type	Details
compID
accs
components
Returns	`posAcc`: list[int]. IDs of accession which has forward direction in given component.

intersectAccLists

 intersectAccLists (accList, dirDict)

updateLinks

 updateLinks (newToOldInd, oldToNewInd, fromComponentLinks,
              toComponentLinks, linkLengths, pairedLinks,
              interconnectedLinks, blockEdges, accStarts, accEnds,
              components, compAccDir, newFromComponentLinks={},
              newToComponentLinks={})

newToOldInd and oldToNewInd: both index and values are 0-based numbers of components in previous and current zoomlayer.

Main layer generation function + assistant function

isStartEnd

 isStartEnd (compNum, components)

nextLayerZoom

 nextLayerZoom (zoomLevel, components, componentLengths,
                fromComponentLinks, toComponentLinks, graph, accStarts,
                accEnds, maxLengthComponent, linkLengths, pairedLinks,
                interconnectedLinks, blockEdges, inversionThreshold=0.5,
                debug=False, debugTime=False)

	Type	Default	Details
zoomLevel
components
componentLengths			componentNucleotides,
fromComponentLinks
toComponentLinks
graph
accStarts
accEnds
maxLengthComponent
linkLengths
pairedLinks
interconnectedLinks
blockEdges
inversionThreshold	float	0.5
debug	bool	False
debugTime	bool	False

Clear elements too small to show

This set of functions (with the orchestrating function being clearInvisible) look at earlier identified non-linear link to size (or number of nucleotides) associations and if the next zoom level is larger than some sizes, then these links are removed (with reinstating of some of linear links instead).

After that Isolation blocks are identified and removed. Isolation block is a contiguous block of components (columns) that are connected only to each other but not to any of components outside the block.

Removing links and rearrangement blocks associated to too small blocks

removeLink

 removeLink (fromComponentLinks, toComponentLinks, linkList, remLinks,
             link, pairedLink=None, subLink=None, subLinks=None,
             remLinkAccessions=None)

This function remove the main link.

If paired and substitute links are provided, the paired link will be checked (if it is not removed or in the queue to be removed), it will be added to the queue

After that common accessions for the same strand (for each separately) for start of main link and and end of paired link are found and substitute link is established for all such accessions.

If the substitute link is not (k,k+1), but (k,k+p), then in componentLinks all links (k,k+1),(k+1,k+1),…,(k+p-1,k+p) are established.

processCollapsibleBlocks

 processCollapsibleBlocks (zoomLevel, linkLengths, pairedLinks,
                           interconnectedLinks, fromComponentLinks,
                           toComponentLinks)

clearRearrangementBlocks

 clearRearrangementBlocks (zoomLevel, blockEdges)

Find isolated blocks

Identify empty edges

testStartEnd

 testStartEnd (compNum, isLeft, components, accStarts, accEnds)

findEmptyEdges

 findEmptyEdges (fromComponentLinks, toComponentLinks, accStarts, accEnds,
                 components)

Identify all empty edges by simply finding components that do not appear either in toComponentLinks (left empty) or fromComponentLinks (right empty)

Identify isolated blocks

checkExternalLinks

 checkExternalLinks (blockStart, blockEnd, fromComponentLinks,
                     toComponentLinks, components)

createNewBoundaries

 createNewBoundaries (blockStart, blockEnd, externalLinksComps,
                      leftEmptyList, rightEmptyList)

# Test for `createNewBoundaries`
import numpy as np

st = [2,5,6,8]
end = [2,3,4,6,8,9,10,11]

blocks = [[2,11],[2,3],[5,11],[8,11],[8,9],[8,11]]
blockSplits = [[[2,3],[5,11]],[[2,2]],[[6,6],[8,11]],[[8,9]],[[8,8]],[]]
externals = [[4],[3],[5,7],[10],[9],[8,9,10,11]]

for bl,blSpl,ext in zip(blocks,blockSplits,externals):
    blSplTT = createNewBoundaries(*bl,ext,st,end)
    assert blSpl == blSplTT,f'Expected {blSpl}, but got {blSplTT}'

# Another test for `createNewBoundaries`
leftEmptyList = [2056, 3080, 3081, 2092, 2099, 1593, 3643, 2627, 1116, 2653, 2655, 3168, 2658, 613, 1637, 1638, 106, 1654, 2695, 2192, 1169, 1686, 2714, 3757, 2233, 3781, 723, 1240, 224, 1761, 1762, 1766, 3323, 1804, 786, 2331, 802, 2850, 807, 811, 1839, 1841, 3396, 3397, 1863, 3400, 843, 3423, 1898, 1899, 882, 884, 3463, 402, 2451, 3478, 408, 3482, 934, 426, 1962, 3504, 3516, 3519, 3520, 451, 1994, 1995, 972, 2506, 463, 3024, 1493, 1494, 3542, 1525]
rightEmptyList = [402, 2451, 3478, 407, 280, 3482, 2848, 802, 934, 807, 426, 811, 2091, 2092, 3757, 1839, 3504, 1841, 3516, 3519, 3405, 463, 722, 1240, 1761, 1762, 1766, 1899]
blockStart = 3396
blockEnd = 3405
externalLinksComps = [3396, 3397, 3398, 3399, 3400, 3401, 3402, 3403, 3404, 3405]

createNewBoundaries(blockStart,blockEnd,externalLinksComps,leftEmptyList,rightEmptyList)

[]

identifyIsolatedBlocks

 identifyIsolatedBlocks (leftEmptyList, rightEmptyList,
                         fromComponentLinks, toComponentLinks, components)

Removing Isolated Blocks

updateLinksRemoveComp

 updateLinksRemoveComp (oldToNewInd, fromComponentLinks, toComponentLinks,
                        linkLengths, pairedLinks, interconnectedLinks,
                        blockEdges, accStarts, accEnds)

removeIsolatedBlocks

 removeIsolatedBlocks (isolatedBlockList, components, componentLengths,
                       fromComponentLinks, toComponentLinks, accStarts,
                       accEnds, linkLengths, pairedLinks,
                       interconnectedLinks, blockEdges)

Clearing small element wrapping function

clearInvisible

 clearInvisible (zoomLevel, linkLengths, pairedLinks, interconnectedLinks,
                 blockEdges, fromComponentLinks, toComponentLinks,
                 accStarts, accEnds, components, componentLengths)

Exporting layer

These functions, with the main one being exportLayer, are exporting prepared zoom level (cleaned and collapsed by other functions) into Pantograph Visualisation tool data structures (JSON chunk files).

createZoomLevelDir

 createZoomLevelDir (outputPath, outputName, zoomLevel)

Creates a directory for zoom level chunks. The function will take care of correct directory level separator.

finaliseChunk

 finaliseChunk (rootStruct, zoomLevel, chunk, nucleotides, nBins,
                chunkNum, curCompCols, prevTotalCols, outputPath,
                outputName)

addLinksToComp

 addLinksToComp (compNum, components, fromComponentLinks,
                 toComponentLinks)

checkLinks

 checkLinks (leftComp, rightComp)

searchIndicesPosRecord

 searchIndicesPosRecord (redisConn, redisCaseID, zoomLevel, accessions,
                         posMapping)

exportLayer

 exportLayer (zoomLevel, components, componentNucleotides,
              fromComponentLinks, toComponentLinks, rootStruct,
              outputPath, outputName, maxLengthComponent, maxLengthChunk,
              inversionThreshold=0.5, redisConn=None, redisCaseID=None,
              accessions=None, debug=False)

Main exporter wrapper with its helper functions

This is the main orchestrating function that export a single graph to Pantograph Visualisation tool with a couple of auxiliary functions.

compLinksToAccCompLinks

 compLinksToAccCompLinks (compLinks, doCompDir=False)

recordZoomLevelForDebug

 recordZoomLevelForDebug (zoomNodeToComponent, zoomComponentToNodes,
                          zoomComponents, nodeToComponent,
                          componentToNodes, components, zoomLevel)

A function which records result of segmentation to dictionaries, which holds results for all zoom levels. It is currently used only for debugging purposes and in normal operation all zoom level dictionaries are not created and used.

	Type	Details
zoomNodeToComponent
zoomComponentToNodes
zoomComponents
nodeToComponent
componentToNodes
components
zoomLevel
Returns	Returns modified dictionaries with `zoom` in the beginning of the names. Theoretically,

searchIndicesGeneRecord

 searchIndicesGeneRecord (redisConn, redisCaseID, geneMapping,
                          genPosMapping, altChrGenPosMapping,
                          genPosSearchMapping, pangenPosSearchMapping)

Recording prepared metadata structures into Redis DB

exportToPantograph

 exportToPantograph (graph=None, inputPath=None, GenomeGraphParams={},
                     outputPath=None, outputName=None, outputSuffix=None,
                     isSeq=True, nodeLengths=None, redisConn=None,
                     zoomLevels=[1], fillZoomLevels=True,
                     maxLengthComponent=100, maxLengthChunk=20,
                     inversionThreshold=0.5, debug=False,
                     returnDebugData=False)

This function is used by exportProject function and should not normally be used independently now.

Project generation

exportProject

 exportProject (projectID, projectName, caseDict, pathToIndex,
                pathToGraphs, redisHost=None, redisPort=6379, redisDB=0,
                suffix='', maxLengthComponent=100, maxLengthChunk=6,
                inversionThreshold=0.5, isSeq=True, zoomLevels=[1],
                fillZoomLevel=True)

This is the only function that should normally be used to export a set of graphs (e.g. a graph per chromosome) to Pantograph Visualisation tool as a project (or interconnected structure).

Exporting of each graph creates a case directory _

with bin2file.json file which describes the case overall and each zoom level. At the same time, each zoom level is contained in multiple chunk JSON files, each zoom level n is in the directory n inside the case directory. Each JSON chunk files contains all required information to visualise up to maxLengthChunk components at a given zoom level.

ALl case directories are in project directory together with <projectID>_project.json, which is simply provides association between case names and and corresponding directory name.

Finally, information about the project will be recorded to Pantograph Visualisation tool data index to make it discoverable by the tool.

In addition, no metadata is recorded into these files as it inflates it very quickly. Instead, a very simple (optional) API works alongside main Pantograph Visualisation tool which provides a lot of various metadata on request if API available or do nothing if not. This API uses Redis DB with special DB schema.

When graphs are exported some metadata (annotations, genome and pangenome positions) can be recorded to Redis DB. If Redis DB is not available or recording of metadata is not needed, then parameter redisHost should be omitted. Otherwise, if Redis DB is available and metadata should be recorded, then redisHost should be set to the hostname (or IP address) of the Redis DB server