Skip to Content
DocsHoodini VizData Formats

Data Formats

Hoodini-Viz supports two file formats:

  • Parquet (recommended) — Binary columnar format, 3-10x faster loading
  • TSV — Plain text, human-readable

If you’re using Hoodini, all data files are generated automatically in the correct format!


Tree (Newick)

File: tree.nwk

Standard Newick format phylogenetic tree. Leaf names must match seqid values in genes/hoods.

((genome1:0.1,genome2:0.2):0.05,(genome3:0.15,genome4:0.12):0.08);

Genes (GFF3-style)

Files: genes.parquet or genes.tsv

ColumnTypeRequiredDescription
seqidstringGenome/contig identifier (matches tree leaf)
startintStart position (1-based)
endintEnd position
strandstring+ or -
IDstringUnique gene identifier
NamestringDisplay name
productstringGene product description
clusterstring/intHomology cluster ID (for coloring)
locus_tagstringLocus tag
protein_idstringProtein accession

Additional columns become available for coloring/filtering.

TSV Example:

seqid start end strand ID Name cluster product genome1 1000 1500 + gene_001 dnaA 1 chromosomal replication initiator genome1 1600 2400 + gene_002 dnaN 2 DNA polymerase III subunit beta

Hoods (Genomic Windows)

Files: hoods.parquet or hoods.tsv

Defines which genomic regions to display for each genome.

ColumnTypeRequiredDescription
hood_idintUnique hood identifier
seqidstringGenome/contig (matches tree leaf)
startintWindow start position
endintWindow end position
align_genestringGene ID to use for alignment
labelstringDisplay label (defaults to seqid)

TSV Example:

hood_id seqid start end align_gene 1 genome1 0 15000 gene_005 2 genome2 50000 65000 gene_105 3 genome3 120000 135000 gene_205

Files: links.parquet or links.tsv

Homology relationships between proteins (shown as curved connections).

ColumnTypeRequiredDescription
gene1stringSource gene ID
gene2stringTarget gene ID
identityfloatSequence identity (0-1)
evaluefloatE-value
bitscorefloatBit score

TSV Example:

gene1 gene2 identity evalue gene_001 gene_101 0.95 1e-150 gene_002 gene_102 0.87 1e-120 gene_003 gene_203 0.72 1e-80

Domains (Optional)

Files: domains.parquet or domains.tsv

Protein domain annotations (Pfam, InterPro, etc.).

ColumnTypeRequiredDescription
gene_idstringGene ID
domain_namestringDomain name/accession
startintDomain start (amino acid position)
endintDomain end
sourcestringSource database (pfam, interpro, etc.)
evaluefloatDomain E-value
descriptionstringDomain description

TSV Example:

gene_id domain_name start end source evalue description gene_001 PF00001 10 150 pfam 1e-50 7 transmembrane receptor gene_001 PF00002 200 350 pfam 1e-40 G-protein coupled receptor gene_002 IPR000001 5 180 interpro 1e-60 Kinase domain

Files: nucleotide_links.parquet or nucleotide_links.tsv

Synteny blocks between genomic regions (shown as polygonal overlays).

ColumnTypeRequiredDescription
source_seqidstringSource genome
source_startintSource start position
source_endintSource end position
target_seqidstringTarget genome
target_startintTarget start position
target_endintTarget end position
identityfloatAverage identity
strandstring+ (same) or - (inverted)

ncRNA Metadata (Optional)

Files: ncrna_metadata.parquet or ncrna_metadata.txt

Non-coding RNA annotations with sequence and secondary structure information.

In the GFF file, ncRNA features must have a type that contains “ncRNA” (e.g., ncRNA, ncRNA_gene). Use the ncrna_type attribute to specify the subtype (tRNA, rRNA, sRNA, tmRNA, etc.).

ColumnTypeRequiredDescription
seqidstringGenome/contig identifier
startintStart position (1-based)
endintEnd position
typestringncRNA type (tRNA, rRNA, tmRNA, etc.)
sequencestringNucleotide sequence
structurestringSecondary structure in dot-bracket notation
namestringDisplay name
productstringDescription

The sequence and structure fields enable the secondary structure viewer in the sidebar panel when clicking on ncRNA features.

GFF Example (type must contain “ncRNA”):

##gff-version 3 genome1 Infernal ncRNA 5000 5075 . + . ID=tRNA_Ala_1;Name=tRNA-Ala;ncrna_type=tRNA genome1 Infernal ncRNA 12000 12120 . + . ID=tmRNA_1;Name=ssrA;ncrna_type=tmRNA genome1 Infernal ncRNA 8500 10000 . + . ID=16S_rRNA;Name=16S rRNA;ncrna_type=rRNA

TSV Metadata Example:

seqid start end type sequence structure name genome1 5000 5075 tRNA GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGAC... (((((((..((((........)))).(((((....... tRNA-Ala genome1 12000 12120 tmRNA GGGGCUGAUUCUGGAUUCGACGGGAUUUGCGA... (((((((((....)))....((((......))))... tmRNA genome2 8500 10000 rRNA AUUGAACGCUGGCGGCAGGCCUAACACAUGCA... ...((((((....))))))...((((...))))... 16S rRNA

Regions (Optional)

Regions are genomic features displayed as rectangular overlays around genes. They are defined in the GFF file with type: 'region' and use the region_type attribute to specify what kind of region they represent.

How to Define Regions

All region features must have type set to region in the GFF. The actual region type (CRISPR, prophage, etc.) is specified in the region_type attribute.

region_type ValueDescriptionExample Source
operonOperon boundariesCustom annotation
CRISPRCRISPR repeat arraysCCTyper
prophageProphage regionsgeNomad, PHASTER
genomic_islandGenomic islandsIslandViewer
mobile_elementMobile genetic elementsgeNomad
defenseDefense system regionsPADLOC, DefenseFinder

GFF Region Example

Regions are included directly in the GFF file with type=region:

genome1 CCTyper region 15000 16500 . + . ID=crispr_1;Name=CRISPR-I-E;region_type=CRISPR genome1 geNomad region 45000 68000 . + . ID=prophage_1;Name=Prophage_region_1;region_type=prophage genome1 PADLOC region 22000 25000 . + . ID=defense_1;Name=RM_Type_I;region_type=defense

The GFF type column must be region for HoodiniViz to recognize it. Use region_type in attributes to specify CRISPR, prophage, etc.


Converting to Parquet

Use the provided Python script:

python scripts/convert_to_parquet.py input.tsv output.parquet

Or with pandas:

import pandas as pd df = pd.read_csv('genes.tsv', sep='\t') df.to_parquet('genes.parquet', index=False)
Last updated on