Data Formats
Hoodini-Viz supports two file formats:
- Parquet (recommended) — Binary columnar format, 3-10x faster loading
- TSV — Plain text, human-readable
If you’re using Hoodini, all data files are generated automatically in the correct format!
Tree (Newick)
File: tree.nwk
Standard Newick format phylogenetic tree. Leaf names must match seqid values in genes/hoods.
((genome1:0.1,genome2:0.2):0.05,(genome3:0.15,genome4:0.12):0.08);Genes (GFF3-style)
Files: genes.parquet or genes.tsv
| Column | Type | Required | Description |
|---|---|---|---|
seqid | string | ✓ | Genome/contig identifier (matches tree leaf) |
start | int | ✓ | Start position (1-based) |
end | int | ✓ | End position |
strand | string | ✓ | + or - |
ID | string | ✓ | Unique gene identifier |
Name | string | Display name | |
product | string | Gene product description | |
cluster | string/int | Homology cluster ID (for coloring) | |
locus_tag | string | Locus tag | |
protein_id | string | Protein accession |
Additional columns become available for coloring/filtering.
TSV Example:
seqid start end strand ID Name cluster product
genome1 1000 1500 + gene_001 dnaA 1 chromosomal replication initiator
genome1 1600 2400 + gene_002 dnaN 2 DNA polymerase III subunit betaHoods (Genomic Windows)
Files: hoods.parquet or hoods.tsv
Defines which genomic regions to display for each genome.
| Column | Type | Required | Description |
|---|---|---|---|
hood_id | int | ✓ | Unique hood identifier |
seqid | string | ✓ | Genome/contig (matches tree leaf) |
start | int | ✓ | Window start position |
end | int | ✓ | Window end position |
align_gene | string | Gene ID to use for alignment | |
label | string | Display label (defaults to seqid) |
TSV Example:
hood_id seqid start end align_gene
1 genome1 0 15000 gene_005
2 genome2 50000 65000 gene_105
3 genome3 120000 135000 gene_205Protein Links
Files: links.parquet or links.tsv
Homology relationships between proteins (shown as curved connections).
| Column | Type | Required | Description |
|---|---|---|---|
gene1 | string | ✓ | Source gene ID |
gene2 | string | ✓ | Target gene ID |
identity | float | Sequence identity (0-1) | |
evalue | float | E-value | |
bitscore | float | Bit score |
TSV Example:
gene1 gene2 identity evalue
gene_001 gene_101 0.95 1e-150
gene_002 gene_102 0.87 1e-120
gene_003 gene_203 0.72 1e-80Domains (Optional)
Files: domains.parquet or domains.tsv
Protein domain annotations (Pfam, InterPro, etc.).
| Column | Type | Required | Description |
|---|---|---|---|
gene_id | string | ✓ | Gene ID |
domain_name | string | ✓ | Domain name/accession |
start | int | ✓ | Domain start (amino acid position) |
end | int | ✓ | Domain end |
source | string | Source database (pfam, interpro, etc.) | |
evalue | float | Domain E-value | |
description | string | Domain description |
TSV Example:
gene_id domain_name start end source evalue description
gene_001 PF00001 10 150 pfam 1e-50 7 transmembrane receptor
gene_001 PF00002 200 350 pfam 1e-40 G-protein coupled receptor
gene_002 IPR000001 5 180 interpro 1e-60 Kinase domainNucleotide Links (Optional)
Files: nucleotide_links.parquet or nucleotide_links.tsv
Synteny blocks between genomic regions (shown as polygonal overlays).
| Column | Type | Required | Description |
|---|---|---|---|
source_seqid | string | ✓ | Source genome |
source_start | int | ✓ | Source start position |
source_end | int | ✓ | Source end position |
target_seqid | string | ✓ | Target genome |
target_start | int | ✓ | Target start position |
target_end | int | ✓ | Target end position |
identity | float | Average identity | |
strand | string | + (same) or - (inverted) |
ncRNA Metadata (Optional)
Files: ncrna_metadata.parquet or ncrna_metadata.txt
Non-coding RNA annotations with sequence and secondary structure information.
In the GFF file, ncRNA features must have a type that contains “ncRNA” (e.g., ncRNA, ncRNA_gene).
Use the ncrna_type attribute to specify the subtype (tRNA, rRNA, sRNA, tmRNA, etc.).
| Column | Type | Required | Description |
|---|---|---|---|
seqid | string | ✓ | Genome/contig identifier |
start | int | ✓ | Start position (1-based) |
end | int | ✓ | End position |
type | string | ncRNA type (tRNA, rRNA, tmRNA, etc.) | |
sequence | string | Nucleotide sequence | |
structure | string | Secondary structure in dot-bracket notation | |
name | string | Display name | |
product | string | Description |
The sequence and structure fields enable the secondary structure viewer in the sidebar panel when clicking on ncRNA features.
GFF Example (type must contain “ncRNA”):
##gff-version 3
genome1 Infernal ncRNA 5000 5075 . + . ID=tRNA_Ala_1;Name=tRNA-Ala;ncrna_type=tRNA
genome1 Infernal ncRNA 12000 12120 . + . ID=tmRNA_1;Name=ssrA;ncrna_type=tmRNA
genome1 Infernal ncRNA 8500 10000 . + . ID=16S_rRNA;Name=16S rRNA;ncrna_type=rRNATSV Metadata Example:
seqid start end type sequence structure name
genome1 5000 5075 tRNA GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGAC... (((((((..((((........)))).(((((....... tRNA-Ala
genome1 12000 12120 tmRNA GGGGCUGAUUCUGGAUUCGACGGGAUUUGCGA... (((((((((....)))....((((......))))... tmRNA
genome2 8500 10000 rRNA AUUGAACGCUGGCGGCAGGCCUAACACAUGCA... ...((((((....))))))...((((...))))... 16S rRNARegions (Optional)
Regions are genomic features displayed as rectangular overlays around genes. They are defined in the GFF file with type: 'region' and use the region_type attribute to specify what kind of region they represent.
How to Define Regions
All region features must have type set to region in the GFF. The actual region type (CRISPR, prophage, etc.) is specified in the region_type attribute.
region_type Value | Description | Example Source |
|---|---|---|
operon | Operon boundaries | Custom annotation |
CRISPR | CRISPR repeat arrays | CCTyper |
prophage | Prophage regions | geNomad, PHASTER |
genomic_island | Genomic islands | IslandViewer |
mobile_element | Mobile genetic elements | geNomad |
defense | Defense system regions | PADLOC, DefenseFinder |
GFF Region Example
Regions are included directly in the GFF file with type=region:
genome1 CCTyper region 15000 16500 . + . ID=crispr_1;Name=CRISPR-I-E;region_type=CRISPR
genome1 geNomad region 45000 68000 . + . ID=prophage_1;Name=Prophage_region_1;region_type=prophage
genome1 PADLOC region 22000 25000 . + . ID=defense_1;Name=RM_Type_I;region_type=defenseThe GFF type column must be region for HoodiniViz to recognize it. Use region_type in attributes to specify CRISPR, prophage, etc.
Converting to Parquet
Use the provided Python script:
python scripts/convert_to_parquet.py input.tsv output.parquetOr with pandas:
import pandas as pd
df = pd.read_csv('genes.tsv', sep='\t')
df.to_parquet('genes.parquet', index=False)