Data Formats

Hoodini-Viz supports two file formats:

Parquet (recommended) — Binary columnar format, 3-10x faster loading
TSV — Plain text, human-readable

If you’re using Hoodini, all data files are generated automatically in the correct format!

Tree (Newick)

File: tree.nwk

Standard Newick format phylogenetic tree. Leaf names must match seqid values in genes/hoods.


((genome1:0.1,genome2:0.2):0.05,(genome3:0.15,genome4:0.12):0.08);

Genes (GFF3-style)

Files: genes.parquet or genes.tsv

Column	Type	Required	Description
`seqid`	string	✓	Genome/contig identifier (matches tree leaf)
`start`	int	✓	Start position (1-based)
`end`	int	✓	End position
`strand`	string	✓	`+` or `-`
`ID`	string	✓	Unique gene identifier
`Name`	string		Display name
`product`	string		Gene product description
`cluster`	string/int		Homology cluster ID (for coloring)
`locus_tag`	string		Locus tag
`protein_id`	string		Protein accession

Additional columns become available for coloring/filtering.

TSV Example:


seqid	start	end	strand	ID	Name	cluster	product
genome1	1000	1500	+	gene_001	dnaA	1	chromosomal replication initiator
genome1	1600	2400	+	gene_002	dnaN	2	DNA polymerase III subunit beta

Hoods (Genomic Windows)

Files: hoods.parquet or hoods.tsv

Defines which genomic regions to display for each genome.

Column	Type	Required	Description
`hood_id`	int	✓	Unique hood identifier
`seqid`	string	✓	Genome/contig (matches tree leaf)
`start`	int	✓	Window start position
`end`	int	✓	Window end position
`align_gene`	string		Gene ID to use for alignment
`label`	string		Display label (defaults to seqid)

TSV Example:


hood_id	seqid	start	end	align_gene
1	genome1	0	15000	gene_005
2	genome2	50000	65000	gene_105
3	genome3	120000	135000	gene_205

Protein Links

Files: links.parquet or links.tsv

Homology relationships between proteins (shown as curved connections).

Column	Type	Required	Description
`gene1`	string	✓	Source gene ID
`gene2`	string	✓	Target gene ID
`identity`	float		Sequence identity (0-1)
`evalue`	float		E-value
`bitscore`	float		Bit score

TSV Example:


gene1	gene2	identity	evalue
gene_001	gene_101	0.95	1e-150
gene_002	gene_102	0.87	1e-120
gene_003	gene_203	0.72	1e-80

Domains (Optional)

Files: domains.parquet or domains.tsv

Protein domain annotations (Pfam, InterPro, etc.).

Column	Type	Required	Description
`gene_id`	string	✓	Gene ID
`domain_name`	string	✓	Domain name/accession
`start`	int	✓	Domain start (amino acid position)
`end`	int	✓	Domain end
`source`	string		Source database (pfam, interpro, etc.)
`evalue`	float		Domain E-value
`description`	string		Domain description

TSV Example:


gene_id	domain_name	start	end	source	evalue	description
gene_001	PF00001	10	150	pfam	1e-50	7 transmembrane receptor
gene_001	PF00002	200	350	pfam	1e-40	G-protein coupled receptor
gene_002	IPR000001	5	180	interpro	1e-60	Kinase domain

Nucleotide Links (Optional)

Files: nucleotide_links.parquet or nucleotide_links.tsv

Synteny blocks between genomic regions (shown as polygonal overlays).

Column	Type	Required	Description
`source_seqid`	string	✓	Source genome
`source_start`	int	✓	Source start position
`source_end`	int	✓	Source end position
`target_seqid`	string	✓	Target genome
`target_start`	int	✓	Target start position
`target_end`	int	✓	Target end position
`identity`	float		Average identity
`strand`	string		`+` (same) or `-` (inverted)

ncRNA Metadata (Optional)

Files: ncrna_metadata.parquet or ncrna_metadata.txt

Non-coding RNA annotations with sequence and secondary structure information.

In the GFF file, ncRNA features must have a type that contains “ncRNA” (e.g., ncRNA, ncRNA_gene). Use the ncrna_type attribute to specify the subtype (tRNA, rRNA, sRNA, tmRNA, etc.).

Column	Type	Required	Description
`seqid`	string	✓	Genome/contig identifier
`start`	int	✓	Start position (1-based)
`end`	int	✓	End position
`type`	string		ncRNA type (tRNA, rRNA, tmRNA, etc.)
`sequence`	string		Nucleotide sequence
`structure`	string		Secondary structure in dot-bracket notation
`name`	string		Display name
`product`	string		Description

The sequence and structure fields enable the secondary structure viewer in the sidebar panel when clicking on ncRNA features.

GFF Example (type must contain “ncRNA”):


##gff-version 3
genome1	Infernal	ncRNA	5000	5075	.	+	.	ID=tRNA_Ala_1;Name=tRNA-Ala;ncrna_type=tRNA
genome1	Infernal	ncRNA	12000	12120	.	+	.	ID=tmRNA_1;Name=ssrA;ncrna_type=tmRNA
genome1	Infernal	ncRNA	8500	10000	.	+	.	ID=16S_rRNA;Name=16S rRNA;ncrna_type=rRNA

TSV Metadata Example:


seqid	start	end	type	sequence	structure	name
genome1	5000	5075	tRNA	GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGAC...	(((((((..((((........)))).(((((.......	tRNA-Ala
genome1	12000	12120	tmRNA	GGGGCUGAUUCUGGAUUCGACGGGAUUUGCGA...	(((((((((....)))....((((......))))...	tmRNA
genome2	8500	10000	rRNA	AUUGAACGCUGGCGGCAGGCCUAACACAUGCA...	...((((((....))))))...((((...))))...	16S rRNA

Regions (Optional)

Regions are genomic features displayed as rectangular overlays around genes. They are defined in the GFF file with type: 'region' and use the region_type attribute to specify what kind of region they represent.

How to Define Regions

All region features must have type set to region in the GFF. The actual region type (CRISPR, prophage, etc.) is specified in the region_type attribute.

`region_type` Value	Description	Example Source
`operon`	Operon boundaries	Custom annotation
`CRISPR`	CRISPR repeat arrays	CCTyper
`prophage`	Prophage regions	geNomad, PHASTER
`genomic_island`	Genomic islands	IslandViewer
`mobile_element`	Mobile genetic elements	geNomad
`defense`	Defense system regions	PADLOC, DefenseFinder

GFF Region Example

Regions are included directly in the GFF file with type=region:


genome1	CCTyper	region	15000	16500	.	+	.	ID=crispr_1;Name=CRISPR-I-E;region_type=CRISPR
genome1	geNomad	region	45000	68000	.	+	.	ID=prophage_1;Name=Prophage_region_1;region_type=prophage
genome1	PADLOC	region	22000	25000	.	+	.	ID=defense_1;Name=RM_Type_I;region_type=defense

The GFF type column must be region for HoodiniViz to recognize it. Use region_type in attributes to specify CRISPR, prophage, etc.

Converting to Parquet

Use the provided Python script:


python scripts/convert_to_parquet.py input.tsv output.parquet

Or with pandas:


import pandas as pd
 
df = pd.read_csv('genes.tsv', sep='\t')
df.to_parquet('genes.parquet', index=False)