Step-by-Step Tutorial
This tutorial walks you through the complete Hoodini pipeline, showing both CLI and Python approaches for each stage.
Choose your preferred approach: CLI for quick runs, Python for custom workflows and programmatic control.
What You’ll Learn
- Run the full pipeline step-by-step
- Understand what each stage produces and what options are available
- Add custom columns to your input that appear in the viewer (via inputsheet or Python)
- Configure optional annotation tools
Pipeline Overview
Quick Start
CLI
Run the entire pipeline with a single command:
hoodini run \
--input proteins.txt \
--output my_analysis \
--num-threads 8 \
--tree-mode taxonomy \
--domains pfam \
--cctyper \
--genomadPipeline Stages
Stage 1: Initialize Inputs
Reads your input file and prepares records for the pipeline.
CLI
# With a single protein ID (triggers remote BLAST to find homologs)
hoodini run --input WP_010922251.1 --output my_analysis
# With a file of protein IDs (one per line, no BLAST)
hoodini run --input proteins.txt --output my_analysis
# With a file of nucleotide IDs (one per line)
hoodini run --input nucleotides.txt --output my_analysis
# With an input sheet (TSV with custom columns and local files)
hoodini run --inputsheet my_samples.tsv --output my_analysisSingle protein ID mode: Using a single protein ID triggers a remote BLAST search to find homologs. This can take several minutes. Control it with --remote-evalue and --remote-max-targets.
Supported input formats:
| Format | Example | Description |
|---|---|---|
| Single protein ID | WP_010922251.1 | Triggers BLAST to find homologs |
| Protein ID file | proteins.txt | One ID per line, no BLAST |
| Nucleotide ID file | nucleotides.txt | One ID per line (contigs, chromosomes) |
| UniProt ID | P12345 | Auto-converted to NCBI protein ID |
| Coordinates | NC_000913.3:1000-2000 | Specific genomic region |
| Input sheet | --inputsheet samples.tsv | TSV with local file paths |
Remote BLAST options (when using single protein ID):
| Option | Default | Description |
|---|---|---|
--remote-evalue | 1e-5 | E-value threshold for BLAST hits |
--remote-max-targets | 500 | Maximum number of homologs to retrieve |
Stage 2: Resolve IPG Records
Queries NCBI’s Identical Protein Groups to find all genomes containing your query proteins. This expands your initial set to all available genomic contexts.
CLI
# Get all IPG candidates (default)
hoodini run --input proteins.txt --output my_analysis --cand-mode any_ipg
# Keep only the best IPG hit per protein
hoodini run --input proteins.txt --output my_analysis --cand-mode best_ipg
# Keep only the best ID per input
hoodini run --input proteins.txt --output my_analysis --cand-mode best_id
# One genome per input protein
hoodini run --input proteins.txt --output my_analysis --cand-mode one_id
# Only genomes with the same ID as input
hoodini run --input proteins.txt --output my_analysis --cand-mode same_idCandidate selection modes (--cand-mode):
| Mode | Description | Use case |
|---|---|---|
best_id (default) | Best representative per input, protein ID must match | One representative per query |
best_ipg | Best representative per input, allows different IDs | Balanced coverage |
same_id | All IPG records with same protein ID | For non-redundant proteins (WP_, YP_, NP_) |
any_ipg | All genomes from IPG groups | Maximum coverage (can expand massively) |
one_id | First IPG record per input | Minimal set, ignores assembly quality |
Using any_ipg or same_id can dramatically increase the number of neighborhoods if your query protein is highly conserved across many assemblies.
IPG finds identical protein sequences across genomes. BLAST homology search (for similar but not identical proteins) happens in Stage 1 when using single-query mode.
Stage 3: Extract Neighborhoods
Downloads assemblies from NCBI and extracts genomic contexts around your query proteins.
CLI
# Default: ±10kb window (20000 nucleotides total)
hoodini run --input proteins.txt --output my_analysis
# Custom window size in nucleotides (±15kb = 30000 total)
hoodini run --input proteins.txt --output my_analysis \
--win-mode win_nts --win 30000
# Window by gene count (20 genes on each side)
hoodini run --input proteins.txt --output my_analysis \
--win-mode win_genes --win 20
# Set minimum window size requirements
hoodini run --input proteins.txt --output my_analysis \
--win 20000 --min-win 5000 --min-win-type both
# Use local assemblies instead of downloading
hoodini run --input proteins.txt --output my_analysis \
--assembly-folder /path/to/assemblies
# Limit concurrent downloads (default: 8)
hoodini run --input proteins.txt --output my_analysis \
--max-concurrent-downloads 4Window options:
| Option | Values | Description |
|---|---|---|
--win-mode | win_nts, win_genes | Define window by nucleotides or gene count |
--win | integer | Window size (total, split ±half on each side) |
--min-win | integer | Minimum acceptable window size |
--min-win-type | total, upstream, downstream, both | How to apply minimum window requirement |
Download time: Depends on how many assemblies need downloading. ~100 assemblies typically take 5-15 minutes. Use --max-concurrent-downloads to adjust parallelism.
Stage 4: Cluster Proteins
Groups similar proteins into families for easier visualization. Proteins in the same cluster get the same color in the viewer.
CLI
# Diamond DeepClust (default, fast hierarchical clustering)
hoodini run --input proteins.txt --output my_analysis \
--clust-method diamond_deepclust
# DeepMMseqs (MMseqs2-based deep clustering)
hoodini run --input proteins.txt --output my_analysis \
--clust-method deepmmseqs
# JackHMMER iterative search
hoodini run --input proteins.txt --output my_analysis \
--clust-method jackhmmer
# BLASTp all-vs-all
hoodini run --input proteins.txt --output my_analysis \
--clust-method blastpClustering methods (--clust-method):
| Method | Speed | Description |
|---|---|---|
diamond_deepclust | ⚡ Fast | Hierarchical clustering with Diamond. Best for most cases |
deepmmseqs | 🐢 Slow | MMseqs2-based deep clustering. Good sensitivity |
jackhmmer | 🐢 Slow | Iterative HMM search. Best sensitivity for divergent proteins |
blastp | ⏱️ Medium | Traditional all-vs-all BLAST. Most thorough |
Stage 5: Build Phylogenetic Tree
Creates a tree showing relationships between genomes. The tree determines how neighborhoods are ordered in the visualization.
CLI
# Taxonomy-based tree (instant, uses NCBI hierarchy)
hoodini run --input proteins.txt --output my_analysis \
--tree-mode taxonomy
# Fast neighbor-joining tree
hoodini run --input proteins.txt --output my_analysis \
--tree-mode fast_nj
# AAI-based tree (protein similarity)
hoodini run --input proteins.txt --output my_analysis \
--tree-mode aai_tree
# ANI-based tree (nucleotide identity)
hoodini run --input proteins.txt --output my_analysis \
--tree-mode ani_tree
# Fast ML tree
hoodini run --input proteins.txt --output my_analysis \
--tree-mode fast_ml
# Use your own tree file
hoodini run --input proteins.txt --output my_analysis \
--tree-mode use_input_tree --tree-file my_tree.nwk
# FoldMason structure-based tree
hoodini run --input proteins.txt --output my_analysis \
--tree-mode foldmason_tree
# Neighborhood similarity tree
hoodini run --input proteins.txt --output my_analysis \
--tree-mode neigh_similarity_tree
# Neighborhood phylogenetic tree
hoodini run --input proteins.txt --output my_analysis \
--tree-mode neigh_phylo_treeTree building methods (--tree-mode):
| Mode | Speed | Description |
|---|---|---|
taxonomy | ⚡ Instant | Builds tree from NCBI taxonomy hierarchy. Pairwise distances computed from taxonomic rank differences |
fast_nj | ⚡ Fast | FAMSA pairwise distance matrix → DecentTree neighbor-joining |
aai_tree | 🔄 Medium | Requires --pairwise-aai. DecentTree NJ from AAI distances (100 - AAI%). Missing pairs filled with max+2σ |
ani_tree | 🔄 Medium | Requires --pairwise-ani. DecentTree NJ from ANI distances (100 - ANI%). Only meaningful above ~70% ANI |
fast_ml | 🐢 Slow | FAMSA alignment → VeryFastTree maximum likelihood |
use_input_tree | ⚡ Instant | Load user-provided Newick file. Requires --tree-file path/to/tree.nwk |
foldmason_tree | 🐢 Slow | Maps proteins to UniProt → fetches AlphaFold structures → FoldMason structural alignment → VeryFastTree. Falls back to fast_ml if mapping fails |
neigh_similarity_tree | 🔄 Medium | Jaccard distance from neighborhood gene family presence/absence matrix → hierarchical clustering |
neigh_phylo_tree | 🐢 Slow | Position-weighted neighborhood gene content (genes closer to target weighted higher) → cosine distance → hierarchical clustering |
Target protein trees
Target protein trees (taxonomy, fast_nj, fast_ml, foldmason_tree, use_input_tree):
- Group neighborhoods by the evolutionary relationship of the target protein
taxonomy: fastest, groups by species/genus/family but doesn’t reflect sequence divergencefast_nj: good balance of speed and accuracy for most casesfast_ml: more accurate topology, slowerfoldmason_tree: best for remote homologs where sequence similarity is low but structure is conserved
Stage 6: Pairwise Comparisons (Optional)
Compute similarities between neighborhoods for visualization links.
CLI
# Enable protein links (colored ribbons between similar proteins)
hoodini run --input proteins.txt --output my_analysis --prot-links
# Enable nucleotide links (synteny ribbons between neighborhoods)
hoodini run --input proteins.txt --output my_analysis --nt-links
# Both
hoodini run --input proteins.txt --output my_analysis \
--prot-links --nt-links
# Configure nucleotide alignment method
hoodini run --input proteins.txt --output my_analysis \
--nt-links --nt-aln-mode blastn
# Configure AAI calculation for trees
hoodini run --input proteins.txt --output my_analysis \
--tree-mode aai_tree --aai-mode wgrr --min-pident 30Pairwise comparison options:
| Option | Values | Description |
|---|---|---|
--nt-aln-mode | blastn, fastani, minimap2, intergenic_blastn | Method for nucleotide alignments |
--ani-mode | skani, blastn | ANI calculation method for trees |
--aai-mode | wgrr, aai | wGRR (weighted Gene Repertoire Relatedness) or plain AAI |
--aai-subset-mode | target_prot, target_region, window | Which proteins to include in AAI |
--min-pident | float | Minimum percent identity for BLAST hits |
Stage 7: Annotations (Optional)
Add functional annotations to proteins and neighborhoods.
Domains
Protein domains via MetaCerberus (Pfam, TIGRfam, COG, etc.)
CLI
# Annotate with Pfam domains
hoodini run --input proteins.txt --output my_analysis --domains pfam
# Multiple databases (comma-separated)
hoodini run --input proteins.txt --output my_analysis \
--domains pfam,tigrfam,cogAvailable domain databases: pfam, tigrfam, cog, kegg, cazy, vog, phrogs
Stage 8: Export Visualization
Package everything into the interactive viewer.
CLI
The CLI automatically exports the visualization. Find it at:
# Open the visualization (macOS)
open my_analysis/hoodini-viz/hoodini-viz.html
# Open the visualization (Linux)
xdg-open my_analysis/hoodini-viz/hoodini-viz.html
# Or just open in your browser manually
firefox my_analysis/hoodini-viz/hoodini-viz.htmlOutput Structure
- gff.parquet
- hoods.parquet
- protein_metadata.parquet
- tree_metadata.parquet
- domains.parquet
- nucleotide_links.parquet
- protein_links.parquet
- tree.nwk
- hoodini-viz.html
- all_neigh.tsv
- records.csv
Adding Custom Columns
One of Hoodini’s powerful features is passing custom metadata through the pipeline to the visualization. Any extra columns you add will appear in the viewer’s tooltip and can be used for filtering/coloring.
There are two ways to add custom columns:
- Input sheet - Add columns to your TSV input file
- Python - Add columns to the Polars DataFrames mid-pipeline
Method 1: Input Sheet with Custom Columns
Create a TSV file with your custom columns alongside the required ones.
CLI
Create my_samples.tsv:
protein_id nucleotide_id sample_source collection_date host_species my_category
WP_010922251.1 NC_002516.2 soil 2023-01-15 environmental group_A
WP_002989955.1 NC_003028.3 clinical 2022-06-20 human group_B
NP_472073.1 NC_000964.3 marine 2021-11-30 fish group_AThen run:
hoodini run --inputsheet my_samples.tsv --output my_analysis --num-threads 8Your custom columns (sample_source, collection_date, host_species, my_category) will flow through the entire pipeline and appear in the visualization.
Method 2: Adding Columns Mid-Pipeline (Python only)
You can add or modify columns at any point during the pipeline using Polars operations:
import polars as pl
# === After initialization: Add computed columns ===
records = records.with_columns(
# Classify based on taxonomy
pl.when(pl.col("taxid").is_in([562, 573, 287]))
.then(pl.lit("pathogen"))
.otherwise(pl.lit("environmental"))
.alias("pathogen_status"),
# Add a constant label
pl.lit("experiment_2024").alias("batch_id"),
)
# === Join external metadata ===
external_data = pl.read_csv("my_annotations.tsv", separator="\t")
records = records.join(
external_data,
on="protein_id",
how="left"
)
# === After extraction: Add to protein metadata ===
all_prots = all_prots.with_columns(
pl.when(pl.col("length") > 500)
.then(pl.lit("large"))
.otherwise(pl.lit("small"))
.alias("size_category")
)
# === Add to neighborhood data ===
all_neigh = all_neigh.with_columns(
pl.col("gc_content").round(2).alias("gc_percent")
)When to use each method:
- Input sheet: Best for metadata you already have (sample info, experimental conditions)
- Mid-pipeline Python: Best for computed values (classifications, joined data, derived metrics)
Required vs Custom Columns
Required Columns
These columns are required for inputsheet mode:
| Column | Description |
|---|---|
protein_id | NCBI protein accession |
nucleotide_id | NCBI nucleotide accession |
If providing local files, also include:
| Column | Description |
|---|---|
gff_path | Path to GFF3 annotation file |
fna_path | Path to genome FASTA file |
faa_path | Path to protein FASTA file |
Viewing Custom Data in the Visualization
After running the pipeline, your custom columns appear in:
- Tree metadata - Hover over leaves to see sample info
- Neighborhood tooltips - Click neighborhoods to see associated metadata
- Filtering sidebar - Use custom columns to filter/highlight specific groups
- Color by - Categorical columns can be used to color the tree or neighborhoods
Complete Examples
CLI
#!/bin/bash
# Complete Hoodini pipeline with all options
hoodini run \
--inputsheet cas9_proteins.tsv \
--output cas9_analysis \
--num-threads 8 \
--win-mode win_nts \
--win 20000 \
--cand-mode any_ipg \
--clust-method diamond_deepclust \
--tree-mode taxonomy \
--domains pfam,tigrfam \
--cctyper \
--genomad \
--ncrna
echo "🎉 Done! Open cas9_analysis/hoodini-viz/hoodini-viz.html"Next Steps
- CLI Reference - Full command-line options with descriptions
- Outputs - Detailed output file formats
- API Reference - Complete Python API documentation