Skip to Content
DocsHoodiniStep-by-Step Tutorial

Step-by-Step Tutorial

This tutorial walks you through the complete Hoodini pipeline, showing both CLI and Python approaches for each stage.

📓

Choose your preferred approach: CLI for quick runs, Python for custom workflows and programmatic control.

What You’ll Learn

  • Run the full pipeline step-by-step
  • Understand what each stage produces and what options are available
  • Add custom columns to your input that appear in the viewer (via inputsheet or Python)
  • Configure optional annotation tools

Pipeline Overview


Quick Start

Run the entire pipeline with a single command:

hoodini run \ --input proteins.txt \ --output my_analysis \ --num-threads 8 \ --tree-mode taxonomy \ --domains pfam \ --cctyper \ --genomad

Pipeline Stages

Stage 1: Initialize Inputs

Reads your input file and prepares records for the pipeline.

# With a single protein ID (triggers remote BLAST to find homologs) hoodini run --input WP_010922251.1 --output my_analysis # With a file of protein IDs (one per line, no BLAST) hoodini run --input proteins.txt --output my_analysis # With a file of nucleotide IDs (one per line) hoodini run --input nucleotides.txt --output my_analysis # With an input sheet (TSV with custom columns and local files) hoodini run --inputsheet my_samples.tsv --output my_analysis
⚠️

Single protein ID mode: Using a single protein ID triggers a remote BLAST search to find homologs. This can take several minutes. Control it with --remote-evalue and --remote-max-targets.

Supported input formats:

FormatExampleDescription
Single protein IDWP_010922251.1Triggers BLAST to find homologs
Protein ID fileproteins.txtOne ID per line, no BLAST
Nucleotide ID filenucleotides.txtOne ID per line (contigs, chromosomes)
UniProt IDP12345Auto-converted to NCBI protein ID
CoordinatesNC_000913.3:1000-2000Specific genomic region
Input sheet--inputsheet samples.tsvTSV with local file paths

Remote BLAST options (when using single protein ID):

OptionDefaultDescription
--remote-evalue1e-5E-value threshold for BLAST hits
--remote-max-targets500Maximum number of homologs to retrieve

Stage 2: Resolve IPG Records

Queries NCBI’s Identical Protein Groups to find all genomes containing your query proteins. This expands your initial set to all available genomic contexts.

# Get all IPG candidates (default) hoodini run --input proteins.txt --output my_analysis --cand-mode any_ipg # Keep only the best IPG hit per protein hoodini run --input proteins.txt --output my_analysis --cand-mode best_ipg # Keep only the best ID per input hoodini run --input proteins.txt --output my_analysis --cand-mode best_id # One genome per input protein hoodini run --input proteins.txt --output my_analysis --cand-mode one_id # Only genomes with the same ID as input hoodini run --input proteins.txt --output my_analysis --cand-mode same_id

Candidate selection modes (--cand-mode):

ModeDescriptionUse case
best_id (default)Best representative per input, protein ID must matchOne representative per query
best_ipgBest representative per input, allows different IDsBalanced coverage
same_idAll IPG records with same protein IDFor non-redundant proteins (WP_, YP_, NP_)
any_ipgAll genomes from IPG groupsMaximum coverage (can expand massively)
one_idFirst IPG record per inputMinimal set, ignores assembly quality

Using any_ipg or same_id can dramatically increase the number of neighborhoods if your query protein is highly conserved across many assemblies.

IPG finds identical protein sequences across genomes. BLAST homology search (for similar but not identical proteins) happens in Stage 1 when using single-query mode.

Stage 3: Extract Neighborhoods

Downloads assemblies from NCBI and extracts genomic contexts around your query proteins.

# Default: ±10kb window (20000 nucleotides total) hoodini run --input proteins.txt --output my_analysis # Custom window size in nucleotides (±15kb = 30000 total) hoodini run --input proteins.txt --output my_analysis \ --win-mode win_nts --win 30000 # Window by gene count (20 genes on each side) hoodini run --input proteins.txt --output my_analysis \ --win-mode win_genes --win 20 # Set minimum window size requirements hoodini run --input proteins.txt --output my_analysis \ --win 20000 --min-win 5000 --min-win-type both # Use local assemblies instead of downloading hoodini run --input proteins.txt --output my_analysis \ --assembly-folder /path/to/assemblies # Limit concurrent downloads (default: 8) hoodini run --input proteins.txt --output my_analysis \ --max-concurrent-downloads 4

Window options:

OptionValuesDescription
--win-modewin_nts, win_genesDefine window by nucleotides or gene count
--winintegerWindow size (total, split ±half on each side)
--min-winintegerMinimum acceptable window size
--min-win-typetotal, upstream, downstream, bothHow to apply minimum window requirement
⏱️

Download time: Depends on how many assemblies need downloading. ~100 assemblies typically take 5-15 minutes. Use --max-concurrent-downloads to adjust parallelism.

Stage 4: Cluster Proteins

Groups similar proteins into families for easier visualization. Proteins in the same cluster get the same color in the viewer.

# Diamond DeepClust (default, fast hierarchical clustering) hoodini run --input proteins.txt --output my_analysis \ --clust-method diamond_deepclust # DeepMMseqs (MMseqs2-based deep clustering) hoodini run --input proteins.txt --output my_analysis \ --clust-method deepmmseqs # JackHMMER iterative search hoodini run --input proteins.txt --output my_analysis \ --clust-method jackhmmer # BLASTp all-vs-all hoodini run --input proteins.txt --output my_analysis \ --clust-method blastp

Clustering methods (--clust-method):

MethodSpeedDescription
diamond_deepclust⚡ FastHierarchical clustering with Diamond. Best for most cases
deepmmseqs🐢 SlowMMseqs2-based deep clustering. Good sensitivity
jackhmmer🐢 SlowIterative HMM search. Best sensitivity for divergent proteins
blastp⏱️ MediumTraditional all-vs-all BLAST. Most thorough

Stage 5: Build Phylogenetic Tree

Creates a tree showing relationships between genomes. The tree determines how neighborhoods are ordered in the visualization.

# Taxonomy-based tree (instant, uses NCBI hierarchy) hoodini run --input proteins.txt --output my_analysis \ --tree-mode taxonomy # Fast neighbor-joining tree hoodini run --input proteins.txt --output my_analysis \ --tree-mode fast_nj # AAI-based tree (protein similarity) hoodini run --input proteins.txt --output my_analysis \ --tree-mode aai_tree # ANI-based tree (nucleotide identity) hoodini run --input proteins.txt --output my_analysis \ --tree-mode ani_tree # Fast ML tree hoodini run --input proteins.txt --output my_analysis \ --tree-mode fast_ml # Use your own tree file hoodini run --input proteins.txt --output my_analysis \ --tree-mode use_input_tree --tree-file my_tree.nwk # FoldMason structure-based tree hoodini run --input proteins.txt --output my_analysis \ --tree-mode foldmason_tree # Neighborhood similarity tree hoodini run --input proteins.txt --output my_analysis \ --tree-mode neigh_similarity_tree # Neighborhood phylogenetic tree hoodini run --input proteins.txt --output my_analysis \ --tree-mode neigh_phylo_tree

Tree building methods (--tree-mode):

ModeSpeedDescription
taxonomy⚡ InstantBuilds tree from NCBI taxonomy hierarchy. Pairwise distances computed from taxonomic rank differences
fast_nj⚡ FastFAMSA pairwise distance matrix → DecentTree neighbor-joining
aai_tree🔄 MediumRequires --pairwise-aai. DecentTree NJ from AAI distances (100 - AAI%). Missing pairs filled with max+2σ
ani_tree🔄 MediumRequires --pairwise-ani. DecentTree NJ from ANI distances (100 - ANI%). Only meaningful above ~70% ANI
fast_ml🐢 SlowFAMSA alignment → VeryFastTree maximum likelihood
use_input_tree⚡ InstantLoad user-provided Newick file. Requires --tree-file path/to/tree.nwk
foldmason_tree🐢 SlowMaps proteins to UniProt → fetches AlphaFold structures → FoldMason structural alignment → VeryFastTree. Falls back to fast_ml if mapping fails
neigh_similarity_tree🔄 MediumJaccard distance from neighborhood gene family presence/absence matrix → hierarchical clustering
neigh_phylo_tree🐢 SlowPosition-weighted neighborhood gene content (genes closer to target weighted higher) → cosine distance → hierarchical clustering

Target protein trees (taxonomy, fast_nj, fast_ml, foldmason_tree, use_input_tree):

  • Group neighborhoods by the evolutionary relationship of the target protein
  • taxonomy: fastest, groups by species/genus/family but doesn’t reflect sequence divergence
  • fast_nj: good balance of speed and accuracy for most cases
  • fast_ml: more accurate topology, slower
  • foldmason_tree: best for remote homologs where sequence similarity is low but structure is conserved

Stage 6: Pairwise Comparisons (Optional)

Compute similarities between neighborhoods for visualization links.

# Enable protein links (colored ribbons between similar proteins) hoodini run --input proteins.txt --output my_analysis --prot-links # Enable nucleotide links (synteny ribbons between neighborhoods) hoodini run --input proteins.txt --output my_analysis --nt-links # Both hoodini run --input proteins.txt --output my_analysis \ --prot-links --nt-links # Configure nucleotide alignment method hoodini run --input proteins.txt --output my_analysis \ --nt-links --nt-aln-mode blastn # Configure AAI calculation for trees hoodini run --input proteins.txt --output my_analysis \ --tree-mode aai_tree --aai-mode wgrr --min-pident 30

Pairwise comparison options:

OptionValuesDescription
--nt-aln-modeblastn, fastani, minimap2, intergenic_blastnMethod for nucleotide alignments
--ani-modeskani, blastnANI calculation method for trees
--aai-modewgrr, aaiwGRR (weighted Gene Repertoire Relatedness) or plain AAI
--aai-subset-modetarget_prot, target_region, windowWhich proteins to include in AAI
--min-pidentfloatMinimum percent identity for BLAST hits

Stage 7: Annotations (Optional)

Add functional annotations to proteins and neighborhoods.

Protein domains via MetaCerberus (Pfam, TIGRfam, COG, etc.)

# Annotate with Pfam domains hoodini run --input proteins.txt --output my_analysis --domains pfam # Multiple databases (comma-separated) hoodini run --input proteins.txt --output my_analysis \ --domains pfam,tigrfam,cog

Available domain databases: pfam, tigrfam, cog, kegg, cazy, vog, phrogs

Stage 8: Export Visualization

Package everything into the interactive viewer.

The CLI automatically exports the visualization. Find it at:

# Open the visualization (macOS) open my_analysis/hoodini-viz/hoodini-viz.html # Open the visualization (Linux) xdg-open my_analysis/hoodini-viz/hoodini-viz.html # Or just open in your browser manually firefox my_analysis/hoodini-viz/hoodini-viz.html

Output Structure

        • gff.parquet
        • hoods.parquet
        • protein_metadata.parquet
        • tree_metadata.parquet
        • domains.parquet
        • nucleotide_links.parquet
        • protein_links.parquet
      • tree.nwk
      • hoodini-viz.html
    • all_neigh.tsv
    • records.csv

Adding Custom Columns

One of Hoodini’s powerful features is passing custom metadata through the pipeline to the visualization. Any extra columns you add will appear in the viewer’s tooltip and can be used for filtering/coloring.

💡

There are two ways to add custom columns:

  1. Input sheet - Add columns to your TSV input file
  2. Python - Add columns to the Polars DataFrames mid-pipeline

Method 1: Input Sheet with Custom Columns

Create a TSV file with your custom columns alongside the required ones.

Create my_samples.tsv:

my_samples.tsv
protein_id nucleotide_id sample_source collection_date host_species my_category WP_010922251.1 NC_002516.2 soil 2023-01-15 environmental group_A WP_002989955.1 NC_003028.3 clinical 2022-06-20 human group_B NP_472073.1 NC_000964.3 marine 2021-11-30 fish group_A

Then run:

hoodini run --inputsheet my_samples.tsv --output my_analysis --num-threads 8

Your custom columns (sample_source, collection_date, host_species, my_category) will flow through the entire pipeline and appear in the visualization.

Method 2: Adding Columns Mid-Pipeline (Python only)

You can add or modify columns at any point during the pipeline using Polars operations:

import polars as pl # === After initialization: Add computed columns === records = records.with_columns( # Classify based on taxonomy pl.when(pl.col("taxid").is_in([562, 573, 287])) .then(pl.lit("pathogen")) .otherwise(pl.lit("environmental")) .alias("pathogen_status"), # Add a constant label pl.lit("experiment_2024").alias("batch_id"), ) # === Join external metadata === external_data = pl.read_csv("my_annotations.tsv", separator="\t") records = records.join( external_data, on="protein_id", how="left" ) # === After extraction: Add to protein metadata === all_prots = all_prots.with_columns( pl.when(pl.col("length") > 500) .then(pl.lit("large")) .otherwise(pl.lit("small")) .alias("size_category") ) # === Add to neighborhood data === all_neigh = all_neigh.with_columns( pl.col("gc_content").round(2).alias("gc_percent") )
💡

When to use each method:

  • Input sheet: Best for metadata you already have (sample info, experimental conditions)
  • Mid-pipeline Python: Best for computed values (classifications, joined data, derived metrics)

Required vs Custom Columns

These columns are required for inputsheet mode:

ColumnDescription
protein_idNCBI protein accession
nucleotide_idNCBI nucleotide accession

If providing local files, also include:

ColumnDescription
gff_pathPath to GFF3 annotation file
fna_pathPath to genome FASTA file
faa_pathPath to protein FASTA file

Viewing Custom Data in the Visualization

After running the pipeline, your custom columns appear in:

  1. Tree metadata - Hover over leaves to see sample info
  2. Neighborhood tooltips - Click neighborhoods to see associated metadata
  3. Filtering sidebar - Use custom columns to filter/highlight specific groups
  4. Color by - Categorical columns can be used to color the tree or neighborhoods

Complete Examples

run_hoodini.sh
#!/bin/bash # Complete Hoodini pipeline with all options hoodini run \ --inputsheet cas9_proteins.tsv \ --output cas9_analysis \ --num-threads 8 \ --win-mode win_nts \ --win 20000 \ --cand-mode any_ipg \ --clust-method diamond_deepclust \ --tree-mode taxonomy \ --domains pfam,tigrfam \ --cctyper \ --genomad \ --ncrna echo "🎉 Done! Open cas9_analysis/hoodini-viz/hoodini-viz.html"

Next Steps

Last updated on