Skip to Content
DocsHoodiniCLI Reference

CLI Reference

Hoodini provides a powerful command-line interface for comparative genomics analysis.

Commands Overview


hoodini run

The main pipeline command that orchestrates the complete analysis workflow.

hoodini run --input <accessions.txt> --output results/

Input Options

Single file or literal input

# File with one accession per line hoodini run --input accessions.txt # Literal protein accession hoodini run --input "WP_012345678.1"

When using a literal input, Hoodini performs a remote BLAST to expand the search set using --remote-evalue and --remote-max-targets.

Output & Config

OptionTypeDefaultDescription
--configpathTOML config file
--outputpathresultsOutput directory
--forceflagfalseOverwrite existing output
--keepflagfalseKeep intermediate files

Performance

OptionTypeDefaultDescription
--num-threadsint10Number of threads
--max-concurrent-downloadsint8Parallel NCBI downloads
--api-keystrNCBI API key (or NCBI_API_KEY env var)

Data Sources

OptionTypeDefaultDescription
--assembly-folderpathUse local assemblies instead of downloading

Neighborhood Window

OptionTypeDefaultDescription
--win-modestrwin_ntsWindow mode: win_nts or win_genes
--winint20000Window size (nucleotides or genes)
--min-winint2000Minimum window per side
--min-win-typestrbothtotal, upstream, downstream, or both

Clustering

OptionTypeDefaultDescription
--cand-modestrbest_idCandidate selection mode (see below)
--clust-methodstrdiamond_deepclustClustering method
📋

Candidate selection modes (--cand-mode)

  • best_id (default): Pick single best representative per input homolog, prioritizing assembly quality and edge proximity. The protein ID must match the original query when possible.
  • best_ipg: Same as best_id, but allows different protein IDs (e.g., GenBank vs RefSeq versions of the same protein).
  • same_id: Keep all IPG records that share the same protein ID as the query. Especially relevant for non-redundant proteins (WP_, YP_, NP_) present in multiple assemblies.
  • any_ipg: Keep all identical proteins from IPG regardless of ID. Can result in massive expansion if the protein exists in thousands of assemblies.
  • one_id: Keep first IPG record per input homolog, regardless of assembly quality (order from NCBI IPG).

Warning: Using any_ipg or same_id can dramatically increase the number of neighborhoods if your query protein is highly conserved across many assemblies.

Pairwise Comparisons

OptionTypeDefaultDescription
--prot-linksflagfalseCompute protein similarity links
--nt-linksflagfalseCompute nucleotide links
--ani-modestrfastaniANI calculation: skani or blastn
--nt-aln-modestrblastnNucleotide alignment: blastn, fastani, minimap2, intergenic_blastn
--min-pidentfloat30.0Minimum percent identity for AAI/wGRR

Tree Construction

OptionTypeDefaultDescription
--tree-modestrfast_mlTree building method (see below)
--tree-filepathInput Newick tree for use_input_tree
--aai-modestrwgrrAAI mode: wgrr or aai
--aai-subset-modestrtarget_regionSubset for AAI tree: target_prot, target_region, window
🌳

Tree modes (--tree-mode)

ModeDescription
taxonomyNCBI taxonomy distances with single-linkage clustering
fast_njFAMSA distance matrix → DecentTree NJ/UPGMA
fast_mlFAMSA alignment → VeryFastTree (default)
aai_treeAAI/wGRR pairwise distances → DecentTree
ani_treeANI pairwise distances → DecentTree
use_input_treeLoad from --tree-file
foldmason_treeAlphaFold structures → foldmason MSA → VeryFastTree
neigh_similarity_treeJaccard distance on protein cluster presence/absence
neigh_phylo_treeWeighted neighborhood similarity using gene positions

Remote BLAST

For single-query expansion:

OptionTypeDefaultDescription
--remote-evaluefloat1e-5E-value for remote BLAST
--remote-max-targetsint100Max hits for remote BLAST

Annotations

OptionTypeDefaultDescription
--padlocflagfalseRun PADLOC defense system detection
--deffinderflagfalseRun DefenseFinder
--cctyperflagfalseRun CCTyper CRISPR-Cas typing
--genomadflagfalseRun geNomad virus/plasmid detection
--ncrnaflagfalseInfernal ncRNA prediction
--domainsstrComma-separated MetaCerberus domains
--emapperflagfalseRun eggNOG-mapper
--blastpathFASTA file for BLAST search against neighborhood nucleotides (e.g., IS elements)
--sorfsflagfalseRe-annotate small ORFs in extracted regions

Logging

OptionTypeDefaultDescription
--quietflagfalseSilence non-error output
--debugflagfalseVerbose debug logging

hoodini download

Download databases and resources used by Hoodini.

hoodini download databases --threads 8

Subcommands

Download all databases (~35 GB total)

Downloads all required databases for full functionality: PADLOC, DefenseFinder, geNomad, eggNOG-mapper, and supporting files.

hoodini download databases [OPTIONS]
OptionDescription
--forceRe-download existing files
--threadsNumber of threads
--skip-padlocSkip PADLOC models
--skip-deffinderSkip DefenseFinder models
--skip-genomadSkip geNomad database
--skip-emapperSkip eggNOG-mapper data
--skip-parquetSkip parquet files
--skip-contig-lengthsSkip contig length database

hoodini utils

Utility commands for metadata helpers.

hoodini utils nuc2asmlen --output out.tsv input.tsv

Subcommands

Convert nucleotide IDs to assembly lengths

hoodini utils nuc2asmlen --output out.tsv input.tsv

Takes a TSV with nucleotide/contig accessions and adds assembly and contig length metadata.


Configuration File

You can supply a TOML config file using --config. CLI flags always override config values.

hoodini run --config my_config.toml --input accessions.txt

Default values are defined in hoodini/config/defaults.toml under sections: general, window, tree, aai, ani, clustering, annotations, pairwise, and paths.

See outputs for details on output files and directory structure.

Last updated on