View page source on GitHub

✨ What's new

gget officially became part of scverse on June 9, 2026. 🥳🥳🥳

Version ≥ 0.30.8 (Jun 28, 2026):

  • gget g2p: Either gene or --uniprot_id is now sufficient — whichever is missing is resolved via UniProt and cached. Gene→UniProt picks the canonical reviewed human Swiss-Prot entry; the resolution and its limitations are logged. The canonical pair is always prepended to the result as gene_name / uniprot_id columns (and stored on df.attrs), so the output schema is invariant regardless of input mode. Existing call sites continue to work.
    • New residues= filter (Python: int / list / range / set; CLI --residues 100,200,300 or 100-200) restricts features / alignment to specific positions client-side.
    • map results gain a parsed PDB Ids List column (list[str]) alongside the comma-joined PDB Ids string, for direct chaining into gget pdb.
    • Fixed silent failure when the gene/UniProt pair was unknown: G2P returns HTTP 200 with a JSON {"status":"failure",...} body that was being parsed as a single TSV column. Now logged as an error and returns None.
    • All failure modes now return None (was a mix of None and empty DataFrame).
    • Added retries with exponential backoff on transient failures (5xx, connection errors, timeouts).
    • URL-encoded path segments.
    • New out= Python argument writes the result to an explicit CSV path (takes precedence over save).
  • gget alphafold: Added a new jackhmmer_savedir argument (-jhd/--jackhmmer_savedir on the command line) that lets you choose where the temporary jackhmmer files are stored. By default, gget alphafold still creates a ~/tmp/jackhmmer/ folder in your home directory (which can take up to ~2 GB of disk space); the new argument lets you redirect these files elsewhere, e.g. to a disk with more free space. Resolves issue 49.
  • gget pdb: Added support for the PDBx/mmCIF structure format (fixes issue 178 and issue 177).
    • New resource="mmcif" option downloads the structure in PDBx/mmCIF format (.cif).
    • The default resource="pdb" now automatically falls back to PDBx/mmCIF when the legacy PDB file is unavailable (e.g. for large structures), since the legacy PDB format is being phased out by RCSB. A warning is logged and saved files use the correct extension (.cif).
  • gget opentargets: Adapted to several upstream Open Targets GraphQL API changes:
    • Fixed: gget opentargets resource="drugs" was failing with HTTP 400 — Field 'synonyms' of type '[DrugLabelAndSource!]!' must have a sub selection. Open Targets changed Drug.synonyms and Drug.tradeNames from scalar lists to lists of structured DrugLabelAndSource objects. The internal GraphQL query was updated to request the label sub-field; the drug.synonyms / drug.tradeNames columns in the returned DataFrame remain list[str], so existing user code is unaffected.
    • Fixed: gget opentargets resource="expression" had started returning an empty DataFrame because Open Targets retired the Target.expressions field. The query now uses the current Target.baselineExpression field. The returned columns have changed because the upstream data model changed: results are now per-biosample (tissue/cell type) baseline expression summary statistics (median, min, q1, q3, max, unit) with tissueBiosample/celltypeBiosample identifiers and datasourceId/datatypeId, instead of the old per-tissue RNA zscore/value/level fields. A gene can have thousands of biosamples; the page size follows --limit (capped at the API's 3000-per-request maximum), and a warning is logged if results are truncated — narrow with --filters (e.g. datasourceId, datatypeId). Resolves issue 247.
    • Docs: Clarified the meaning of the diseases resource score column (OpenTargets' single overall target–disease association score, 0–1, aggregated across all data types/sources — not a per-data-source score) and that disease.id values are EFO-mapped traits that include not only MONDO diseases but also phenotypes (HP_*) and measurements (EFO_*), with an example of how to filter to MONDO disease terms only. Addresses issue 168.
  • gget archs4 (tissue mode): No longer crashes with KeyError: ['color'] not found in axis when ARCHS4 intermittently omits the optional color column from its CSV response. The column is now dropped only if present. Output also has a deterministic row order (sorted by median descending, with id as tiebreaker) so equal-median tissues no longer flip order between requests.
  • gget bgee: Outbound Bgee API requests now send a User-Agent: gget/<version> header so the Bgee service can attribute (and, if needed, allow-list) gget traffic, improving resilience against intermittent request blocking.

Version ≥ 0.30.7 (Jun 21, 2026):

  • gget cellxgene: Added support for the three non-human primate species available in the CZ CELLxGENE Census LTS 2025-11-08: rhesus macaque (macaca_mulatta), common marmoset (callithrix_jacchus), and chimpanzee (pan_troglodytes).
    • The species argument (both Python and command line) now accepts all five supported organisms; the CLI choices, help text, and docstrings list them.
    • Added early validation of the species argument that raises a clear ValueError listing the supported species, instead of failing later inside the Census API call.
    • Note: the new primate species require census_version="2025-11-08" (LTS) or newer.
  • Docs/README: updated gget repository and manual URLs from pachterlab to scverse (github.com/scverse/gget, scverse.org/gget) to reflect the project's move under the scverse organization. Links to separate resources (pachterlab/gget_examples, pachterlab/kvar, pachterlab/varseek, and the Pachter Lab homepage) were left unchanged. Resolves issue 217.
  • gget g2p: New module to query the Genomics 2 Proteins (G2P) portal for residue-level protein structure/function annotations — per-residue features (AlphaFold pLDDT, UniProt sites, predicted pockets, PTMs), the gene–transcript–protein–isoform–structure map, and isoform alignments. Resolves issue 138.
  • Deprecations:
    • gget alphafold and gget gpt are no longer actively maintained. Both now emit a warning when invoked, and a deprecation notice was added to the top of each module's docs.
  • Bug fixes:
    • gget search: Missing values are now consistently returned as None instead of NaN, both in scalar cells and inside synonym lists ([None] rather than [nan] for genes with no synonyms). The previous output was an artifact of SQL LEFT JOINs surfacing as pandas NaNs; the JSON output was already null either way, so this only affects the DataFrame return path.
    • gget mutate: Fixed pyarrow.lib.ArrowNotImplementedError: Function 'binary_join_element_wise' has no kernel matching input types (large_string, null, large_string) when the input contained no substitutions (only deletions/insertions/delins/duplications/inversions). The substitution-only sequence-build branches now short-circuit on an empty selection instead of triggering an arrow-string kernel that older pyarrow versions don't implement.
    • gget muscle and gget diamond: When the bundled MUSCLE / DIAMOND binary fails because a system library is missing (most commonly libgomp / libomp on macOS without Homebrew gcc/libomp installed), gget now raises a clear RuntimeError naming the missing library and the exact brew install / apt install command to fix it — instead of the raw dyld / ld.so error spilling onto stderr.
    • gget muscle: When the MUSCLE binary fails, gget now exits with a non-zero status code instead of silently exiting 0. Pipelines and CI scripts can finally detect MUSCLE failures programmatically.
  • Developer tooling / packaging:
    • Migrated packaging to a single pyproject.toml (the hatchling build backend); removed setup.py, setup.cfg, requirements.txt, dev-requirements.txt, and MANIFEST.in. Runtime dependencies and the test dependency group are now declared in pyproject.toml.
    • The minimum supported Python version is now 3.12; CI tests on 3.12, 3.13, and 3.14.
    • cellxgene-census is now declared as an optional extra (pip install gget[cellxgene]) because it has no wheels for Python 3.14 yet. Users who install it via gget setup cellxgene are unaffected.
    • Added a pre-commit configuration (lint + format via ruff, plus standard hygiene hooks). Run prek run --all-files (or pre-commit run --all-files) before opening a PR.
    • Added type annotations across the entire gget/ source tree (175 functions across all 28 source files) and wired mypy into the pre-commit config with a permissive baseline. IDE intellisense and downstream typed-code consumers benefit immediately; the type-error baseline can be tightened module-by-module over time.
    • Modernized the test CI to use uv and run on pull requests, and added package-build-check and PyPI trusted-publishing workflows.

Version ≥ 0.30.6 (Jun 10, 2026):

  • gget blat: Improved resilience against UCSC BLAT endpoint failures (fixes intermittently failing tests).
    • Added retry-with-exponential-backoff for transient failures (HTTP 429/5xx, network errors, and non-JSON 200 responses caused by UCSC rate-limiting or HTML error pages). Up to 4 attempts with 1.5s → 3s → 6s backoff.
    • Replaced the misleading "sequence too short or assembly invalid" message with the actual server response (status code, response preview) so failures are diagnosable.
    • HTTPError and URLError are now caught explicitly instead of bubbling up as unhandled exceptions.
  • Bug fixes:
    • gget cosmic: Fixed misleading error message when the download step fails — was reporting the previous command's return code/stderr instead of the failing command's.
    • gget cosmic: Narrowed the JSON parse exception handler to json.JSONDecodeError so unrelated ValueErrors are no longer masked by the "Failed to download file" message.
    • gget --version, gget --help, gget invoked with no arguments, and gget <module> with no further arguments now all exit with status 0 instead of 1, so CI scripts and shell pipelines no longer treat these informational outputs as failures.
    • Added request timeouts to previously-unguarded requests calls in gget ref, gget info, gget 8cube, gget enrichr, and gget opentargets. Default is 10s connect / 60s read; configurable via the new DEFAULT_REQUESTS_TIMEOUT constant.
    • Narrowed a bare except: in utils.get_uniprot_seqs to (KeyError, IndexError, TypeError) so unrelated errors (including KeyboardInterrupt) are no longer swallowed.
    • Added utils.http_json() and utils.dig() helpers that issue a request and parse JSON / walk a nested response path with consistent error reporting. Migrated gget bgee, gget opentargets, and one .json() callsite in gget virus to use them; remaining modules will migrate opportunistically. Upstream HTML error pages, malformed JSON, and missing response keys now surface as clear RuntimeErrors naming the failing service instead of cryptic JSONDecodeError / KeyError tracebacks.
    • utils.http_json() now retries transient failures (connection errors, read timeouts, HTTP 5xx) up to 3 times with exponential backoff. Smooths over short upstream blips (e.g. bgee.org read timeouts) without affecting 4xx errors, which still raise immediately.
    • gget virus: Replaced 11 bare except: pass blocks around file.close() / os.remove() cleanup calls with narrowed except OSError handlers that log the failure at DEBUG. Previously, real I/O issues during cleanup (disk full, permissions) were silently dropped and the cleanup path also swallowed KeyboardInterrupt.
    • gget cbio: Fixed a code path in cbio_plot that called the removed-in-pandas-2.0 DataFrame.append() inside a loop when filling missing CNA genes — the entire branch crashed on modern pandas. It now builds a single DataFrame of missing rows and concatenates once.
  • Performance:
    • utils.get_uniprot_seqs: Collect per-ID DataFrames in a list and pd.concat(..., ignore_index=True) once at the end, avoiding the O(n²) cost of growing a DataFrame inside the request loop.
    • Cached utils.find_latest_ens_rel, utils.search_species_options, utils.ref_species_options, and utils.find_nv_kingdom with functools.lru_cache. These hit Ensembl FTP listings that are stable for a release; repeated calls within one Python process are now free.
    • Added utils.parallel_map, a thin ThreadPoolExecutor wrapper for I/O-bound work. Used to fan out utils.get_uniprot_seqs across the input ID list — looking up N IDs is now bounded by ~N / pool_size UniProt round-trips instead of N. Pool size defaults to 8 and can be overridden via the GGET_MAX_WORKERS environment variable.

Version ≥ 0.30.5 (May 23, 2026):

  • gget opentargets: Rewrote this module to reflect the new Open Targets API structure
    • some output column/key names may differ to reflect the new API structure
    • Removed the --filter_mode argument
  • gget blast: Fixed compatibility with newer pandas versions (≥ 2.0) where pd.read_html() no longer accepts raw HTML strings directly, causing a FileNotFoundError / OSError: Filename too long error when parsing BLAST results
  • gget cosmic: Added overwrite and gzip arguments to internals.

Version ≥ 0.30.3 (Feb 26, 2026):

  • gget virus: New filtering options, quiet mode, and improved download reliability
    • Added --segment filter for segmented viruses (e.g., Influenza A segments like 'HA', 'NA', 'PB1')
    • Added --vaccine_strain filter to include or exclude vaccine strain sequences
    • Added --source_database filter to select sequences from 'genbank' or 'refseq' (replaces refseqOnly)
    • Added -q / --quiet flag to suppress progress information
    • Extended fallback strategies for improved download reliability on large datasets
    • Command summary file now includes software version

Version ≥ 0.30.2 (Feb 08, 2026):

  • gget virus: Metadata streaming optimization, improved protein filtering, and enhanced error handling and retry logic
    • Metadata now streams to disk during fetch to prevent memory exhaustion on large datasets (100,000+ records)
    • Fixed metadata CSV mapping (camelCase → snake_case) for organism name, host, and collection date
    • Enhanced protein filtering for segmented viruses with improved FASTA header parsing
    • Added annotated=False option for filtering unannotated sequences
    • Added progress bars to batched sequence downloads
    • Fixed collection date naming bug
    • Improved error messages for invalid filter dates
    • Added enhanced retry attempts for virus name resolution
    • Added verbosity to influenza A and COVID-19 checking steps

Version ≥ 0.30.0 (Jan 19, 2026):

  • NEW MODULES:
  • SECURITY IMPROVEMENTS:
    • Replaced os.system() with f-strings containing URLs from external APIs in gget/main.py
    • Replaced exec() with importlib.import_module() in gget setup for safer dynamic imports
    • Replaced shell=True subprocess calls with list-based arguments in gget muscle, gget diamond, and gget setup to prevent command injection

Version ≥ 0.29.3 (Sep 11, 2025):

Version ≥ 0.29.2 (Jul 03, 2025):

  • gget can now be installed using uv pip install gget
    • All package metadata (version, author, description, etc.) is now managed in setup.cfg for full compatibility with modern tools like uv, pip, and PyPI
    • gget now uses a minimal setup.py and is fully PEP 517/518 compatible
  • gget setup will now try to use uv pip install first for speed and modern dependency resolution, and fall back onto pip install if uv fails or is not available
    • Users are informed at each step which installer is being used and if a retry is happening
    • Note: Some scientific dependencies (e.g., cellxgene-census) may not yet support Python 3.12. If you encounter installation errors, try using Python 3.9 or 3.10. (The pip installation might also still succeed in these cases.)
  • All required dependencies are now listed in setup.cfg under install_requires -> Installing gget with pip install . or uv pip install . will automatically install all dependencies

Version ≥ 0.29.1 (Apr 21, 2025):

  • gget mutate:
    • gget mutate has been simplified to focus on taking as input a list of mutations and associated reference genome with corresponding annotation information, and produce as output the sequences with the mutation incorporated and a short region of surrounding context. For the full functionality of the previous version and how it integrates in the context of a novel variant screening pipeline, visit the varseek repository being developed by members of the gget team at https://github.com/pachterlab/varseek.git.
    • Added additional information to returned data frames as described here: https://github.com/scverse/gget/pull/169
  • gget cosmic:
    • Major restructuring of the gget cosmic module to adhere to new login requirements set by COSMIC
    • New arguments email and password were added to allow the user to manually enter their login credentials without required input for data download
    • Default changed: gget_mutate=False
    • Deprecated argument: entity
    • Argument mutation_class is now cosmic_project
  • gget bgee:
    • type="orthologs" is now the default, removing the need to specify the type argument when calling orthologs
    • Allow querying multiple genes at once.
  • gget diamond:
    • Now supports translated alignment of nucleotide sequences to amino acid reference sequences using the --translated flag.
  • gget elm:
    • Improved server error handling.

Version ≥ 0.29.0 (Sep 25, 2024):

  • New modules:
  • gget enrichr now also supports species other than human and mouse (fly, yeast, worm, and fish) via modEnrichR
  • gget mutate:
    gget mutate will now merge identical sequences in the final file by default. Mutation creation was vectorized to decrease runtime. Improved flanking sequence check for non-substitution mutations to make sure no wildtype kmer is retained in the mutation-containing sequence. Addition of several new arguments to customize sequence generation and output.
  • gget cosmic:
    Added support for targeted as well as gene screens. The CSV file created for gget mutate now also contains protein mutation info.
  • gget ref:
    Added out file option.
  • gget info and gget seq:
    Switched to Ensembl POST API to increase speed (nothing changes in front end).
  • Other "behind the scenes" changes:

Version ≥ 0.28.6 (Jun 2, 2024):

  • New module: gget mutate
  • gget cosmic: You can now download entire COSMIC databases using the argument download_cosmic argument
  • gget ref: Can now fetch the GRCh37 genome assembly using species='human_grch37'
  • gget search: Adjust access of human data to the structure of Ensembl release 112 (fixes issue 129)

Version ≥ 0.28.5 (May 29, 2024):

  • Yanked due to logging bug in gget.setup("alphafold") + inversion mutations in gget mutate only reverse the string instead of also computing the complementary strand

Version ≥ 0.28.4 (January 31, 2024):

  • gget setup: Fix bug with filepath when running gget.setup("elm") on Windows OS.

Version ≥ 0.28.3 (January 22, 2024):

  • gget search and gget ref now also support fungi 🍄, protists 🌝, and invertebrate metazoa 🐝 🐜 🐌 🐙 (in addition to vertebrates and plants)
  • New module: gget cosmic
  • gget enrichr: Fix duplicate scatter dots in plot when pathway names are duplicated
  • gget elm:
    • Changed ortho results column name 'Ortholog_UniProt_ID' to 'Ortholog_UniProt_Acc' to correctly reflect the column contents, which are UniProt Accessions. 'UniProt ID' was changed to 'UniProt Acc' in the documentation for all gget modules.
    • Changed ortho results column name 'motif_in_query' to 'motif_inside_subject_query_overlap'.
    • Added interaction domain information to results (new columns: "InteractionDomainId", "InteractionDomainDescription", "InteractionDomainName").
    • The regex string for regular expression matches was encapsulated as follows: "(?=(regex))" (instead of directly passing the regex string "regex") to enable capturing all occurrences of a motif when the motif length is variable and there are repeats in the sequence (https://regex101.com/r/HUWLlZ/1).
  • gget setup: Use the out argument to specify a directory the ELM database will be downloaded into. Completes this feature request.
  • gget diamond: The DIAMOND command is now run with --ignore-warnings flag, allowing niche sequences such as amino acid sequences that only contain nucleotide characters and repeated sequences. This is also true for DIAMOND alignments performed within gget elm.
  • gget ref and gget search back-end change: the current Ensembl release is fetched from the new release file on the Ensembl FTP site to avoid errors during uploads of new releases.
  • gget search:
    • FTP link results (--ftp) are saved in txt file format instead of json.
    • Fix URL links to Ensembl gene summary for species with a subspecies name and invertebrates.
  • gget ref:
    • Back-end changes to increase speed
    • New argument: list_iv_species to list all available invertebrate species (can be combined with the release argument to fetch all species available from a specific Ensembl release)

Version ≥ 0.28.2 (November 15, 2023):

  • gget info: Return a logging error message when the NCBI server fails for a reason other than a fetch fail (this is an error on the server side rather than an error with gget)
  • Replace deprecated 'text' argument to find()-type methods whenever used with dependency BeautifulSoup
  • gget elm: Remove false positive and true negative instances from returned results
  • gget elm: Add expand argument

Version ≥ 0.28.0 (November 5, 2023):

Version ≥ 0.27.9 (August 7, 2023):

  • gget enrichr: Use new argument background_list to provide a list of background genes
  • gget search now also searches Ensembl synonyms (in addition to gene descriptions and names) to return more comprehensive search results (thanks to Samuel Klein for the suggestion)

Version ≥ 0.27.8 (July 12, 2023):

  • gget search: Specify the Ensembl release from which information is fetched with new argument -r --release
  • Fixed bug in gget pdb (this bug was introduced in version 0.27.5)

Version ≥ 0.27.7 (May 15, 2023):

Version ≥ 0.27.6 (May 1, 2023) (YANKED due to problems with dependencies -> replaced with version 0.27.7):

Version ≥ 0.27.5 (April 6, 2023):

  • Updated gget search to function correctly with new Pandas version 2.0.0 (released on April 3rd, 2023) as well as older versions of Pandas
  • Updated gget info with new flags uniprot and ncbi which allow turning off results from these databases independently to save runtime (note: flag ensembl_only was deprecated)
  • All gget modules now feature a -q / --quiet (Python: verbose=False) flag to turn off progress information

Version ≥ 0.27.4 (March 19, 2023):

Version ≥ 0.27.3 (March 11, 2023):

  • gget info excludes PDB IDs by default to increase speed (PDB results can be included using flag --pdb / pdb=True).

Version ≥ 0.27.2 (January 1, 2023):

Version ≥ 0.27.0 (December 10, 2022):

  • Updated gget alphafold to match recent changes by DeepMind
  • Updated version number to match gget's creator's age following a long-standing Pachter lab tradition

Version ≥ 0.3.13 (November 11, 2022):

Version ≥ 0.3.12 (November 10, 2022):

  • gget info now also returns subcellular localisation data from UniProt
  • New gget info flag ensembl_only returns only Ensembl results
  • Reduced runtime for gget info and gget seq

Version ≥ 0.3.11 (September 7, 2022):

Version ≥ 0.3.10 (September 2, 2022):

Version ≥ 0.3.9 (August 25, 2022):

Version ≥ 0.3.8 (August 12, 2022):

  • Fixed mysql-connector-python version requirements

Version ≥ 0.3.7 (August 9, 2022):

  • NOTE: The Ensembl FTP site changed its structure on August 8, 2022. Please upgrade to gget version ≥ 0.3.7 if you use gget ref

Version ≥ 0.3.5 (August 6, 2022):

Version ≥ 0.2.6 (July 7, 2022):

  • gget ref now supports plant genomes! 🌱

Version ≥ 0.2.5 (June 30, 2022):

  • NOTE: UniProt changed the structure of their API on June 28, 2022. Please upgrade to gget version ≥ 0.2.5 if you use any of the modules querying data from UniProt (gget info and gget seq).

Version ≥ 0.2.3: (June 26, 2022):

  • JSON is now the default output format for the command-line interface for modules that previously returned data frame (CSV) format by default (the output can be converted to data frame/CSV using flag [-csv][--csv]). Data frame/CSV remains the default output for Jupyter Lab / Google Colab (and can be converted to JSON with json=True).
  • For all modules, the first required argument was converted to a positional argument and should not be named anymore in the command-line, e.g. gget ref -s humangget ref human.
  • gget info: [--expand] is deprecated. The module will now always return all of the available information.
  • Slight changes to the output returned by gget info, including the return of versioned Ensembl IDs.
  • gget info and gget seq now support 🪱 WormBase and 🪰 FlyBase IDs.
  • gget archs4 and gget enrichr now also take Ensembl IDs as input with added flag [-e][--ensembl] (ensembl=True in Jupyter Lab / Google Colab).
  • gget seq argument seqtype was replaced by flag [-t][--translate] (translate=True/False in Jupyter Lab / Google Colab) which will return either nucleotide (False) or amino acid (True) sequences.
  • gget search argument seqtype was renamed to id_type for clarity (still taking the same arguments 'gene' or 'transcript').