Overview
OpenAntigens is generated as a local, versioned computational snapshot. Each target is processed into a report that combines target identity, topology, extracellular or membrane-protein design scope, structure confidence, experimental structure precedent, domain/family context, ortholog mappings, cross-reactivity searches, disease associations, and precomputed construct candidates.
Reports provide computational evidence for antigen construct design and antibody-discovery planning. Final construct choice still requires biological review, expression testing, and experimental validation.
Target universe and resolution
The target universe is sourced from the refined human cell-surface and secreted protein topology table under data/inputs/. The active public portal focuses on proteins categorized as secreted, GPI-anchored, single-pass membrane, or multipass membrane.
Each input row is resolved primarily through reviewed human UniProt records. Resolution prefers stable UniProt entry names or accessions when available and preserves the original input identifiers for traceability. Ambiguous or unresolved rows remain represented in batch outputs with explicit error status.
Topology and construct scope
Topology controls which design track is used. Secreted proteins use the mature extracellular protein as the soluble design scope. GPI-anchored proteins and single-pass proteins emphasize extracellular regions, with boundaries inferred from curated topology, signal peptide, propeptide, and transmembrane annotations.
Multipass proteins are treated separately. When the largest extracellular region exceeds 80 amino acids, OpenAntigens treats the protein as a mixed case with both soluble extracellular-region constructs and full-length membrane-expression context. With shorter extracellular loops, the report emphasizes membrane-protein expression and full-length context.
Sequence canonicality
OpenAntigens requires the report sequence, UniProt sequence, and AlphaFold model sequence to represent the same canonical isoform. Mismatching AlphaFold models are excluded from structure-dependent metrics because residue-number mismatches corrupt construct boundaries, pLDDT coloring, and sequence-to-structure selection.
AlphaFold DB models are used whenever a compatible canonical model is available. If AlphaFold DB artifact resolution fails and local AlphaFold 3 artifacts are enabled, the client accepts a local AlphaFold 3 model only when its stored sequence hash and parsed PDB sequence match the canonical report sequence. Prepared AlphaFold 3 artifacts are selected by AlphaFold Server ranking score, converted into the same PDB/PAE artifact shape used by the rest of the pipeline, and labeled separately in the portal.
When a target lacks a matching AlphaFold DB model or accepted local AlphaFold 3 artifact, non-structure-dependent annotations remain available; structure-derived segmentation and interactive structure features are limited.
AlphaFold confidence
pLDDT is used as local residue-confidence evidence. High pLDDT supports local fold confidence, while low pLDDT often marks flexible linkers, disordered tails, or uncertain loops. PAE is used as regional confidence evidence: low internal PAE supports a coherent structural unit, while high PAE between neighboring blocks suggests flexible domain motion or uncertain relative orientation.
Construct diagnostics summarize mean pLDDT, structured fraction, mean intrablock PAE, interblock PAE, local boundary PAE, separation, and linker pLDDT when available. These metrics are evidence features, not absolute pass/fail criteria.
Residue topology and accessibility
Residues are classified by topology first, using UniProt topology features, transmembrane segments, signal peptides, propeptides, mature-chain annotations, and the input topology track. The reported location categories are extracellular, membrane, intracellular, or unknown. For multipass proteins, extracellular termini and loops are treated as extracellular regions inside a membrane-protein model.
Surface accessibility is computed from the canonical AlphaFold structure with FreeSASA using the Lee-Richards algorithm. The default probe radius is 1.4 Angstrom, the default slice count is 20, and residues are called solvent-accessible when relative SASA is at least 0.20.
Extracellular accessible residues
Antibody recognition depends most directly on residues that are both extracellular and accessible. OpenAntigens therefore tracks an extracellular accessible residue set and uses it in ortholog identity, paralog/family identity, BLAST summaries, and alignment highlighting.
PAE-block SASA is used because AlphaFold sometimes packs flexible domains or disordered segments against each other even when PAE indicates their relative placement is uncertain. Recomputing accessibility inside the PAE-coherent block reduces false buried calls for residues that may be exposed in solution or on cells.
The extracellular accessible identity reported in homolog, family, and BLAST sections is calculated only over aligned human residues that are both extracellular and classified as exposed or conditional.
Construct generation
The analysis engine scores construct candidates, removes exact duplicate boundaries across calculated and annotated candidate sources (PDB-backed constructs are kept individually as evidence), filters soluble candidates outside the inferred design scope or below the minimum length, and the portal displays the surviving classes in a fixed review order.
The dedicated Pre-generated Constructs page expands this section with construct classes, diagnostics, card fields, homolog transfer, PTM-aware review, and practical selection guidance.
The minimum construct size defaults to 50 amino acids. Soluble constructs outside the inferred design scope are excluded. Exact duplicate boundaries from different calculated or annotated sources are collapsed during construct deduplication; PDB-backed constructs are exempt, so every deposited structure is retained as evidence.
Strict and lenient calculated constructs come from two AlphaFold segmentation passes: a conservative strict pass that isolates compact, domain-like units, and a permissive lenient pass that preserves larger, sometimes multi-domain regions. Lenient constructs are often larger and more epitope-preserving; strict constructs are smaller and more single-domain-like. The exact pLDDT thresholds, PAE split criteria, and boundary scoring are documented on the Pre-generated Constructs page.
Domain annotations
UniProt feature annotations and InterPro records are merged into the residue annotation map. Domain annotations must lie inside the relevant design scope to be used as construct recommendations. Large whole-protein signatures that extend beyond the mature secreted or extracellular region are retained as descriptive context when appropriate. They are excluded from soluble construct recommendations.
InterPro is treated as a higher-level curated domain/family source. Pfam entries support domain annotation. Pfam-only family IDs are excluded as canonical family identifiers for precomputed family matrices.
Family and paralog context
Canonical family assignment prefers InterPro family annotations when they are specific and informative. HGNC and Ensembl-derived paralog references are used to recover biologically relevant families when InterPro is incomplete or too broad. If multiple families are available, smaller and more specific families are preferred over large generic families.
Large families can be skipped from detailed matrix rendering to keep pages readable. Directional identity matrices are used because X-to-Y identity and Y-to-X identity can differ when one protein is a subregion or short paralog relative to another.
Orthologs and homology
Precomputed ortholog reference tables provide human, mouse, and cynomolgus monkey canonical RefSeq accessions, sequences, and pairwise identity to the human sequence. Reports use these tables first. If no precomputed row covers the target, the analyzer runs live species lookup.
Construct-level homolog sequences are inferred by aligning the full human design scope to species ortholog sequences, projecting the human construct boundaries through that alignment, and extracting the corresponding species region. Construct-specific identity is then calculated against the human construct sequence.
Experimental structures
PDB/RCSB evidence is mapped back to UniProt residue numbering where possible. Experimental construct boundaries that overlap the design scope are summarized and can become PDB-backed construct suggestions. PDB entries fully outside the mature secreted, extracellular, or relevant membrane-protein scope are ignored for construct recommendation.
PDB evidence is treated as design precedent. Exact construct choice still depends on the current antibody-discovery context.
BLAST cross-reactivity
Cross-reactivity is estimated with local blastp searches so the portal can operate reproducibly without remote BLAST jobs. Human, mouse, and cynomolgus monkey primary databases use reviewed UniProt protein sets. For cynomolgus monkey, targets without hits in the reviewed set are searched against a broader local UniProt proteome database when that database is configured.
Self hits are filtered by accession and entry name. BLAST results are rank-only in the current implementation; no universal cross-reactivity threshold is applied because acceptable risk depends on antigen class, screening goal, and downstream assay design.
PTMs and processing features
OpenAntigens extracts curated UniProt PTM-like features including glycosylation, modified residues, lipidation, disulfide bonds, cross-links, initiator methionine processing, signal peptides, propeptides, mature chains, peptides, and cleavage-related site annotations.
Every target page lists these annotations in a dedicated PTM section. Construct cards list which PTMs overlap each construct, and the Interactive Construct Builder updates the included PTMs in real time as the selected residue window changes. Processing and cleavage annotations raise boundary-review warnings because cutting through or including these sites unintentionally can affect expression and antigen quality.
Cysteine analysis
Cysteines are reported as paired or unpaired when curated disulfide evidence or inferred pairing context is available. Unpaired cysteines inside a construct trigger warnings, especially when they appear accessible or are associated with unwanted disulfides, aggregation, or expression heterogeneity.
The Interactive Construct Builder can optionally apply Cys-to-Ser edits to selected unpaired cysteines and updates copied TSV/FASTA sequences in real time.
Furin-site scanning
Basic furin-like cleavage motifs are scanned across the design region and constructs. The pipeline reports suggested edits as annotations and leaves application to manual review because cleavage risk is context-dependent and some basic motifs may be biologically important.
Ligand and interaction regions
Curated UniProt features, PDB mappings, and annotation text are used to report ligand-binding or interaction-related regions when available. These annotations help users avoid trimming known functional surfaces that may be needed for conformational antibody discovery.
Assembly and partner requirements
Interaction and complex information is summarized separately from obligatory-partner classification. Known interactions provide context; obligatory-partner warnings require stronger curated complex evidence.
Obligatory-partner warnings are anchored to curated Complex Portal records when that module is enabled. UniProt SUBUNIT comments are retained as interaction and assembly context without independently creating obligatory-partner warnings.
This conservative behavior avoids false mandatory-partner calls that would incorrectly discourage soluble antigen designs. Reports can list many known interactions without an obligatory requirement when curated evidence falls short of the stronger mandatory-partner criteria.
Disease, literature, and antibody-resource context
Open Targets direct target-disease associations are precomputed as a separate resumable artifact. The portal shows the top disease association in the index, loads a disease-name index for disease-aware filtering, and lists disease score plus datasource-score context on each target page.
PubTator3 is queried for approximate gene-linked PubMed hit counts. These counts support triage and literature awareness. Target validity requires separate curation. CiteAb links are provided as external antibody-resource shortcuts; CiteAb catalog content remains external.
Static assets and visualization
When assets are generated, each construct can include structure images, pLDDT plots, and PAE plots with the construct region highlighted. Structure renderings use AlphaFold models colored by pLDDT and emphasize cysteines when possible. The portal also includes an Interactive Construct Builder that links sequence selection, structure highlighting, pLDDT, PAE, warnings, and copy-ready sequence outputs.
Static image assets support report review. Interactive widgets support boundary tuning and exploratory design.
Update cadence and limitations
OpenAntigens is released as a static, versioned snapshot. Each release records the build date, target universe, OpenAntigens version when available, and source database versions when available. The planned public update cadence is every 6 months.
Because each release is static, source databases may have changed after the displayed release date. Cite the release date and downloaded artifact names when using OpenAntigens data in downstream analyses.