Data generation

How OpenAntigens generates portal annotations

The portal combines curated target metadata, UniProt feature annotations, AlphaFold confidence data, structural evidence, ortholog/paralog references, disease associations, and local BLAST searches for antigen construct review.

Overview

OpenAntigens is generated as a local, versioned computational snapshot. Each target is processed into a report that combines target identity, topology, extracellular or membrane-protein design scope, structure confidence, experimental structure precedent, domain/family context, ortholog mappings, cross-reactivity searches, disease associations, and precomputed construct candidates.

Reports provide computational evidence for antigen construct design and antibody-discovery planning. Final construct choice still requires biological review, expression testing, and experimental validation.

Evidence classPrimary sourceUsed for
Target identity and sequenceReviewed UniProt human recordsCanonical protein sequence, gene symbol, entry name, protein name, feature annotations.
Topology and extracellular scopeInput topology table plus UniProt featuresSecreted, GPI, single-pass, and multipass track assignment; soluble design-region or full-length membrane design scope.
Predicted structure confidenceAlphaFold DB; local AlphaFold 3 artifacts when enabled and sequence-validatedpLDDT, PAE, structure visualization, structural segmentation, construct diagnostics.
Experimental structure precedentRCSB PDB mappingsDeposited construct boundaries and experimentally observed extracellular regions.
GPCR-specific annotationsGPCRdbGPCR class/family, segment boundaries, generic residue numbering, conserved motif context, and GPCR-focused construct interpretation.
Domains and familiesInterPro, UniProt, HGNC, Ensembl-derived referencesDomain annotation, canonical family selection, paralog identity matrices.
OrthologsPrecomputed RefSeq/HCOP-derived ortholog tableHuman, mouse, and cynomolgus monkey sequence mapping and construct-level identity.
Cross-reactivityLocal blastp databasesRanked non-self sequence-similarity hits and original alignment text.
Disease contextOpen Targets PlatformTarget-disease association scores and disease-aware portal search.
Literature contextPubTator3Approximate target-linked PubMed hit counts and literature search links.
PTMs and processingUniProt feature annotationsGlycosylation, modified residues, lipidation, disulfides, cross-links, signal/propeptide/chain annotations, and cleavage-related sites.

Target universe and resolution

The target universe is sourced from the refined human cell-surface and secreted protein topology table under data/inputs/. The active public portal focuses on proteins categorized as secreted, GPI-anchored, single-pass membrane, or multipass membrane.

Each input row is resolved primarily through reviewed human UniProt records. Resolution prefers stable UniProt entry names or accessions when available and preserves the original input identifiers for traceability. Ambiguous or unresolved rows remain represented in batch outputs with explicit error status.

Topology and construct scope

Topology controls which design track is used. Secreted proteins use the mature extracellular protein as the soluble design scope. GPI-anchored proteins and single-pass proteins emphasize extracellular regions, with boundaries inferred from curated topology, signal peptide, propeptide, and transmembrane annotations.

Multipass proteins are treated separately. When the largest extracellular region exceeds 80 amino acids, OpenAntigens treats the protein as a mixed case with both soluble extracellular-region constructs and full-length membrane-expression context. With shorter extracellular loops, the report emphasizes membrane-protein expression and full-length context.

Sequence canonicality

OpenAntigens requires the report sequence, UniProt sequence, and AlphaFold model sequence to represent the same canonical isoform. Mismatching AlphaFold models are excluded from structure-dependent metrics because residue-number mismatches corrupt construct boundaries, pLDDT coloring, and sequence-to-structure selection.

AlphaFold DB models are used whenever a compatible canonical model is available. If AlphaFold DB artifact resolution fails and local AlphaFold 3 artifacts are enabled, the client accepts a local AlphaFold 3 model only when its stored sequence hash and parsed PDB sequence match the canonical report sequence. Prepared AlphaFold 3 artifacts are selected by AlphaFold Server ranking score, converted into the same PDB/PAE artifact shape used by the rest of the pipeline, and labeled separately in the portal.

When a target lacks a matching AlphaFold DB model or accepted local AlphaFold 3 artifact, non-structure-dependent annotations remain available; structure-derived segmentation and interactive structure features are limited.

AlphaFold confidence

pLDDT is used as local residue-confidence evidence. High pLDDT supports local fold confidence, while low pLDDT often marks flexible linkers, disordered tails, or uncertain loops. PAE is used as regional confidence evidence: low internal PAE supports a coherent structural unit, while high PAE between neighboring blocks suggests flexible domain motion or uncertain relative orientation.

Construct diagnostics summarize mean pLDDT, structured fraction, mean intrablock PAE, interblock PAE, local boundary PAE, separation, and linker pLDDT when available. These metrics are evidence features, not absolute pass/fail criteria.

Residue topology and accessibility

Residues are classified by topology first, using UniProt topology features, transmembrane segments, signal peptides, propeptides, mature-chain annotations, and the input topology track. The reported location categories are extracellular, membrane, intracellular, or unknown. For multipass proteins, extracellular termini and loops are treated as extracellular regions inside a membrane-protein model.

Surface accessibility is computed from the canonical AlphaFold structure with FreeSASA using the Lee-Richards algorithm. The default probe radius is 1.4 Angstrom, the default slice count is 20, and residues are called solvent-accessible when relative SASA is at least 0.20.

Extracellular accessible residues

Antibody recognition depends most directly on residues that are both extracellular and accessible. OpenAntigens therefore tracks an extracellular accessible residue set and uses it in ortholog identity, paralog/family identity, BLAST summaries, and alignment highlighting.

Accessibility classHow it is assignedDesign interpretation
ExposedRelative SASA in the full AlphaFold model is at least the configured threshold.Residue is directly solvent-accessible in the full model.
ConditionalThe residue is buried in the full model but exposed when SASA is recomputed inside a PAE-coherent block, or it has low pLDDT and is treated as accessible but low-confidence/disordered.Flag for flexible or uncertain regions where full-model AlphaFold packing may hide residues exposed in solution or on cells.
BuriedRelative SASA is below threshold in the full model and in available PAE-block contexts.Lower antibody accessibility unless the biological conformation differs from the model.
UncertainNo structure or SASA context is available.Manual accessibility review required.

PAE-block SASA is used because AlphaFold sometimes packs flexible domains or disordered segments against each other even when PAE indicates their relative placement is uncertain. Recomputing accessibility inside the PAE-coherent block reduces false buried calls for residues that may be exposed in solution or on cells.

The extracellular accessible identity reported in homolog, family, and BLAST sections is calculated only over aligned human residues that are both extracellular and classified as exposed or conditional.

Construct generation

The analysis engine scores construct candidates, removes exact duplicate boundaries across calculated and annotated candidate sources (PDB-backed constructs are kept individually as evidence), filters soluble candidates outside the inferred design scope or below the minimum length, and the portal displays the surviving classes in a fixed review order.

The dedicated Pre-generated Constructs page expands this section with construct classes, diagnostics, card fields, homolog transfer, PTM-aware review, and practical selection guidance.

Construct classHow it is generatedPrimary use
Full design regionUses the inferred mature secreted or extracellular design region after signal peptide/propeptide/topology processing.Preserves conformational epitopes and complete soluble context.
PDB-backedUses experimental construct boundaries mapped to UniProt numbering, excluding structures fully outside the design scope. Every overlapping structure is retained — even when its boundaries match another construct — and links to its RCSB entry.Captures experimentally observed and often expression-tested boundaries.
Domain annotatedUses curated UniProt/InterPro domain boundaries that lie fully inside the design scope.Captures biologically named domain units.
Strict calculatedUses more conservative pLDDT/PAE segmentation to favor compact single-domain-like units.Emphasizes smaller constructs with stronger cohesive-domain support.
Lenient calculatedUses more permissive segmentation to retain larger structured units and multi-domain regions when support is acceptable.Captures bigger surfaces and possible conformational epitopes.
Membrane-expressionFor multipass targets, trims obviously disordered cytoplasmic regions while retaining membrane-spanning context.Supports whole membrane-protein expression planning.

The minimum construct size defaults to 50 amino acids. Soluble constructs outside the inferred design scope are excluded. Exact duplicate boundaries from different calculated or annotated sources are collapsed during construct deduplication; PDB-backed constructs are exempt, so every deposited structure is retained as evidence.

Strict and lenient calculated constructs come from two AlphaFold segmentation passes: a conservative strict pass that isolates compact, domain-like units, and a permissive lenient pass that preserves larger, sometimes multi-domain regions. Lenient constructs are often larger and more epitope-preserving; strict constructs are smaller and more single-domain-like. The exact pLDDT thresholds, PAE split criteria, and boundary scoring are documented on the Pre-generated Constructs page.

Domain annotations

UniProt feature annotations and InterPro records are merged into the residue annotation map. Domain annotations must lie inside the relevant design scope to be used as construct recommendations. Large whole-protein signatures that extend beyond the mature secreted or extracellular region are retained as descriptive context when appropriate. They are excluded from soluble construct recommendations.

InterPro is treated as a higher-level curated domain/family source. Pfam entries support domain annotation. Pfam-only family IDs are excluded as canonical family identifiers for precomputed family matrices.

Family and paralog context

Canonical family assignment prefers InterPro family annotations when they are specific and informative. HGNC and Ensembl-derived paralog references are used to recover biologically relevant families when InterPro is incomplete or too broad. If multiple families are available, smaller and more specific families are preferred over large generic families.

Large families can be skipped from detailed matrix rendering to keep pages readable. Directional identity matrices are used because X-to-Y identity and Y-to-X identity can differ when one protein is a subregion or short paralog relative to another.

Orthologs and homology

Precomputed ortholog reference tables provide human, mouse, and cynomolgus monkey canonical RefSeq accessions, sequences, and pairwise identity to the human sequence. Reports use these tables first. If no precomputed row covers the target, the analyzer runs live species lookup.

Construct-level homolog sequences are inferred by aligning the full human design scope to species ortholog sequences, projecting the human construct boundaries through that alignment, and extracting the corresponding species region. Construct-specific identity is then calculated against the human construct sequence.

Experimental structures

PDB/RCSB evidence is mapped back to UniProt residue numbering where possible. Experimental construct boundaries that overlap the design scope are summarized and can become PDB-backed construct suggestions. PDB entries fully outside the mature secreted, extracellular, or relevant membrane-protein scope are ignored for construct recommendation.

PDB evidence is treated as design precedent. Exact construct choice still depends on the current antibody-discovery context.

BLAST cross-reactivity

Cross-reactivity is estimated with local blastp searches so the portal can operate reproducibly without remote BLAST jobs. Human, mouse, and cynomolgus monkey primary databases use reviewed UniProt protein sets. For cynomolgus monkey, targets without hits in the reviewed set are searched against a broader local UniProt proteome database when that database is configured.

Reported fieldMeaningInterpretation
Bit scoreBLAST alignment score normalized for database-independent comparison.Primary ranking field. Higher values indicate stronger sequence similarity.
E-valueExpected number of alignments of similar quality by chance.Lower values indicate more statistically significant alignments.
IdentityFraction of identical residues across the aligned region.Useful for local similarity, but must be interpreted with coverage.
CoverageAligned query span divided by query length, capped at 100%.Distinguishes full-region similarity from short conserved motifs.
Alignment textOriginal BLAST alignment block.Inspect this to determine whether similarity overlaps the selected construct or only a small motif.

Self hits are filtered by accession and entry name. BLAST results are rank-only in the current implementation; no universal cross-reactivity threshold is applied because acceptable risk depends on antigen class, screening goal, and downstream assay design.

PTMs and processing features

OpenAntigens extracts curated UniProt PTM-like features including glycosylation, modified residues, lipidation, disulfide bonds, cross-links, initiator methionine processing, signal peptides, propeptides, mature chains, peptides, and cleavage-related site annotations.

Every target page lists these annotations in a dedicated PTM section. Construct cards list which PTMs overlap each construct, and the Interactive Construct Builder updates the included PTMs in real time as the selected residue window changes. Processing and cleavage annotations raise boundary-review warnings because cutting through or including these sites unintentionally can affect expression and antigen quality.

Cysteine analysis

Cysteines are reported as paired or unpaired when curated disulfide evidence or inferred pairing context is available. Unpaired cysteines inside a construct trigger warnings, especially when they appear accessible or are associated with unwanted disulfides, aggregation, or expression heterogeneity.

The Interactive Construct Builder can optionally apply Cys-to-Ser edits to selected unpaired cysteines and updates copied TSV/FASTA sequences in real time.

Furin-site scanning

Basic furin-like cleavage motifs are scanned across the design region and constructs. The pipeline reports suggested edits as annotations and leaves application to manual review because cleavage risk is context-dependent and some basic motifs may be biologically important.

Ligand and interaction regions

Curated UniProt features, PDB mappings, and annotation text are used to report ligand-binding or interaction-related regions when available. These annotations help users avoid trimming known functional surfaces that may be needed for conformational antibody discovery.

Assembly and partner requirements

Interaction and complex information is summarized separately from obligatory-partner classification. Known interactions provide context; obligatory-partner warnings require stronger curated complex evidence.

Obligatory-partner warnings are anchored to curated Complex Portal records when that module is enabled. UniProt SUBUNIT comments are retained as interaction and assembly context without independently creating obligatory-partner warnings.

ClassEvidence ruleHow to use it
Obligatory partner requirementHuman Complex Portal evidence supports a biologically relevant mandatory partner relationship, currently including curated integrin alpha/beta complexes.Plan co-expression, partner-chain inclusion, or extra validation before treating an isolated construct as biologically complete.
Mouse complex supportMouse Complex Portal evidence supports a comparable ortholog complex. Mouse-only evidence stays contextual for the human report.Cross-species context for reagent strategy and validation planning.
Conditional assemblyLanguage such as ligand-induced, upon ligand, or ligand-triggered dimerization.Context for conditional biology; partner inclusion depends on the assay and construct goal.
Stable complex contextCurated human Complex Portal membership without mandatory-partner status for antigen design.Biological context for deciding whether to co-express partners.
Interaction contextGeneric phrases such as interacts with or forms a complex with.Interaction awareness without mandatory-partner status.

This conservative behavior avoids false mandatory-partner calls that would incorrectly discourage soluble antigen designs. Reports can list many known interactions without an obligatory requirement when curated evidence falls short of the stronger mandatory-partner criteria.

Disease, literature, and antibody-resource context

Open Targets direct target-disease associations are precomputed as a separate resumable artifact. The portal shows the top disease association in the index, loads a disease-name index for disease-aware filtering, and lists disease score plus datasource-score context on each target page.

PubTator3 is queried for approximate gene-linked PubMed hit counts. These counts support triage and literature awareness. Target validity requires separate curation. CiteAb links are provided as external antibody-resource shortcuts; CiteAb catalog content remains external.

Static assets and visualization

When assets are generated, each construct can include structure images, pLDDT plots, and PAE plots with the construct region highlighted. Structure renderings use AlphaFold models colored by pLDDT and emphasize cysteines when possible. The portal also includes an Interactive Construct Builder that links sequence selection, structure highlighting, pLDDT, PAE, warnings, and copy-ready sequence outputs.

Static image assets support report review. Interactive widgets support boundary tuning and exploratory design.

Update cadence and limitations

OpenAntigens is released as a static, versioned snapshot. Each release records the build date, target universe, OpenAntigens version when available, and source database versions when available. The planned public update cadence is every 6 months.

Because each release is static, source databases may have changed after the displayed release date. Cite the release date and downloaded artifact names when using OpenAntigens data in downstream analyses.

LimitationConsequence
AlphaFold confidence is computational.High pLDDT/low PAE supports design reasoning. Expression, folding, and epitope presentation require experimental evidence. AlphaFold 3 artifacts are computational fallback models for non-commercial use under the AlphaFold Server Output Terms; experimental structures remain separate evidence.
Topology annotations can be incomplete or inconsistent.Mature secreted and extracellular-region boundaries require manual review for important targets.
BLAST is sequence-based.It cannot fully predict antibody cross-reactivity, conformational epitope overlap, glycosylation effects, or cell-surface accessibility.
Family assignment can be imperfect.Some proteins have broad, overlapping, or missing family annotations; paralog matrices provide context and require biological review.
Ortholog mapping depends on available canonical sequences.Missing or predicted RefSeq records can limit mouse or cynomolgus monkey construct transfer.
Construct heuristics are general-purpose.Known biology, literature constructs, ligand-binding requirements, and assay-specific constraints can override automated suggestions.