Pre-generated antigen constructs and design heuristics

OpenAntigens generates construct candidates from resolved design scope, domain annotations, structure-derived confidence, PDB chain coverage, and ortholog mappings when those inputs are available.

Why pre-generate constructs?

The construct list keeps boundary generation and evidence fields consistent across reports. When enough information is available, protein reports contain candidate boundaries derived from topology, curated annotations, AlphaFold confidence, PDB evidence, PTM/processing features, cysteine context, and ortholog mappings.

Use the list to inspect candidate boundaries alongside the evidence behind them. Before ordering, review target biology, literature, partner requirements, assay goals, and expression constraints.

Strict and lenient construct algorithm

This section summarizes the AlphaFold-derived portion of the construct generator, which runs when a compatible AlphaFold model and PAE matrix are available. Strict constructs use conservative pLDDT seeds followed by recursive PAE boundary scoring; lenient constructs use a lower pLDDT threshold to preserve larger structured regions.

The two AlphaFold confidence signals are used for different decisions. pLDDT is a one-dimensional per-residue signal, so it first defines candidate structured intervals. The lenient track stops after this step and keeps larger intervals using a lower pLDDT threshold. The strict track takes only higher-confidence pLDDT seed intervals and then uses the two-dimensional PAE matrix to decide whether each seed should be split into smaller domain-like units.

For strict splitting, every possible boundary is tested if both resulting fragments remain at least 40 amino acids. A boundary is eligible when the predicted error between the two sides is high relative to the predicted error within each side. Low linker pLDDT near the boundary adds support for a split because it suggests a flexible connector between structured blocks.

Step-by-step strict and lenient generation

Define the eligible design region. The pipeline first determines the region that can be used for soluble construct design. For secreted proteins this is usually the mature secreted chain; for single-pass proteins it is usually the extracellular region; for multipass proteins, large extracellular regions can be treated as mixed soluble/membrane cases, while membrane-expression constructs are handled separately. Strict and lenient soluble constructs stay inside this region.
Retrieve canonical AlphaFold confidence data when available. The canonical UniProt sequence is matched to the AlphaFold structure and PAE matrix. pLDDT is used as a per-residue local-confidence signal. PAE is used as a pairwise confidence signal that indicates whether two parts of the model have a confident relative orientation.
Generate lenient pLDDT regions. The lenient track marks residues with pLDDT at least 60, allows low-confidence gaps up to 12 residues, and discards resulting regions shorter than 25 residues. This track preserves larger structured units and can include more than one domain when the intervening linker is short or only moderately uncertain.
Generate strict pLDDT seed regions. The strict track marks residues with pLDDT at least 70, allows low-confidence gaps up to 8 residues, and discards seed regions shorter than 25 residues. These high-confidence intervals then feed PAE-based domain splitting.
Evaluate PAE split candidates inside each strict seed. For every possible cut, both resulting sides must remain at least 40 amino acids. The pipeline computes intrablock PAE within each side, interblock PAE between the two sides, and separation, defined as interblock PAE minus mean intrablock PAE.
Accept only convincing PAE boundaries. A cut is eligible only when interblock PAE is at least 12 A and separation is at least 4 A. This means the two sides are internally more coherent than they are with respect to each other, which is the signal expected at a domain boundary or flexible hinge.
Score eligible strict split boundaries. Candidate boundaries are ranked using split score = separation + 0.35 x max(0, boundary PAE - intrablock PAE) + linker bonus. The linker bonus increases when local pLDDT near the cut is low, because low-confidence linker residues support splitting adjacent structured blocks.
Recursively split strict regions. The highest-scoring eligible cut is applied, then the same PAE split search is repeated on the left and right child regions. Recursion stops when no valid cut remains, child regions would become too small, or the configured maximum recursion depth is reached.
Apply final construct filters. Calculated soluble constructs must stay inside the design region and must be at least 50 amino acids. Exact duplicate boundaries from different calculated or annotated sources are collapsed to keep reports concise; PDB-backed constructs are kept individually so no experimental structure is dropped.
Annotate each construct for review. Construct cards list sequence, boundaries, exports, included UniProt/InterPro annotations, PTMs, furin-like motifs, cysteine warnings, homolog-equivalent regions, and construct-specific conservation. Structural metrics, PAE split diagnostics, ligand-interaction annotations, and images appear when available.

Construct classes

Constructs are displayed in a fixed order so reports can be compared target-to-target. The order starts with broad biological context, then moves toward narrower or more computationally inferred regions.

Display order	Construct class	How boundaries are generated	Review focus
1	Full design region	Uses the inferred mature secreted or extracellular design scope from topology, signal peptide, propeptide, chain, and transmembrane annotations.	Preserving complete soluble context and conformational epitopes.
2	PDB-backed constructs	Uses PDB chain coverage mapped to UniProt numbering. Every deposited structure is listed individually — even when its boundaries duplicate another construct — because each is independent experimental evidence. The construct name links to its RCSB entry.	Reviewing deposited structure boundaries.
3	Domain annotated constructs	Uses UniProt and InterPro domain-like annotations that lie fully inside the design scope.	Designing named biological domains or domain modules.
4	Strict calculated constructs	Uses more conservative AlphaFold pLDDT/PAE segmentation to isolate compact, cohesive structural regions.	Single-domain-like antigens and cleaner soluble expression candidates.
5	Lenient calculated constructs	Uses more permissive structural thresholds to retain larger coherent or partially coherent regions.	Multi-domain surfaces and larger conformational epitopes.
6	Membrane-expression constructs	For multipass proteins, retains membrane-spanning context while trimming obvious disordered cytoplasmic tails when appropriate.	Whole membrane protein expression, GPCR-like targets, transporters, channels, and mixed soluble/membrane cases.

The default minimum construct length is 50 amino acids; shorter candidates are filtered from the recommendation set.

Design scope comes first

OpenAntigens first determines the design scope: mature secreted sequence, GPI/single-pass extracellular region, a large multipass extracellular region, or a full-length membrane-protein expression region. Soluble constructs stay inside that inferred scope.

For secreted proteins, signal peptide and propeptide annotations help define mature sequence. For single-pass proteins, transmembrane and topology annotations define extracellular boundaries. For multipass proteins, large extracellular regions are treated as mixed cases; short loops usually route to membrane-expression target review.

Boundary filtering

Recommended soluble constructs must fit inside the design scope. InterPro or UniProt annotations that extend outside the mature secreted or extracellular region remain descriptive context.

PDB-derived boundaries that are fully outside the relevant scope are ignored. PDB entries that overlap the scope are each retained as a separate construct, even when their boundaries duplicate another construct, so every experimental structure is captured as evidence.

Strict versus lenient

Strict calculated constructs favor smaller regions with stronger local confidence and cleaner structural cohesion. They are useful when the goal is a compact antigen domain with fewer flexible residues.

Lenient calculated constructs tolerate broader regions and can preserve multi-domain surfaces. They are useful when the antibody-discovery goal may require a larger conformational epitope or when domains are structurally adjacent.

Interpretable evidence

Construct cards emphasize interpretable evidence: boundaries, sequence, class, structural metrics, PTMs, cysteines, homolog equivalents, PDB evidence, and warnings.

These fields keep each boundary traceable. A lower-confidence construct may still be appropriate if it preserves an important ligand interface or known epitope.

AlphaFold-derived structural heuristics

When a compatible AlphaFold model and PAE matrix are available, OpenAntigens uses AlphaFold confidence in two complementary ways: pLDDT estimates local residue confidence, and PAE estimates confidence in relative placement between residue pairs or regions. The construct generator uses both signals to identify coherent structural blocks and plausible split points.

The strict workflow is designed to find compact, domain-like units. It starts with residues whose pLDDT is at least 70, merges short low-confidence gaps up to 8 residues, discards structured seeds shorter than 25 residues, and then evaluates PAE-based split points. A split is considered only if each side would remain at least 40 residues. The split must have interblock PAE at least 12 and PAE separation of at least 4, where separation is the difference between interblock PAE and the average intrablock PAE of the two sides.

Accepted split candidates are scored by combining PAE separation, local boundary PAE enrichment, and linker pLDDT. Boundaries are favored when the candidate blocks are internally coherent, the two sides have uncertain relative placement, and the linker around the boundary has lower confidence. The splitter recurses up to the configured maximum depth, producing strict calculated constructs.

The lenient workflow is intentionally less conservative. It uses a pLDDT threshold of 60 and merges gaps up to 12 residues, which often preserves larger structured regions that strict splitting would divide. Lenient constructs are therefore useful when preserving a multi-domain surface may be more important than isolating a compact single domain.

Diagnostic	What it measures	How to interpret it
Mean pLDDT	Average local AlphaFold confidence across the construct.	Higher values support local fold confidence. Low values suggest disorder, flexible tails, or uncertain loops.
Structured coverage	Fraction of construct residues above the configured pLDDT structured threshold.	Higher coverage supports a more consistently structured construct.
Mean intra-domain PAE	Average PAE among residues inside the construct or block.	Lower values suggest a coherent structural unit.
Interblock PAE	PAE between two candidate neighboring blocks around a possible split.	Higher interblock PAE can support separating blocks into distinct constructs.
Separation	Difference between interblock and intrablock PAE around a candidate boundary.	Larger separation supports a domain boundary or flexible hinge.
Local boundary PAE	PAE near a candidate split boundary.	High boundary PAE can indicate uncertain relative orientation near the split.
Linker pLDDT	pLDDT around residues connecting blocks.	Low linker pLDDT supports trimming or splitting near that linker.

Interpret these diagnostics with domain annotations, PDB evidence, PTMs, ligand sites, cysteines, and the intended antibody-discovery strategy.

Domain and PDB evidence

Evidence	How it is used	Important limitation
UniProt domains/features	Curated features are merged into the residue map and can define domain annotated constructs when boundaries fit inside the design scope.	Feature granularity varies by protein and curator coverage.
InterPro domains	Specific domain signatures can define domain annotated constructs and help classify construct content.	Broad whole-protein signatures are filtered from construct recommendations when they extend outside the relevant design scope.
Pfam domains	Useful as domain evidence and descriptive annotations.	Paralog matrices use canonical family identifiers; Pfam-only family IDs remain domain evidence.
PDB mapped regions	Mapped experimental boundaries can become PDB-backed constructs and provide precedent for construct boundaries.	PDB constructs document deposited structure boundaries; antibody-discovery context still determines final suitability.

What each construct card includes

Each construct card lists the primary human sequence, boundaries, export-ready TSV/FASTA, ortholog-equivalent rows when available, and construct-specific evidence, diagnostics, and warnings when available.

Field	Meaning	Why it matters
Name	`GENE_SPECIES_start-end`, with optional mutation suffixes when edits are applied in the builder.	Provides traceable construct identity for cloning, ordering, and analysis.
Boundary	Amino-acid start/end in the source protein sequence.	Defines the exact construct region.
Sequence	Construct amino-acid sequence.	Can be copied directly into cloning or ordering workflows.
Homologs	Mouse and cynomolgus monkey equivalent regions when ortholog mapping is available.	Supports cross-species reagent and screening strategy.
Identity to human	Construct-specific sequence identity between species-equivalent and human regions.	Helps triage conservation and cross-species reagent risk; antibody behavior still depends on epitope, structure, glycans, and assay context.
PTMs included	Curated UniProt PTM or processing annotations overlapping the construct.	Flags glycosylation, cleavage, chain, propeptide, lipidation, or other features that may affect design.
Cysteines	Construct-specific paired/unpaired cysteine analysis.	Flags potential disulfide, aggregation, or expression liabilities.
Furin motifs	Basic furin-like motifs inside the construct.	Raises cleavage-risk review items.
Images/plots	Static structure PNGs when generated, plus pLDDT/PAE quality plots when available.	Provides a quick check of model confidence and boundary placement.

Homolog transfer

Human construct boundaries are transferred to mouse and cynomolgus monkey by aligning the full design scope to each ortholog sequence. The alignment projects human boundary positions onto ortholog sequences, so residue numbers can differ across species.

Unreliable or missing ortholog alignments are marked unavailable.

Cysteine edits

Pre-generated construct cards report unpaired cysteine warnings. In the Interactive Construct Builder, checkboxes can apply Cys-to-Ser edits to selected unpaired cysteines; names, TSV, and FASTA update immediately.

Cys-to-Ser edits are optional substitutions for flagged unpaired cysteines. Review known disulfides, structure, and functional biology before ordering.

PTM-aware review

Construct cards list overlapping PTMs and processing features for boundary review.

Cleavage, propeptide, signal peptide, and chain annotations deserve special attention because they can define mature protein boundaries.

Multipass proteins

Large extracellular regions on multipass proteins are handled as mixed cases with soluble extracellular-region constructs and full-length membrane-protein context. Proteins with only short extracellular loops are treated primarily as membrane-expression targets.

Membrane-expression suggestions are intentionally separate from soluble extracellular-region constructs because they answer a different experimental question.

How to choose among pre-generated constructs

Start with the full design region to understand the complete mature secreted or extracellular context.
Check PDB-backed constructs for experimentally observed boundary precedent.
Use domain annotated constructs when the goal is a named biological domain.
Use strict calculated constructs when compact soluble expression is the priority.
Use lenient calculated constructs when a larger conformational epitope or multi-domain surface may matter.
Review PTMs, cysteines, furin motifs, ligand regions, and assembly/partner requirements before finalizing.
Inspect homolog-equivalent sequences if mouse or cynomolgus monkey screening, immunization, or reagent validation is planned.
Open the Interactive Construct Builder to refine boundaries and export the exact final sequence.

Limitations

Pre-generated constructs are computational recommendations. They can miss literature-specific constructs, expression-system constraints, epitope-specific requirements, glycan-dependent biology, partner-dependent folding, and context-dependent cleavage or processing. AlphaFold confidence supports structural reasoning; expression, secretion, folding, and antibody accessibility require experimental validation.

Use these constructs as a reproducible starting set. Final designs should be reviewed by a scientist familiar with the target biology and validated experimentally.