API Reference

Module: sig2dna_core.signomics.py (from the Generative Simulation Initiative)

This is the core module of the sig2dna framework, dedicated to transforming numerical chemical signals into DNA-like symbolic representations. The module enables symbolic analysis, fingerprinting, alignment, and classification of complex analytical signals, such as:

GC-MS / GC-FID
HPLC-MS
NMR / FTIR / Raman
RX and other spectroscopy data

It is designed to facilitate high-throughput pattern recognition, compression, clustering, and AI/ML-based classification. Symbolic transformation is based on wavelet decomposition (Mexican Hat/Ricker) and segment encoding into letters (e.g., A, B, C, X, Y, Z). The representation preserves key structural patterns (e.g., peak transitions) across multiple scales and supports entropy-based distances.

Main Components

DNAsignal — core class to transform a signal into DNA-like symbolic representation
DNAstr — string subclass enabling alignment, entropy analysis, visualization, and reconstruction
DNApairwiseAnalysis — distance and clustering toolbox for aligned DNA codes (PCoA, dendrogram, 2D/3D plots)

Key Features

Multi-scale wavelet transform with symbolic encoding
Symbolic entropy and mutual information measures
Fast symbolic alignment using difflib or biopython
Pairwise symbolic distances (Shannon, excess entropy, Jaccard, Jensen-Shannon)
Interactive plotting: segments, alignment masks, triangle patches
Motif search, alignment visualization, HTML and terminal rendering
Dimensionality reduction (MDS), clustering, dendrogram and heatmaps

Core Concept

Input Signal - 1D NumPy array S of shape (m,) - Data type: np.float64 (default) or np.float32 - Typically sparse and non-negative, such as GC-MS total ion chromatograms

Wavelet Transform - CWT with Ricker (Mexican hat) wavelet - Scales: $s = 2^0, 2^1,…, 2^n - Downsampling by scale (to reduce data volume and capture features at relevant resolutions)

Symbolic Encoding and compressed representation

The transformed signal Ts at scale s is converted to a sequence of symbolic letters using the rules:

Symbol	Description
A	Monotonic increase crossing from − to +
B	Monotonic increase from − to − (no zero crossing)
C	Monotonic increase from + to +
X	Monotonic decrease from + to + (no zero crossing)
Y	Monotonic decrease from − to −
Z	Monotonic decrease crossing from + to −
_	Zero or noise (after filtering)

Each encoded segment is associated with: - width: number of points - height: amplitude difference

These form the compressed representation

Installation

Install all dependencies with:

conda install pywavelets seaborn scikit-learn conda install -c conda-forge python-Levenshtein biopython

Examples

>>> from signomics import DNAsignal
>>> from signal import signal

>>> # Load a sampled signal (e.g., from GC-MS, Raman)
>>> S = signal.from_peaks(...)  # or any constructor for sampled signals

>>> # Encode into DNA-like format
>>> D = DNAsignal(S, encode=True)
>>> D.encode_dna()
>>> D.plot_codes(scale=4)

>>> # Compare samples and cluster
>>> Dlist = [DNAsignal(S1, encode=True), DNAsignal(S2, encode=True), ...]
>>> analysis = DNAsignal._pairwiseEntropyDistance(Dlist, scale=4)
>>> analysis.plot_dendrogram()
>>> analysis.scatter(n_clusters=3)

Notes

The methodology implemented in this module covers and extends the approaches initially tested during the PhD of Julien Kermorvant. “Concept of chemical fingerprints applied to the management of chemical risk of materials, recycled deposits and food packaging”. PhD thesis AgroParisTech. December 2023. https://theses.hal.science/tel-04194172

Maintenance & forking

$ git init -b main $ gh repo create sig2dna –public –source=. –remote=origin –push $ # alternatively $ # git remote add origin git@github.com:ovitrac/sig2dna.git $ # git branch -M main # Ensure current branch is named ‘main’ $ # git push -u origin main # Push and set upstream tracking

$ tree -P ‘.py’ -P ‘.md’ -P ‘LICENSE’ -I ‘__pycache__|.*’ –prune $ conda activate base $ pdoc ./sig2dna_core/signomics.py -f –html -o ./docs $ doctoc –github –maxlevel 2 README.md

$ conda activate sphinxdoc $ cd docs_sphinx/ $ make clean $ make html $ cp -rp build/html/. ../docs

Author: Olivier Vitrac — olivier.vitrac@gmail.com Revision: 2025-06-13

class sig2dna_core.signomics.DNACodes(*args, meta=None, encoded=False, **kwargs)

Bases: UserDict

🧬 DNACodes Dictionary-like container for symbolic signal encodings at multiple scales.

meta

Metadata describing the signal and encoding parameters.

Type:: dict

encoded

Whether the content has been sinusoidally encoded.

Type:: bool

sinencode(d_part=32, N=10000): Encodes symbolic segments using transformer-style sinusoidal embeddings.

sindecode(reference_dx=None): Decodes sinusoidal embeddings back to symbolic segment structure.

summary(): Displays segment or vector counts by scale.

plot(figsize=(12, 4), d_part=None, N=None): Plot method for DNACodes

plot(figsize=(12, 4), d_part=None, N=None)

Plot method for DNACodes: visualizes encoded vectors or symbolic segment distribution.

Parameters:

figsize (tuple) – Figure size for the entire plot.
d_part (int, optional) – Number of dimensions per segment part (only for encoded).
N (int, optional) – Frequency base (for metadata or title info).

sindecode(reference_dx=None)

Decode sinusoidally embedded codes grouped by letter into symbolic segment structure.

Parameters:: reference_dx (float, optional) – Sampling interval used to reconstruct xloc. Defaults to meta[“sampling_dt”].
Returns:: Decoded symbolic codes for each scale.
Return type:: DNACodes

sinencode(d_part=32, N=10000)

Encode symbolic segments at each scale using sinusoidal encoding grouped by letter.

Parameters:

d_part (int) – Number of dimensions for each component (start, width, height).
N (int) – Frequency base for sinusoidal embedding.

Returns:

Encoded version of the current codes, grouped by letter per scale.

Return type:

DNACodes

summary(): Print the number of encoded vectors or symbolic segments per scale.

class sig2dna_core.signomics.DNAFullCodes(*args, meta=None, encoded=False, **kwargs)

Bases: dict

🧬 DNAFullCodes(dict)

A container for symbolic full-resolution DNA-like strings or their sinusoidal embeddings, organized per scale.

This structure maps each scale (typically corresponding to a wavelet or resolution level) to either:

a DNA-like string (str or DNAstr) representing symbolic patterns over time, or

a compressed embedding (dict of vectors) after sinusoidal encoding.

It supports signal discretization, symbolic transformation, sinusoidal encoding, dimensionally-reduced analysis, and visual comparison of encoded motifs.

meta

Optional metadata (e.g. sampling rate, units, scale definitions, etc.).

Type:: dict

encoded

Whether this instance contains sinusoidally encoded data.

Type:: bool

unwrapped_matrix

When applicable, stores a matrix {scale: ndarray (n_letters, d_model)} from compressed representations via unwrap_letters_to_matrix().

Type:: dict, optional

sinencode(d_model=96, N=10000, operation=None): Encodes symbolic data with sinusoidal positional encoding. Supports per-letter reduction via ‘sum’ or ‘mean’. Returns a new encoded instance.

sindecode(): Attempts to reconstruct the symbolic string by repeating letters. Only works if the original operation did not compress to a single vector per letter.

unwrap_letters_to_matrix(): Converts compressed encodings (after ‘sum’ or ‘mean’) into (n_letters × d_model) matrices per scale. Required for d-space plotting.

plot(figsize=(12, 4)): Plots the letter-wise composition (symbolic form) or encoded means (if encoded=True).

plot_unwrapped_matrix(figsize=(12, 4)): Visualizes each letter’s embedding vector in d-space, with one subplot per scale.

Example

>>> codes = DNAFullCodes({4: 'YAABZZ'}, meta={"sampling_dt": 0.5})
>>> encoded = codes.sinencode(operation="mean")
>>> encoded.unwrap_letters_to_matrix()
>>> encoded.plot_unwrapped_matrix()

plot(figsize=(12, 4))

Plot method for DNAFullCodes: visualizes encoded vectors or DNA string composition.

Parameters:: figsize (tuple) – Figure size for the entire plot.

plot_unwrapped_matrix(figsize=(12, 4))

Plot each letter’s encoded vector in the abstract embedding space (d-space). One curve per letter, one subplot per scale.

Requires unwrap_letters_to_matrix() to have been called.

Parameters:: figsize (tuple) – Base figure size. Height will be scaled based on number of scales.
Returns:: The generated matplotlib figure.
Return type:: matplotlib.figure.Figure

sindecode(): Sindecode method

sinencode(d_model=96, N=10000, operation='sum')

Sinencode method — encodes each letter in the DNAFullCodes as a set of sinusoidal embeddings.

Parameters:

d_model (int, optional) – Dimensionality of the sinusoidal embedding (default is 96).
N (int, optional) – Maximum number of positions for the encoding (default is 10000).
operation (str or None, optional) – If “sum”, sums all encodings per letter. If “mean”, averages all encodings per letter. If None, keeps the full (n_occurrences, d_model) matrix per letter. Raises a ValueError if the operation is not one of the above.

Returns:

DNAFullCodes (without aggregation) – A new DNAFullCodes instance with encoded representations and metadata.
DNAFullCodes (with aggregation) – A new DNAFullCodes instance with encoded and aggregated (based operation) representations and metadata.

summary()

Return a brief summary of the full codes per scale.

Returns:: Mapping from scale to summary string or code length.
Return type:: dict

unwrap_letters_to_matrix()

Assemble all encoded letter vectors into a matrix of shape (n_letters, d_model) for each scale.

Applies only when the encoding was performed with an operation (“sum” or “mean”). Stores the result in self.unwrapped_matrix as a dict {scale: matrix}. Returns the dictionary for chaining or inspection.

Raises:

ValueError if the encoding is not compressed (i.e., operation is None or missing), –
or if encoded entries are inconsistent in shape. –

class sig2dna_core.signomics.DNApairwiseAnalysis(D, names, DNAsignals, name=None)

Bases: object

Class to handle pairwise distance analysis, PCoA, clustering, and visualization for DNA-coded signals.

D

Pairwise excess entropy distance matrix.

Type:: np.ndarray

names

Names of the DNA signals.

Type:: list

DNAsignals

original DNAsignal objects

Type:: list

coords

Coordinates in reduced space (PCoA).

Type:: np.ndarray

dimensions

Selected dimensions for reduced analysis.

Type:: list

linkage_matrix

Linkage matrix used for hierarchical clustering.

Type:: np.ndarray

best_dimension(max_dim=10): Determine optimal dimension by maximizing silhouette score.

cluster(t=1.0, criterion='distance'): Assign cluster labels from linkage matrix.

compute_linkage(method='ward'): Compute hierarchical clustering.

dimension_variance_curve(threshold=0.5, plot=True, figsize=(8, 5))

Computes the cumulative explained variance (based on pairwise distances) as a function of the number of dimensions used (from 1 to n-1). Optionally plots the curve and the point where the threshold (default 0.5) is reached.

Parameters:

threshold (float) – Fraction of total variance to reach (default 0.5).
plot (bool) – If True, display the variance curve and highlight dhalf.
figsize (tuple) – Size of the figure if plotted.

Returns:

dhalf (int) – Number of dimensions needed to reach the threshold.
curve (list of float) – Normalized cumulative variance (in [0, 1]) for dimensions 1 to n-1.

get_cluster_labels(n_clusters=2, method='ward')

Returns cluster labels from hierarchical clustering. If not computed yet, computes linkage.

Parameters:

n_clusters (int) – Number of clusters to assign.
method (str) – Linkage method to use if recomputing linkage.

Returns:

labels – Cluster IDs for each sample.

Return type:

np.ndarray of int

heatmap(figsize=(10, 8)): Plot heatmap of pairwise distances.

static load(path): Load analysis from file.

pcoa(n_components=None): Perform Principal Coordinate Analysis (PCoA).

plot_dendrogram(truncate_mode=None, p=10): Plot dendrogram from linkage matrix.

reduced_distances(): Recompute distances on selected subspace.

save(path): Save current analysis to file.

scatter(dims=(0, 1), annotate=True, figsize=(8, 6), n_clusters=None)

2D scatter plot in selected dimensions with optional cluster-based coloring.

Parameters:

dims (tuple) – Dimensions to plot (default: (0, 1)).
annotate (bool) – If True, annotate points with their index.
figsize (tuple) – Size of the plot.
n_clusters (int or None) – If provided, use clustering to color points.

Returns:

fig

Return type:

matplotlib.figure.Figure

scatter3d(dims=(0, 1, 2), annotate=True, n_clusters=None)

3D scatter plot in selected dimensions with optional cluster-based coloring.

Parameters:

dims (tuple) – Dimensions to plot.
annotate (bool) – Annotate points with their index.
n_clusters (int or None) – If provided, use clustering to color points.

Returns:

fig

Return type:

matplotlib.figure.Figure

select_dimensions(dims): Update active dimensions.

class sig2dna_core.signomics.DNAsignal(signal_obj, sampling_dt=1.0, dtype=<class 'numpy.float64'>, encode=False, encoder=['compute_cwt', 'encode_dna', 'encode_dna_full'], scales=[1, 2, 4, 8, 16, 32], x_label='index', x_unit='-', y_label='Intensity', y_unit='', plot=False, plotter=['plot_signals', 'plot_transforms', 'plot_codes'])

Bases: object

DNAsignal(signal):

A class to encode a numerical signal (typically a 1D GC-MS trace, NMR/FTIR/Raman spectra or time series) into a DNA-like symbolic representation using wavelet analysis. This symbolic coding enables fast comparison, search, and alignment of signal features using abstracted patterns (e.g., ‘YAZB’).

The class supports: - Continuous Wavelet Transform (CWT) using Ricker wavelets - Symbolic conversion of wavelet features to DNA-like letters (A, B, C, X, Y, Z) - Visualization of CWTs, symbolic encodings, and signal overlays - Reversible decoding of symbolic segments back into approximate signals - Substring extraction and matching - Storage of multi-scale representations (multi-resolution DNA encoding)

This class is part of the symbolic signal transformation pipeline (sig2dna), compatible with signal, peaks, and signal_collection.

param signal:: An instance of the signal class representing a sampled waveform. It must have x, y, and a valid name or identifier.
type signal:: signal
param encode:: Launch encoders = [“compute_cwt”,”encode_dna”,”encode_dna_full”] if True
type encode:: bool (default = False)
param plot:: Plot with plotter = [“plot_signals”,”plot_transforms”,”plot_codes”]
type plot:: bool (default = False)

signal

Original numerical signal (values only; x stored via sampling_dt).

Type:: np.ndarray

dtype

Data type of the stored signal.

Type:: data-type

sampling_dt

Sampling interval along the x-axis.

Type:: float

dx

The nominal x-resolution of the signal (automatically derived).

Type:: float (depreciated)

n

Number of points in the signal (length of x/y arrays).

Type:: int

name

Name of the signal.

Type:: str

x_label

Label of the x-axis.

Type:: str

x_unit

Unit of the x-axis.

Type:: str

y_label

Label of the y-axis.

Type:: str

y_unit

Unit of the y-axis.

Type:: str

scales

List of scales used for encoding.

Type:: list[int]

codes

Dictionary-like container of symbolic triplet encodings by scale.

Type:: DNACodes

codesfull

Dictionary-like container of full-resolution symbolic strings per scale.

Type:: DNAFullCodes

scales

Set of scales used in the Continuous Wavelet Transform (powers of 2 by default).

Type:: array-like

transforms

Stores the CWT-transformed signals, each as a signal object, indexed by scale.

Type:: signal_collection

codes

Symbolic codes by scale level. Each is a DNAstr object representing the symbolic sequence at that scale.

Type:: dict[int, DNAstr]

codesfull

Same as codes, but uses full resolution symbolic representation.

Type:: dict[int, DNAstr]

sincodesfull

Sinusoidal position encoded DNAstr (without aggregation)

Type:: dict[int, DNAFullCodes]

sincodesfull_aggregated: Sinusoidal position encoded DNAstr (with aggregation)

peaks

Optional peaks object used to index real peak positions from the signal.

Type:: peaks

codebook

Mapping between symbolic characters (A, B, C, X, Y, Z, _) and wavelet features.

Type:: dict

generator

Name of the wavelet basis used (default: ‘ricker’).

Type:: str

normalize_signal(mode='zscore+shift'): Normalizes the internal signal (preserves positivity).

compute_cwt(scales=None, normalize=False): Computes the Continuous Wavelet Transform using the Ricker wavelet and stores the transformed signals in transforms.

sparsify_cwt(self, scale: int | float, threshold: float, inplace: bool = True): Zero out wavelet coefficients below a threshold for a specific scale.

encode_dna(): Encodes each scale’s transformed signal into a symbolic DNAstr sequence using local maximum coding (ABCXYZ).

encode_dna_full(): Encodes signals at each scale using the full encoding scheme, preserving the flat regions (_) and finer symbolic transitions.

plot_signals(): Plots signals.

plot_codes(scale): Plots both the wavelet transform and the symbolic code for the given scale.

plot_transforms(): Plots the stored CWT-transformed signals as a signal collection.

plot_scalogram():: Plots a scalogram with two subplots

decode_dna(scale): Reconstructs the approximate signal for a given scale from its DNA encoding.

__getitem__(scale): Shortcut to access the DNAstr object for a specific scale.

summary(): Returns a dictionary summarizing encoded scales and metadata.

has(scale): Checks whether a DNAstr encoding exists at the given scale.

pseudoinverse(scales=None, rank=None, return_weights=False, name=None):: Approximates signal reconstruction via pseudo-inverse using stored CWT coefficients

Static pairwise distance methods

--------------------------------

_pairwiseEntropyDistance(list of DNAstr objects, scale): Return a DNApairwiseAnalysis instance based on the mutually exclusive information after DNA/code alignment.

_pairwiseJaccardMotifDistance(list of DNAstr objects, scale): Return a DNApairwisedistance based on the presence/absence of a pattern (default=YAZB)

_pairwiseJensenShannonDistance(list of DNAstr objects, scale): Return a DNApairwisedistance based on the Jensen-Shannon distance of a pattern

_pairwiseLevenshteinDistance(list of DNAstr objects, scale): Return a DNApairwisedistance based on the Levenshtein Distance

Methods: Symbolic Encoding

--------------------------

encode_dna(scales=None): Convert signal into triplet-based symbolic encoding (per scale).

encode_dna_full(scales=None, resolution='index', repeat=True, n_points=None): Generate full-resolution DNA strings by repeating letters (to codesfull).

Methods: Sinusoidal Encoding

----------------------------

sinencode_dna(scales=None, d_part=32, N=10000): Encode symbolic segments (from codes) into sinusoidal vectors.

sinencode_dna_full(scales=None, d_part=32, N=10000): Encode symbolic full strings (from codesfull) into sinusoidal vectors.

Methods: Sinusoidal Decoding (Static)

-------------------------------------

sindecode_dna(grouped_embeddings, reference_dx=1.0, d_part=32, N=10000): [static] Decode grouped sinusoidal embeddings → DNACodes structure.

sindecode_dna_full(grouped_embeddings, reference_dx=1.0, d_part=32, N=10000): [static] Decode full sinusoidal embeddings → DNAFullCodes structure.

Methods: Signal Reconstruction

------------------------------

tosignal(scale=None, codes_attr='codes'): Reconstruct approximate signal (as signal instance) from symbolic encoding.

Examples

>>> S = signal.from_peaks(...)  # define a signal
>>> dna = DNAsignal(S)
>>> dna.compute_cwt()
>>> dna.encode_dna()
>>> dna.codes[4]
DNAstr("AAAZZZYY...")
>>> dna.plot_codes(4)
>>> dna.codes[4].find("YAZB")

align_with(other, scale=1)

Align symbolic sequences and compute mutual entropy.

Returns:

with fields

seq1_aligned (str)
seq2_aligned (str)
aligned_signal (list of tuples)
mutual_entropy (float)

Return type:

SimpleNamespace

static apply_baseline_filter(signal, w=None, k=2, delta_t=1.0)

Apply baseline filtering using moving median and local Poisson-based thresholding.

Parameters:

signal (np.ndarray) – Input signal (expected to be non-negative or baseline-dominated).
w (int or None) – Window size for baseline and statistics (must be odd). Defaults to max(11, 2% of signal length).
k (float) – Bienaymé-Tchebychev multiplier.
delta_t (float) – Sampling time step.

Returns:

filtered – Signal with baseline removed and low-intensity noise suppressed.

Return type:

np.ndarray

Note

This method is static, use signal.apply_baseline_filter() whenever appropriate instead.

compute_cwt(scales=None, apply_filter=False, wavelet='mexh')

Compute Continuous Wavelet Transform (CWT) using the Mexican Hat wavelet.

Parameters:

scales (list, int, or None) – List of scales (or a single scale) to compute. If None, default to [1, 2, 4, 8, 16].
apply_filter (bool) – Whether to apply a baseline filter to the input signal before transforming.
wavelet (str (default='mexh')) – The name of the PyWavelets-compatible wavelet.
Sets –
---- –
self.scales (list) – The list of actual scales used.
self.filtered_signal (ndarray) – Filtered or raw signal used for CWT.
self.cwt_coeffs (dict) – Dictionary mapping each scale to its 1D coefficient array.
self.transforms (signal_collection) – Collection of signal objects storing the transformed signals for each scale.

encode_dna(scales=None)

Encode each transformed signal into a symbolic DNA-like sequence of monotonic segments.

Parameters:

scales (list, int, or None) – List of scales (or a single scale) to encode. If None, use self.scales.
letters (The encoding detects strictly monotonic (or flat) segments and labels them with symbolic) –
A (-) –
Z (-) –
B (-) –
Y (-) –
C (-) –
X (-) –
_ (-) –
Sets –
---- –
self.codes (dict) –
Dictionary mapping each scale to a struct with:
- letters : str (symbolic encoding)
- widths : list of float (x-span of each segment)
- heights : list of float (y-delta of each segment)
- iloc : list of index-pair tuples (start, end+1)
- xloc : list of x-span tuples (x_start, x_end)
- dx : segment step (dx)

encode_dna_full(scales=None, resolution='index', repeat=True, n_points=None)

Convert symbolic codes into DNA-like strings by repeating letters proportionally to their span.

Parameters:

scales (list, int, or None) – List of scales (or a single scale) to convert. If None, use self.scales.
resolution ({'index', 'x'}) –
Repetition mode:
- ’index’: repeat letters by number of indices (j - i from iloc)
- ’x’ : interpolate letter values over physical x-axis distance (xloc)
repeat (bool) – If True, repeat or interpolate letters to form a string of desired resolution. If False, return the symbolic sequence without repetition.
n_points (int or None) – Used only for resolution=’x’ to control the number of interpolation points. If None, defaults to ~10 points per x-unit.

Returns:

dict – Dictionary mapping each scale to its DNA-like string.
Sets
—-
self.codesfull (dict) – Dictionary storing the resulting full DNA-like string per scale.

static entropy_from_string(s): return the entropy of a string

find_sequence(pattern, scale): Find occurrences of a specific letter pattern in encoded sequence.

get_code(scale): Retrieve encoded data for a specific scale.

get_entropy(scale): Calculate Shannon entropy for encoded signal.

has(scale)

Check if a DNA encoding exists for the specified scale.

Parameters:: scale (int) – The wavelet scale to check.
Returns:: True if a symbolic DNAstr encoding exists at the given scale, False otherwise.
Return type:: bool

Examples

>>> dna.has(4)
True
>>> dna.has(16)
False

property letters: Return used letters

normalize_signal(mode='zscore+shift')

Normalize the internal signal using one of several strategies that ensure positivity.

Parameters:: mode (str) – Normalization mode passed to signal.normalize(). See signal.normalize() for available modes.
Raises:: AttributeError – If signal attribute is missing or of the wrong type.

plot_codes(scale, ax=None, colormap=None, alpha=0.4)

Plot the symbolic DNA-like encoding as colored triangle segments.

Parameters:

scale (int) – The scale at which the signal was encoded.
ax (matplotlib.axes.Axes, optional) – Axis to draw on. If None, a new figure is created.
colormap (dict, optional) – Custom mapping of letters to colors. Default uses 7 distinct colors.
alpha (float) – Transparency for the patches. Default is 0.4.

plot_scalogram()

Plot a scalogram with two subplots: - Top: colored image of CWT coefficient amplitudes - Bottom: line curves of selected scales

Returns:: fig – The matplotlib figure object.
Return type:: matplotlib.figure.Figure

plot_signals(scales=None): Plot signals.

plot_transforms(indices=None, **kwargs)

Plot the stored CWT-transformed signals as a signal collection.

Parameters:

indices (list[int or str], optional) – Specific scales or names to plot.
kwargs (passed to signal_collection.plot) –

static print_alignment(seq1, seq2, width=80): print aligned sequences

pseudoinverse(scales=None, rank=None, return_weights=False, name=None)

Approximate signal reconstruction via pseudo-inverse using stored CWT coefficients.

Parameters:

scales (list, float, int, or None) – Scales to include in the reconstruction. If None, all scales in self.cwt_coeffs are used.
rank (int or None) – Optional truncation rank for the SVD decomposition (for denoising or dimensionality reduction).
return_weights (bool) – If True, also return the weights (contributions) of each scale.
name (str or None) – Optional name for the returned signal. Defaults to “pseudoinverse” with included scales.

Returns:

reconstructed_signal (signal) – Reconstructed signal instance from the pseudo-inverse of the CWT decomposition.
weights (np.ndarray, optional) – Returned only if return_weights=True, gives the contribution of each scale.

Raises:

ValueError – If CWT coefficients are not available.

static reconstruct_aligned_string(seq, aligned): Fast reconstruction of aligned signals

reconstruct_signal(scale, return_signal=True)

Reconstruct the signal from symbolic features (e.g., YAZB).

Parameters:

scale (int) – Scale to use for reconstruction.
return_signal (bool) – If True, return a signal object. Else return y array.

Returns:

Reconstructed signal.

Return type:

signal or np.ndarray

static sindecode_dna(grouped_embeddings, reference_dx=1.0, d_part=32, N=10000)

Decode sinusoidal grouped embeddings into a DNACodes structure.

Parameters:

grouped_embeddings (dict) – {scale: {letter: np.ndarray}} sinusoidal representations
reference_dx (float) – Sampling resolution used to reconstruct xloc and iloc
d_part (int) – Dimensionality per component (start, width, height)
N (int) – Frequency base

Returns:

Decoded symbolic structure

Return type:

DNACodes

sinencode_dna(scales=None, d_part=32, N=10000)

Encode self.codes into sinusoidal embeddings (grouped by letter).

Sets:

self.codes (DNACodes): Encoded version of original codes.

sinencode_dna_full(d_model=96, N=10000, operation=None)

🌀 Encode full-resolution DNA-like strings into sinusoidal embeddings grouped by letter.

Parameters
d_modelint, optional: Dimensionality of the sinusoidal embedding (default is 96).
Nint, optional: Maximum number of positions for the encoding (default is 10000).
operationstr or None, optional: If “sum”, sums all encodings per letter. If “mean”, averages all encodings per letter. If None, keeps the full (n_occurrences, d_model) matrix per letter. Raises a ValueError if the operation is not one of the above.

Sets

self.codesfullDNAFullCodes (if not already set): Full-resolution symbolic strings at each scale.
self.codesfull_encodedDNAFullCodes: Sinusoidally encoded version of the full DNA strings.

sparsify_cwt(scale=None, threshold=None, inplace=True)

Sparsify CWT coefficients by zeroing values below a threshold.

Parameters:

scale (int, float, list, or None) – Scale(s) to sparsify. If None, all available scales in self.cwt_coeffs are used.
threshold (float or None) – Absolute value below which coefficients are set to zero. If None, uses 1% of the maximum absolute value at each scale.
inplace (bool) – If True, modifies current instance. If False, returns a modified copy.

Returns:

Modified copy if inplace is False, otherwise None.

Return type:

DNAsignal or None

Raises:

ValueError – If scale(s) not found in self.cwt_coeffs.

static synthetic_signal(x, peaks, baseline=None): Generate flexible synthetic signals. (obsolete)

tosignal(scale=None, codes_attr='codes')

Reconstruct an approximate signal from symbolic encodings.

Parameters:

scale (int or None) – Scale level to use (defaults to first available if None).
codes_attr (str) – Attribute from which to decode (‘codes’ or ‘codesfull’).

Returns:

An approximate signal object reconstructed from symbolic information.

Return type:

signal

class sig2dna_core.signomics.DNAsignal_collection(*signals, vtmscale=None, rasterscan=True, dtype=<class 'numpy.float32'>)

Bases: list

A collection of DNAsignal objects (e.g., from a GC-MS chromatogram) supporting symbolic sinusoidal encoding, full tensor construction, and blind deconvolution using latent component analysis.

Purpose

DNAsignal_collection is designed to enable symbolic and positional encoding of multiple 1D analytical signals (e.g., ion channels from GC-MS data) using lettered segments and sinusoidal encodings. It allows combining multiple encoded signals into a single tensor for processing with machine learning methods, including dimensionality reduction and blind source separation.

Theory

Each signal is decomposed into symbolic segments based on their local morphology (encoded as letters like A, B, Y, Z, _). Each segment is represented in an embedding space of dimension d via sinusoidal encoding. The 3D tensor $v_{t,m,d}$ is composed of:

$E_{t,m,d}$: symbolic encoding across segments.
$PE_t$: positional encoding along the time/segment axis (t).
$PE_m$: positional encoding along the mass channel/ion axis (m).

Combining these yields:

$$ v_{t,m,d} = E_{t,m,d} + PE_t(t,d) + PE_m(m,d) $$

- `.sinencode_dna_full(scale=4)`: performs symbolic encoding at a given scale.

- `.E_symbol`: property returning symbolic component E for each letter and scale.

- `.PE_t`: positional encoding per letter along t for each scale.

- `.PE_m`: positional encoding along m (mass channels).

- `.vtm`: dictionary of $v_{t,m,d}$ matrices per letter.

- `.vtm_full`: complete tensor (sum of all letters) used for machine learning.

- `.deconvolve_latent_sources(...)`: uses PCA to decompose full tensor into component chromatograms.

- `.plot_v_symbol_components(...)`: visualizes the construction of $v_{t,m}$ for each letter.

- `.plot_vtm_full(...)`: visualizes the components of the full $v_{t,m}$ tensor.

param signals:: List of DNAsignal instances (e.g., one per ion channel).
type signals:: list of DNAsignal
param vtmscale:: Scale to calculate vtm_full
type vtmscale:: int
param rasterscan:: True if one point is read at a time by the detector In practice, flatten the 2D signal (T × m) into a single 1D time axis by appending all temporal channels one after another:
type rasterscan:: bool (default=True)
param dtype:: Numeric dtype used for storing encoding arrays (E_symbol, PE_t, PE_m, vtm, vtm_full). Defaults to np.float32 for reduced memory usage.
type dtype:: type or np.dtype, optional

m

Number of signals (ion channels).

Type:: int

d

Embedding dimension.

Type:: int

letters

List of symbolic segment labels used in the encodings.

Type:: list of str

scales

Available scales for the symbolic encoding.

Type:: list of int

_E_symbol

Cached symbolic encoding tensors for each letter and scale.

Type:: dict

_PE_t

Cached positional encoding along t for each letter and scale.

Type:: dict

_PE_m

Cached positional encoding along m for each scale.

Type:: dict

_vtm

Cached symbolic+positional tensors per letter.

Type:: dict

_vtm_full

Cached full encoding tensor combining all letters.

Type:: np.ndarray

property E_symbol

Symbolic component of the 2D encoding for each scale and letter.

For each letter, builds a tensor (T_letter, M, D) where: - T_letter is the total number of segments of that letter across all M signals, - D is the embedding dimension, - M is the number of ion channels.

Returns:: Mapping {scale: {letter: ndarray of shape (T_letter, m, d)}}
Return type:: dict of dict
Type:: E_symbol(t, m)

property PE_m

Positional encoding along m (IC index axis) per scale.

Returns:: Mapping: {scale: array of shape (m, d)}
Return type:: dict
Type:: PE_m(m)

property PE_t

Positional encoding along t (segment axis) per scale and letter.

For each scale and letter, this provides a matrix of shape (n_segments, d), where n_segments is the total number of segments of that letter across the m signals, and d is the embedding dimension.

Returns:: Mapping {scale: {letter: ndarray of shape (n_segments, d)}}
Return type:: dict of dict
Type:: PE_t(t)

combine_embeddings(selected_letters=None)

Combine unwrapped embeddings across all signals for each scale.

Parameters:: selected_letters (list of str, optional) – If provided, restrict to these letters.
Returns:: Dictionary {scale: {letter: (m, d)}} for each selected scale and letter.
Return type:: dict

deconvolve_latent_sources(n_components=64, inertia_loss_threshold=0.25, plot=True, nmax_plot=8)

Perform dimensionality reduction on the 3D tensor v_{t,m,d} to extract non-coeluted compound chromatograms using PCA, with optional plotting.

Parameters:

n_components (int, optional) – Maximum number of latent components (e.g., pure compounds) to extract. Default is 64.
inertia_loss_threshold (float, optional) – The maximum allowed proportion of total variance to lose in the projection. Default is 0.25 (i.e., at least 75% of the variance should be preserved).
plot (bool, optional) – Whether to display diagnostic plots.
nmax_plot (int, optional) – Maximum number of components to visualize in plots.

Returns:

components (np.ndarray) – Component matrix of shape (n_selected_components, D), representing the spectral basis vectors (latent features).
chromatograms (np.ndarray) – Projected chromatograms for each component, shape (T, M, n_selected_components).
explained_variance_ratio (np.ndarray) – Variance explained by each selected component.

property m: Return the number of signals in the collection.

plot(letters=None, scales=None, figsize=(18, 10), max_legend=25)

Plot the encoded signals in subplots. Rows represent letters, columns represent scales. Each subplot contains overlaid colored curves from all signals.

Parameters:

letters (list or None) – Letters to be plotted. If None, all available letters are plotted.
scales (list or None) – Scales to be plotted. If None, all available scales are plotted.
figsize (tuple) – Size of the full figure.
max_legend (int) – Maximum number of signals to label in the legend.

Returns:

The figure containing the plots.

Return type:

matplotlib.figure.Figure

plot_embedding_projection(letters=None, scales=None, method='pca', max_points=25, figsize=(14, 10))

Plot embedding projections of the encoded signals using PCA (default) or other DR methods.

Parameters:

collection (DNAsignal_collection) – The collection of encoded signals.
scales (list or None) – Scales to include in the projection. If None, all available scales are used.
method (str) – Dimensionality reduction method (‘pca’ only supported for now).
max_points (int) – Maximum number of signal points to label explicitly.
figsize (tuple) – Size of the figure.

Returns:

fig

Return type:

matplotlib.figure.Figure

plot_letters(scale=None, figsize=(12, 6), cmap='viridis')

Plot a heatmap of the letter codes (symbolic DNA) across all signals.

Parameters:

scale (int, optional) – Scale to use (default: self.vtmscale).
figsize (tuple) – Size of the figure.
cmap (str) – Matplotlib colormap name (default: “viridis”).

Return type:

matplotlib.figure.Figure

plot_v_symbol_components(scale=None, dims='all')

Plot the components E_symbol, PE_t, PE_m and their sum v_{t,m} as image matrices for each letter at a given scale.

Parameters:

scale (int, optional) – The scale to use. Defaults to the first available scale.
dims ("all", list or slice) – Which dimensions to include in the sum over d. Default is all.

Return type:

matplotlib.figure.Figure

plot_vtm_full(scale=None, dims='all')

Visualize the components and sum of the full encoded GC-MS signal at a given scale.

Parameters:

scale (int, optional) – Scale to visualize. Defaults to the first scale.
dims ("all", list or slice) – Dimensions of the embedding d to include (default: all).

Return type:

matplotlib.figure.Figure

reduce_dimensions(method='pca', selected_letters=None, n_components=2)

Apply dimensionality reduction (PCA or UMAP) across signals for each scale.

Parameters:

method (str) – “pca” or “umap”.
selected_letters (list of str, optional) – Restrict to a subset of letters.
n_components (int) – Number of projection dimensions.

Returns:

Dictionary {scale: ndarray} with shape (m, n_components), one per scale.

Return type:

dict

scale_alignment(method='zscore')

Normalize embeddings across all signals and all scales.

Parameters:: method (str) – One of {“zscore”, “minmax”}.

sinencode_dna_full(d_model=128, N=10000, operation='sum')

🌀 Encode all DNAsignal instances using full-resolution sinusoidal embeddings, grouped by letter and organized per scale.

Parameters:

d_model (int, optional) – Dimensionality of the sinusoidal embedding (default is 128).
N (int, optional) – Maximum number of positions for the encoding (default is 10000).
operation (str or None) – If “sum”, sum all position encodings per letter. If “mean”, average encodings. If None, retain full (n_occurrences × d_model) arrays.

to_dataframe(selected_letters=None)

Export combined embeddings for all scales as a tidy pandas DataFrame, suitable for machine learning tasks.

Parameters:: selected_letters (list of str, optional) – Subset of letters to include. If None, include all letters.
Returns:: A long-form DataFrame with columns: [‘signal_index’, ‘scale’, ‘letter’, ‘dim_0’, …, ‘dim_{d-1}’]
Return type:: pd.DataFrame

property vtm

Compute the full encoded matrix v_{t,m} for each letter at each scale.

This property combines three orthogonal components: - E_symbol(t, m): the original per-segment encoding for each letter - PE_t(t): a sinusoidal encoding applied along the segment (time) axis - PE_m(m): a sinusoidal encoding applied along the signal (IC) axis

The resulting tensor for each scale and letter is of shape (n_segments, m, d), where:

n_segments: number of segments (time positions) per letter
m: number of DNAsignal instances in the collection
d: dimensionality of the encoding space (d_model)

Returns:: A nested dictionary {scale: {letter: array of shape (n_segments, m, d)}}.
Return type:: dict

property vtm_full

Compute and store the full encoded tensor for the GC-MS signal:

If self.rasterscan is False:
shape = (T, m, d)
If self.rasterscan is True:
shape = (T * m, d)

This combines: - Symbolic embedding per character (one-hot or learned) - Positional encoding along time axis - PE_m (mass/IC identity) is used only if rasterscan is False

Returns:: Encoded array (T, m, d) or (T*m, d)
Return type:: np.ndarray

class sig2dna_core.signomics.DNAstr(content, dx=1.0, iloc=0, xloc=None, x_label='index', x_unit='-', engine='difflib', engineOpts=None)

Bases: str

A symbolic DNA-like sequence class supporting alignment, entropy analysis, edit-distance metrics, and signal reconstruction from symbolic codes.

Extended from str, it is designed for symbolic transformations of signals (e.g., wavelet-encoded GC-MS peaks or time series).

Main Features

Supports symbolic operations for pattern recognition, entropy, alignment.
Encodes x-resolution (dx), original index (iloc), and physical x-range (xloc).
Aligns sequences with visual inspection and rich diffs.
Converts symbolic strings into synthetic numerical signals.

Operators

: concatenate two DNAstr objects

: symbolic difference after alignment (mismatches only)

== : equality comparison (exact content and dx)

Key Methods

__init__ / __new__ : Constructor with metadata (dx, iloc, xloc)
align(other) : Align this DNAstr to another, update mask and aligned views
wrapped_alignment() : Pretty terminal view of the alignment with colors and symbols
html_alignment() : Rich HTML display of the alignment (Jupyter)
plot_alignment() : Visualize waveform alignment with symbolic signals
plot_mask() : Color block plot showing matches/mismatches/gaps
find(pattern, regex=False) : Search for symbolic patterns with fuzziness or regex
to_signal() : Convert symbolic code into synthetic signal (NumPy)
vectorized() : Convert string to integer codes
summary() : Print entropy and character frequencies
mutation_counts : Property: {‘matches’, ‘mismatches’, ‘indels’}
entropy : Property: Shannon entropy
mutual_entropy(other) : Mutual entropy of two sequences
excess_entropy(other) : Excess entropy H1 + H2 - 2 * H12
jensen_shannon(other) : Jensen-Shannon divergence
jaccard(other) : Jaccard similarity
alignment_stats : Property: Match, substitution, gap counts
score(normalized=True) : Alignment score (fraction of matches)
has(other: str) : Check if a pattern or substring exists

dx

Average resolution along the x-axis.

Type:: float

iloc

Positional index or index range in the source DNA string.

Type:: int or tuple of int

xloc

Corresponding x-value(s) for the symbolic sequence.

Type:: float or tuple of float

aligned_with

Aligned form of self with insertions (spaces) where needed.

Type:: str or None

other_copy

Aligned form of the reference sequence.

Type:: str or None

ref_hash

SHA256 hash of the aligned reference sequence.

Type:: str or None

mask

Alignment mask: ‘=’ for matches, ‘*’ for substitutions, ‘ ‘ for gaps.

Type:: str or None

engine

Alignment engine: ‘difflib’ or ‘bio’.

Type:: str

engineOpts

Options passed to the alignment engine.

Type:: dict

Examples

>>> s1 = DNAstr("YYAAZZBB", dx=0.5)
>>> s2 = DNAstr("YAABZBB", dx=0.5)
>>> s1.align(s2)
>>> print(s1.wrapped_alignment(40))
>>> s1.plot_alignment()
>>> segments = s1.find("YAZB")
>>> segments[0].to_signal().plot()

align(other, engine=None, engineOpts=None, forced=False)

Align this DNAstr sequence to another, allowing insertions/deletions to maximize matches.

Parameters:

other (DNAstr) – Another DNAstr object to align with.
engine ({'difflib', 'bio'} or None) –
Alignment engine to use:
- ’difflib’: uses difflib.SequenceMatcher (fast, approximate).
- ’bio’ : uses Bio.Align.PairwiseAligner (biologically inspired global alignment).
If None, defaults to self.engine.
engineOpts (dict, optional) – Dictionary of alignment parameters for the selected engine.
forced (bool) – If True, allow alignment even if dx values differ. If False (default), a mismatch in dx will raise an error to prevent incorrect alignment of signals with different sampling.

Returns:

aligned_self (str) – Aligned version of this sequence (with gaps inserted where needed).
aligned_other (str) – Aligned version of the other sequence.

Notes

The alignment is symmetric and permanent: both sequences are aligned with gaps introduced (spaces) to preserve positional correspondence. A hash of the aligned other sequence is stored to detect redundant alignments.

A match mask (self.mask) is generated with:

‘=’ for exact matches, ‘*’ for mismatches (substitutions), ‘ ‘ for insertions/deletions (gaps).

The method updates:

self.aligned_with
self.other_copy
self.mask
self.ref_hash

Example:

S1 = DNAstr(“AABBCC”) S2 = DNAstr(“AACBCC”) S1.align(S2,”difflib”) print(S1.mask) print(S1.wrapped_alignment()) ==*=== AACBCC || ||| AABBCC

S1 = DNAstr(“AABBCC”) S2 = DNAstr(“AACBCC”) S1.align(S2,”bio”) print(S1.mask) print(S1.wrapped_alignment()) == == AAB·CC || || AA·BCC

S1 = DNAstr(“AABBCCXYZZZ”) S2 = DNAstr(“AACBCCZZXXX”) S1.align(S2,”bio”) print(S1.mask) print(S1.wrapped_alignment()) == * == AABCC··ZZ || || AA·B·CCZZ

property aligned_code: return aligned code

property alignment_stats: Retrun DNAstr alignment statistics

property entropy: Compute the Shannon entropy of the DNAstr sequence

excess_entropy(other): Compute the excess Shannon entropy of two DNAstr sequences H(A)+H(B)-2*H(AB)

extract_motifs(pattern='YAZB', minlen=4, plot=True)

Extract and analyze YAZB motifs (canonical and distorted) from the symbolic sequence.

Parameters:

pattern (str) – Canonical motif pattern (default is ‘YAZB’).
minlen (int) – Minimum motif length to be considered valid.
plot (bool) – If True, generate a motif density plot using xloc or sequence index.

Returns:

Table of detected motifs with start/end positions, length, and classification.

Return type:

pd.DataFrame

find(pattern, regex=False)

Finds all fuzzy (or regex-based) occurrences of a DNA-like sequence pattern.

Parameters:

pattern (str) – The symbolic sequence to search for (e.g., “YAZB”).
regex (bool, optional) – If False (default), interprets pattern as symbolic and inserts ‘.’ between characters. If True, uses the raw pattern as a regular expression.

Returns:

A list of DNAstr slices with attributes:

iloc: (start_idx, end_idx)
xloc: (x_start, x_end)
width: segment width

Return type:

list of DNAstr

html_alignment()

Render the alignment using HTML with color coding: - green: match - blue: gap - red: substitution

Returns:: Displays HTML directly in Jupyter/Notebook environments.
Return type:: None

jaccard(other)

Compute the Jaccard distance between two DNAstr sequences.

Parameters:: other (DNAstr) – The other DNAstr sequence to compare with.
Returns:: Jaccard distance: 1 - (intersection / union) of unique letters.
Return type:: float

jensen_shannon(other, base=2)

Compute the Jensen-Shannon distance between self and another DNAstr.

Parameters:

other (DNAstr) – Another DNAstr instance.
base (float, optional) – Base for the logarithm (default: 2)

Returns:

Jensen-Shannon distance.

Return type:

float

levenshtein(other, use_alignment=True, engine=None, engineOpts=None, forced=False)

Compute the Levenshtein distance between this DNAstr and another one.

Parameters:

other (DNAstr) – Another DNAstr object to compare against.
use_alignment (bool, default=True) – If True, uses the aligned sequences (computed if necessary). If False, compares the raw sequences directly.
engine ({'difflib', 'bio'}, optional) – Alignment engine to use if alignment is needed.
engineOpts (dict, optional) – Parameters for the selected alignment engine.
forced (bool, default=False) – Force alignment even if dx values differ.

Returns:

dist – Levenshtein distance between the two sequences (aligned or raw).

Return type:

int

Examples

A = DNAstr(“YAZBZAY”) B = DNAstr(“YAZBZZY”) A.levenshtein_distance(B, use_alignment=False) # raw A.levenshtein_distance(B, use_alignment=True, engine=”bio”) # aligned

property mutation_counts: Counts of insertions, deletions/substitutions, and matches.

mutual_entropy(other=None): Compute the Shannon mutual entropy of two DNAstr sequences from their aligned segments

plot_alignment(dx=1.0, dy=1.0, width=20, normalize=True)

Plot a block alignment view of two DNAstr sequences with color-coded segments.

Parameters:

dx (float) – Horizontal step between segments (defaults to 1.0).
dy (float) – Vertical height increment for symbolic waveform visualization.
width (int) – Number of characters per row (line wrapping).

Returns:

matplotlib.figure.Figure
matplotlib.axes.Axes

plot_mask()

Plot a color-coded mask of the alignment between sequences.

Returns:: Matplotlib figure of the alignment mask.
Return type:: matplotlib.figure.Figure

score(normalized=True)

Return an alignment score, optionally normalized.

Parameters:: normalized (bool) – If True (default), return score as a fraction of total aligned positions.
Returns:: Alignment score.
Return type:: float

summary()

Summarize the DNAstr with key stats: length, unique letters, entropy, etc.

Returns:: Dictionary containing length, letter frequency, Shannon entropy, and dx.
Return type:: dict

to_signal()

Converts the symbolic DNA sequence into a synthetic NumPy array mimicking the original wavelet-transformed signal.

Rules per letter:

‘A’: Crosses zero upward → linear from -1 to +1, zero in the middle
‘Z’: Crosses zero downward → linear from +1 to -1, zero in the middle
‘B’: Increasing negative → from -1 to 0
‘Y’: Decreasing negative → from 0 to -1
‘C’: Increasing positive → from 0 to +1
‘X’: Decreasing positive → from +1 to 0
‘_’: Flat at 0

Returns:: Synthetic signal array matching the symbolic encoding.
Return type:: numpy.ndarray

vectorized(codebook={'A': 1, 'B': 2, 'C': 3, 'X': 4, 'Y': 5, 'Z': 6, '_': 0})

Map the DNAstr content to an integer array using a codebook.

Parameters:: codebook (dict, optional) – Dictionary mapping characters to integer values. default = {“A”:1,”B”:2,”C”:3,”X”:4,”Y”:5,”Z”:6,”_”:0} None will generate a codebook based on current symbols only
Returns:: Vectorized integer representation of the string.
Return type:: np.ndarray

wrapped_alignment(width=80, colors=True)

Return a line-wrapped alignment view (multi-line), optionally color-coded for terminal/IPython usage (Spyder, Jupyter).

Parameters:

width (int) – Number of characters per line in wrapped display.
colors (bool) – If True, use ANSI codes to highlight differences. May be overridden if terminal does not support ANSI (e.g., Spyder).

Returns:

Wrapped, optionally colorized alignment.

Return type:

str

class sig2dna_core.signomics.SinusoidalEncoder(d_model=96, N=10000, dtype=<class 'numpy.float32'>)

Bases: object

🌀 Generic sinusoidal encoder/decoder supporting symbolic and numeric sequences.

Each scalar value is transformed into a vector of dimension d_model, where alternating components contain sinusoidal features of increasing frequency. The mapping is based on:

For k = 0 to d_model/2 - 1:
f_{2k}(x) = sin(x / r_k) f_{2k+1}(x) = cos(x / r_k)

where r_k = N^(2k / d_model)

This representation preserves relative positions and scaling in a smooth, topologically faithful embedding space. The class supports multiple decoding strategies, scaling logic, residual control, and round-trip verification.

Parameters:

d_model (int) – Dimensionality of each sinusoidal encoding (must be even).
N (int) – Frequency base for the positional encoding.
dtype (np.dtype) – Output data type (default: np.float32).

d_model

Embedding dimensionality.

Type:: int

N

Frequency base.

Type:: int

dtype

Data type for encoded output.

Type:: np.dtype

_last_input_type

Last input type passed to encode.

Type:: type

_last_input_length

Last input length passed to encode.

Type:: int

_scale

Scaling factor applied to normalize input values.

Type:: float or None

_auto_scale_enabled

Whether autoscaling is enabled.

Type:: bool

_decode_residual_tolerance

Tolerance for residual error checking in decode verification.

Type:: float

encode(values, scale=None): Encodes a sequence of values (scalar or symbolic) into sinusoidal embeddings.

decode(embedding, method='least_squares', return_error=False): Decodes the embedding to the original values using the selected inverse method.

fit_encoder(values, target_range=10.0): Automatically estimates and stores a scaling factor to normalize input values.

set_decode_tolerance(tol): Sets the residual error threshold above which decoding results will raise a warning.

verify_roundtrip(values, method='least_squares', scale='auto', verbose=True, return_details=False): Checks round-trip accuracy of encoding and decoding. Warns if residuals exceed tolerance.

sinencode_dna_grouped(code, d_part, N): Encodes a codes entry (triplet-based segments grouped by letter).

sindecode_dna_grouped(grouped, reference_code, d_part, N): Reconstructs code dictionary from grouped sinusoidal embeddings.

sinencode_dnafull_grouped(dnafull, d_model, N): Encodes codesfull entries (strings) into grouped embeddings.

sindecode_dnafull_grouped(grouped): Decodes grouped full embeddings back to DNAstr.

Static Methods

--------------

to_complex(emb): Convert a sinusoidal embedding (sin, cos) into complex numbers using Euler’s identity.

complex_distance(emb1, emb2, norm='L2'): Compute pointwise distances between two embeddings in the complex sinusoidal space.

angle_difference(emb): Compute angular differences Δθ between consecutive elements of a sinusoidal embedding.

phase_alignment(emb, ref): Align the phase of an embedding emb to a reference embedding ref using complex phase factors.

pairwise_similarity(emb, metric='cosine'): Compute a pairwise similarity or distance matrix (‘cosine’ or ‘L2’) between all elements.

group_centroid(emb, labels=None, return_std=False): Compute the centroid (and optionally standard deviation) of groups in complex embedding space.

phase_unwrap(emb, normalize=False): Perform phase unwrapping (à la Fourier) on the sinusoidal embedding, optionally normalized to [0, 1].

Example(without scaling)

--------------------------

>>> s = SinusoidalEncoder(8, 100) # poor encoder (8 dimensions, high N)

>>> a = s.encode([0, 1, 1, 2, 2, 3, 4, 5, 6, 6])

>>> s.decode(a)

Output:

[0.0,: 1.0000000072927564, 1.0000000072927564, 1.9999999989612762, 1.9999999989612762, 2.9999999881593866, 3.9999997833243714, 5.000000005427378, 5.999999552487366, 5.999999552487366]

Notes:

Use lower N (e.g., N = 1000) to compress phase variation and allow larger input range.
Use scaling (via fit_encoder() or scale=) for large or high-resolution inputs.

Example(with scaling)

----------------------

>>> s2 = SinusoidalEncoder(128, 10000)

>>> a2 = s2.encode([0, 1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 7, 7, 8, 8, 8, 8, 16, 99, 130], scale=100)

>>> s2.decode(a2)

Output:

[0.0,: 1.0000000127956146, 1.0000000127956146, 1.999999922004248, 1.999999922004248, 2.999999960015871, 3.9999999840341935, 5.000000076901334, 5.999999935247362, 5.999999935247362, 6.999999909109446, 6.999999909109446, 6.999999909109446, 8.000000084954152, 8.000000084954152, 8.000000084954152, 8.000000084954152, 16.000000431016925, 99.00000148970207, 129.99999823222058]

Advanced Example

-----------------

import numpy as np

import matplotlib.pyplot as plt

>>> # 1. Construct a test input signal with smooth and jump segments

>>> x_smooth = np.linspace(0, 20, 100)

>>> x_jumps = np.array([25, 25, 26, 27, 100, 101, 130])

>>> x = np.concatenate([x_smooth, x_jumps])

>>> # 2. Initialize encoder with high d_model and N

>>> s = SinusoidalEncoder(d_model=128, N=10000)

>>> # 3. Fit auto-scaling to compress input into sinusoidal-friendly space

>>> s.fit_encoder(x, target_range=10)

>>> # 4. Encode and decode using all robust methods

>>> a = s.encode(x)

>>> decoded_lsq, err_lsq = s.decode(a, method='least_squares', return_error=True)

>>> decoded_svd, err_svd = s.decode(a, method='svd', return_error=True)

>>> # 5. Compare errors

>>> true = x

>>> lsq_error = np.abs(decoded_lsq - true)

>>> svd_error = np.abs(decoded_svd - true)

>>> # 6. Plot results

>>> fig, axs = plt.subplots(2, 2, figsize=(12, 8))

>>> axs[0, 0].plot(true, label="Original")

>>> axs[0, 0].plot(decoded_lsq, '--', label="Decoded (LSQ)")

>>> axs[0, 0].plot(decoded_svd, ':', label="Decoded (SVD)")

>>> axs[0, 0].set_title("Decoded vs Original")

>>> axs[0, 0].legend()

>>> axs[0, 1].plot(lsq_error, label="Abs Error (LSQ)")

>>> axs[0, 1].plot(svd_error, label="Abs Error (SVD)")

>>> axs[0, 1].set_yscale('log')

>>> axs[0, 1].set_title("Absolute Decoding Error (log scale)")

>>> axs[0, 1].legend()

>>> axs[1, 0].plot(err_lsq, label="Residual Norm (LSQ)")

>>> axs[1, 0].plot(err_svd, label="Residual Norm (SVD)")

>>> axs[1, 0].set_yscale('log')

>>> axs[1, 0].set_title("Reconstruction Residuals")

>>> axs[1, 0].legend()

>>> axs[1, 1].hist(lsq_error, bins=50, alpha=0.7, label="LSQ")

>>> axs[1, 1].hist(svd_error, bins=50, alpha=0.5, label="SVD")

>>> axs[1, 1].set_title("Histogram of Absolute Errors")

>>> axs[1, 1].legend()

>>> plt.suptitle("🌀 SinusoidalEncoder: Accuracy Evaluation", fontsize=14)

>>> plt.tight_layout()

>>> plt.show()

References(for the encoding)

------------------------------

\* Vaswani et al. (2017), "Attention is All You Need"

\* https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)

static angle_difference(emb)

Compute angular differences (∆θ) between consecutive embeddings.

Parameters:: emb (np.ndarray) – Encoded array of shape (n, 2d)
Returns:: Array of shape (n-1, d) of angular differences in radians
Return type:: np.ndarray

static complex_distance(emb1, emb2, norm='L2')

Compute distance between two encoded arrays using complex projection.

Parameters:

emb1 (np.ndarray) – Encoded arrays of shape (…, 2d) to compare.
emb2 (np.ndarray) – Encoded arrays of shape (…, 2d) to compare.
norm (str) – ‘L2’ for Euclidean norm, ‘cos’ for cosine angle distance.

Returns:

Distance values per sample (1D array)

Return type:

np.ndarray

decode(embedding, method='least_squares', return_error=False)

Decode sinusoidal embeddings back to original values using selected method.

Parameters:

embedding (np.ndarray) – Encoded sinusoidal array of shape (n, d_model)
method (str) – Decoding strategy: ‘least_squares’ (default), ‘optimize’, or ‘naive’
return_error (bool) – If True, returns (decoded_values, residual_error) as a tuple

Returns:

decoded (list or array) – Reconstructed input values
residual (np.ndarray, optional) – Residuals of decoding (per sample), returned only if return_error=True

encode(values, scale=None)

Encode input values into sinusoidal embeddings.

Parameters:

values (array-like) – Values to encode.
scale (float or None) – Rescaling factor. If None, auto-scaling is applied if enabled.

Returns:

Embedded values of shape (n, d_model)

Return type:

np.ndarray

fit_encoder(values, target_range=10.0)

Fit a scaling factor to normalize values into a target sinusoidal-safe range.

Parameters:

values (array-like) – Original values to encode (will determine scale).
target_range (float) – Maximum scaled range to span (e.g. [0, 10]).

Returns:

Recommended scale factor stored internally.

Return type:

float

static group_centroid(emb, labels=None, return_std=False)

Compute the centroid (average embedding) of each group in complex sinusoidal space.

Parameters:

emb (np.ndarray) – Encoded array of shape (n, 2d), with sin and cos interleaved.
labels (list or np.ndarray, optional) – Group labels (n,). If None, the entire set is treated as one group.
return_std (bool) – Whether to also return the standard deviation per group.

Returns:

Dictionary mapping each group label to its centroid (2d real array). If return_std=True, also includes key ‘<label>_std’ with standard deviation.

Return type:

dict

static pairwise_similarity(emb, metric='cosine')

Compute a pairwise similarity (or distance) matrix in sinusoidal embedding space.

Parameters:

emb (np.ndarray) – Encoded array of shape (n, 2d).
metric (str) – Distance metric: ‘cosine’ for 1 - cosine similarity, ‘L2’ for Euclidean norm.

Returns:

Pairwise similarity matrix of shape (n, n)

Return type:

np.ndarray

static phase_alignment(emb, ref)

Align embedding emb to reference ref using complex phase.

Parameters:

emb (np.ndarray) – Encoded array to align (n, 2d)
ref (np.ndarray) – Reference encoded array (n, 2d)

Returns:

Aligned encoding of emb, same shape as input.

Return type:

np.ndarray

static phase_unwrap(emb, normalize=False)

Perform Fourier-like phase unwrapping on sinusoidal embedding.

Parameters:

emb (np.ndarray) – Encoded array of shape (n, 2d).
normalize (bool) – Whether to scale unwrapped phases to [0, 1].

Returns:

Phase unwrapped matrix of shape (n, d), optionally normalized.

Return type:

np.ndarray

set_decode_tolerance(tol=0.001)

Set maximum acceptable residual error for decoding.

Parameters:: tol (float) – Residual threshold above which a warning is triggered.

static sindecode_dna_grouped(grouped, reference_code, d_part=32, N=10000)

Decode sinusoidal embeddings grouped by letter into symbolic segments.

Parameters:

grouped (dict) – Dictionary of letter: embeddings
reference_code (dict) – Must include ‘dx’ (sampling resolution).
d_part (int) – Number of dimensions per part (start, width, height)
N (int) – Frequency base

Returns:

Dictionary with keys: letters, widths, heights, xloc, iloc, dx

Return type:

dict

static sinencode_dna_grouped(code, d_part=32, N=10000)

Encode symbolic code segments grouped by letter into sinusoidal embeddings.

Parameters:

code (dict) –
Must contain:
- ’letters’: str
- ’xloc’: list of (start, end) tuples
- ’widths’: list of float
- ’heights’: list of float
d_part (int) – Number of dimensions per field (position, width, height)
N (int) – Sinusoidal frequency base

Returns:

Dictionary of embeddings by letter: {letter: np.ndarray(n, 3*d_part)}

Return type:

dict

static to_complex(emb)

Convert sinusoidal embedding into a complex array using Euler’s identity.

Parameters:: emb (np.ndarray) – Array of shape (…, 2d) where sin/cos pairs are stored.
Returns:: Complex array of shape (…, d) with values exp(i * theta)
Return type:: np.ndarray

verify_roundtrip(values, method='least_squares', scale='auto', verbose=True, return_details=False)

Perform an encode → decode → compare roundtrip and report accuracy.

Parameters:

values (array-like) – Original values to test.
method (str) – Decoding method: ‘least_squares’, ‘svd’, ‘optimize’, or ‘naive’.
scale (float or 'auto' or None) – Scaling strategy: ‘auto’ uses fit_encoder(), float uses fixed scaling, None disables scaling.
verbose (bool) – If True, prints accuracy report.
return_details (bool) – If True, also returns the encoded array, decoded values, and residuals.

Returns:

success (bool) – Whether all values were accurately recovered within residual tolerance.
details (tuple, optional) – Tuple (encoded, decoded, residuals) if return_details=True

class sig2dna_core.signomics.generator(kind='gauss'): Bases: object

sig2dna_core.signomics.import_local_module(name: str, relative_path: str)

Import a module by name from a file path relative to the calling module.

Usage here:

import_local_module(“figprint”, “figprint.py”) replaces import figprint (zero-installation)

Parameters:

namestr: Name to assign to the module (used internally).
relative_pathstr: Path relative to the calling module’s location.

Returns:

module: Imported module object.

Example:

>>> figprint = import_local_module("figprint", "figprint.py")
>>> figprint.print_pdf(...)

class sig2dna_core.signomics.peaks(data=None)

Bases: object

A class for managing a collection of peak definitions used in synthetic signal generation.

Each peak is represented as a dictionary with the following fields: - ‘name’ (str): unique identifier (autogenerated if not provided) - ‘x’ (float): center position (e.g., time, wavenumber, index) - ‘w’ (float): width (related to FWHM) - ‘h’ (float): peak height - ‘type’ (str): generator type (e.g., ‘gauss’, ‘lorentz’, ‘triangle’)

Supports: - Flexible addition and broadcasting of peak parameters - Named or indexed access to individual or multiple peaks - Overloaded operators for peak translation and scaling - Utility methods: update, sort, rename, remove_duplicates, copy - Conversion to signal object via .to_signal() - Informative __str__ and __repr__ output

This class is used to build reproducible and structured test cases for symbolic encoding (e.g., sig2dna).

add(x, w=1.0, h=1.0, name=None, type='gauss')

Add one or multiple peaks to the collection.

Parameters:

x (float or array-like) – Center positions of the peaks.
w (float or array-like) – Width(s) of the peaks (broadcastable).
h (float or array-like) – Height(s) of the peaks (broadcastable).
name (str or list of str or None) – Peak name(s); auto-generated if None or duplicate.
type (str) – Generator type, e.g., ‘gauss’, ‘lorentz’, etc.

as_dict(): Return the list of peaks as dict

copy(): Return a deep-copy of the peaks

names(): Return the list of names

remove_duplicates()

rename(prefix='P')

sort(order='asc')

Sort peaks in-place based on their center positions (x values).

Parameters:

order (str) –

Sorting direction. Use:

”asc” for ascending (default)
”desc” for descending

to_signal(index=None, name=None, generator_map=None, x=None, x0=0.0, n=1000): Generate a signal from a peaks object. Optionally restrict to a subset.

update(data)

Update or insert peaks from a list of dictionaries.

Parameters:: data (list of dict) – Each dict must include at least ‘x’, ‘w’, ‘h’. If ‘name’ matches an existing peak, it will be updated. If ‘name’ is new or missing, the peak is appended.

class sig2dna_core.signomics.signal(x=None, y=None, name='signal', type='generic', x_label='index', x_unit='-', y_label='intensity', y_unit='a.u.', metadata=None, source='array', user=None, date=None, host=None, cwd=None, version=None, color=None, linewidth=2, linestyle='-', message=None, fullhistory=True)

Bases: object

signal: A self-documented 1D analytical signal container for reproducible scientific workflows.

This class is designed for lab-grade signal processing and traceable data storage. It represents a discrete 1D signal (e.g., chromatogram, spectrum, transient) with full metadata, support for symbolic transformation, numerical operations, plotting, and structured saving/loading.

Key features include: - Portable metadata (user, time, host, cwd, version) - Domain-aware plots and operations - Reproducible signal serialization in JSON or compressed format - Full traceability of all transformation events - Optional recursive backup of prior states

x

Sampling domain (e.g., time, wavelength, chemical shift).

Type:: np.ndarray

y

Signal values aligned with x.

Type:: np.ndarray

name

Label for plots and file storage (used as default filename).

Type:: str

type

Optional tag (e.g., ‘GC-MS’, ‘FTIR’, ‘NMR’, ‘synthetic’).

Type:: str

x_label

Label for the x-axis (e.g., ‘wavenumber’).

Type:: str

x_unit

Unit of the x-axis (e.g., ‘cm⁻¹’).

Type:: str

source

Origin label (‘array’, ‘peaks’, ‘noise’, ‘imported’…).

Type:: str

metadata

Includes user, date, host, cwd, version — filled automatically unless overridden.

Type:: dict

color(str or [rgb]), linestyle (str), linewidth (str)

_previous

deep-copy of current object

Type:: signal

_history

“user@host:timestamp | uidkey” :{“action”:str, “details”: str}

Type:: dict

Key Methods

-----------

- normalize(...)

Type:: Normalize the signal to positive values

- from_peaks(...)

Type:: Construct signal from a peaks object

- add_noise(...)

Type:: Return noisy variant (Poisson, Gaussian, ramp or constant bias)

- align_with(...)

Type:: Align this signal with another (same x domain)

- copy()

Type:: Deep copy

- save(...)

Type:: Save as JSON or .gz (optional CSV export)

- load(...)

Type:: Load from saved file

- plot(...)

Type:: Plot the signal with axis labels

- backup(...)

Type:: Backup current signal (deep-copy stored in _previous)

- restore(...)

Type:: Restore the previous state of the signal

- apply_poisson_baseline_filter(...)

Type:: Apply a Poisson-based filter

- enable_fullhistory

Type:: enable full history

- disable_fullhistory

Type:: disable full history

- _toDNA(signal)

Type:: DNAsignal

Overloaded Operators

--------------------

- +, -, \*, /

Type:: Operates on signals or scalars, aligns if needed

- +=, -=, \*=, /=

Type:: In-place functional versions (returns new signal)

Low-level Methods

-----------------

- _current_stamp()

Type:: stamp for events (static method)

- _copystatic()

Type:: deep-copy of signal only (use copy for a full copy) (static method)

- _events()

Type:: register a processing step

- _to_serializable

Type:: Convert the signal into a dictionary suitable for JSON export

- _from_serizalizable

Type:: convert a dict (e.g., from JSON import) to signal

Example

>>> s = signal(x, y, name="sample", type="FTIR", x_label="wavenumber", x_unit="cm⁻¹")
>>> s.add_noise("gaussian", 0.05).plot()
>>> s.save()  # saves to ./sample.json.gz
>>> s2 = signal.load("sample.json.gz")

add_noise(kind='gaussian', scale=1.0, bias=None): Return a new signal with noise and/or bias added.

align_with(other, mode='union', n=1000)

Align two signals to a common x grid with interpolation and padding.

Parameters:

other (signal) – the other signal to align with
mode (str) – ‘union’ (default) or ‘intersection’
n (int) – number of points for the new grid

Returns:

(self_interp, other_interp) as new signal instances

Return type:

tuple

apply_poisson_baseline_filter(window_ratio=0.02, gain=1.0, proba=0.9)

Apply a baseline filter assuming Poisson-dominated statistics with adjustable gain and a rejection threshold based on the Bienaymé-Tchebychev inequality.

The signal is filtered by removing values likely caused by statistical noise (false peaks) using a per-point threshold defined from local statistics:

Local mean: $$ mu_t =

rac{1}{w} sum_{i in W(t)} y_i $$

Local std dev: $$ sigma_t = sqrt{mu_t cdot ext{gain}} $$
Coefficient of variation: $$ ext{cv}_t =

rac{sigma_t}{mu_t} $$

Estimated local intensity (lambda): $$ lambda_t =

rac{1}{ ext{cv}_t^2} $$

Bienaymé-Tchebychev threshold: $$ ext{threshold}_t =

rac{1}{sqrt{1 - p}} cdot sqrt{10 lambda_t cdot Delta t} $$

window_ratiofloat, default=0.02
Ratio of signal length used as window size (must yield odd integer ≥ 11).

gainfloat, default=1.0
Linear amplification factor applied to simulate signal counts.

probafloat, default=0.9
Minimum probability to consider a signal point significant. Must be in (0, 1).

signal
The current signal instance (self), with updated y.

ValueError
If the window size is too small for reliable statistics.

backup(fullhistory=None, message=None): Backup current state in _previous

copy(): Deep copy of the signal, excluding full history control flag

disable_fullhistory(): Disable full history tracking

enable_fullhistory(): Enable full history tracking

classmethod from_peaks(peaks_obj, x=None, generator_map=None, name='from_peaks', x0=None, n=1000)

Generate a signal from a set of peaks.

Parameters:

peaks_obj (peaks) – A list-like object containing peak definitions.
x (array-like, float, or None) – If None: compute x domain from peaks. If scalar: interpreted as xmax; linspace from x0 to xmax. If array: use as x directly.
generator_map (dict or None) – Optional map of peak type → generator instance (default is Gaussian).
name (str) – Name of the signal instance.
x0 (float or None) – Left bound of the domain (used only if x is None or scalar). If None: inferred from peaks.
n (int) – Number of points in the generated x array.

Returns:

A new signal instance generated from the peaks.

Return type:

signal

Example

p = peaks() p.add(x=[400, 800, 1600], w=30, h=[1.0, 0.6, 0.9], type=”gauss”) s = signal.from_peaks(p, x0=300, n=2048) s.plot()

static load(filepath)

Load a signal from a JSON or gzipped JSON file, including recursive _previous.

Parameters:: filepath (str or Path) – Path to the JSON or .gz file
Returns:: A fully reconstructed signal object
Return type:: signal

property n: Return the length of the signal and None if it is None

normalize(mode='zscore+shift', inplace=True, shift_eps=1e-06)

Normalize the signal to positive values using different normalization strategies.

Parameters:

mode (str) – Normalization mode: - “zscore+shift” : (y - mean) / std, then shift so min is shift_eps - “minmax” : (y - min) / (max - min), scales to [0, 1] - “max” : y / max, scales to [0, 1] - “l1” : y / sum(|y|), sums to 1 (like probability) - “energy” : y / sqrt(sum(y^2)), unit energy - “none” : No normalization, just returns a copy or itself
inplace (bool) – Whether to modify the signal in place. If False, returns a new signal.
shift_eps (float) – Minimum value to add after z-score shift to ensure strictly positive output.

Returns:

Normalized signal (if inplace is False), else None.

Return type:

signal or None

Raises:

ValueError – If the normalization fails (e.g., due to division by zero).

plot(ax=None, label=None, color=None, linestyle=None, linewidth=None, fontsize=12, newfig=False)

Plot the signal using matplotlib, applying either internal style settings or overrides provided at call time.

Parameters:

ax (matplotlib.axes.Axes, optional) – Axis to plot on. If None, uses current axis or new figure if newfig=True.
label (str, optional) – Legend label. Defaults to self.name.
color (str or None) – Line color. If None, uses default matplotlib cycling.
linestyle (str or None) – Line style (e.g., ‘-’, ‘–‘). If None, uses self.linestyle.
linewidth (float or None) – Line width. If None, uses self.linewidth.
fontsize (int or str) – Font size for axis labels and legend. Can use values like ‘small’, ‘large’.
newfig (bool) – If True, creates a new figure before plotting.

Returns:

matplotlib.figure.Figure
matplotlib.axes.Axes

restore(): Restore the previous signal version if available

sample(x_new): Interpolate values from x

save(filepath=None, zip=True, export_csv=False)

Save signal to JSON (optionally compressed) and optionally CSV.

Parameters:

filepath (str or Path or None) – If None, builds path from metadata[‘cwd’] and self.name + ‘.json[.gz]’. If a directory, appends name + ‘.json[.gz]’. If a file, uses as is.
zip (bool) – Whether to compress the JSON file using gzip. Default: True.
export_csv (bool) – If True, also save a .csv file (x,y) alongside the JSON.

class sig2dna_core.signomics.signal_collection(*signals, n=1024, mode='union', name=None, force=True)

Bases: list

A container class for multiple signal instances that ensures alignment on a shared x-grid.

The collection is used to manage, compare, combine, or visualize multiple signals (e.g., from replicates, experiments, synthetic scenarios). Signals are interpolated and padded on insertion so all have the same shape and domain. Arithmetic, matrix extraction, and overlay plots are supported.

Parameters:

*signalssignal: One or more signal instances to include (they are copied and aligned).
nint: Number of sampling points in the aligned x grid (default: 1000).
modestr: Alignment mode: ‘union’ or ‘intersection’ of x-ranges.

Core Attributes:

modestr: Alignment strategy used (“union” or “intersection”).
nint: Number of x-points used in alignment (default=1024).

Key Methods:

append(signal) → add and align a new signal
to_matrix() → convert signals to a 2D array (n_signals x n_points)
mean(coeffs=None) → weighted or unweighted mean
sum(coeffs=None) → weighted or unweighted sum
plot(…) → overlay signals with optional mean/sum
copy → all signals stored are deep copies
generate_synthetic → signal collection composed of random peaks.
__getitem__(…) → slice, list, or name-based access to signals
__repr__ / __str__ → report contents with span and names
_toDNA(signal_collection) → list of DNAsignals

Access Patterns:

sc[0:3] → subcollection by slice
sc[[0, 2]] → subcollection by list of indices
sc[“name”] → return a copy of signal with that name
sc[“A”, “B”] → return a subcollection with those names

Supports arithmetic operations for aligned signal mixtures.

Arithmetic operations on aligned signal collections

Scalar multiplication: a * sc scales each signal by a constant a.
Collection addition: sc1 + sc2 adds two collections element-wise.
Linear combinations: a * A + b * B + c * C constructs mixtures of compatible collections.
Compatible with sum([a*A, b*B, …]) for aggregating multiple weighted collections.

Requirements

All signal_collections must share the same number of signals.
Signals are aligned on a common x-grid (same n, mode, and domain).
Element-wise operations preserve signal names and metadata when possible.

Examples:

>>> sc = signal_collection(s1, s2, s3)
>>> sc.plot(show_mean=True)

>>> sc[0:2]         # sub-collection (copy)
>>> sc["peak1"]     # get copy of signal named 'peak1'
>>> mat = sc.to_matrix()

>>> sc.mean().plot()
>>> sc.sum(coeffs=[0.4, 0.6]).plot()

append(new_signal): Append and align the new signal to the existing collection.

classmethod generate_mixtures(n_mixtures=10, max_peaks=16, peaks_per_mixture=(3, 8), amplitude_range=(0.5, 2), flatten='mean', n_signals=1, n_peaks=1, kinds=('gauss',), width_range=(0.5, 3), height_range=(1.0, 5.0), x_range=(0, 500), n_points=1024, normalize=False, seed=None, **kwargs)

Generate synthetic mixtures of signals by combining a subset of base peaks.

Parameters:

n_mixtures (int) – Number of synthetic mixtures to generate.
max_peaks (int) – Maximum number of base signals (from which peaks are taken).
peaks_per_mixture (tuple of (int, int)) – Range (min, max) for the number of peaks to combine in each mixture. Cannot exceed max_peaks.
amplitude_range (tuple of (float, float)) – Random scaling range applied to peak amplitudes in each mixture.
flatten ({'sum', 'mean'}, default='mean') – How to combine the signals for each mixture.
**kwargs (dict) – All other keyword arguments passed to generate_synthetic.

Returns:

result_collection (signal_collection) – A collection of synthetic mixed signals.
all_peaks (list of dict) – All individual peaks originally generated.
used_peak_ids (list of list of str) – For each mixture, the list of peak names used.

Examples

S, pS = signal_collection.generate_mixtures( … n_mixtures=30, … max_peaks=12, … peaks_per_mixture=(4, 8), … amplitude_range=(0.2, 1.5), … n_signals=12, … kinds=(“gauss”,), … width_range=(0.5, 3), … height_range=(1.0, 5.0), … x_range=(0, 500), … n_points=2048, … normalize=False, … seed=123 … ) >>> S.plot()

classmethod generate_synthetic(n_signals=5, n_peaks=5, kind_distribution='uniform', kinds=('gauss', 'lorentz', 'triangle'), x_range=(0, 1000), n_points=1024, avoid_overlap=True, width_range=(20, 60), height_range=(0.5, 1.0), normalize=True, noise=None, bias=None, name_prefix='synthetic', seed=None)

Generate a synthetic signal collection composed of random peaks.

Parameters:

n_signals (int) – Number of synthetic signals to generate.
n_peaks (int or tuple[int,int]) – Number of peaks per signal or its range.
kind_distribution (str) – ‘uniform’ → use all peak kinds equally; ‘random’ → random draw from kinds.
kinds (tuple[str]) – Generator types to choose from (‘gauss’, ‘lorentz’, ‘triangle’).
x_range (tuple[float, float]) – Start and end of x-domain.
n_points (int) – Number of sampling points for each signal (default: 1024).
avoid_overlap (bool) – Prevent peaks from overlapping by checking spacing vs. width.
width_range (tuple[float, float]) – Range of widths for the peaks.
height_range (tuple[float, float]) – Range of peak heights.
normalize (bool) – Normalize each signal so the highest peak has intensity 1.
noise (dict or None) – Optional noise model, e.g. {“kind”: “gaussian”, “scale”: 0.01}.
bias (float, str, np.ndarray, or signal) – Optional signal bias: can be a constant, ‘ramp’, or signal.
name_prefix (str) – Base name for each generated signal.
seed (int or None) – Random seed for reproducibility.

Returns:

A collection of generated signals.

Return type:

signal_collection

Examples

# 1. Default random peaks, Gaussian + ramp bias sc = signal_collection.generate_synthetic(

n_signals=5, n_peaks=6, kinds=(“gauss”, “lorentz”, “triangle”), noise={“kind”: “gaussian”, “scale”: 0.02}, bias=”ramp”, name_prefix=”test”

) sc.plot(show_mean=True, fontsize=”large”)

# 2. High-res signal, fixed width and height sc2 = signal_collection.generate_synthetic(

n_signals=3, n_peaks=8, kinds=(“gauss”,), width_range=(30, 30), height_range=(1.0, 1.0), x_range=(0, 500), n_points=2048, normalize=False, seed=123

) sc2.plot(fontsize=14)

# 3. Poisson noise, no overlap, save output sc3 = signal_collection.generate_synthetic(

n_signals=2, n_peaks=5, noise={“kind”: “poisson”, “scale”: 2.0}, name_prefix=”poisson_example”

) for s in sc3:

s.save(export_csv=True)

mean(indices_or_names=None, coeffs=None)

Mean of selected signals, optionally weighted.

Parameters:

indices_or_names (list[int or str], optional) – Signal names or indices to include.
coeffs (list[float], optional) – Weights for selected signals.

Returns:

Averaged signal.

Return type:

signal

plot(indices=None, labels=True, title=None, newfig=None, ax=None, show_mean=False, show_sum=False, coeffs=None, fontsize=12, colormap=None)

Plot selected signals with style attributes and optional overlays.

Parameters:

indices (list[int] or list[str], optional) – Signals to plot by index or name.
labels (bool) – Whether to show signal labels.
title (str) – Plot title.
newfig (bool or None) – If True, always open a new figure. If False, use current axes. If None, open new figure only the first time this collection is plotted.
ax (matplotlib axis, optional) – Axis to draw on.
show_mean (bool) – Overlay mean curve.
show_sum (bool) – Overlay sum curve.
coeffs (list[float], optional) – Optional weights for mean/sum.
fontsize (int or str) – Font size for labels and legend.
colormap (list[str], optional) – List of colors to cycle through when signal.color is None.

Return type:

matplotlib.figure.Figure

sum(indices_or_names=None, coeffs=None)

Sum selected signals, optionally weighted by coeffs.

Parameters:

indices_or_names (list[int or str], optional) – If provided, selects a subset by index or name.
coeffs (list[float], optional) – Weights matching the number of selected signals.

Returns:

Summed signal.

Return type:

signal

to_matrix(): Return a 2D array (n_signals x n_points) of aligned signal values.