API Reference
Module: sig2dna_core.signomics.py (from the Generative Simulation Initiative)
This is the core module of the sig2dna framework, dedicated to transforming numerical chemical signals into DNA-like symbolic representations. The module enables symbolic analysis, fingerprinting, alignment, and classification of complex analytical signals, such as:
GC-MS / GC-FID
HPLC-MS
NMR / FTIR / Raman
RX and other spectroscopy data
It is designed to facilitate high-throughput pattern recognition, compression, clustering, and AI/ML-based classification. Symbolic transformation is based on wavelet decomposition (Mexican Hat/Ricker) and segment encoding into letters (e.g., A, B, C, X, Y, Z). The representation preserves key structural patterns (e.g., peak transitions) across multiple scales and supports entropy-based distances.
Main Components
DNAsignal — core class to transform a signal into DNA-like symbolic representation
DNAstr — string subclass enabling alignment, entropy analysis, visualization, and reconstruction
DNApairwiseAnalysis — distance and clustering toolbox for aligned DNA codes (PCoA, dendrogram, 2D/3D plots)
Key Features
Multi-scale wavelet transform with symbolic encoding
Symbolic entropy and mutual information measures
Fast symbolic alignment using difflib or biopython
Pairwise symbolic distances (Shannon, excess entropy, Jaccard, Jensen-Shannon)
Interactive plotting: segments, alignment masks, triangle patches
Motif search, alignment visualization, HTML and terminal rendering
Dimensionality reduction (MDS), clustering, dendrogram and heatmaps
Core Concept
Input Signal - 1D NumPy array S of shape (m,) - Data type: np.float64 (default) or np.float32 - Typically sparse and non-negative, such as GC-MS total ion chromatograms
Wavelet Transform - CWT with Ricker (Mexican hat) wavelet - Scales: $s = 2^0, 2^1,…, 2^n - Downsampling by scale (to reduce data volume and capture features at relevant resolutions)
Symbolic Encoding and compressed representation
The transformed signal Ts at scale s is converted to a sequence of symbolic letters using the rules:
Symbol |
Description |
---|---|
A |
Monotonic increase crossing from − to + |
B |
Monotonic increase from − to − (no zero crossing) |
C |
Monotonic increase from + to + |
X |
Monotonic decrease from + to + (no zero crossing) |
Y |
Monotonic decrease from − to − |
Z |
Monotonic decrease crossing from + to − |
_ |
Zero or noise (after filtering) |
Each encoded segment is associated with: - width: number of points - height: amplitude difference
These form the compressed representation
Installation
Install all dependencies with:
conda install pywavelets seaborn scikit-learn conda install -c conda-forge python-Levenshtein biopython
Examples
>>> from signomics import DNAsignal
>>> from signal import signal
>>> # Load a sampled signal (e.g., from GC-MS, Raman)
>>> S = signal.from_peaks(...) # or any constructor for sampled signals
>>> # Encode into DNA-like format
>>> D = DNAsignal(S, encode=True)
>>> D.encode_dna()
>>> D.plot_codes(scale=4)
>>> # Compare samples and cluster
>>> Dlist = [DNAsignal(S1, encode=True), DNAsignal(S2, encode=True), ...]
>>> analysis = DNAsignal._pairwiseEntropyDistance(Dlist, scale=4)
>>> analysis.plot_dendrogram()
>>> analysis.scatter(n_clusters=3)
Notes
The methodology implemented in this module covers and extends the approaches initially tested during the PhD of Julien Kermorvant. “Concept of chemical fingerprints applied to the management of chemical risk of materials, recycled deposits and food packaging”. PhD thesis AgroParisTech. December 2023. https://theses.hal.science/tel-04194172
Maintenance & forking
$ git init -b main $ gh repo create sig2dna –public –source=. –remote=origin –push $ # alternatively $ # git remote add origin git@github.com:ovitrac/sig2dna.git $ # git branch -M main # Ensure current branch is named ‘main’ $ # git push -u origin main # Push and set upstream tracking
$ tree -P ‘.py’ -P ‘.md’ -P ‘LICENSE’ -I ‘__pycache__|.*’ –prune $ conda activate base $ pdoc ./sig2dna_core/signomics.py -f –html -o ./docs $ doctoc –github –maxlevel 2 README.md
$ conda activate sphinxdoc $ cd docs_sphinx/ $ make clean $ make html $ cp -rp build/html/. ../docs
Author: Olivier Vitrac — olivier.vitrac@gmail.com Revision: 2025-06-13
- class sig2dna_core.signomics.DNACodes(*args, meta=None, encoded=False, **kwargs)
Bases:
UserDict
🧬 DNACodes Dictionary-like container for symbolic signal encodings at multiple scales.
- meta
Metadata describing the signal and encoding parameters.
- Type:
dict
- encoded
Whether the content has been sinusoidally encoded.
- Type:
bool
- sinencode(d_part=32, N=10000)
Encodes symbolic segments using transformer-style sinusoidal embeddings.
- sindecode(reference_dx=None)
Decodes sinusoidal embeddings back to symbolic segment structure.
- summary()
Displays segment or vector counts by scale.
- plot(figsize=(12, 4), d_part=None, N=None)
Plot method for DNACodes
- plot(figsize=(12, 4), d_part=None, N=None)
Plot method for DNACodes: visualizes encoded vectors or symbolic segment distribution.
- Parameters:
figsize (tuple) – Figure size for the entire plot.
d_part (int, optional) – Number of dimensions per segment part (only for encoded).
N (int, optional) – Frequency base (for metadata or title info).
- sindecode(reference_dx=None)
Decode sinusoidally embedded codes grouped by letter into symbolic segment structure.
- Parameters:
reference_dx (float, optional) – Sampling interval used to reconstruct xloc. Defaults to meta[“sampling_dt”].
- Returns:
Decoded symbolic codes for each scale.
- Return type:
- sinencode(d_part=32, N=10000)
Encode symbolic segments at each scale using sinusoidal encoding grouped by letter.
- Parameters:
d_part (int) – Number of dimensions for each component (start, width, height).
N (int) – Frequency base for sinusoidal embedding.
- Returns:
Encoded version of the current codes, grouped by letter per scale.
- Return type:
- summary()
Print the number of encoded vectors or symbolic segments per scale.
- class sig2dna_core.signomics.DNAFullCodes(*args, meta=None, encoded=False, **kwargs)
Bases:
dict
🧬 DNAFullCodes(dict)
A container for symbolic full-resolution DNA-like strings or their sinusoidal embeddings, organized per scale.
This structure maps each scale (typically corresponding to a wavelet or resolution level) to either:
a DNA-like string (str or DNAstr) representing symbolic patterns over time, or
a compressed embedding (dict of vectors) after sinusoidal encoding.
It supports signal discretization, symbolic transformation, sinusoidal encoding, dimensionally-reduced analysis, and visual comparison of encoded motifs.
- meta
Optional metadata (e.g. sampling rate, units, scale definitions, etc.).
- Type:
dict
- encoded
Whether this instance contains sinusoidally encoded data.
- Type:
bool
- unwrapped_matrix
When applicable, stores a matrix {scale: ndarray (n_letters, d_model)} from compressed representations via unwrap_letters_to_matrix().
- Type:
dict, optional
- sinencode(d_model=96, N=10000, operation=None)
Encodes symbolic data with sinusoidal positional encoding. Supports per-letter reduction via ‘sum’ or ‘mean’. Returns a new encoded instance.
- sindecode()
Attempts to reconstruct the symbolic string by repeating letters. Only works if the original operation did not compress to a single vector per letter.
- unwrap_letters_to_matrix()
Converts compressed encodings (after ‘sum’ or ‘mean’) into (n_letters × d_model) matrices per scale. Required for d-space plotting.
- plot(figsize=(12, 4))
Plots the letter-wise composition (symbolic form) or encoded means (if encoded=True).
- plot_unwrapped_matrix(figsize=(12, 4))
Visualizes each letter’s embedding vector in d-space, with one subplot per scale.
Example
>>> codes = DNAFullCodes({4: 'YAABZZ'}, meta={"sampling_dt": 0.5}) >>> encoded = codes.sinencode(operation="mean") >>> encoded.unwrap_letters_to_matrix() >>> encoded.plot_unwrapped_matrix()
- plot(figsize=(12, 4))
Plot method for DNAFullCodes: visualizes encoded vectors or DNA string composition.
- Parameters:
figsize (tuple) – Figure size for the entire plot.
- plot_unwrapped_matrix(figsize=(12, 4))
Plot each letter’s encoded vector in the abstract embedding space (d-space). One curve per letter, one subplot per scale.
Requires unwrap_letters_to_matrix() to have been called.
- Parameters:
figsize (tuple) – Base figure size. Height will be scaled based on number of scales.
- Returns:
The generated matplotlib figure.
- Return type:
matplotlib.figure.Figure
- sindecode()
Sindecode method
- sinencode(d_model=96, N=10000, operation='sum')
Sinencode method — encodes each letter in the DNAFullCodes as a set of sinusoidal embeddings.
- Parameters:
d_model (int, optional) – Dimensionality of the sinusoidal embedding (default is 96).
N (int, optional) – Maximum number of positions for the encoding (default is 10000).
operation (str or None, optional) – If “sum”, sums all encodings per letter. If “mean”, averages all encodings per letter. If None, keeps the full (n_occurrences, d_model) matrix per letter. Raises a ValueError if the operation is not one of the above.
- Returns:
DNAFullCodes (without aggregation) – A new DNAFullCodes instance with encoded representations and metadata.
DNAFullCodes (with aggregation) – A new DNAFullCodes instance with encoded and aggregated (based operation) representations and metadata.
- summary()
Return a brief summary of the full codes per scale.
- Returns:
Mapping from scale to summary string or code length.
- Return type:
dict
- unwrap_letters_to_matrix()
Assemble all encoded letter vectors into a matrix of shape (n_letters, d_model) for each scale.
Applies only when the encoding was performed with an operation (“sum” or “mean”). Stores the result in self.unwrapped_matrix as a dict {scale: matrix}. Returns the dictionary for chaining or inspection.
- Raises:
ValueError if the encoding is not compressed (i.e., operation is None or missing), –
or if encoded entries are inconsistent in shape. –
- class sig2dna_core.signomics.DNApairwiseAnalysis(D, names, DNAsignals, name=None)
Bases:
object
Class to handle pairwise distance analysis, PCoA, clustering, and visualization for DNA-coded signals.
- D
Pairwise excess entropy distance matrix.
- Type:
np.ndarray
- names
Names of the DNA signals.
- Type:
list
- DNAsignals
original DNAsignal objects
- Type:
list
- coords
Coordinates in reduced space (PCoA).
- Type:
np.ndarray
- dimensions
Selected dimensions for reduced analysis.
- Type:
list
- linkage_matrix
Linkage matrix used for hierarchical clustering.
- Type:
np.ndarray
- best_dimension(max_dim=10)
Determine optimal dimension by maximizing silhouette score.
- cluster(t=1.0, criterion='distance')
Assign cluster labels from linkage matrix.
- compute_linkage(method='ward')
Compute hierarchical clustering.
- dimension_variance_curve(threshold=0.5, plot=True, figsize=(8, 5))
Computes the cumulative explained variance (based on pairwise distances) as a function of the number of dimensions used (from 1 to n-1). Optionally plots the curve and the point where the threshold (default 0.5) is reached.
- Parameters:
threshold (float) – Fraction of total variance to reach (default 0.5).
plot (bool) – If True, display the variance curve and highlight dhalf.
figsize (tuple) – Size of the figure if plotted.
- Returns:
dhalf (int) – Number of dimensions needed to reach the threshold.
curve (list of float) – Normalized cumulative variance (in [0, 1]) for dimensions 1 to n-1.
- get_cluster_labels(n_clusters=2, method='ward')
Returns cluster labels from hierarchical clustering. If not computed yet, computes linkage.
- Parameters:
n_clusters (int) – Number of clusters to assign.
method (str) – Linkage method to use if recomputing linkage.
- Returns:
labels – Cluster IDs for each sample.
- Return type:
np.ndarray of int
- heatmap(figsize=(10, 8))
Plot heatmap of pairwise distances.
- static load(path)
Load analysis from file.
- pcoa(n_components=None)
Perform Principal Coordinate Analysis (PCoA).
- plot_dendrogram(truncate_mode=None, p=10)
Plot dendrogram from linkage matrix.
- reduced_distances()
Recompute distances on selected subspace.
- save(path)
Save current analysis to file.
- scatter(dims=(0, 1), annotate=True, figsize=(8, 6), n_clusters=None)
2D scatter plot in selected dimensions with optional cluster-based coloring.
- Parameters:
dims (tuple) – Dimensions to plot (default: (0, 1)).
annotate (bool) – If True, annotate points with their index.
figsize (tuple) – Size of the plot.
n_clusters (int or None) – If provided, use clustering to color points.
- Returns:
fig
- Return type:
matplotlib.figure.Figure
- scatter3d(dims=(0, 1, 2), annotate=True, n_clusters=None)
3D scatter plot in selected dimensions with optional cluster-based coloring.
- Parameters:
dims (tuple) – Dimensions to plot.
annotate (bool) – Annotate points with their index.
n_clusters (int or None) – If provided, use clustering to color points.
- Returns:
fig
- Return type:
matplotlib.figure.Figure
- select_dimensions(dims)
Update active dimensions.
- class sig2dna_core.signomics.DNAsignal(signal_obj, sampling_dt=1.0, dtype=<class 'numpy.float64'>, encode=False, encoder=['compute_cwt', 'encode_dna', 'encode_dna_full'], scales=[1, 2, 4, 8, 16, 32], x_label='index', x_unit='-', y_label='Intensity', y_unit='', plot=False, plotter=['plot_signals', 'plot_transforms', 'plot_codes'])
Bases:
object
DNAsignal(signal):
A class to encode a numerical signal (typically a 1D GC-MS trace, NMR/FTIR/Raman spectra or time series) into a DNA-like symbolic representation using wavelet analysis. This symbolic coding enables fast comparison, search, and alignment of signal features using abstracted patterns (e.g., ‘YAZB’).
The class supports: - Continuous Wavelet Transform (CWT) using Ricker wavelets - Symbolic conversion of wavelet features to DNA-like letters (A, B, C, X, Y, Z) - Visualization of CWTs, symbolic encodings, and signal overlays - Reversible decoding of symbolic segments back into approximate signals - Substring extraction and matching - Storage of multi-scale representations (multi-resolution DNA encoding)
This class is part of the symbolic signal transformation pipeline (sig2dna), compatible with signal, peaks, and signal_collection.
- param signal:
An instance of the signal class representing a sampled waveform. It must have x, y, and a valid name or identifier.
- type signal:
signal
- param encode:
Launch encoders = [“compute_cwt”,”encode_dna”,”encode_dna_full”] if True
- type encode:
bool (default = False)
- param plot:
Plot with plotter = [“plot_signals”,”plot_transforms”,”plot_codes”]
- type plot:
bool (default = False)
- signal
Original numerical signal (values only; x stored via sampling_dt).
- Type:
np.ndarray
- dtype
Data type of the stored signal.
- Type:
data-type
- sampling_dt
Sampling interval along the x-axis.
- Type:
float
- dx
The nominal x-resolution of the signal (automatically derived).
- Type:
float (depreciated)
- n
Number of points in the signal (length of x/y arrays).
- Type:
int
- name
Name of the signal.
- Type:
str
- x_label
Label of the x-axis.
- Type:
str
- x_unit
Unit of the x-axis.
- Type:
str
- y_label
Label of the y-axis.
- Type:
str
- y_unit
Unit of the y-axis.
- Type:
str
- scales
List of scales used for encoding.
- Type:
list[int]
- codesfull
Dictionary-like container of full-resolution symbolic strings per scale.
- Type:
- scales
Set of scales used in the Continuous Wavelet Transform (powers of 2 by default).
- Type:
array-like
- transforms
Stores the CWT-transformed signals, each as a signal object, indexed by scale.
- Type:
- codes
Symbolic codes by scale level. Each is a DNAstr object representing the symbolic sequence at that scale.
- Type:
dict[int, DNAstr]
- sincodesfull
Sinusoidal position encoded DNAstr (without aggregation)
- Type:
dict[int, DNAFullCodes]
- sincodesfull_aggregated
Sinusoidal position encoded DNAstr (with aggregation)
- codebook
Mapping between symbolic characters (A, B, C, X, Y, Z, _) and wavelet features.
- Type:
dict
- generator
Name of the wavelet basis used (default: ‘ricker’).
- Type:
str
- normalize_signal(mode='zscore+shift')
Normalizes the internal signal (preserves positivity).
- compute_cwt(scales=None, normalize=False)
Computes the Continuous Wavelet Transform using the Ricker wavelet and stores the transformed signals in transforms.
- sparsify_cwt(self, scale: int | float, threshold: float, inplace: bool = True)
Zero out wavelet coefficients below a threshold for a specific scale.
- encode_dna()
Encodes each scale’s transformed signal into a symbolic DNAstr sequence using local maximum coding (ABCXYZ).
- encode_dna_full()
Encodes signals at each scale using the full encoding scheme, preserving the flat regions (_) and finer symbolic transitions.
- plot_signals()
Plots signals.
- plot_codes(scale)
Plots both the wavelet transform and the symbolic code for the given scale.
- plot_transforms()
Plots the stored CWT-transformed signals as a signal collection.
- plot_scalogram():
Plots a scalogram with two subplots
- decode_dna(scale)
Reconstructs the approximate signal for a given scale from its DNA encoding.
- __getitem__(scale)
Shortcut to access the DNAstr object for a specific scale.
- summary()
Returns a dictionary summarizing encoded scales and metadata.
- has(scale)
Checks whether a DNAstr encoding exists at the given scale.
- pseudoinverse(scales=None, rank=None, return_weights=False, name=None):
Approximates signal reconstruction via pseudo-inverse using stored CWT coefficients
- Static pairwise distance methods
- --------------------------------
- _pairwiseEntropyDistance(list of DNAstr objects, scale)
Return a DNApairwiseAnalysis instance based on the mutually exclusive information after DNA/code alignment.
- _pairwiseJaccardMotifDistance(list of DNAstr objects, scale)
Return a DNApairwisedistance based on the presence/absence of a pattern (default=YAZB)
- _pairwiseJensenShannonDistance(list of DNAstr objects, scale)
Return a DNApairwisedistance based on the Jensen-Shannon distance of a pattern
- _pairwiseLevenshteinDistance(list of DNAstr objects, scale)
Return a DNApairwisedistance based on the Levenshtein Distance
- Methods: Symbolic Encoding
- --------------------------
- encode_dna(scales=None)
Convert signal into triplet-based symbolic encoding (per scale).
- encode_dna_full(scales=None, resolution='index', repeat=True, n_points=None)
Generate full-resolution DNA strings by repeating letters (to codesfull).
- Methods: Sinusoidal Encoding
- ----------------------------
- sinencode_dna(scales=None, d_part=32, N=10000)
Encode symbolic segments (from codes) into sinusoidal vectors.
- sinencode_dna_full(scales=None, d_part=32, N=10000)
Encode symbolic full strings (from codesfull) into sinusoidal vectors.
- Methods: Sinusoidal Decoding (Static)
- -------------------------------------
- sindecode_dna(grouped_embeddings, reference_dx=1.0, d_part=32, N=10000)
[static] Decode grouped sinusoidal embeddings → DNACodes structure.
- sindecode_dna_full(grouped_embeddings, reference_dx=1.0, d_part=32, N=10000)
[static] Decode full sinusoidal embeddings → DNAFullCodes structure.
- Methods: Signal Reconstruction
- ------------------------------
- tosignal(scale=None, codes_attr='codes')
Reconstruct approximate signal (as signal instance) from symbolic encoding.
Examples
>>> S = signal.from_peaks(...) # define a signal >>> dna = DNAsignal(S) >>> dna.compute_cwt() >>> dna.encode_dna() >>> dna.codes[4] DNAstr("AAAZZZYY...") >>> dna.plot_codes(4) >>> dna.codes[4].find("YAZB")
- align_with(other, scale=1)
Align symbolic sequences and compute mutual entropy.
- Returns:
- with fields
seq1_aligned (str)
seq2_aligned (str)
aligned_signal (list of tuples)
mutual_entropy (float)
- Return type:
SimpleNamespace
- static apply_baseline_filter(signal, w=None, k=2, delta_t=1.0)
Apply baseline filtering using moving median and local Poisson-based thresholding.
- Parameters:
signal (np.ndarray) – Input signal (expected to be non-negative or baseline-dominated).
w (int or None) – Window size for baseline and statistics (must be odd). Defaults to max(11, 2% of signal length).
k (float) – Bienaymé-Tchebychev multiplier.
delta_t (float) – Sampling time step.
- Returns:
filtered – Signal with baseline removed and low-intensity noise suppressed.
- Return type:
np.ndarray
Note
This method is static, use signal.apply_baseline_filter() whenever appropriate instead.
- compute_cwt(scales=None, apply_filter=False, wavelet='mexh')
Compute Continuous Wavelet Transform (CWT) using the Mexican Hat wavelet.
- Parameters:
scales (list, int, or None) – List of scales (or a single scale) to compute. If None, default to [1, 2, 4, 8, 16].
apply_filter (bool) – Whether to apply a baseline filter to the input signal before transforming.
wavelet (str (default='mexh')) – The name of the PyWavelets-compatible wavelet.
Sets –
---- –
self.scales (list) – The list of actual scales used.
self.filtered_signal (ndarray) – Filtered or raw signal used for CWT.
self.cwt_coeffs (dict) – Dictionary mapping each scale to its 1D coefficient array.
self.transforms (signal_collection) – Collection of signal objects storing the transformed signals for each scale.
- encode_dna(scales=None)
Encode each transformed signal into a symbolic DNA-like sequence of monotonic segments.
- Parameters:
scales (list, int, or None) – List of scales (or a single scale) to encode. If None, use self.scales.
letters (The encoding detects strictly monotonic (or flat) segments and labels them with symbolic) –
A (-) –
Z (-) –
B (-) –
Y (-) –
C (-) –
X (-) –
_ (-) –
Sets –
---- –
self.codes (dict) –
- Dictionary mapping each scale to a struct with:
letters : str (symbolic encoding)
widths : list of float (x-span of each segment)
heights : list of float (y-delta of each segment)
iloc : list of index-pair tuples (start, end+1)
xloc : list of x-span tuples (x_start, x_end)
dx : segment step (dx)
- encode_dna_full(scales=None, resolution='index', repeat=True, n_points=None)
Convert symbolic codes into DNA-like strings by repeating letters proportionally to their span.
- Parameters:
scales (list, int, or None) – List of scales (or a single scale) to convert. If None, use self.scales.
resolution ({'index', 'x'}) –
- Repetition mode:
’index’: repeat letters by number of indices (j - i from iloc)
’x’ : interpolate letter values over physical x-axis distance (xloc)
repeat (bool) – If True, repeat or interpolate letters to form a string of desired resolution. If False, return the symbolic sequence without repetition.
n_points (int or None) – Used only for resolution=’x’ to control the number of interpolation points. If None, defaults to ~10 points per x-unit.
- Returns:
dict – Dictionary mapping each scale to its DNA-like string.
Sets
—-
self.codesfull (dict) – Dictionary storing the resulting full DNA-like string per scale.
- static entropy_from_string(s)
return the entropy of a string
- find_sequence(pattern, scale)
Find occurrences of a specific letter pattern in encoded sequence.
- get_code(scale)
Retrieve encoded data for a specific scale.
- get_entropy(scale)
Calculate Shannon entropy for encoded signal.
- has(scale)
Check if a DNA encoding exists for the specified scale.
- Parameters:
scale (int) – The wavelet scale to check.
- Returns:
True if a symbolic DNAstr encoding exists at the given scale, False otherwise.
- Return type:
bool
Examples
>>> dna.has(4) True >>> dna.has(16) False
- property letters
Return used letters
- normalize_signal(mode='zscore+shift')
Normalize the internal signal using one of several strategies that ensure positivity.
- Parameters:
mode (str) – Normalization mode passed to signal.normalize(). See signal.normalize() for available modes.
- Raises:
AttributeError – If signal attribute is missing or of the wrong type.
- plot_codes(scale, ax=None, colormap=None, alpha=0.4)
Plot the symbolic DNA-like encoding as colored triangle segments.
- Parameters:
scale (int) – The scale at which the signal was encoded.
ax (matplotlib.axes.Axes, optional) – Axis to draw on. If None, a new figure is created.
colormap (dict, optional) – Custom mapping of letters to colors. Default uses 7 distinct colors.
alpha (float) – Transparency for the patches. Default is 0.4.
- plot_scalogram()
Plot a scalogram with two subplots: - Top: colored image of CWT coefficient amplitudes - Bottom: line curves of selected scales
- Returns:
fig – The matplotlib figure object.
- Return type:
matplotlib.figure.Figure
- plot_signals(scales=None)
Plot signals.
- plot_transforms(indices=None, **kwargs)
Plot the stored CWT-transformed signals as a signal collection.
- Parameters:
indices (list[int or str], optional) – Specific scales or names to plot.
kwargs (passed to signal_collection.plot) –
- static print_alignment(seq1, seq2, width=80)
print aligned sequences
- pseudoinverse(scales=None, rank=None, return_weights=False, name=None)
Approximate signal reconstruction via pseudo-inverse using stored CWT coefficients.
- Parameters:
scales (list, float, int, or None) – Scales to include in the reconstruction. If None, all scales in self.cwt_coeffs are used.
rank (int or None) – Optional truncation rank for the SVD decomposition (for denoising or dimensionality reduction).
return_weights (bool) – If True, also return the weights (contributions) of each scale.
name (str or None) – Optional name for the returned signal. Defaults to “pseudoinverse” with included scales.
- Returns:
reconstructed_signal (signal) – Reconstructed signal instance from the pseudo-inverse of the CWT decomposition.
weights (np.ndarray, optional) – Returned only if return_weights=True, gives the contribution of each scale.
- Raises:
ValueError – If CWT coefficients are not available.
- static reconstruct_aligned_string(seq, aligned)
Fast reconstruction of aligned signals
- reconstruct_signal(scale, return_signal=True)
Reconstruct the signal from symbolic features (e.g., YAZB).
- Parameters:
scale (int) – Scale to use for reconstruction.
return_signal (bool) – If True, return a signal object. Else return y array.
- Returns:
Reconstructed signal.
- Return type:
signal or np.ndarray
- static sindecode_dna(grouped_embeddings, reference_dx=1.0, d_part=32, N=10000)
Decode sinusoidal grouped embeddings into a DNACodes structure.
- Parameters:
grouped_embeddings (dict) – {scale: {letter: np.ndarray}} sinusoidal representations
reference_dx (float) – Sampling resolution used to reconstruct xloc and iloc
d_part (int) – Dimensionality per component (start, width, height)
N (int) – Frequency base
- Returns:
Decoded symbolic structure
- Return type:
- sinencode_dna(scales=None, d_part=32, N=10000)
Encode self.codes into sinusoidal embeddings (grouped by letter).
- Sets:
self.codes (DNACodes): Encoded version of original codes.
- sinencode_dna_full(d_model=96, N=10000, operation=None)
🌀 Encode full-resolution DNA-like strings into sinusoidal embeddings grouped by letter.
- Parameters
- d_modelint, optional
Dimensionality of the sinusoidal embedding (default is 96).
- Nint, optional
Maximum number of positions for the encoding (default is 10000).
- operationstr or None, optional
If “sum”, sums all encodings per letter. If “mean”, averages all encodings per letter. If None, keeps the full (n_occurrences, d_model) matrix per letter. Raises a ValueError if the operation is not one of the above.
Sets
- self.codesfullDNAFullCodes (if not already set)
Full-resolution symbolic strings at each scale.
- self.codesfull_encodedDNAFullCodes
Sinusoidally encoded version of the full DNA strings.
- sparsify_cwt(scale=None, threshold=None, inplace=True)
Sparsify CWT coefficients by zeroing values below a threshold.
- Parameters:
scale (int, float, list, or None) – Scale(s) to sparsify. If None, all available scales in self.cwt_coeffs are used.
threshold (float or None) – Absolute value below which coefficients are set to zero. If None, uses 1% of the maximum absolute value at each scale.
inplace (bool) – If True, modifies current instance. If False, returns a modified copy.
- Returns:
Modified copy if inplace is False, otherwise None.
- Return type:
DNAsignal or None
- Raises:
ValueError – If scale(s) not found in self.cwt_coeffs.
- static synthetic_signal(x, peaks, baseline=None)
Generate flexible synthetic signals. (obsolete)
- tosignal(scale=None, codes_attr='codes')
Reconstruct an approximate signal from symbolic encodings.
- Parameters:
scale (int or None) – Scale level to use (defaults to first available if None).
codes_attr (str) – Attribute from which to decode (‘codes’ or ‘codesfull’).
- Returns:
An approximate signal object reconstructed from symbolic information.
- Return type:
- class sig2dna_core.signomics.DNAsignal_collection(*signals, vtmscale=None, rasterscan=True, dtype=<class 'numpy.float32'>)
Bases:
list
A collection of DNAsignal objects (e.g., from a GC-MS chromatogram) supporting symbolic sinusoidal encoding, full tensor construction, and blind deconvolution using latent component analysis.
Purpose
DNAsignal_collection is designed to enable symbolic and positional encoding of multiple 1D analytical signals (e.g., ion channels from GC-MS data) using lettered segments and sinusoidal encodings. It allows combining multiple encoded signals into a single tensor for processing with machine learning methods, including dimensionality reduction and blind source separation.
Theory
Each signal is decomposed into symbolic segments based on their local morphology (encoded as letters like A, B, Y, Z, _). Each segment is represented in an embedding space of dimension d via sinusoidal encoding. The 3D tensor $v_{t,m,d}$ is composed of:
$E_{t,m,d}$: symbolic encoding across segments.
$PE_t$: positional encoding along the time/segment axis (t).
$PE_m$: positional encoding along the mass channel/ion axis (m).
Combining these yields:
$$ v_{t,m,d} = E_{t,m,d} + PE_t(t,d) + PE_m(m,d) $$
- - `.sinencode_dna_full(scale=4)`: performs symbolic encoding at a given scale.
- - `.E_symbol`: property returning symbolic component E for each letter and scale.
- - `.PE_t`: positional encoding per letter along t for each scale.
- - `.PE_m`: positional encoding along m (mass channels).
- - `.vtm`: dictionary of $v_{t,m,d}$ matrices per letter.
- - `.vtm_full`: complete tensor (sum of all letters) used for machine learning.
- - `.deconvolve_latent_sources(...)`: uses PCA to decompose full tensor into component chromatograms.
- - `.plot_v_symbol_components(...)`: visualizes the construction of $v_{t,m}$ for each letter.
- - `.plot_vtm_full(...)`: visualizes the components of the full $v_{t,m}$ tensor.
- param signals:
List of DNAsignal instances (e.g., one per ion channel).
- type signals:
list of DNAsignal
- param vtmscale:
Scale to calculate vtm_full
- type vtmscale:
int
- param rasterscan:
True if one point is read at a time by the detector In practice, flatten the 2D signal (T × m) into a single 1D time axis by appending all temporal channels one after another:
- type rasterscan:
bool (default=True)
- param dtype:
Numeric dtype used for storing encoding arrays (E_symbol, PE_t, PE_m, vtm, vtm_full). Defaults to np.float32 for reduced memory usage.
- type dtype:
type or np.dtype, optional
- m
Number of signals (ion channels).
- Type:
int
- d
Embedding dimension.
- Type:
int
- letters
List of symbolic segment labels used in the encodings.
- Type:
list of str
- scales
Available scales for the symbolic encoding.
- Type:
list of int
- _E_symbol
Cached symbolic encoding tensors for each letter and scale.
- Type:
dict
- _PE_t
Cached positional encoding along t for each letter and scale.
- Type:
dict
- _PE_m
Cached positional encoding along m for each scale.
- Type:
dict
- _vtm
Cached symbolic+positional tensors per letter.
- Type:
dict
- _vtm_full
Cached full encoding tensor combining all letters.
- Type:
np.ndarray
- property E_symbol
Symbolic component of the 2D encoding for each scale and letter.
For each letter, builds a tensor (T_letter, M, D) where: - T_letter is the total number of segments of that letter across all M signals, - D is the embedding dimension, - M is the number of ion channels.
- Returns:
Mapping {scale: {letter: ndarray of shape (T_letter, m, d)}}
- Return type:
dict of dict
- Type:
E_symbol(t, m)
- property PE_m
Positional encoding along m (IC index axis) per scale.
- Returns:
Mapping: {scale: array of shape (m, d)}
- Return type:
dict
- Type:
PE_m(m)
- property PE_t
Positional encoding along t (segment axis) per scale and letter.
For each scale and letter, this provides a matrix of shape (n_segments, d), where n_segments is the total number of segments of that letter across the m signals, and d is the embedding dimension.
- Returns:
Mapping {scale: {letter: ndarray of shape (n_segments, d)}}
- Return type:
dict of dict
- Type:
PE_t(t)
- combine_embeddings(selected_letters=None)
Combine unwrapped embeddings across all signals for each scale.
- Parameters:
selected_letters (list of str, optional) – If provided, restrict to these letters.
- Returns:
Dictionary {scale: {letter: (m, d)}} for each selected scale and letter.
- Return type:
dict
- deconvolve_latent_sources(n_components=64, inertia_loss_threshold=0.25, plot=True, nmax_plot=8)
Perform dimensionality reduction on the 3D tensor v_{t,m,d} to extract non-coeluted compound chromatograms using PCA, with optional plotting.
- Parameters:
n_components (int, optional) – Maximum number of latent components (e.g., pure compounds) to extract. Default is 64.
inertia_loss_threshold (float, optional) – The maximum allowed proportion of total variance to lose in the projection. Default is 0.25 (i.e., at least 75% of the variance should be preserved).
plot (bool, optional) – Whether to display diagnostic plots.
nmax_plot (int, optional) – Maximum number of components to visualize in plots.
- Returns:
components (np.ndarray) – Component matrix of shape (n_selected_components, D), representing the spectral basis vectors (latent features).
chromatograms (np.ndarray) – Projected chromatograms for each component, shape (T, M, n_selected_components).
explained_variance_ratio (np.ndarray) – Variance explained by each selected component.
- property m
Return the number of signals in the collection.
- plot(letters=None, scales=None, figsize=(18, 10), max_legend=25)
Plot the encoded signals in subplots. Rows represent letters, columns represent scales. Each subplot contains overlaid colored curves from all signals.
- Parameters:
letters (list or None) – Letters to be plotted. If None, all available letters are plotted.
scales (list or None) – Scales to be plotted. If None, all available scales are plotted.
figsize (tuple) – Size of the full figure.
max_legend (int) – Maximum number of signals to label in the legend.
- Returns:
The figure containing the plots.
- Return type:
matplotlib.figure.Figure
- plot_embedding_projection(letters=None, scales=None, method='pca', max_points=25, figsize=(14, 10))
Plot embedding projections of the encoded signals using PCA (default) or other DR methods.
- Parameters:
collection (DNAsignal_collection) – The collection of encoded signals.
scales (list or None) – Scales to include in the projection. If None, all available scales are used.
method (str) – Dimensionality reduction method (‘pca’ only supported for now).
max_points (int) – Maximum number of signal points to label explicitly.
figsize (tuple) – Size of the figure.
- Returns:
fig
- Return type:
matplotlib.figure.Figure
- plot_letters(scale=None, figsize=(12, 6), cmap='viridis')
Plot a heatmap of the letter codes (symbolic DNA) across all signals.
- Parameters:
scale (int, optional) – Scale to use (default: self.vtmscale).
figsize (tuple) – Size of the figure.
cmap (str) – Matplotlib colormap name (default: “viridis”).
- Return type:
matplotlib.figure.Figure
- plot_v_symbol_components(scale=None, dims='all')
Plot the components E_symbol, PE_t, PE_m and their sum v_{t,m} as image matrices for each letter at a given scale.
- Parameters:
scale (int, optional) – The scale to use. Defaults to the first available scale.
dims ("all", list or slice) – Which dimensions to include in the sum over d. Default is all.
- Return type:
matplotlib.figure.Figure
- plot_vtm_full(scale=None, dims='all')
Visualize the components and sum of the full encoded GC-MS signal at a given scale.
- Parameters:
scale (int, optional) – Scale to visualize. Defaults to the first scale.
dims ("all", list or slice) – Dimensions of the embedding d to include (default: all).
- Return type:
matplotlib.figure.Figure
- reduce_dimensions(method='pca', selected_letters=None, n_components=2)
Apply dimensionality reduction (PCA or UMAP) across signals for each scale.
- Parameters:
method (str) – “pca” or “umap”.
selected_letters (list of str, optional) – Restrict to a subset of letters.
n_components (int) – Number of projection dimensions.
- Returns:
Dictionary {scale: ndarray} with shape (m, n_components), one per scale.
- Return type:
dict
- scale_alignment(method='zscore')
Normalize embeddings across all signals and all scales.
- Parameters:
method (str) – One of {“zscore”, “minmax”}.
- sinencode_dna_full(d_model=128, N=10000, operation='sum')
🌀 Encode all DNAsignal instances using full-resolution sinusoidal embeddings, grouped by letter and organized per scale.
- Parameters:
d_model (int, optional) – Dimensionality of the sinusoidal embedding (default is 128).
N (int, optional) – Maximum number of positions for the encoding (default is 10000).
operation (str or None) – If “sum”, sum all position encodings per letter. If “mean”, average encodings. If None, retain full (n_occurrences × d_model) arrays.
- to_dataframe(selected_letters=None)
Export combined embeddings for all scales as a tidy pandas DataFrame, suitable for machine learning tasks.
- Parameters:
selected_letters (list of str, optional) – Subset of letters to include. If None, include all letters.
- Returns:
A long-form DataFrame with columns: [‘signal_index’, ‘scale’, ‘letter’, ‘dim_0’, …, ‘dim_{d-1}’]
- Return type:
pd.DataFrame
- property vtm
Compute the full encoded matrix v_{t,m} for each letter at each scale.
This property combines three orthogonal components: - E_symbol(t, m): the original per-segment encoding for each letter - PE_t(t): a sinusoidal encoding applied along the segment (time) axis - PE_m(m): a sinusoidal encoding applied along the signal (IC) axis
- The resulting tensor for each scale and letter is of shape (n_segments, m, d), where:
n_segments: number of segments (time positions) per letter
m: number of DNAsignal instances in the collection
d: dimensionality of the encoding space (d_model)
- Returns:
A nested dictionary {scale: {letter: array of shape (n_segments, m, d)}}.
- Return type:
dict
- property vtm_full
- Compute and store the full encoded tensor for the GC-MS signal:
- If self.rasterscan is False:
shape = (T, m, d)
- If self.rasterscan is True:
shape = (T * m, d)
This combines: - Symbolic embedding per character (one-hot or learned) - Positional encoding along time axis - PE_m (mass/IC identity) is used only if rasterscan is False
- Returns:
Encoded array (T, m, d) or (T*m, d)
- Return type:
np.ndarray
- class sig2dna_core.signomics.DNAstr(content, dx=1.0, iloc=0, xloc=None, x_label='index', x_unit='-', engine='difflib', engineOpts=None)
Bases:
str
A symbolic DNA-like sequence class supporting alignment, entropy analysis, edit-distance metrics, and signal reconstruction from symbolic codes.
Extended from str, it is designed for symbolic transformations of signals (e.g., wavelet-encoded GC-MS peaks or time series).
Main Features
Supports symbolic operations for pattern recognition, entropy, alignment.
Encodes x-resolution (dx), original index (iloc), and physical x-range (xloc).
Aligns sequences with visual inspection and rich diffs.
Converts symbolic strings into synthetic numerical signals.
Operators
: concatenate two DNAstr objects
: symbolic difference after alignment (mismatches only)
== : equality comparison (exact content and dx)
Key Methods
__init__ / __new__ : Constructor with metadata (dx, iloc, xloc)
align(other) : Align this DNAstr to another, update mask and aligned views
wrapped_alignment() : Pretty terminal view of the alignment with colors and symbols
html_alignment() : Rich HTML display of the alignment (Jupyter)
plot_alignment() : Visualize waveform alignment with symbolic signals
plot_mask() : Color block plot showing matches/mismatches/gaps
find(pattern, regex=False) : Search for symbolic patterns with fuzziness or regex
to_signal() : Convert symbolic code into synthetic signal (NumPy)
vectorized() : Convert string to integer codes
summary() : Print entropy and character frequencies
mutation_counts : Property: {‘matches’, ‘mismatches’, ‘indels’}
entropy : Property: Shannon entropy
mutual_entropy(other) : Mutual entropy of two sequences
excess_entropy(other) : Excess entropy H1 + H2 - 2 * H12
jensen_shannon(other) : Jensen-Shannon divergence
jaccard(other) : Jaccard similarity
alignment_stats : Property: Match, substitution, gap counts
score(normalized=True) : Alignment score (fraction of matches)
has(other: str) : Check if a pattern or substring exists
- dx
Average resolution along the x-axis.
- Type:
float
- iloc
Positional index or index range in the source DNA string.
- Type:
int or tuple of int
- xloc
Corresponding x-value(s) for the symbolic sequence.
- Type:
float or tuple of float
- aligned_with
Aligned form of self with insertions (spaces) where needed.
- Type:
str or None
- other_copy
Aligned form of the reference sequence.
- Type:
str or None
- ref_hash
SHA256 hash of the aligned reference sequence.
- Type:
str or None
- mask
Alignment mask: ‘=’ for matches, ‘*’ for substitutions, ‘ ‘ for gaps.
- Type:
str or None
- engine
Alignment engine: ‘difflib’ or ‘bio’.
- Type:
str
- engineOpts
Options passed to the alignment engine.
- Type:
dict
Examples
>>> s1 = DNAstr("YYAAZZBB", dx=0.5) >>> s2 = DNAstr("YAABZBB", dx=0.5) >>> s1.align(s2) >>> print(s1.wrapped_alignment(40)) >>> s1.plot_alignment() >>> segments = s1.find("YAZB") >>> segments[0].to_signal().plot()
- align(other, engine=None, engineOpts=None, forced=False)
Align this DNAstr sequence to another, allowing insertions/deletions to maximize matches.
- Parameters:
other (DNAstr) – Another DNAstr object to align with.
engine ({'difflib', 'bio'} or None) –
- Alignment engine to use:
’difflib’: uses difflib.SequenceMatcher (fast, approximate).
’bio’ : uses Bio.Align.PairwiseAligner (biologically inspired global alignment).
If None, defaults to self.engine.
engineOpts (dict, optional) – Dictionary of alignment parameters for the selected engine.
forced (bool) – If True, allow alignment even if dx values differ. If False (default), a mismatch in dx will raise an error to prevent incorrect alignment of signals with different sampling.
- Returns:
aligned_self (str) – Aligned version of this sequence (with gaps inserted where needed).
aligned_other (str) – Aligned version of the other sequence.
Notes
The alignment is symmetric and permanent: both sequences are aligned with gaps introduced (spaces) to preserve positional correspondence. A hash of the aligned other sequence is stored to detect redundant alignments.
- A match mask (self.mask) is generated with:
‘=’ for exact matches, ‘*’ for mismatches (substitutions), ‘ ‘ for insertions/deletions (gaps).
- The method updates:
self.aligned_with
self.other_copy
self.mask
self.ref_hash
Example:
S1 = DNAstr(“AABBCC”) S2 = DNAstr(“AACBCC”) S1.align(S2,”difflib”) print(S1.mask) print(S1.wrapped_alignment()) ==*=== AACBCC || ||| AABBCC
S1 = DNAstr(“AABBCC”) S2 = DNAstr(“AACBCC”) S1.align(S2,”bio”) print(S1.mask) print(S1.wrapped_alignment()) == == AAB·CC || || AA·BCC
S1 = DNAstr(“AABBCCXYZZZ”) S2 = DNAstr(“AACBCCZZXXX”) S1.align(S2,”bio”) print(S1.mask) print(S1.wrapped_alignment()) == * == AABCC··ZZ || || AA·B·CCZZ
- property aligned_code
return aligned code
- property alignment_stats
Retrun DNAstr alignment statistics
- property entropy
Compute the Shannon entropy of the DNAstr sequence
- excess_entropy(other)
Compute the excess Shannon entropy of two DNAstr sequences H(A)+H(B)-2*H(AB)
- extract_motifs(pattern='YAZB', minlen=4, plot=True)
Extract and analyze YAZB motifs (canonical and distorted) from the symbolic sequence.
- Parameters:
pattern (str) – Canonical motif pattern (default is ‘YAZB’).
minlen (int) – Minimum motif length to be considered valid.
plot (bool) – If True, generate a motif density plot using xloc or sequence index.
- Returns:
Table of detected motifs with start/end positions, length, and classification.
- Return type:
pd.DataFrame
- find(pattern, regex=False)
Finds all fuzzy (or regex-based) occurrences of a DNA-like sequence pattern.
- Parameters:
pattern (str) – The symbolic sequence to search for (e.g., “YAZB”).
regex (bool, optional) – If False (default), interprets pattern as symbolic and inserts ‘.’ between characters. If True, uses the raw pattern as a regular expression.
- Returns:
- A list of DNAstr slices with attributes:
iloc: (start_idx, end_idx)
xloc: (x_start, x_end)
width: segment width
- Return type:
list of DNAstr
- html_alignment()
Render the alignment using HTML with color coding: - green: match - blue: gap - red: substitution
- Returns:
Displays HTML directly in Jupyter/Notebook environments.
- Return type:
None
- jaccard(other)
Compute the Jaccard distance between two DNAstr sequences.
- Parameters:
other (DNAstr) – The other DNAstr sequence to compare with.
- Returns:
Jaccard distance: 1 - (intersection / union) of unique letters.
- Return type:
float
- jensen_shannon(other, base=2)
Compute the Jensen-Shannon distance between self and another DNAstr.
- Parameters:
other (DNAstr) – Another DNAstr instance.
base (float, optional) – Base for the logarithm (default: 2)
- Returns:
Jensen-Shannon distance.
- Return type:
float
- levenshtein(other, use_alignment=True, engine=None, engineOpts=None, forced=False)
Compute the Levenshtein distance between this DNAstr and another one.
- Parameters:
other (DNAstr) – Another DNAstr object to compare against.
use_alignment (bool, default=True) – If True, uses the aligned sequences (computed if necessary). If False, compares the raw sequences directly.
engine ({'difflib', 'bio'}, optional) – Alignment engine to use if alignment is needed.
engineOpts (dict, optional) – Parameters for the selected alignment engine.
forced (bool, default=False) – Force alignment even if dx values differ.
- Returns:
dist – Levenshtein distance between the two sequences (aligned or raw).
- Return type:
int
Examples
A = DNAstr(“YAZBZAY”) B = DNAstr(“YAZBZZY”) A.levenshtein_distance(B, use_alignment=False) # raw A.levenshtein_distance(B, use_alignment=True, engine=”bio”) # aligned
- property mutation_counts
Counts of insertions, deletions/substitutions, and matches.
- mutual_entropy(other=None)
Compute the Shannon mutual entropy of two DNAstr sequences from their aligned segments
- plot_alignment(dx=1.0, dy=1.0, width=20, normalize=True)
Plot a block alignment view of two DNAstr sequences with color-coded segments.
- Parameters:
dx (float) – Horizontal step between segments (defaults to 1.0).
dy (float) – Vertical height increment for symbolic waveform visualization.
width (int) – Number of characters per row (line wrapping).
- Returns:
matplotlib.figure.Figure
matplotlib.axes.Axes
- plot_mask()
Plot a color-coded mask of the alignment between sequences.
- Returns:
Matplotlib figure of the alignment mask.
- Return type:
matplotlib.figure.Figure
- score(normalized=True)
Return an alignment score, optionally normalized.
- Parameters:
normalized (bool) – If True (default), return score as a fraction of total aligned positions.
- Returns:
Alignment score.
- Return type:
float
- summary()
Summarize the DNAstr with key stats: length, unique letters, entropy, etc.
- Returns:
Dictionary containing length, letter frequency, Shannon entropy, and dx.
- Return type:
dict
- to_signal()
Converts the symbolic DNA sequence into a synthetic NumPy array mimicking the original wavelet-transformed signal.
- Rules per letter:
‘A’: Crosses zero upward → linear from -1 to +1, zero in the middle
‘Z’: Crosses zero downward → linear from +1 to -1, zero in the middle
‘B’: Increasing negative → from -1 to 0
‘Y’: Decreasing negative → from 0 to -1
‘C’: Increasing positive → from 0 to +1
‘X’: Decreasing positive → from +1 to 0
‘_’: Flat at 0
- Returns:
Synthetic signal array matching the symbolic encoding.
- Return type:
numpy.ndarray
- vectorized(codebook={'A': 1, 'B': 2, 'C': 3, 'X': 4, 'Y': 5, 'Z': 6, '_': 0})
Map the DNAstr content to an integer array using a codebook.
- Parameters:
codebook (dict, optional) – Dictionary mapping characters to integer values. default = {“A”:1,”B”:2,”C”:3,”X”:4,”Y”:5,”Z”:6,”_”:0} None will generate a codebook based on current symbols only
- Returns:
Vectorized integer representation of the string.
- Return type:
np.ndarray
- wrapped_alignment(width=80, colors=True)
Return a line-wrapped alignment view (multi-line), optionally color-coded for terminal/IPython usage (Spyder, Jupyter).
- Parameters:
width (int) – Number of characters per line in wrapped display.
colors (bool) – If True, use ANSI codes to highlight differences. May be overridden if terminal does not support ANSI (e.g., Spyder).
- Returns:
Wrapped, optionally colorized alignment.
- Return type:
str
- class sig2dna_core.signomics.SinusoidalEncoder(d_model=96, N=10000, dtype=<class 'numpy.float32'>)
Bases:
object
🌀 Generic sinusoidal encoder/decoder supporting symbolic and numeric sequences.
Each scalar value is transformed into a vector of dimension d_model, where alternating components contain sinusoidal features of increasing frequency. The mapping is based on:
- For k = 0 to d_model/2 - 1:
f_{2k}(x) = sin(x / r_k) f_{2k+1}(x) = cos(x / r_k)
where r_k = N^(2k / d_model)
This representation preserves relative positions and scaling in a smooth, topologically faithful embedding space. The class supports multiple decoding strategies, scaling logic, residual control, and round-trip verification.
- Parameters:
d_model (int) – Dimensionality of each sinusoidal encoding (must be even).
N (int) – Frequency base for the positional encoding.
dtype (np.dtype) – Output data type (default: np.float32).
- d_model
Embedding dimensionality.
- Type:
int
- N
Frequency base.
- Type:
int
- dtype
Data type for encoded output.
- Type:
np.dtype
- _last_input_type
Last input type passed to encode.
- Type:
type
- _last_input_length
Last input length passed to encode.
- Type:
int
- _scale
Scaling factor applied to normalize input values.
- Type:
float or None
- _auto_scale_enabled
Whether autoscaling is enabled.
- Type:
bool
- _decode_residual_tolerance
Tolerance for residual error checking in decode verification.
- Type:
float
- encode(values, scale=None)
Encodes a sequence of values (scalar or symbolic) into sinusoidal embeddings.
- decode(embedding, method='least_squares', return_error=False)
Decodes the embedding to the original values using the selected inverse method.
- fit_encoder(values, target_range=10.0)
Automatically estimates and stores a scaling factor to normalize input values.
- set_decode_tolerance(tol)
Sets the residual error threshold above which decoding results will raise a warning.
- verify_roundtrip(values, method='least_squares', scale='auto', verbose=True, return_details=False)
Checks round-trip accuracy of encoding and decoding. Warns if residuals exceed tolerance.
- sinencode_dna_grouped(code, d_part, N)
Encodes a codes entry (triplet-based segments grouped by letter).
- sindecode_dna_grouped(grouped, reference_code, d_part, N)
Reconstructs code dictionary from grouped sinusoidal embeddings.
- sinencode_dnafull_grouped(dnafull, d_model, N)
Encodes codesfull entries (strings) into grouped embeddings.
- sindecode_dnafull_grouped(grouped)
Decodes grouped full embeddings back to DNAstr.
- Static Methods
- --------------
- to_complex(emb)
Convert a sinusoidal embedding (sin, cos) into complex numbers using Euler’s identity.
- complex_distance(emb1, emb2, norm='L2')
Compute pointwise distances between two embeddings in the complex sinusoidal space.
- angle_difference(emb)
Compute angular differences Δθ between consecutive elements of a sinusoidal embedding.
- phase_alignment(emb, ref)
Align the phase of an embedding emb to a reference embedding ref using complex phase factors.
- pairwise_similarity(emb, metric='cosine')
Compute a pairwise similarity or distance matrix (‘cosine’ or ‘L2’) between all elements.
- group_centroid(emb, labels=None, return_std=False)
Compute the centroid (and optionally standard deviation) of groups in complex embedding space.
- phase_unwrap(emb, normalize=False)
Perform phase unwrapping (à la Fourier) on the sinusoidal embedding, optionally normalized to [0, 1].
- Example(without scaling)
- --------------------------
- >>> s = SinusoidalEncoder(8, 100) # poor encoder (8 dimensions, high N)
- >>> a = s.encode([0, 1, 1, 2, 2, 3, 4, 5, 6, 6])
- >>> s.decode(a)
- Output:
- [0.0,
1.0000000072927564, 1.0000000072927564, 1.9999999989612762, 1.9999999989612762, 2.9999999881593866, 3.9999997833243714, 5.000000005427378, 5.999999552487366, 5.999999552487366]
- Notes:
Use lower N (e.g., N = 1000) to compress phase variation and allow larger input range.
Use scaling (via fit_encoder() or scale=) for large or high-resolution inputs.
- Example(with scaling)
- ----------------------
- >>> s2 = SinusoidalEncoder(128, 10000)
- >>> a2 = s2.encode([0, 1, 1, 2, 2, 3, 4, 5, 6, 6, 7, 7, 7, 8, 8, 8, 8, 16, 99, 130], scale=100)
- >>> s2.decode(a2)
- Output:
- [0.0,
1.0000000127956146, 1.0000000127956146, 1.999999922004248, 1.999999922004248, 2.999999960015871, 3.9999999840341935, 5.000000076901334, 5.999999935247362, 5.999999935247362, 6.999999909109446, 6.999999909109446, 6.999999909109446, 8.000000084954152, 8.000000084954152, 8.000000084954152, 8.000000084954152, 16.000000431016925, 99.00000148970207, 129.99999823222058]
- Advanced Example
- -----------------
- import numpy as np
- import matplotlib.pyplot as plt
- >>> # 1. Construct a test input signal with smooth and jump segments
- >>> x_smooth = np.linspace(0, 20, 100)
- >>> x_jumps = np.array([25, 25, 26, 27, 100, 101, 130])
- >>> x = np.concatenate([x_smooth, x_jumps])
- >>> # 2. Initialize encoder with high d_model and N
- >>> s = SinusoidalEncoder(d_model=128, N=10000)
- >>> # 3. Fit auto-scaling to compress input into sinusoidal-friendly space
- >>> s.fit_encoder(x, target_range=10)
- >>> # 4. Encode and decode using all robust methods
- >>> a = s.encode(x)
- >>> decoded_lsq, err_lsq = s.decode(a, method='least_squares', return_error=True)
- >>> decoded_svd, err_svd = s.decode(a, method='svd', return_error=True)
- >>> # 5. Compare errors
- >>> true = x
- >>> lsq_error = np.abs(decoded_lsq - true)
- >>> svd_error = np.abs(decoded_svd - true)
- >>> # 6. Plot results
- >>> fig, axs = plt.subplots(2, 2, figsize=(12, 8))
- >>> axs[0, 0].plot(true, label="Original")
- >>> axs[0, 0].plot(decoded_lsq, '--', label="Decoded (LSQ)")
- >>> axs[0, 0].plot(decoded_svd, ':', label="Decoded (SVD)")
- >>> axs[0, 0].set_title("Decoded vs Original")
- >>> axs[0, 0].legend()
- >>> axs[0, 1].plot(lsq_error, label="Abs Error (LSQ)")
- >>> axs[0, 1].plot(svd_error, label="Abs Error (SVD)")
- >>> axs[0, 1].set_yscale('log')
- >>> axs[0, 1].set_title("Absolute Decoding Error (log scale)")
- >>> axs[0, 1].legend()
- >>> axs[1, 0].plot(err_lsq, label="Residual Norm (LSQ)")
- >>> axs[1, 0].plot(err_svd, label="Residual Norm (SVD)")
- >>> axs[1, 0].set_yscale('log')
- >>> axs[1, 0].set_title("Reconstruction Residuals")
- >>> axs[1, 0].legend()
- >>> axs[1, 1].hist(lsq_error, bins=50, alpha=0.7, label="LSQ")
- >>> axs[1, 1].hist(svd_error, bins=50, alpha=0.5, label="SVD")
- >>> axs[1, 1].set_title("Histogram of Absolute Errors")
- >>> axs[1, 1].legend()
- >>> plt.suptitle("🌀 SinusoidalEncoder: Accuracy Evaluation", fontsize=14)
- >>> plt.tight_layout()
- >>> plt.show()
- References(for the encoding)
- ------------------------------
- \* Vaswani et al. (2017), "Attention is All You Need"
- \* https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)
- static angle_difference(emb)
Compute angular differences (∆θ) between consecutive embeddings.
- Parameters:
emb (np.ndarray) – Encoded array of shape (n, 2d)
- Returns:
Array of shape (n-1, d) of angular differences in radians
- Return type:
np.ndarray
- static complex_distance(emb1, emb2, norm='L2')
Compute distance between two encoded arrays using complex projection.
- Parameters:
emb1 (np.ndarray) – Encoded arrays of shape (…, 2d) to compare.
emb2 (np.ndarray) – Encoded arrays of shape (…, 2d) to compare.
norm (str) – ‘L2’ for Euclidean norm, ‘cos’ for cosine angle distance.
- Returns:
Distance values per sample (1D array)
- Return type:
np.ndarray
- decode(embedding, method='least_squares', return_error=False)
Decode sinusoidal embeddings back to original values using selected method.
- Parameters:
embedding (np.ndarray) – Encoded sinusoidal array of shape (n, d_model)
method (str) – Decoding strategy: ‘least_squares’ (default), ‘optimize’, or ‘naive’
return_error (bool) – If True, returns (decoded_values, residual_error) as a tuple
- Returns:
decoded (list or array) – Reconstructed input values
residual (np.ndarray, optional) – Residuals of decoding (per sample), returned only if return_error=True
- encode(values, scale=None)
Encode input values into sinusoidal embeddings.
- Parameters:
values (array-like) – Values to encode.
scale (float or None) – Rescaling factor. If None, auto-scaling is applied if enabled.
- Returns:
Embedded values of shape (n, d_model)
- Return type:
np.ndarray
- fit_encoder(values, target_range=10.0)
Fit a scaling factor to normalize values into a target sinusoidal-safe range.
- Parameters:
values (array-like) – Original values to encode (will determine scale).
target_range (float) – Maximum scaled range to span (e.g. [0, 10]).
- Returns:
Recommended scale factor stored internally.
- Return type:
float
- static group_centroid(emb, labels=None, return_std=False)
Compute the centroid (average embedding) of each group in complex sinusoidal space.
- Parameters:
emb (np.ndarray) – Encoded array of shape (n, 2d), with sin and cos interleaved.
labels (list or np.ndarray, optional) – Group labels (n,). If None, the entire set is treated as one group.
return_std (bool) – Whether to also return the standard deviation per group.
- Returns:
Dictionary mapping each group label to its centroid (2d real array). If return_std=True, also includes key ‘<label>_std’ with standard deviation.
- Return type:
dict
- static pairwise_similarity(emb, metric='cosine')
Compute a pairwise similarity (or distance) matrix in sinusoidal embedding space.
- Parameters:
emb (np.ndarray) – Encoded array of shape (n, 2d).
metric (str) – Distance metric: ‘cosine’ for 1 - cosine similarity, ‘L2’ for Euclidean norm.
- Returns:
Pairwise similarity matrix of shape (n, n)
- Return type:
np.ndarray
- static phase_alignment(emb, ref)
Align embedding emb to reference ref using complex phase.
- Parameters:
emb (np.ndarray) – Encoded array to align (n, 2d)
ref (np.ndarray) – Reference encoded array (n, 2d)
- Returns:
Aligned encoding of emb, same shape as input.
- Return type:
np.ndarray
- static phase_unwrap(emb, normalize=False)
Perform Fourier-like phase unwrapping on sinusoidal embedding.
- Parameters:
emb (np.ndarray) – Encoded array of shape (n, 2d).
normalize (bool) – Whether to scale unwrapped phases to [0, 1].
- Returns:
Phase unwrapped matrix of shape (n, d), optionally normalized.
- Return type:
np.ndarray
- set_decode_tolerance(tol=0.001)
Set maximum acceptable residual error for decoding.
- Parameters:
tol (float) – Residual threshold above which a warning is triggered.
- static sindecode_dna_grouped(grouped, reference_code, d_part=32, N=10000)
Decode sinusoidal embeddings grouped by letter into symbolic segments.
- Parameters:
grouped (dict) – Dictionary of letter: embeddings
reference_code (dict) – Must include ‘dx’ (sampling resolution).
d_part (int) – Number of dimensions per part (start, width, height)
N (int) – Frequency base
- Returns:
Dictionary with keys: letters, widths, heights, xloc, iloc, dx
- Return type:
dict
- static sinencode_dna_grouped(code, d_part=32, N=10000)
Encode symbolic code segments grouped by letter into sinusoidal embeddings.
- Parameters:
code (dict) –
- Must contain:
’letters’: str
’xloc’: list of (start, end) tuples
’widths’: list of float
’heights’: list of float
d_part (int) – Number of dimensions per field (position, width, height)
N (int) – Sinusoidal frequency base
- Returns:
Dictionary of embeddings by letter: {letter: np.ndarray(n, 3*d_part)}
- Return type:
dict
- static to_complex(emb)
Convert sinusoidal embedding into a complex array using Euler’s identity.
- Parameters:
emb (np.ndarray) – Array of shape (…, 2d) where sin/cos pairs are stored.
- Returns:
Complex array of shape (…, d) with values exp(i * theta)
- Return type:
np.ndarray
- verify_roundtrip(values, method='least_squares', scale='auto', verbose=True, return_details=False)
Perform an encode → decode → compare roundtrip and report accuracy.
- Parameters:
values (array-like) – Original values to test.
method (str) – Decoding method: ‘least_squares’, ‘svd’, ‘optimize’, or ‘naive’.
scale (float or 'auto' or None) – Scaling strategy: ‘auto’ uses fit_encoder(), float uses fixed scaling, None disables scaling.
verbose (bool) – If True, prints accuracy report.
return_details (bool) – If True, also returns the encoded array, decoded values, and residuals.
- Returns:
success (bool) – Whether all values were accurately recovered within residual tolerance.
details (tuple, optional) – Tuple (encoded, decoded, residuals) if return_details=True
- class sig2dna_core.signomics.generator(kind='gauss')
Bases:
object
- sig2dna_core.signomics.import_local_module(name: str, relative_path: str)
Import a module by name from a file path relative to the calling module.
Usage here:
import_local_module(“figprint”, “figprint.py”) replaces import figprint (zero-installation)
Parameters:
- namestr
Name to assign to the module (used internally).
- relative_pathstr
Path relative to the calling module’s location.
Returns:
- module
Imported module object.
Example:
>>> figprint = import_local_module("figprint", "figprint.py") >>> figprint.print_pdf(...)
- class sig2dna_core.signomics.peaks(data=None)
Bases:
object
A class for managing a collection of peak definitions used in synthetic signal generation.
Each peak is represented as a dictionary with the following fields: - ‘name’ (str): unique identifier (autogenerated if not provided) - ‘x’ (float): center position (e.g., time, wavenumber, index) - ‘w’ (float): width (related to FWHM) - ‘h’ (float): peak height - ‘type’ (str): generator type (e.g., ‘gauss’, ‘lorentz’, ‘triangle’)
Supports: - Flexible addition and broadcasting of peak parameters - Named or indexed access to individual or multiple peaks - Overloaded operators for peak translation and scaling - Utility methods: update, sort, rename, remove_duplicates, copy - Conversion to signal object via .to_signal() - Informative __str__ and __repr__ output
This class is used to build reproducible and structured test cases for symbolic encoding (e.g., sig2dna).
- add(x, w=1.0, h=1.0, name=None, type='gauss')
Add one or multiple peaks to the collection.
- Parameters:
x (float or array-like) – Center positions of the peaks.
w (float or array-like) – Width(s) of the peaks (broadcastable).
h (float or array-like) – Height(s) of the peaks (broadcastable).
name (str or list of str or None) – Peak name(s); auto-generated if None or duplicate.
type (str) – Generator type, e.g., ‘gauss’, ‘lorentz’, etc.
- as_dict()
Return the list of peaks as dict
- copy()
Return a deep-copy of the peaks
- names()
Return the list of names
- remove_duplicates()
- rename(prefix='P')
- sort(order='asc')
Sort peaks in-place based on their center positions (x values).
- Parameters:
order (str) –
- Sorting direction. Use:
”asc” for ascending (default)
”desc” for descending
- to_signal(index=None, name=None, generator_map=None, x=None, x0=0.0, n=1000)
Generate a signal from a peaks object. Optionally restrict to a subset.
- update(data)
Update or insert peaks from a list of dictionaries.
- Parameters:
data (list of dict) – Each dict must include at least ‘x’, ‘w’, ‘h’. If ‘name’ matches an existing peak, it will be updated. If ‘name’ is new or missing, the peak is appended.
- class sig2dna_core.signomics.signal(x=None, y=None, name='signal', type='generic', x_label='index', x_unit='-', y_label='intensity', y_unit='a.u.', metadata=None, source='array', user=None, date=None, host=None, cwd=None, version=None, color=None, linewidth=2, linestyle='-', message=None, fullhistory=True)
Bases:
object
signal: A self-documented 1D analytical signal container for reproducible scientific workflows.
This class is designed for lab-grade signal processing and traceable data storage. It represents a discrete 1D signal (e.g., chromatogram, spectrum, transient) with full metadata, support for symbolic transformation, numerical operations, plotting, and structured saving/loading.
Key features include: - Portable metadata (user, time, host, cwd, version) - Domain-aware plots and operations - Reproducible signal serialization in JSON or compressed format - Full traceability of all transformation events - Optional recursive backup of prior states
- x
Sampling domain (e.g., time, wavelength, chemical shift).
- Type:
np.ndarray
- y
Signal values aligned with x.
- Type:
np.ndarray
- name
Label for plots and file storage (used as default filename).
- Type:
str
- type
Optional tag (e.g., ‘GC-MS’, ‘FTIR’, ‘NMR’, ‘synthetic’).
- Type:
str
- x_label
Label for the x-axis (e.g., ‘wavenumber’).
- Type:
str
- x_unit
Unit of the x-axis (e.g., ‘cm⁻¹’).
- Type:
str
- source
Origin label (‘array’, ‘peaks’, ‘noise’, ‘imported’…).
- Type:
str
- metadata
Includes user, date, host, cwd, version — filled automatically unless overridden.
- Type:
dict
- color(str or [rgb]), linestyle (str), linewidth (str)
- Key Methods
- -----------
- - normalize(...)
- Type:
Normalize the signal to positive values
- - from_peaks(...)
- Type:
Construct signal from a peaks object
- - add_noise(...)
- Type:
Return noisy variant (Poisson, Gaussian, ramp or constant bias)
- - align_with(...)
- Type:
Align this signal with another (same x domain)
- - copy()
- Type:
Deep copy
- - save(...)
- Type:
Save as JSON or .gz (optional CSV export)
- - load(...)
- Type:
Load from saved file
- - plot(...)
- Type:
Plot the signal with axis labels
- - backup(...)
- Type:
Backup current signal (deep-copy stored in _previous)
- - restore(...)
- Type:
Restore the previous state of the signal
- - apply_poisson_baseline_filter(...)
- Type:
Apply a Poisson-based filter
- - enable_fullhistory
- Type:
enable full history
- - disable_fullhistory
- Type:
disable full history
- - _toDNA(signal)
- Type:
- Overloaded Operators
- --------------------
- - +, -, \*, /
- Type:
Operates on signals or scalars, aligns if needed
- - +=, -=, \*=, /=
- Type:
In-place functional versions (returns new signal)
- Low-level Methods
- -----------------
- - _current_stamp()
- Type:
stamp for events (static method)
- - _copystatic()
- Type:
deep-copy of signal only (use copy for a full copy) (static method)
- - _events()
- Type:
register a processing step
- - _to_serializable
- Type:
Convert the signal into a dictionary suitable for JSON export
- - _from_serizalizable
- Type:
convert a dict (e.g., from JSON import) to signal
Example
>>> s = signal(x, y, name="sample", type="FTIR", x_label="wavenumber", x_unit="cm⁻¹") >>> s.add_noise("gaussian", 0.05).plot() >>> s.save() # saves to ./sample.json.gz >>> s2 = signal.load("sample.json.gz")
- add_noise(kind='gaussian', scale=1.0, bias=None)
Return a new signal with noise and/or bias added.
- align_with(other, mode='union', n=1000)
Align two signals to a common x grid with interpolation and padding.
- Parameters:
other (signal) – the other signal to align with
mode (str) – ‘union’ (default) or ‘intersection’
n (int) – number of points for the new grid
- Returns:
(self_interp, other_interp) as new signal instances
- Return type:
tuple
- apply_poisson_baseline_filter(window_ratio=0.02, gain=1.0, proba=0.9)
Apply a baseline filter assuming Poisson-dominated statistics with adjustable gain and a rejection threshold based on the Bienaymé-Tchebychev inequality.
The signal is filtered by removing values likely caused by statistical noise (false peaks) using a per-point threshold defined from local statistics:
Local mean: $$ mu_t =
- rac{1}{w} sum_{i in W(t)} y_i $$
Local std dev: $$ sigma_t = sqrt{mu_t cdot ext{gain}} $$
Coefficient of variation: $$ ext{cv}_t =
- rac{sigma_t}{mu_t} $$
Estimated local intensity (lambda): $$ lambda_t =
- rac{1}{ ext{cv}_t^2} $$
Bienaymé-Tchebychev threshold: $$ ext{threshold}_t =
rac{1}{sqrt{1 - p}} cdot sqrt{10 lambda_t cdot Delta t} $$
- window_ratiofloat, default=0.02
Ratio of signal length used as window size (must yield odd integer ≥ 11).
- gainfloat, default=1.0
Linear amplification factor applied to simulate signal counts.
- probafloat, default=0.9
Minimum probability to consider a signal point significant. Must be in (0, 1).
- signal
The current signal instance (self), with updated y.
- ValueError
If the window size is too small for reliable statistics.
- backup(fullhistory=None, message=None)
Backup current state in _previous
- copy()
Deep copy of the signal, excluding full history control flag
- disable_fullhistory()
Disable full history tracking
- enable_fullhistory()
Enable full history tracking
- classmethod from_peaks(peaks_obj, x=None, generator_map=None, name='from_peaks', x0=None, n=1000)
Generate a signal from a set of peaks.
- Parameters:
peaks_obj (peaks) – A list-like object containing peak definitions.
x (array-like, float, or None) – If None: compute x domain from peaks. If scalar: interpreted as xmax; linspace from x0 to xmax. If array: use as x directly.
generator_map (dict or None) – Optional map of peak type → generator instance (default is Gaussian).
name (str) – Name of the signal instance.
x0 (float or None) – Left bound of the domain (used only if x is None or scalar). If None: inferred from peaks.
n (int) – Number of points in the generated x array.
- Returns:
A new signal instance generated from the peaks.
- Return type:
Example
p = peaks() p.add(x=[400, 800, 1600], w=30, h=[1.0, 0.6, 0.9], type=”gauss”) s = signal.from_peaks(p, x0=300, n=2048) s.plot()
- static load(filepath)
Load a signal from a JSON or gzipped JSON file, including recursive _previous.
- Parameters:
filepath (str or Path) – Path to the JSON or .gz file
- Returns:
A fully reconstructed signal object
- Return type:
- property n
Return the length of the signal and None if it is None
- normalize(mode='zscore+shift', inplace=True, shift_eps=1e-06)
Normalize the signal to positive values using different normalization strategies.
- Parameters:
mode (str) – Normalization mode: - “zscore+shift” : (y - mean) / std, then shift so min is shift_eps - “minmax” : (y - min) / (max - min), scales to [0, 1] - “max” : y / max, scales to [0, 1] - “l1” : y / sum(|y|), sums to 1 (like probability) - “energy” : y / sqrt(sum(y^2)), unit energy - “none” : No normalization, just returns a copy or itself
inplace (bool) – Whether to modify the signal in place. If False, returns a new signal.
shift_eps (float) – Minimum value to add after z-score shift to ensure strictly positive output.
- Returns:
Normalized signal (if inplace is False), else None.
- Return type:
signal or None
- Raises:
ValueError – If the normalization fails (e.g., due to division by zero).
- plot(ax=None, label=None, color=None, linestyle=None, linewidth=None, fontsize=12, newfig=False)
Plot the signal using matplotlib, applying either internal style settings or overrides provided at call time.
- Parameters:
ax (matplotlib.axes.Axes, optional) – Axis to plot on. If None, uses current axis or new figure if newfig=True.
label (str, optional) – Legend label. Defaults to self.name.
color (str or None) – Line color. If None, uses default matplotlib cycling.
linestyle (str or None) – Line style (e.g., ‘-’, ‘–‘). If None, uses self.linestyle.
linewidth (float or None) – Line width. If None, uses self.linewidth.
fontsize (int or str) – Font size for axis labels and legend. Can use values like ‘small’, ‘large’.
newfig (bool) – If True, creates a new figure before plotting.
- Returns:
matplotlib.figure.Figure
matplotlib.axes.Axes
- restore()
Restore the previous signal version if available
- sample(x_new)
Interpolate values from x
- save(filepath=None, zip=True, export_csv=False)
Save signal to JSON (optionally compressed) and optionally CSV.
- Parameters:
filepath (str or Path or None) – If None, builds path from metadata[‘cwd’] and self.name + ‘.json[.gz]’. If a directory, appends name + ‘.json[.gz]’. If a file, uses as is.
zip (bool) – Whether to compress the JSON file using gzip. Default: True.
export_csv (bool) – If True, also save a .csv file (x,y) alongside the JSON.
- class sig2dna_core.signomics.signal_collection(*signals, n=1024, mode='union', name=None, force=True)
Bases:
list
A container class for multiple signal instances that ensures alignment on a shared x-grid.
The collection is used to manage, compare, combine, or visualize multiple signals (e.g., from replicates, experiments, synthetic scenarios). Signals are interpolated and padded on insertion so all have the same shape and domain. Arithmetic, matrix extraction, and overlay plots are supported.
Parameters:
- *signalssignal
One or more signal instances to include (they are copied and aligned).
- nint
Number of sampling points in the aligned x grid (default: 1000).
- modestr
Alignment mode: ‘union’ or ‘intersection’ of x-ranges.
Core Attributes:
- modestr
Alignment strategy used (“union” or “intersection”).
- nint
Number of x-points used in alignment (default=1024).
Key Methods:
append(signal) → add and align a new signal
to_matrix() → convert signals to a 2D array (n_signals x n_points)
mean(coeffs=None) → weighted or unweighted mean
sum(coeffs=None) → weighted or unweighted sum
plot(…) → overlay signals with optional mean/sum
copy → all signals stored are deep copies
generate_synthetic → signal collection composed of random peaks.
__getitem__(…) → slice, list, or name-based access to signals
__repr__ / __str__ → report contents with span and names
_toDNA(signal_collection) → list of DNAsignals
Access Patterns:
sc[0:3] → subcollection by slice
sc[[0, 2]] → subcollection by list of indices
sc[“name”] → return a copy of signal with that name
sc[“A”, “B”] → return a subcollection with those names
Supports arithmetic operations for aligned signal mixtures.
Arithmetic operations on aligned signal collections
Scalar multiplication: a * sc scales each signal by a constant a.
Collection addition: sc1 + sc2 adds two collections element-wise.
Linear combinations: a * A + b * B + c * C constructs mixtures of compatible collections.
Compatible with sum([a*A, b*B, …]) for aggregating multiple weighted collections.
- Requirements
All signal_collections must share the same number of signals.
Signals are aligned on a common x-grid (same n, mode, and domain).
Element-wise operations preserve signal names and metadata when possible.
Examples:
>>> sc = signal_collection(s1, s2, s3) >>> sc.plot(show_mean=True)
>>> sc[0:2] # sub-collection (copy) >>> sc["peak1"] # get copy of signal named 'peak1' >>> mat = sc.to_matrix()
>>> sc.mean().plot() >>> sc.sum(coeffs=[0.4, 0.6]).plot()
- append(new_signal)
Append and align the new signal to the existing collection.
- classmethod generate_mixtures(n_mixtures=10, max_peaks=16, peaks_per_mixture=(3, 8), amplitude_range=(0.5, 2), flatten='mean', n_signals=1, n_peaks=1, kinds=('gauss',), width_range=(0.5, 3), height_range=(1.0, 5.0), x_range=(0, 500), n_points=1024, normalize=False, seed=None, **kwargs)
Generate synthetic mixtures of signals by combining a subset of base peaks.
- Parameters:
n_mixtures (int) – Number of synthetic mixtures to generate.
max_peaks (int) – Maximum number of base signals (from which peaks are taken).
peaks_per_mixture (tuple of (int, int)) – Range (min, max) for the number of peaks to combine in each mixture. Cannot exceed max_peaks.
amplitude_range (tuple of (float, float)) – Random scaling range applied to peak amplitudes in each mixture.
flatten ({'sum', 'mean'}, default='mean') – How to combine the signals for each mixture.
**kwargs (dict) – All other keyword arguments passed to generate_synthetic.
- Returns:
result_collection (signal_collection) – A collection of synthetic mixed signals.
all_peaks (list of dict) – All individual peaks originally generated.
used_peak_ids (list of list of str) – For each mixture, the list of peak names used.
Examples
S, pS = signal_collection.generate_mixtures( … n_mixtures=30, … max_peaks=12, … peaks_per_mixture=(4, 8), … amplitude_range=(0.2, 1.5), … n_signals=12, … kinds=(“gauss”,), … width_range=(0.5, 3), … height_range=(1.0, 5.0), … x_range=(0, 500), … n_points=2048, … normalize=False, … seed=123 … ) >>> S.plot()
- classmethod generate_synthetic(n_signals=5, n_peaks=5, kind_distribution='uniform', kinds=('gauss', 'lorentz', 'triangle'), x_range=(0, 1000), n_points=1024, avoid_overlap=True, width_range=(20, 60), height_range=(0.5, 1.0), normalize=True, noise=None, bias=None, name_prefix='synthetic', seed=None)
Generate a synthetic signal collection composed of random peaks.
- Parameters:
n_signals (int) – Number of synthetic signals to generate.
n_peaks (int or tuple[int,int]) – Number of peaks per signal or its range.
kind_distribution (str) – ‘uniform’ → use all peak kinds equally; ‘random’ → random draw from kinds.
kinds (tuple[str]) – Generator types to choose from (‘gauss’, ‘lorentz’, ‘triangle’).
x_range (tuple[float, float]) – Start and end of x-domain.
n_points (int) – Number of sampling points for each signal (default: 1024).
avoid_overlap (bool) – Prevent peaks from overlapping by checking spacing vs. width.
width_range (tuple[float, float]) – Range of widths for the peaks.
height_range (tuple[float, float]) – Range of peak heights.
normalize (bool) – Normalize each signal so the highest peak has intensity 1.
noise (dict or None) – Optional noise model, e.g. {“kind”: “gaussian”, “scale”: 0.01}.
bias (float, str, np.ndarray, or signal) – Optional signal bias: can be a constant, ‘ramp’, or signal.
name_prefix (str) – Base name for each generated signal.
seed (int or None) – Random seed for reproducibility.
- Returns:
A collection of generated signals.
- Return type:
Examples
# 1. Default random peaks, Gaussian + ramp bias sc = signal_collection.generate_synthetic(
n_signals=5, n_peaks=6, kinds=(“gauss”, “lorentz”, “triangle”), noise={“kind”: “gaussian”, “scale”: 0.02}, bias=”ramp”, name_prefix=”test”
) sc.plot(show_mean=True, fontsize=”large”)
# 2. High-res signal, fixed width and height sc2 = signal_collection.generate_synthetic(
n_signals=3, n_peaks=8, kinds=(“gauss”,), width_range=(30, 30), height_range=(1.0, 1.0), x_range=(0, 500), n_points=2048, normalize=False, seed=123
) sc2.plot(fontsize=14)
# 3. Poisson noise, no overlap, save output sc3 = signal_collection.generate_synthetic(
n_signals=2, n_peaks=5, noise={“kind”: “poisson”, “scale”: 2.0}, name_prefix=”poisson_example”
) for s in sc3:
s.save(export_csv=True)
- mean(indices_or_names=None, coeffs=None)
Mean of selected signals, optionally weighted.
- Parameters:
indices_or_names (list[int or str], optional) – Signal names or indices to include.
coeffs (list[float], optional) – Weights for selected signals.
- Returns:
Averaged signal.
- Return type:
- plot(indices=None, labels=True, title=None, newfig=None, ax=None, show_mean=False, show_sum=False, coeffs=None, fontsize=12, colormap=None)
Plot selected signals with style attributes and optional overlays.
- Parameters:
indices (list[int] or list[str], optional) – Signals to plot by index or name.
labels (bool) – Whether to show signal labels.
title (str) – Plot title.
newfig (bool or None) – If True, always open a new figure. If False, use current axes. If None, open new figure only the first time this collection is plotted.
ax (matplotlib axis, optional) – Axis to draw on.
show_mean (bool) – Overlay mean curve.
show_sum (bool) – Overlay sum curve.
coeffs (list[float], optional) – Optional weights for mean/sum.
fontsize (int or str) – Font size for labels and legend.
colormap (list[str], optional) – List of colors to cycle through when signal.color is None.
- Return type:
matplotlib.figure.Figure
- sum(indices_or_names=None, coeffs=None)
Sum selected signals, optionally weighted by coeffs.
- Parameters:
indices_or_names (list[int or str], optional) – If provided, selects a subset by index or name.
coeffs (list[float], optional) – Weights matching the number of selected signals.
- Returns:
Summed signal.
- Return type:
- to_matrix()
Return a 2D array (n_signals x n_points) of aligned signal values.