๐ก๐งฌ sig2dna๏
Symbolic Signal Transformation for Fingerprinting, Alignment, and AI-Based Classification
๏น ๏ฎฉูจู๏ฎฉ๏ฎฉูจู๏ฎฉูจู๏ฎฉ๏ฎฉูจู๏น๏น
sig2dna
is a Python module that transforms complex 1D, 2Dโฆ analytical signals into DNA-like symbolic sequences via morphological encoding. These symbolic fingerprints enable fast alignment, motif recognition, classification, and high-throughput comparison of signals originating from:
GC-MS
/GC-FID
- ๐low and๐ฌhigh resolution
HPLC-MS
- ๐low and๐ฌhigh resolution
NMR
/FTIR
/Raman
/RX
It supports large-scale applications such as identifying unknown substances in โป๏ธ recycled materials or mixtures containing NIAS (Non-Intentionally Added Substances). ๐๏ธ Symbolic compression (up to 95%+) enables scalable storage and alignmentโand seamless integration with Large Language Models (LLMs).
๐จ Credits: Olivier Vitrac
๐ This approach was developed and tested as part of the PhD thesis:
Julien Kermorvant, โConcept of chemical fingerprints applied to the management of chemical risk of materials, recycled deposits and food packagingโ, AgroParisTech. 2023. https://theses.hal.science/tel-04194172
๐ก Note for ND-signals and multi-detector or multi-technique data
sig2dna
natively supports signal alignment and comparison across heterogeneous sources. This makes it particularly suited for ND-signals (non-destructive signals) or aggregated data from different detectors or acquisition techniques.
โก๏ธ You can seamlessly concatenate or compare signals originating from different instruments or modes (e.g., UV + MS, GCรGC, LC-FTIR, etc.) โ the symbolic coding abstracts away intensity scales and detector-specific artifacts, focusing instead on morphological motifs.
Additional recommendations for ๐ผ๏ธ 2D and ๐๏ธ multimodal acquisition systems are given at the end of this document ๐.
๐ Table of Contents๏
๐งฉ 1| Main Components๏
Class |
Description |
---|---|
|
Encodes numerical signal as symbolic sequence (multi-scale wavelet transform) |
|
A symbolic string with alignment, entropy, motif search, plotting, etc. |
|
Computes and visualizes distances, PCoA, clustering, scatter, dendrogram |
๐ง 2| Applications๏
High-throughput chemical pattern recognition ๐, -หหโโโโโ
NIAS tracking in complex matrices โฌ,๐โ ๏ธ
AI-compatible signal fingerprints ๐ซ๐
Classification of recycled material batches โป๏ธ ๐
Detection of structural motif distortions (due to overlapping compounds) ๐ต๐ป
AI-assisted quality control
AI-assisted compliance testing ๐ฝ๏ธ
๐งฌ 3| Core Concepts - Overview๏
Morphology Encoding as a โGeneticโ Code. Chemical signals are subjected to signal morphology encoding using continuous, symmetric wavelet transforms. The symbolic sequences are similar to genetic code. It is based on a limited number of letters, symbols appear grouped into motifs which behave like codons As a result, a motif table can be used to recognize n-upplets in ^1^H-NMR, mass spectra, retention times, etc.
Motif Recognition. Searches of substances or typical patterns can be carried out via regular expressions or via transition probabilities (AโZ vs. ZโB vs. AโA) over a sliding window. All operations can be carried out in parallel for efficiency and automated treatment.
Scalable Machine-Learning. High compression ratios enable the efficient storage of millions of chemical signatures.
3.1 Input Signal โก๏ธ๏
One-dimensional
signal
objects (NumPy-based)Supports synthetic and experimental sources
S = signal.from_peaks(...)
๐ฅ
3.2 Wavelet Transform ใฐ๏
A Mexican hat (Ricker) wavelet is used:
The Continuous Wavelet Transform (CWT) of a signal \(x(t)\) is:
where \(s\) is the scale parameter (typically powers of two, e.g., \(s = 2^n\)) and \(*\) the convolution operator.
๐ช
3.3 Relationship of \(W\_s(t)\) with the second derivative \(x''(t)=\frac{\partial^2 x(t)}{\partial t^2}\)๏
Applying the Ricker wavelet (second derivative of a Gaussian) to the signal \(x(t)\) via convolution (i.e., CWT) is equivalent to computing the second derivative of \(x(t)\) smoothed by the Gaussian kernel \(g\_s(t)\):
Click here for the demonstration
The convolution of \(x(t)\) with the second derivative of \(g(t)\) is:
\((x * g'')(t) = \int\_{-\infty}^{\infty} x(\tau) g''(t - \tau) \, d\tau\)
We perform a change of variable: let \(u = t - \tau\), so \(\tau = t - u\) and \(d\tau = -du\). This gives:
\((x * g'')(t) = \int\_{-\infty}^{\infty} x(t - u) g''(u) \, du\)
Now consider the convolution of the second derivative of \(x(t)\) with \(g(t)\):
\((x'' * g)(t) = \int\_{-\infty}^{\infty} x''(\tau) g(t - \tau) \, d\tau\)
Again, perform the change of variable \(u = t - \tau\), yielding:
\((x'' * g)(t) = \int\_{-\infty}^{\infty} x''(t - u) g(u) \, du\)
Now integrate by parts twice:
First integration by parts (assuming \(g(u) \to 0\) and \(x'(t - u) \to 0\) as \(u \to \pm \infty\)):
\(\int x''(t - u) g(u) \, du = - \int x'(t - u) g'(u) \, du\)
Second integration by parts:
\(-\int x'(t - u) g'(u) \, du = \int x(t - u) g''(u) \, du\)
Which yields:
\((x'' * g)(t) = \int x(t - u) g''(u) \, du = (x * g'')(t)\)
๐ฆ
3.4 Symbolic Encoding ๐ก๏
Each segment of the wavelet-transformed signal is encoded into one of the symbolic codes corresponding to the table of variation \(\text{sign}\left(\frac{\partial}{\partial t}W\_s(t)\right)\) and \(\text{sign}\left(W\_s(t)\right)\).
Symbol: \(\ell\_i\) |
Variation |
Description |
---|---|---|
A |
-โ+ |
Increasing crossing from โ to + (zero-crossing) |
B |
-โ- |
Increasing negative |
C |
+โ+ |
Increasing positive |
X |
+โ+ |
Decreasing positive |
Y |
-โ- |
Decreasing negative |
Z |
+โ- |
Decreasing crossing from + to โ (zero-crossing) |
_ |
โโ |
Flat or noise segment |
Each segment stores its width
, height
, and position
.
The full-resolution symbolic sequence is reconstructed by interpolating or repeating these symbols proportionally to their span. A quantitative pseudo-inverse is proposed to reconstruct chemical signals from their code.
๐ฉ
3.5 Symbolic Compression ๐๏ธ๏
Symbolic sequences can be compressed and encoded at full resolution via:
dna.encode_dna()
dna.encode_dna_full(resolution="index")
Resulting in DNA-like sequences like:
"YYAAZZBB_YAZB"
๐จ
3.6 Structural Meaning (e.g., YAZB Motif)๏
A single Gaussian peak transformed via the Ricker wavelet results in:
Y: rising pre-lobe ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท
A: left inflection (โ to + crossing)
Z: right inflection (+ to โ crossing)
B: trailing decay
The YAZB
motif is a symbolic map of the Ricker wavelet transform (CWT) of a Gaussian. An alteration of the pattern reveals overlapping Gaussians , asymmetric signals or more generally interactions and interferences.
๐ง
3.7 Interpretation When Gaussians Overlap ๐โค๏
When two Gaussians overlap, especially at close proximity or with different amplitudes:
The central peak becomes asymmetric.
The Ricker transform becomes a superposition of two wavelets.
This causes:
Emergence of extra inflection points,
Distortion of the clean YAZB sequence,
Insertion of intermediary motifs (e.g., repeating A-Z transitions, shortened or elongated lobes),
Possible merging of YAZB motifs or partial truncation.
So changes in the symbolic code structure directly reflect signal interference, i.e., nonlinearity in overlapping peaks.
๐ง 4| Entropy and Distance Metrics๏
Sig2dna
implements several metrics to evaluate the similarity of coded chemical signals. Alignment is essential to compare them while respecting order. It is performed via global/local pairwise alignment usingdifflib
orBiopython
. Excess Entropy and Jensen-Shannon are best choices in the presence of complex mixtures by enabling the detection of small structural changes.
Distance |
Sensitive to |
Based on |
Alignment needed |
Suitable for |
---|---|---|---|---|
Excess Entropy |
Symbol order |
Shannon entropy |
โ Yes |
Structural motif similarity |
Jensen-Shannon |
Symbol usage |
Probability dist. |
โ No |
Profile similarity (e.g., peak types) |
Levenshtein |
Edit steps |
Insertion/Deletion |
โ Yes |
Sequence-level variation |
Jaccard |
Pattern sets |
Motif occurrences |
โ No |
Motif overlap, Motif density map |
4.1 Shannon Entropy โโโโโโ ๏
Entropy provides a robust, physics-informed metric for morphological comparisons. For a symbolic sequence \(X\), it reads:
where \(p(\ell\_i)\) is the frequency of letter \(l\_i\) in the sequence \(X\).
Entropy \(H\) is an extensive quantity verify additivity properties for independent sequences. Its value is accumulated between structured and low structured regions. Entropy is invariant under translation and stable under small perturbations, especially when using symbolic codes rather than raw intensities. This makes it ideal for comparing:
Signals with shifts in baseline
Morphologically similar but intensity-scaled signals
Partially distorted sequences (e.g., from mixtures or degradation)
๐ด
4.2 Aligned sequences and Excess Entropy Distance โ๏ธ๏
Let \(A\) and \(B\) be two symbolic sequences (DNAstr
) representing two signals. After alignment (e.g., via global/local pairwise alignment using difflib
or Biopython
), we obtain:
\(\tilde{A}\): aligned version of \(A\) (with possible gap insertions)
\(\tilde{B}\): aligned version of \(B\)
\(\tilde{A} * \tilde{B}\): a new sequence formed by pairing corresponding symbols (possibly with gaps)
Given sequences \(A\) and \(B\), the mutually exclusive information or excess entropy is defined as:
where:
\(H(A)\) and \(H(B)\) are the Shannon entropies of the original sequences
\(H(\tilde{A} * \tilde{B})\) is the Shannon entropy of the aligned signal pairs (treated as โjoint lettersโ)
๐
4.3 Jensen-Shannon Distance โ๏ธ๏
Let \(P\) and \(Q\) be the empirical frequency distributions of symbolic letters in two DNA-like coded signals \(A\) and \(B\), respectively. That is:
\(P = {p\_\ell}\) where \(p\_\ell = \frac{\text{count of symbol } \ell \text{ in } A}{|A|}\)
\(Q = {q\_\ell}\) where \(q\_\ell = \frac{\text{count of symbol } \ell \text{ in } B}{|B|}\)
Let \(M\) be the average distribution:
Then, the JensenโShannon distance between \(P\) and \(Q\) is defined as:
where \(D\_{\text{KL}}\) is the Kullback-Leibler divergence:
and \(m\_\ell\) is the frequency of symbol \(\ell\) in the average distribution \(M\).
4.3.1 Interpretation ๐ก๏
The JensenโShannon distance quantifies how different the symbol usage is between two signals, ignoring the order in which the symbols appear.
It is bounded between 0 and 1, symmetric, and always finite (even when some symbols are missing in one sequence).
A value of 0 indicates identical symbol distributions, while 1 indicates completely disjoint symbol usage.
4.3.2 Use Cases ๐งช๏
Robust against misalignment or noise: two signals with similar overall composition but different positions will still score low JSD.
Useful for clustering symbolic signals by type or composition, regardless of temporal structure.
Complementary to entropy or edit-based distances, which capture positional or morphological changes.
๐ก
4.4 Jaccard Motif Distance ๐๏
The Jaccard distance measures the similarity between two symbolic signals by comparing the sets of motifs (short symbolic substrings) they contain, without requiring alignment. It is particularly suited for identifying common structural patterns across signals, regardless of their order or spacing.
Given two sequences \(A\) and \(B\), and a set of motifs \(\mathcal{M}\) of length \(k\) (typically 3โ5 characters), we define:
\(\mathcal{M}(A)\): set of motifs found in \(A\)
\(\mathcal{M}(B)\): set of motifs found in \(B\)
Then the Jaccard distance is defined as:
4.4.1 Key Features:๏
โ No alignment needed โ motif presence is evaluated globally
๐ Sensitive to local patterns โ detects repeated or shared symbolic structures
๐ Sparse and interpretable โ suitable for heatmaps and clustering
4.4.2 Implementation Notes:๏
Motifs are extracted using a sliding window of fixed length (default:
k=4
)Symbol sequences are assumed to be from the encoded
DNAstr
outputsMotif sets are hashed to speed up large comparisons
Jaccard scores are computed pairwise across a collection of symbolic sequences
This metric is especially useful when:
You expect common substructures across signals
Signals may differ in length or alignment is unreliable
You want to create density maps of motif usage or explore structural similarity clusters
Here is the complete, cleanly formatted README.md documentation section for the new sinusoidal encoder/decoder functions added to sig2dna
, including appropriate emojis and explanations:
๐ 5 | Sinusoidal Encoding of Symbolic Segments๏
sig2dna
integrates a transformer-style positional encoding for symbolic segments, enabling conversion of morphological features into fixed-size vectors. This provides a compact, AI-ready representation of:
โฑ๏ธ Position (\(x\_0\))
๐ Width (\(\Delta x\))
๐ถ Amplitude (\(\Delta y\))
๐ก This mechanism replaces long repetitions of letters by a numerically invertible vector encoding, useful for clustering, attention-based models, or compressed storage.
5.1 Mathematical Basis ๐๏
Let \(t \in \mathbb{R}\) be a scalar quantity (e.g., position, width, or height). The sinusoidal encoding \(\mathbf{f}(t) \in \mathbb{R}^d\) is defined by:
where:
\(r = N^{2/d}\) is a frequency base (default: \(N = 10000\))
\(d\) is the number of embedding dimensions for the feature (default: \(d = 32\))
Each encoded feature (position, width, amplitude) gets its own \(d\)-vector
Then the full vector for one symbolic segment becomes:
These vectors are computed for each letter (A, B, โฆ, Z) and grouped accordingly.
This encoding maps any scalar value \(t\) (e.g., โฑ๏ธ, ๐, ๐ถ) onto periodic functions. Due to the nature of sine and cosine, this representation is:
translation-equivariant for local displacements (relative order and spacing are preservd),
periodic, so absolute positions wrap with ambiguity (exact localization may be lossy).
The key mathematical identity is:
\[f(t + \Delta t) = \mathrm{diag}(f(\Delta t)) \cdot f(t) \]
>
>
> ๐ **shifting** a position $t$ by $\Delta t$ corresponds to a **linear transformation** of its embedding.
>
> โ ๏ธ To enable **invertibility**, we restrict $x_0$ within a known range $[0, L]$ with resolution determined by $N$ (current implementation), or add explicit absolute anchor
๐ตใฐ๏ธใฐ๏ธโช๏ธใฐ๏ธใฐ๏ธใฐ๏ธ๐ด
### 5.2 Decoding implementation ๐๏ธ
- **Encoding**: $t \mapsto [\sin(t/r_k), \cos(t/r_k)]_{k=0}^{d/2 - 1}$
- **Decoding**:
- Convert $\sin$, $\cos$ pairs into $z_k = \cos + i\sin = e^{ix/r_k}$
- Unwrap $\angle(z_k)$ โ gives $\theta_k \approx t/r_k$
- Fit $t$ via least-squares:
```{math}
x\_i = \frac{\sum\_k \theta\_{ik} \cdot \frac{1}{r\_k}}{\sum\_k \left(\frac{1}{r\_k}\right)^2}
Robust, differentiable, and avoids scalar-local minima traps.
Four decoders have been implemented ๐ง:
Method
Description
Stability
'least_squares'
Fast, phase-unwrapped projection
โ Excellent
'svd'
SVD-regularized LSQ for robust inversion
โ Excellent
'optimize'
Scalar optimization (slow, fragile)
โ Unstable
'naive'
Mean of phase-projected values (quick + dirty)
โ Wrong shifts
Rules of Thumb ๐ง:
Option
Action
Effect
Use scaling
Normalize input to
[0, 10]
Accurate decoding for wide range
Reduce
N
Use e.g.
N = 1000
Higher range support
๐ทใฐ๏ธใฐ๏ธ๐ทใฐ๏ธใฐ๏ธใฐ๏ธ๐ท
5.3 sinencode_dna()
โ Letter-wise Sinusoidal Encoder ๐ก๏
Encodes all symbolic segments at selected scale(s) into sinusoidal vectors, grouped by letter (A
, Z
, B
, etc.).
dna.sinencode\_dna(scales=[4], d\_part=32)
๐ง Stored outputs:
self.code_embeddings_grouped
:{ 4: { "A": np.ndarray (n\_A, 96), "Z": np.ndarray (n\_Z, 96), ... } }
self.code_embeddings_meta
: Metadata required for reconstruction:{ "sampling\_dt": 0.1, "x\_label": "RT", "x\_unit": "min", "y\_label": "Intensity", "y\_unit": "a.u.", "name": "GC-MS peak trace", "scales": [4], "d\_part": 32, "N": 10000 }
๐ถใฐ๏ธใฐ๏ธ๐ถใฐ๏ธใฐ๏ธใฐ๏ธ๐ถ
5.4 sindecode_dna(...)
โ Static Decoder to DNAsignal ๐๏
Reconstructs a new DNAsignal
instance from sinusoidal embeddings:
reconstructed = DNAsignal.sindecode\_dna(
grouped\_embeddings = dna.code\_embeddings\_grouped,
meta\_info = dna.code\_embeddings\_meta
)
๐งฌ Returns a complete DNAsignal
object with:
reconstructed
codes[scale]
dictionaries:letters
,widths
,heights
,iloc
,xloc
,dx
empty signal (since waveform cannot be recovered from symbol encoding alone)
๐ง Ideal for:
Embedding symbolic sequences for AI/ML workflows
Comparing motifs without repeating long letters
Visualizing symbolic structure in latent spaces
โญใฐ๏ธใฐ๏ธโญใฐ๏ธใฐ๏ธใฐ๏ธโญ
5.5 Summary and error estimation \(\varepsilon = |\hat{t} - t|\) ๐ฌ๏
Each scalar \(t\) (like \(x_0\) or \(\Delta x\)) is encoded as:
with \(r = N^{2/d}\), typically \(N = 10000\), and \(d \sim 32\).
In decoding, we estimate \(t\) by averaging multiple phase inversions:
Let \(L\) be the maximum span of \(t\) values to encode (e.g., total signal length), and \(d\) the embedding size (e.g., 32). Then:
For \(k=0\) (highest freq), \(\text{period}_0 \sim 2\pi\)
For \(k = d/2 - 1\), \(\text{period}_k \sim 2\pi N\)
So the resolution behaves like:
where \(N\) is the frequency base and \(L\) is the range of \(t\) values being encoded (e.g., max segment length or signal length)
Feature |
Value |
---|---|
Error scales |
\(\varepsilon \sim L / N\) |
Depends on |
Signal span \(L\), base \(N\) |
Tunable by |
Increasing \(d\) or \(N\) |
Accuracy |
Typically \(<0.1%\) %of signal range (\(L=500\) and \(N=10^4\) gives \(\varepsilon \approx \frac{500}{10000} = 0.05\) ) |
Robustness |
Stable across most morphologies |
The errors are acceptable for:
Motif alignment
Classifiers
Density maps
Latent embeddings
๐ 6| Baseline Filtering and Poisson Noise Rejection๏
The Ricker wavelet \(\psi_s(t)\) used in
sig2dna
is mathematically the second derivative of a Gaussian kernel. As such, applying the Continuous Wavelet Transform (CWT) with \(\psi_s(t)\) is equivalent to performing a second-order differentiation of the signal \(x(t)\) followed by a Gaussian smoothing, where the scale parameter \(s\) controls the bandwidth.This structure makes the CWT intrinsically robust to low-frequency noise, baseline drifts, and stationary random noise (such as column bleeding in GC). Moreover, the symmetry of \(\psi_s(t)\) ensures suppression of linear trends, enhancing signal clarity without distorting peak structures.
For ideal Gaussian-shaped peaks, the optimal CWT response is obtained when the scale \(s\) matches the peakโs width at its inflection points, which corresponds to half-height for a Gaussian. This is where the symbolic motif
YAZB
is most cleanly detected.However, on real-life signals, maximizing noise rejection by increasing \(s\) can blur peak details. Preserving the morphological fidelity of peaks while ensuring their detectability requires operating near the optimal scale, not beyond it. To this end,
sig2dna
integrates a robust preprocessing methodology tailored for signals acquired through accumulation or integration (i.e., counting statistics), such as total ion counts in mass spectrometry or spectroscopic intensities.
Step 1 โ Median Baseline Subtraction ๏น๐๏น๏
Let \(x(t)\) be the input signal. We compute a moving median over a window of width \(w\):
Then, apply a non-negative correction:
๐ปโ๐ผโ๐ฝโ๐พ๐ฟ
Step 2 โ Poisson Noise Estimation โถ๏ธ แแ||แ|แ|||| |๏
From the baseline-corrected signal \(x_b(t)\):
Compute the local mean \(\mu(t)\) and standard deviation \(\sigma(t)\) using a uniform filter.
Estimate the coefficient of variation:
Assuming Poisson noise, infer the local Poisson parameter:
๐ปโ๐ผโ๐ฝโ๐พ๐ฟ
Step 3 โ BienaymรฉโTchebychev Thresholding ๐๏ธ๏
To reject noise, use a threshold \(T(t)\) derived from \(\lambda(t)\):
Filtered signal is then:
๐งช 7| Synthetic Signal Generation๏
Synthetic signals are modeled as a sum of Gaussian/Lorentzian/Triangle peaks. For Gaussian, they read
where:
\(h_i\): peak height
\(\mu_i\): center
\(w_i\): peak width (calibrated to Full Width Half Maximum)
This is used to:
Reconstruct symbolic segments
Generate artificial mixtures
Simulate motifs for clustering or ML training
Parses sequences into
YAZB
motif candidates (Mass spectra)
๐ฆ 8| Available Classes๏
Module sig2dna_core.signomics.py
Class Name |
Description |
---|---|
|
Peak shape generator: Gaussian, Lorentzian, triangle |
|
Peak library with synthesis, parameter control, and arithmetic operations |
|
1D signal class with plotting, peak summation, transformations, and noise |
|
Wrapper for multi-signal analysis: mean, sum, scaling, alignment, synthesis |
|
Symbolic sequence class with entropy, motif search, edit distances |
|
Symbolic encoding/decoding from signals (DNA-like) |
|
Tools for clustering, dimensionality reduction, dendrograms, visual metrics |
|
Wrapper for 2D, nD DNAsignals |
|
Encoder/decoder for symbolic and numeric data using sinusoidal projections |
|
Dictionary-like symbolic representation of triplet codes (letter, width, height) |
|
Dictionary-based encoder for resolution-based symbolic repetition |
Class Inheritance Diagram
graph TD; DNACodes DNAFullCodes DNApairwiseAnalysis DNAsignal DNAsignal_collection DNAstr SinusoidalEncoder generator peaks signal signal_collection UserDict --> DNACodes dict --> DNAFullCodes list --> DNAsignal_collection list --> signal_collection object --> DNApairwiseAnalysis object --> DNAsignal object --> SinusoidalEncoder object --> generator object --> peaks object --> signal str --> DNAstr
๐ 9| Example Workflow๏
from signomics import DNAsignal
# Load and encode
D = DNAsignal(S, encode=True)
D.encode\_dna()
D.encode\_dna\_full()
# Visualize
D.plot\_codes(scale=4)
# Entropy and distances
entropy = D.get\_entropy(scale=4)
analysis = DNAsignal.\_pairwiseEntropyDistance([D1, D2, D3], scale=4)
๐ 10| Visualization๏
signal.plot()
,signal_collection.plot()
: plot signalsDNAsignal.plot_signals()
: Original + CWT overlayDNAsignal.plot_transforms()
: plot transformed signals a collection of signalsDNAsignal.plot_codes(scale=4)
: Colored triangle segmentsDNAstr.plot_mask
: plot alignment maskDNAstr.plot_alignment
: plot aligned codes as reconstructed signalsDNApairwiseAnalysis.plot_dendrogram()
,scatter3d(), scatter(), heatmap
,dimension_variance_curve
: Cluster and distance views
๐ 11| Motif Detection๏
Pattern search: ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท
listPat=D.codes[4].find("YAZB")
listPat[0].to\_signal().plot() # show the first match as a signal
Extract and plot motifs: โโโโโโโโ
D.codesfull[4].extract\_motifs("YAZB", minlen=4, plot=True)
๐ค 12| Alignment๏
โด Fast symbolic alignment:โ๏ธโฑ๏ธ
D1.codes[4].align(D2.codes[4], engine="bio")
D1.codes[4].wrapped\_alignment()
D1.html\_alignment()
D1.plot\_alignment()
๐งช 13| Examples (unsorted)๏
from sig2dna\_core.signomics import peaks, signal\_collection, DNAsignal
# 1. Peak creation and basic signals ๐๏ธ
p = peaks()
p.add(x=10, w=2, h=1)
p.add(x=20, w=2, h=1)
s = p.to\_signal()
s.plot()
# 2. Signal collection ๐๏ธ
s\_noisy = s.add\_noise("gaussian", scale=0.01, bias=5)
s\_scaled = s * 0.5
coll = signal\_collection(s, s\_noisy, s\_scaled)
s\_mean = coll.mean()
s\_mean.plot(label="Mean")
# 3. Synthetic mixtures ๐ฅฃ
S, pS = signal\_collection.generate\_synthetic(n\_signals=12, n\_peaks=1, ...)
Sfull = S.mean()
dna = DNAsignal(Sfull)
dna.compute\_cwt()
dna.encode\_dna\_full()
dna.plot\_codes(scale=4)
# 4. Alignment of encoded sequences ๐งฌ๐งฌ
A = dna.codesfull[4]
B = dna.codesfull[2]
A.align(B)
A.html\_alignment()
A.plot\_alignment()
# 5. Extract motifs (e.g., YAZB segments โ๏ธ
pA = A.find("YAZB")
pAs = signal\_collection(*[s.to\_signal() for s in pA])
pAs.plot()
# 6. Classification from mixtures ๐
Smix, pSmix, idSmix = signal\_collection.generate\_mixtures(...)
dnaSmix = Smix.\_toDNA(scales=[1,2,4,8,16,32])
# 7. Excess entropy distance & clustering ๐ฒ
D = DNAsignal.\_pairwiseEntropyDistance(dnaSmix, scale=4, engine="bio")
D.name = "Excess Entropy"
D.dimension\_variance\_curve()
D.select\_dimensions(10)
D.plot\_dendrogram()
D.scatter3d(n\_clusters=5)
# 8. Jaccard motif distance โ๏ธ
J = DNAsignal.\_pairwiseJaccardMotifDistance(dnaSmix, scale=4)
J.name = "YAZB Jaccard"
J.dimension\_variance\_curve()
J.select\_dimensions(10)
J.plot\_dendrogram()
J.scatter3d(n\_clusters=5)
๐ฆ 14| Installation๏
The sig2dna
toolkit is composed of two core modules that must be used together:
๐งฉ Module |
Description |
---|---|
๐งฌ |
Core module implementing symbolic transformation, wavelet coding, and signal comparison (compact code, >7 Klines) |
๐จ๏ธ |
Utility module for saving and exporting Matplotlib figures (PDF, PNG, SVG) |
Recommended File Structure ๐ ๏
For simplicity and consistency, it is recommended to use both modules from a local subfolder (e.g., sig2dna_core
) within your working directory. You can clone or place the source files accordingly:
๐ sig2dna/ <- your working directory
โ
โโโ ๐ sig2dna\_core/ <- folder for core modules
โ โโโ ๐จ๏ธ figprint.py <- figure saving utilities
โ โโโ ๐งฌ signomics.py <- main symbolic signal processing module (>4 Klines)
โ
โโโ ๐ sig2dna\_tools/ <- folder for tools (not included in this release)
โ
โโโ ๐ images/ <- output folder for saved figures (PDF, PNG, SVG)
โ
โโโ ๐ yourscript.py <- your script using sig2dna\_core modules
โ
โโโ ๐ test\_signomics.py <- minimal test and plotting script
โโโ ๐ casestudy\_signomics.py <- in-depth classification and clustering example
โโโ ๐ LICENSE
โโโ ๐ README.md
Import Example ๐ฅ๏
In your scripts, import the components directly:
from sig2dna\_core.signomics import peaks, signal\_collection, DNAsignal
Dependencies ๐ฆ๏
The project relies only on standard scientific Python libraries and a few well-known optional packages. All can be installed with conda
or pip
:
conda install pywavelets seaborn scikit-learn
conda install -c conda-forge python-Levenshtein biopython
Or using pip
:
pip install PyWavelets seaborn scikit-learn python-Levenshtein biopython
โ No installation script is needed; simply place the module files in your working directory and ensure the structure above is respected.
๐ก15| Recommendations๏
Strategy for 2D or Multi-modal Chromatography ๐งญ๏
For 2D chromatographic systems, such as GCรGC or LCรLC, or in workflows combining retention time and mass detection, we suggest the following dual encoding strategy:
Along the retention axis: perform symbolic encoding of TIC (Total Ion Current) or a selected ion trace, to track retention-based morphology.
Along the \(m/z\) axis: use time-averaged spectra to encode mass distribution patterns, capturing molecular-level information.
๐ This combined coding captures both substance separation and substance identity, improving both detection (peak finding) and quantification.
๐ฏ Starting from version \(0.45\), 2D signals are handled natively with the class
DNAsignal_collection
. Look at the detailed tutorial ``
Substance Identification and Library Matching ๐๏
sig2dna
includes signal reconstruction capabilities from the symbolic code, allowing for approximate substance identification against reference libraries.
However, when precise identification is required:
โ It is preferable to transform the mass spectra of reference substances using
sig2dna
and compare them directly to the coded signal.
This enables symbol-level matching, which is more robust to noise, shifts, and peak distortion than traditional numerical similarity or library lookup.
๐ | License๏
MIT License โ 2025 Olivier Vitrac
๐ง | Contact๏
Author: Olivier Vitrac Contact: olivier.vitrac@gmail.com Version: 0.43 (2025-06-13)
Sig2dna
is part of the Generative Simulation initiative ๐ฑ: building modular, interpretable AI-ready tools for scientific modeling.
๐ |Appendices๏
๐งฉ | Methods Table (module: sig2dna.signomics
)๏
Class |
Method |
Docstring First Paragraph |
# Lines |
version |
---|---|---|---|---|
(module-level) |
|
Import a module by name from a file path relative to the calling module. |
37 |
0.51 |
|
|
Initialize self. See help(type(self)) for accurate signature. |
4 |
0.51 |
|
|
Plot method for DNACodes: visualizes encoded vectors or symbolic segment distribution. |
57 |
0.51 |
|
|
Decode sinusoidally embedded codes grouped by letter into symbolic segment structure. |
24 |
0.51 |
|
|
Encode symbolic segments at each scale using sinusoidal encoding grouped by letter. |
22 |
0.51 |
|
|
Print the number of encoded vectors or symbolic segments per scale. |
12 |
0.51 |
|
|
Initialize self. See help(type(self)) for accurate signature. |
4 |
0.51 |
|
|
Return repr(self). |
2 |
0.51 |
|
|
Return str(self). |
2 |
0.51 |
|
|
Plot method for DNAFullCodes: visualizes encoded vectors or DNA string composition. |
53 |
0.51 |
|
|
Plot each letterโs encoded vector in the abstract embedding space (d-space). One curve per letter, one subplot per scale. |
38 |
0.51 |
|
|
Sindecode method |
11 |
0.51 |
|
|
Sinencode method โ encodes each letter in the DNAFullCodes as a set of sinusoidal embeddings. |
45 |
0.51 |
|
|
Return a brief summary of the full codes per scale. |
10 |
0.51 |
|
|
Assemble all encoded letter vectors into a matrix of shape (n_letters, d_model) for each scale. |
38 |
0.51 |
|
|
Initialize self. See help(type(self)) for accurate signature. |
10 |
0.51 |
|
|
Return repr(self). |
7 |
0.51 |
|
|
Return str(self). |
2 |
0.51 |
|
|
Determine optimal dimension by maximizing silhouette score. |
16 |
0.51 |
|
|
Assign cluster labels from linkage matrix. |
5 |
0.51 |
|
|
Compute hierarchical clustering. |
6 |
0.51 |
|
|
Computes the cumulative explained variance (based on pairwise distances) as a function of the number of dimensions used (from 1 to n-1). Optionally plots the curve and the point where the threshold (default 0.5) is reached. |
46 |
0.51 |
|
|
Returns cluster labels from hierarchical clustering. If not computed yet, computes linkage. |
20 |
0.51 |
|
|
Plot heatmap of pairwise distances. |
9 |
0.51 |
|
|
Load analysis from file. |
5 |
0.51 |
|
|
Perform Principal Coordinate Analysis (PCoA). |
7 |
0.51 |
|
|
Plot dendrogram from linkage matrix. |
14 |
0.51 |
|
|
Recompute distances on selected subspace. |
4 |
0.51 |
|
|
Save current analysis to file. |
4 |
0.51 |
|
|
2D scatter plot in selected dimensions with optional cluster-based coloring. |
40 |
0.51 |
|
|
3D scatter plot in selected dimensions with optional cluster-based coloring. |
38 |
0.51 |
|
|
Update active dimensions. |
8 |
0.51 |
|
|
Initialize DNAsignal with a signal object or 1D array. |
58 |
0.51 |
|
|
Return repr(self). |
6 |
0.51 |
|
|
Return str(self). |
2 |
0.51 |
|
|
Determine letter based on monotonicity and signal range. |
31 |
0.51 |
|
|
returns the triangle (counter-clockwise, x are incr) |
27 |
0.51 |
|
|
Calculate excess-entropy pairwise distances. |
52 |
0.51 |
|
|
Compute pairwise Jaccard distances based on motif presence across symbolic DNAstr sequences. |
102 |
0.51 |
|
|
Calculate pairwise Jensen-Shannon distances between DNAstr codes at a given scale. |
45 |
0.51 |
|
|
Compute pairwise Levenshtein distances between codes at a given scale. |
61 |
0.51 |
|
|
Align symbolic sequences and compute mutual entropy. |
27 |
0.51 |
|
|
Apply baseline filtering using moving median and local Poisson-based thresholding. |
58 |
0.51 |
|
|
Compute Continuous Wavelet Transform (CWT) using the Mexican Hat wavelet. |
46 |
0.51 |
|
|
Encode each transformed signal into a symbolic DNA-like sequence of monotonic segments. |
67 |
0.51 |
|
|
Convert symbolic codes into DNA-like strings by repeating letters proportionally to their span. |
78 |
0.51 |
|
|
return the entropy of a string |
5 |
0.51 |
|
|
Find occurrences of a specific letter pattern in encoded sequence. |
9 |
0.51 |
|
|
Retrieve encoded data for a specific scale. |
3 |
0.51 |
|
|
Calculate Shannon entropy for encoded signal. |
5 |
0.51 |
|
|
Check if a DNA encoding exists for the specified scale. |
22 |
0.51 |
|
|
Normalize the internal signal using one of several strategies that ensure positivity. |
18 |
0.51 |
|
|
Plot the symbolic DNA-like encoding as colored triangle segments. |
77 |
0.51 |
|
|
Plot a scalogram with two subplots: - Top: colored image of CWT coefficient amplitudes - Bottom: line curves of selected scales |
34 |
0.51 |
|
|
Plot signals. |
13 |
0.51 |
|
|
Plot the stored CWT-transformed signals as a signal collection. |
13 |
0.51 |
|
|
print aligned sequences |
11 |
0.51 |
|
|
Approximate signal reconstruction via pseudo-inverse using stored CWT coefficients. |
64 |
0.51 |
|
|
Fast reconstruction of aligned signals |
9 |
0.51 |
|
|
Reconstruct the signal from symbolic features (e.g., YAZB). |
31 |
0.51 |
|
|
Decode sinusoidal grouped embeddings into a DNACodes structure. |
27 |
0.51 |
|
|
Encode |
15 |
0.51 |
|
|
๐ Encode full-resolution DNA-like strings into sinusoidal embeddings grouped by letter. |
37 |
0.51 |
|
|
Sparsify CWT coefficients by zeroing values below a threshold. |
58 |
0.51 |
|
|
Generate flexible synthetic signals. (obsolete) |
9 |
0.51 |
|
|
Reconstruct an approximate signal from symbolic encodings. |
46 |
0.51 |
|
|
Initialize a DNAsignal_collection from DNAsignal instances. |
41 |
0.51 |
|
|
Return a readable summary of the DNAsignal_collection contents. Shows the number of signals, available letters, scales, and embedding dimension. Flags whether sinusoidal encoding has been performed. |
14 |
0.51 |
|
|
Return str(self). |
2 |
0.51 |
|
|
Combine unwrapped embeddings across all signals for each scale. |
30 |
0.51 |
|
|
Perform dimensionality reduction on the 3D tensor v_{t,m,d} to extract non-coeluted compound chromatograms using PCA, with optional plotting. |
138 |
0.51 |
|
|
Plot the encoded signals in subplots. Rows represent letters, columns represent scales. Each subplot contains overlaid colored curves from all signals. |
65 |
0.51 |
|
|
Plot embedding projections of the encoded signals using PCA (default) or other DR methods. |
72 |
0.51 |
|
|
Plot a heatmap of the letter codes (symbolic DNA) across all signals. |
48 |
0.51 |
|
|
Plot the components E_symbol, PE_t, PE_m and their sum v_{t,m} as image matrices for each letter at a given scale. |
56 |
0.51 |
|
|
Visualize the components and sum of the full encoded GC-MS signal at a given scale. |
66 |
0.51 |
|
|
Apply dimensionality reduction (PCA or UMAP) across signals for each scale. |
39 |
0.51 |
|
|
Normalize embeddings across all signals and all scales. |
24 |
0.51 |
|
|
๐ Encode all DNAsignal instances using full-resolution sinusoidal embeddings, grouped by letter and organized per scale. |
22 |
0.51 |
|
|
Export combined embeddings for all scales as a tidy pandas DataFrame, suitable for machine learning tasks. |
29 |
0.51 |
|
|
Concatenate two DNAstr instances with identical dx values. |
27 |
0.51 |
|
|
Check equality based on symbolic content and dx resolution. |
15 |
0.51 |
|
|
Return hash combining the string content and dx. |
10 |
0.51 |
|
|
Construct a new DNAstr object. |
42 |
0.51 |
|
|
Return a short technical representation of the DNAstr instance. |
15 |
0.51 |
|
|
String representation. |
11 |
0.51 |
|
|
Subtract two DNAstr sequences by aligning and removing matched regions. |
20 |
0.51 |
|
|
Returns True if ther terminal supports colors |
4 |
0.51 |
|
|
Align this DNAstr sequence to another, allowing insertions/deletions to maximize matches. |
158 |
0.51 |
|
|
Compute the excess Shannon entropy of two DNAstr sequences H(A)+H(B)-2*H(AB) |
3 |
0.51 |
|
|
Extract and analyze YAZB motifs (canonical and distorted) from the symbolic sequence. |
68 |
0.51 |
|
|
Finds all fuzzy (or regex-based) occurrences of a DNA-like sequence pattern. |
42 |
0.51 |
|
|
Render the alignment using HTML with color coding: - green: match - blue: gap - red: substitution |
24 |
0.51 |
|
|
Compute the Jaccard distance between two DNAstr sequences. |
19 |
0.51 |
|
|
Compute the Jensen-Shannon distance between self and another DNAstr. |
24 |
0.51 |
|
|
Compute the Levenshtein distance between this DNAstr and another one. |
38 |
0.51 |
|
|
Compute the Shannon mutual entropy of two DNAstr sequences from their aligned segments |
12 |
0.51 |
|
|
Plot a block alignment view of two DNAstr sequences with color-coded segments. |
62 |
0.51 |
|
|
Plot a color-coded mask of the alignment between sequences. |
20 |
0.51 |
|
|
Return an alignment score, optionally normalized. |
17 |
0.51 |
|
|
Summarize the DNAstr with key stats: length, unique letters, entropy, etc. |
19 |
0.51 |
|
|
Converts the symbolic DNA sequence into a synthetic NumPy array mimicking the original wavelet-transformed signal. |
54 |
0.51 |
|
|
Map the DNAstr content to an integer array using a codebook. |
21 |
0.51 |
|
|
Return a line-wrapped alignment view (multi-line), optionally color-coded for terminal/IPython usage (Spyder, Jupyter). |
48 |
0.51 |
|
|
SinuosidalEncoder Constructor |
22 |
0.51 |
|
|
Least-squares phase unwrapping decoder (robust fast decoder). |
29 |
0.51 |
|
|
Simple decoder based on mean projection (coarse and periodic). |
19 |
0.51 |
|
|
Decode each embedding vector via scalar minimization. |
35 |
0.51 |
|
|
SVD-based phase unwrapping decoder with regularized least-squares. |
33 |
0.51 |
|
|
Inverse sinusoidal encoding (approximate). |
23 |
0.51 |
|
|
Parse values |
10 |
0.51 |
|
|
Reconstruct types |
12 |
0.51 |
|
|
Project scalar values into sinusoidal embedding space. |
20 |
0.51 |
|
|
Compute angular differences (โฮธ) between consecutive embeddings. |
17 |
0.51 |
|
|
Compute distance between two encoded arrays using complex projection. |
28 |
0.51 |
|
|
Decode sinusoidal embeddings back to original values using selected method. |
42 |
0.51 |
|
|
Encode input values into sinusoidal embeddings. |
31 |
0.51 |
|
|
Fit a scaling factor to normalize values into a target sinusoidal-safe range. |
24 |
0.51 |
|
|
Compute the centroid (average embedding) of each group in complex sinusoidal space. |
43 |
0.51 |
|
|
Compute a pairwise similarity (or distance) matrix in sinusoidal embedding space. |
27 |
0.51 |
|
|
Align embedding |
23 |
0.51 |
|
|
Perform Fourier-like phase unwrapping on sinusoidal embedding. |
24 |
0.51 |
|
|
Set maximum acceptable residual error for decoding. |
10 |
0.51 |
|
|
Decode sinusoidal embeddings grouped by letter into symbolic segments. |
41 |
0.51 |
|
|
Encode symbolic code segments grouped by letter into sinusoidal embeddings. |
36 |
0.51 |
|
|
Convert sinusoidal embedding into a complex array using Eulerโs identity. |
18 |
0.51 |
|
|
Perform an encode โ decode โ compare roundtrip and report accuracy. |
56 |
0.51 |
|
|
Call self as a function. |
12 |
0.51 |
|
|
define a peak generator gauss/lorentz/triangle |
3 |
0.51 |
|
|
Return repr(self). |
2 |
0.51 |
|
|
8 |
0.51 |
|
|
|
14 |
0.51 |
|
|
|
11 |
0.51 |
|
|
|
Initialize a peak collection. |
51 |
0.51 |
|
|
2 |
0.51 |
|
|
|
14 |
0.51 |
|
|
|
Return repr(self). |
20 |
0.51 |
|
|
Return str(self). |
6 |
0.51 |
|
|
3 |
0.51 |
|
|
|
Add one or multiple peaks to the collection. |
40 |
0.51 |
|
|
Return the list of peaks as dict |
3 |
0.51 |
|
|
Return a deep-copy of the peaks |
5 |
0.51 |
|
|
Return the list of names |
3 |
0.51 |
|
|
10 |
0.51 |
|
|
|
5 |
0.51 |
|
|
|
Sort peaks in-place based on their center positions (x values). |
17 |
0.51 |
|
|
Generate a signal from a peaks object. Optionally restrict to a subset. |
8 |
0.51 |
|
|
Update or insert peaks from a list of dictionaries. |
33 |
0.51 |
|
|
1 |
0.51 |
|
|
|
1 |
0.51 |
|
|
|
1 |
0.51 |
|
|
|
Initialize a signal instance. |
76 |
0.51 |
|
|
1 |
0.51 |
|
|
|
1 |
0.51 |
|
|
|
1 |
0.51 |
|
|
|
Return repr(self). |
21 |
0.51 |
|
|
Return str(self). |
5 |
0.51 |
|
|
1 |
0.51 |
|
|
|
1 |
0.51 |
|
|
|
Binary operation on signals |
10 |
0.51 |
|
|
Static copy of a signal (x and y only) |
4 |
0.51 |
|
|
6 |
0.51 |
|
|
|
Register a traceable action in the signalโs history. |
23 |
0.51 |
|
|
Convert a serialized dict to a signal |
24 |
0.51 |
|
|
Return a DNA encoded signal |
9 |
0.51 |
|
|
Convert the signal into a dictionary suitable for JSON export. |
19 |
0.51 |
|
|
Return a new signal with noise and/or bias added. |
33 |
0.51 |
|
|
Align two signals to a common x grid with interpolation and padding. |
35 |
0.51 |
|
|
Apply a baseline filter assuming Poisson-dominated statistics with adjustable gain and a rejection threshold based on the Bienaymรฉ-Tchebychev inequality. |
69 |
0.51 |
|
|
Backup current state in _previous |
8 |
0.51 |
|
|
Deep copy of the signal, excluding full history control flag |
24 |
0.51 |
|
|
Disable full history tracking |
5 |
0.51 |
|
|
Enable full history tracking |
4 |
0.51 |
|
|
Load a signal from a JSON or gzipped JSON file, including recursive _previous. |
23 |
0.51 |
|
|
Normalize the signal to positive values using different normalization strategies. |
76 |
0.51 |
|
|
Plot the signal using matplotlib, applying either internal style settings or overrides provided at call time. |
72 |
0.51 |
|
|
Restore the previous signal version if available |
11 |
0.51 |
|
|
Interpolate values from x |
3 |
0.51 |
|
|
Save signal to JSON (optionally compressed) and optionally CSV. |
36 |
0.51 |
|
|
overrride + |
14 |
0.51 |
|
|
Delete self[key]. |
7 |
0.51 |
|
|
sc[โmy_signalโ] # returns a copy of the signal named โmy_signalโ sc[โAโ, โBโ, โCโ] # returns a sub-collection with those names sc[0:2] or sc[[0, 2]] # still works for index-based access |
19 |
0.51 |
|
|
override += |
3 |
0.51 |
|
|
Initialize collection with aligned signals of the same type. |
30 |
0.51 |
|
|
override * |
6 |
0.51 |
|
|
override + |
5 |
0.51 |
|
|
Return repr(self). |
4 |
0.51 |
|
|
override * |
6 |
0.51 |
|
|
Set self[key] to value. |
5 |
0.51 |
|
|
Return str(self). |
2 |
0.51 |
|
|
11 |
0.51 |
|
|
|
Return a collection of DNA encoded signals (class DNAsignal_collection) |
9 |
0.51 |
|
|
Append and align the new signal to the existing collection. |
5 |
0.51 |
|
|
Mean of selected signals, optionally weighted. |
23 |
0.51 |
|
|
Plot selected signals with style attributes and optional overlays. |
75 |
0.51 |
|
|
Sum selected signals, optionally weighted by coeffs. |
41 |
0.51 |
|
|
Return a 2D array (n_signals x n_points) of aligned signal values. |
3 |
0.51 |
๐งฉ | Methods Table (module: sig2dna.PrintableFigure
)
Class |
Method |
Docstring First Paragraph |
# Lines |
version |
---|---|---|---|---|
(module-level) |
|
Generate a cleaned filename using figure metadata or the current datetime. |
19 |
|
(module-level) |
|
Generic saving logic for figure files. |
27 |
|
(module-level) |
|
Override |
9 |
|
(module-level) |
|
Override |
10 |
|
(module-level) |
|
Check if |
10 |
|
(module-level) |
|
Save a figure as PDF, PNG, and SVG. |
25 |
|
(module-level) |
|
Save a figure as a PDF. Example: โโโ >>> fig, ax = plt.subplots() >>> ax.plot([0, 1], [0, 1]) >>> print_pdf(fig, โmyplotโ, overwrite=True) |
10 |
|
(module-level) |
|
Save a figure as a PNG. |
9 |
|
(module-level) |
|
Save a figure as an SVG (vector format, dpi-independent). |
9 |
|
|
|
Save figure in PDF, PNG, and SVG formats. |
10 |
|
|
|
2 |
||
|
|
2 |
||
|
|
2 |
||
|
|
Display figure intelligently based on context (Jupyter/script). |
15 |