# ๐ก๐งฌ sig2dna
**Symbolic Signal Transformation for Fingerprinting, Alignment, and AI-Based Classification**
๏น ๏ฎฉูจู๏ฎฉ๏ฎฉูจู๏ฎฉูจู๏ฎฉ๏ฎฉูจู๏น๏น
>`sig2dna` is a Python module that transforms **complex 1D, 2Dโฆ analytical signals** into **DNA-like symbolic sequences** via morphological encoding. These symbolic fingerprints enable fast alignment, motif recognition, classification, and high-throughput comparison of signals originating from:
>
>- `GC-MS` / `GC-FID` - ๐low and๐ฌhigh resolution
>- `HPLC-MS` - ๐low and๐ฌhigh resolution
>- `NMR` / `FTIR` / `Raman` / `RX`
It supports **large-scale applications** such as identifying unknown substances in โป๏ธ recycled materials or mixtures containing **NIAS** (*Non-Intentionally Added Substances*). ๐๏ธ Symbolic compression (up to 95%+) enables scalable storage and alignmentโand seamless integration with **Large Language Models (LLMs)**.

๐จ Credits: Olivier Vitrac
> ๐ This approach was developed and tested as part of the PhD thesis:
>
> **Julien Kermorvant**, *"Concept of chemical fingerprints applied to the management of chemical risk of materials, recycled deposits and food packaging"*, AgroParisTech. 2023.
> https://theses.hal.science/tel-04194172
๐ก **Note for ND-signals and multi-detector or multi-technique data**
`sig2dna` natively supports signal alignment and comparison across heterogeneous sources. This makes it particularly suited for **ND-signals** (*non-destructive signals*) or **aggregated data** from different detectors or acquisition techniques.
โก๏ธ You can seamlessly concatenate or compare signals originating from **different instruments or modes** (e.g., UV + MS, GCรGC, LC-FTIR, etc.) โ the symbolic coding abstracts away intensity scales and detector-specific artifacts, focusing instead on **morphological motifs**.
> Additional recommendations for ๐ผ๏ธ 2D and ๐๏ธ multimodal acquisition systems are given at the end of this document ๐.
---
## ๐ Table of Contents
- [๐งฉ 1| **Main Components**](#-1-main-components)
- [๐ง 2| **Applications**](#-2-applications)
- [๐งฌ 3| **Core Concepts** - Overview](#-3-core-concepts---overview)
- [๐ง 4| **Entropy and Distance Metrics**](#-4-entropy-and-distance-metrics)
- [๐ 5 | **Sinusoidal Encoding of Symbolic Segments**](#-5--sinusoidal-encoding-of-symbolic-segments)
- [๐ 6| **Baseline Filtering and Poisson Noise Rejection**](#-6-baseline-filtering-and-poisson-noise-rejection)
- [๐งช 7| **Synthetic Signal Generation**](#-7-synthetic-signal-generation)
- [๐ฆ 8| **Available Classes**](#-8-available-classes)
- [๐ 9| **Example Workflow**](#--9-example-workflow)
- [๐ 10| **Visualization**](#-10-visualization)
- [๐ 11| **Motif Detection**](#-11-motif-detection)
- [๐ค 12| **Alignment**](#-12-alignment)
- [๐งช 13| **Examples** (unsorted)](#-13-examples-unsorted)
- [๐ฆ 14| **Installation**](#-14-installation)
- [๐ก15| **Recommendations**](#15-recommendations)
- [๐ | License](#--license)
- [๐ง | Contact](#--contact)
- [๐ |Appendices](#-appendices)
------
## ๐งฉ 1| **Main Components**
| Class | Description |
| --------------------- | ------------------------------------------------------------ |
| `DNAsignal` | Encodes numerical signal as symbolic sequence (multi-scale wavelet transform) |
| `DNAstr` | A symbolic string with alignment, entropy, motif search, plotting, etc. |
| `DNApairwiseAnalysis` | Computes and visualizes distances, PCoA, clustering, scatter, dendrogram |
---
## ๐ง 2| **Applications**
- High-throughput chemical pattern recognition ๐, -หหโโโโโ
- NIAS tracking in complex matrices โฌ,๐โ ๏ธ
- AI-compatible signal fingerprints ๐ซ๐
- Classification of recycled material batches โป๏ธ ๐
- Detection of structural motif distortions (due to overlapping compounds) ๐ต๐ป
- AI-assisted quality control
- AI-assisted compliance testing ๐ฝ๏ธ
---
## ๐งฌ 3| **Core Concepts** - Overview
> **Morphology Encoding as a โGeneticโ Code**. Chemical signals are subjected to signal morphology encoding using continuous, symmetric wavelet transforms. The symbolic sequences are similar to genetic code. It is based on a limited number of letters, symbols appear grouped into motifs which behave like codons As a result, a motif table can be used to recognize n-upplets in ^1^H-NMR, mass spectra, retention times, etc.
>
> **Motif Recognition.** Searches of substances or typical patterns can be carried out via regular expressions or via transition probabilities (AโZ vs. ZโB vs. AโA) over a sliding window. All operations can be carried out in parallel for efficiency and automated treatment.
>
> **Scalable Machine-Learning**. High compression ratios enable the efficient storage of millions of chemical signatures.
### 3.1 **Input Signa**l โก๏ธ
- One-dimensional `signal` objects (NumPy-based)
- Supports synthetic and experimental sources
```python
S = signal.from_peaks(...)
```
๐ฅ
### 3.2 **Wavelet Transform** ใฐ
A **Mexican hat (Ricker)** wavelet is used:
```{math}
\psi_s(t) = \left(1 - \frac{t^2}{s^2}\right)e^{-\frac{t^2}{2s^2}}
```
The Continuous Wavelet Transform (CWT) of a signal $x(t)$ is:
```{math}
W_s(t) = x(t)*\psi_s(t) = \int x(\tau) \cdot \psi_s(t - \tau) \, d\tau
```
where $s$ is the scale parameter (typically powers of two, e.g., $s = 2^n$) and $*$ the convolution operator.
๐ช
### 3.3 **Relationship of $W\_s(t)$ with the second derivative** $x''(t)=\frac{\partial^2 x(t)}{\partial t^2}$
Applying the **Ricker wavelet (second derivative of a Gaussian**) to the signal $x(t)$ via convolution (i.e., CWT) is equivalent to computing the **second derivative of $x(t)$** smoothed by the Gaussian kernel $g\_s(t)$:
```{math}
W_s(t) = x(t)*\psi_s(t) = x(t) * g''(t) = x''(t) * g(t)
```
Click here for the demonstration
The **convolution** of $x(t)$ with the second derivative of $g(t)$ is:
$(x * g'')(t) = \int\_{-\infty}^{\infty} x(\tau) g''(t - \tau) \, d\tau$
We perform a **change of variable**: let $u = t - \tau$, so $\tau = t - u$ and $d\tau = -du$. This gives:
$(x * g'')(t) = \int\_{-\infty}^{\infty} x(t - u) g''(u) \, du$
Now consider the convolution of the **second derivative of $x(t)$** with $g(t)$:
$(x'' * g)(t) = \int\_{-\infty}^{\infty} x''(\tau) g(t - \tau) \, d\tau$
Again, perform the **change of variable** $u = t - \tau$, yielding:
$(x'' * g)(t) = \int\_{-\infty}^{\infty} x''(t - u) g(u) \, du$
*Now integrate by parts twice*:
* **First integration by parts** (assuming $g(u) \to 0$ and $x'(t - u) \to 0$ as $u \to \pm \infty$):
$\int x''(t - u) g(u) \, du = - \int x'(t - u) g'(u) \, du$
* **Second integration by part**s:
$-\int x'(t - u) g'(u) \, du = \int x(t - u) g''(u) \, du$
Which yields:
$(x'' * g)(t) = \int x(t - u) g''(u) \, du = (x * g'')(t)$
๐ฆ
### 3.4 **Symbolic Encoding** ๐ก
Each segment of the wavelet-transformed signal is encoded into one of the symbolic codes corresponding to the table of variation $\text{sign}\left(\frac{\partial}{\partial t}W\_s(t)\right)$ and $\text{sign}\left(W\_s(t)\right)$.
| Symbol: $\ell\_i$ | Variation | Description |
| ---------------- | :-------: | ----------------------------------------------- |
| A | -โ+ | Increasing crossing from โ to + (zero-crossing) |
| B | -โ- | Increasing negative |
| C | +โ+ | Increasing positive |
| X | +โ+ | Decreasing positive |
| Y | -โ\- | Decreasing negative |
| Z | +โ- | Decreasing crossing from + to โ (zero-crossing) |
| \_ | โโ | Flat or noise segment |
Each segment stores its `width`, `height`, and `position`.
> The full-resolution symbolic sequence is reconstructed by interpolating or repeating these symbols proportionally to their span. A quantitative pseudo-inverse is proposed to reconstruct chemical signals from their code.
๐ฉ
### 3.5 **Symbolic Compression** ๐๏ธ
Symbolic sequences can be compressed and encoded at full resolution via:
```python
dna.encode_dna()
dna.encode_dna_full(resolution="index")
```
Resulting in DNA-like sequences like:
```text
"YYAAZZBB_YAZB"
```
๐จ
### 3.6 **Structural Meaning** (e.g., YAZB Motif)
A single Gaussian peak transformed via the Ricker wavelet results in:
- Y: rising pre-lobe ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท
- A: left inflection (โ to + crossing)
- Z: right inflection (+ to โ crossing)
- B: trailing decay
The `YAZB` motif is a **symbolic map of the Ricker wavelet transform (CWT) of a Gaussian**. An alteration of the pattern reveals overlapping Gaussians , asymmetric signals or more generally interactions and interferences.
๐ง
### 3.7 **Interpretation When Gaussians Overlap** ๐โค
When two Gaussians overlap, especially at close proximity or with different amplitudes:
1. The **central peak becomes asymmetric**.
2. The **Ricker transform** becomes a **superposition** of two wavelets.
3. This causes:
- Emergence of **extra inflection points**,
- Distortion of the clean YAZB sequence,
- Insertion of intermediary motifs (e.g., repeating A-Z transitions, shortened or elongated lobes),
- Possible merging of YAZB motifs or partial truncation.
So **changes in the symbolic code structure** directly reflect **signal interference**, i.e., **nonlinearity in overlapping peaks**.
---
## ๐ง 4| **Entropy and Distance Metrics**
> `Sig2dna` implements several metrics to evaluate the similarity of coded chemical signals. Alignment is essential to compare them while respecting order. It is performed via global/local pairwise alignment using `difflib` or `Biopython`. Excess Entropy and Jensen-Shannon are best choices in the presence of complex mixtures by enabling the detection of small structural changes.
| Distance | Sensitive to | Based on | Alignment needed | Suitable for |
| ------------------ | ------------ | ------------------ | ---------------- | ------------------------------------- |
| **Excess Entropy** | Symbol order | Shannon entropy | โ
Yes | Structural motif similarity |
| **Jensen-Shannon** | Symbol usage | Probability dist. | โ No | Profile similarity (e.g., peak types) |
| **Levenshtein** | Edit steps | Insertion/Deletion | โ
Yes | Sequence-level variation |
| **Jaccard** | Pattern sets | Motif occurrences | โ No | Motif overlap, Motif density map |
### 4.1 **Shannon Entropy** โโโโโโ
Entropy provides a **robust, physics-informed metric** for morphological comparisons. For a symbolic sequence $X$, it reads:
```{math}
H(X) = -\sum_i p(\ell_i) \log_2 p(\ell_i)
```
where $p(\ell\_i)$ is the frequency of letter $l\_i$ in the sequence $X$.
Entropy $H$ is an extensive quantity verify additivity properties for independent sequences. Its value is accumulated between structured and low structured regions. Entropy is **invariant under translation and stable under small perturbations**, especially when using symbolic codes rather than raw intensities. This makes it ideal for comparing:
- Signals with **shifts in baseline**
- Morphologically similar but **intensity-scaled** signals
- **Partially distorted** sequences (e.g., from mixtures or degradation)
๐ด
### 4.2 **Aligned sequences and Excess Entropy Distance** โ๏ธ
Let $A$ and $B$ be two symbolic sequences (`DNAstr`) representing two signals. After alignment (*e.g.,* via global/local pairwise alignment using `difflib` or `Biopython`), we obtain:
- $\tilde{A}$: aligned version of $A$ (with possible gap insertions)
- $\tilde{B}$: aligned version of $B$
- $\tilde{A} * \tilde{B}$: a new sequence formed by pairing corresponding symbols (possibly with gaps)
Given sequences $A$ and $B$, the mutually exclusive information or excess entropy is defined as:
```{math}
D_{\text{excess}}(A, B) = H(A) + H(B) - 2 H(\tilde{A} * \tilde{B})
```
where:
- $H(A)$ and $H(B)$ are the Shannon entropies of the original sequences
- $H(\tilde{A} * \tilde{B})$ is the Shannon entropy of the aligned signal pairs (treated as "joint letters")
๐
### 4.3 **Jensen-Shannon Distance** โ๏ธ
Let $P$ and $Q$ be the **empirical frequency distributions** of symbolic letters in two DNA-like coded signals $A$ and $B$, respectively. That is:
* $P = {p\_\ell}$ where $p\_\ell = \frac{\text{count of symbol } \ell \text{ in } A}{|A|}$
* $Q = {q\_\ell}$ where $q\_\ell = \frac{\text{count of symbol } \ell \text{ in } B}{|B|}$
Let $M$ be the average distribution:
```{math}
M = \frac{1}{2}(P + Q)
```
Then, the **JensenโShannon distance** between $P$ and $Q$ is defined as:
```{math}
D_{\text{JS}}(P, Q) = \sqrt{ \frac{1}{2} D_{\text{KL}}(P \| M) + \frac{1}{2} D_{\text{KL}}(Q \| M) }
```
where $D\_{\text{KL}}$ is the Kullback-Leibler divergence:
```{math}
D_{\text{KL}}(P \| M) = \sum_\ell p_\ell \log_2 \left( \frac{p_\ell}{m_\ell} \right)
```
and $m\_\ell$ is the frequency of symbol $\ell$ in the average distribution $M$.
#### 4.3.1 **Interpretation** ๐ก
* The **JensenโShannon distance** quantifies **how different the symbol usage is** between two signals, **ignoring the order** in which the symbols appear.
* It is **bounded between 0 and 1**, symmetric, and always finite (even when some symbols are missing in one sequence).
* A value of **0** indicates identical symbol distributions, while **1** indicates completely disjoint symbol usage.
#### 4.3.2 **Use Cases** ๐งช
* **Robust against misalignment or noise**: two signals with similar overall composition but different positions will still score low JSD.
* **Useful for clustering** symbolic signals by type or composition, regardless of temporal structure.
* **Complementary to entropy or edit-based distances**, which capture positional or morphological changes.
๐ก
### 4.4 **Jaccard Motif Distance** ๐
The **Jaccard distance** measures the similarity between two symbolic signals by comparing the sets of **motifs** (short symbolic substrings) they contain, without requiring alignment. It is particularly suited for identifying common structural patterns across signals, regardless of their order or spacing.
Given two sequences $A$ and $B$, and a set of motifs $\mathcal{M}$ of length $k$ (typically 3โ5 characters), we define:
* $\mathcal{M}(A)$: set of motifs found in $A$
* $\mathcal{M}(B)$: set of motifs found in $B$
Then the **Jaccard distance** is defined as:
```{math}
D_{\text{Jaccard}}(A, B) = 1 - \frac{|\mathcal{M}(A) \cap \mathcal{M}(B)|}{|\mathcal{M}(A) \cup \mathcal{M}(B)|}
```
#### 4.4.1 Key Features:
* โ
**No alignment needed** โ motif presence is evaluated globally
* ๐ **Sensitive to local patterns** โ detects repeated or shared symbolic structures
* ๐ **Sparse and interpretable** โ suitable for heatmaps and clustering
#### 4.4.2 Implementation Notes:
* Motifs are extracted using a sliding window of fixed length (default: `k=4`)
* Symbol sequences are assumed to be from the encoded `DNAstr` outputs
* Motif sets are hashed to speed up large comparisons
* Jaccard scores are computed pairwise across a collection of symbolic sequences
This metric is especially useful when:
* You expect **common substructures** across signals
* Signals may differ in length or alignment is unreliable
* You want to create **density maps of motif usage** or explore **structural similarity clusters**
---
Here is the complete, cleanly formatted **README.md documentation section** for the new **sinusoidal encoder/decoder** functions added to `sig2dna`, including appropriate emojis and explanations:
------
## ๐ 5 | **Sinusoidal Encoding of Symbolic Segments**
`sig2dna` integrates a **transformer-style positional encoding** for symbolic segments, enabling conversion of **morphological features** into **fixed-size vectors**. This provides a compact, AI-ready representation of:
* โฑ๏ธ Position ($x\_0$)
* ๐ Width ($\Delta x$)
* ๐ถ Amplitude ($\Delta y$)
> ๐ก This mechanism replaces long repetitions of letters by a **numerically invertible vector encoding**, useful for clustering, attention-based models, or compressed storage.
### 5.1 Mathematical Basis ๐
Let $t \in \mathbb{R}$ be a scalar quantity (e.g., position, width, or height). The sinusoidal encoding $\mathbf{f}(t) \in \mathbb{R}^d$ is defined by:
```{math}
\begin{aligned}
f_{2k}(t) &= \sin\left(\frac{t}{r^k}\right), \\
f_{2k+1}(t) &= \cos\left(\frac{t}{r^k}\right),
\end{aligned}
\quad \text{for } k = 0, \dots, \frac{d}{2}-1
```
where:
* $r = N^{2/d}$ is a frequency base (default: $N = 10000$)
* $d$ is the number of embedding dimensions for the feature (default: $d = 32$)
* Each encoded feature (position, width, amplitude) gets its own $d$-vector
Then the full vector for one symbolic segment becomes:
```{math}
\mathbf{v} = [\mathbf{f}(x_0) \, \| \, \mathbf{f}(\Delta x) \, \| \, \mathbf{f}(\Delta y)]
\in \mathbb{R}^{3d}
```
These vectors are computed for each letter (A, B, ..., Z) and grouped accordingly.
> This encoding maps any scalar value $t$ (e.g., โฑ๏ธ, ๐, ๐ถ) onto periodic functions. Due to the nature of sine and cosine, this representation is:
>
> * **translation-equivariant** for local displacements (relative order and spacing are preservd),
> * **periodic**, so absolute positions wrap with ambiguity (exact localization may be lossy).
>
> The **key mathematical identity** is:
>
> ```{math}
> f(t + \Delta t) = \mathrm{diag}(f(\Delta t)) \cdot f(t)
>
```
>
>
> ๐ **shifting** a position $t$ by $\Delta t$ corresponds to a **linear transformation** of its embedding.
>
> โ ๏ธ To enable **invertibility**, we restrict $x_0$ within a known range $[0, L]$ with resolution determined by $N$ (current implementation), or add explicit absolute anchor
๐ตใฐ๏ธใฐ๏ธโช๏ธใฐ๏ธใฐ๏ธใฐ๏ธ๐ด
### 5.2 Decoding implementation ๐๏ธ
- **Encoding**: $t \mapsto [\sin(t/r_k), \cos(t/r_k)]_{k=0}^{d/2 - 1}$
- **Decoding**:
- Convert $\sin$, $\cos$ pairs into $z_k = \cos + i\sin = e^{ix/r_k}$
- Unwrap $\angle(z_k)$ โ gives $\theta_k \approx t/r_k$
- Fit $t$ via least-squares:
```{math}
x\_i = \frac{\sum\_k \theta\_{ik} \cdot \frac{1}{r\_k}}{\sum\_k \left(\frac{1}{r\_k}\right)^2}
```
- Robust, differentiable, and avoids scalar-local minima traps.
> Four decoders have been implemented ๐ง:
>
> | Method | Description | Stability |
> | ----------------- | ---------------------------------------------- | -------------- |
> | `'least_squares'` | Fast, phase-unwrapped projection | โ
Excellent |
> | `'svd'` | SVD-regularized LSQ for robust inversion | โ
Excellent |
> | `'optimize'` | Scalar optimization (slow, fragile) | โ Unstable |
> | `'naive'` | Mean of phase-projected values (quick + dirty) | โ Wrong shifts |
>
> Rules of Thumb ๐ง:
>
> | Option | Action | Effect |
> | ----------- | ---------------------------- | -------------------------------- |
> | Use scaling | Normalize input to `[0, 10]` | Accurate decoding for wide range |
> | Reduce `N` | Use e.g. `N = 1000` | Higher range support |
๐ทใฐ๏ธใฐ๏ธ๐ทใฐ๏ธใฐ๏ธใฐ๏ธ๐ท
### 5.3 `sinencode_dna()` โ Letter-wise Sinusoidal Encoder ๐ก
Encodes all symbolic segments at selected scale(s) into sinusoidal vectors, grouped by letter (`A`, `Z`, `B`, etc.).
```python
dna.sinencode\_dna(scales=[4], d\_part=32)
```
๐ง Stored outputs:
- `self.code_embeddings_grouped`:
```python
{
4: {
"A": np.ndarray (n\_A, 96),
"Z": np.ndarray (n\_Z, 96),
...
}
}
```
- `self.code_embeddings_meta`:
Metadata required for reconstruction:
```python
{
"sampling\_dt": 0.1,
"x\_label": "RT",
"x\_unit": "min",
"y\_label": "Intensity",
"y\_unit": "a.u.",
"name": "GC-MS peak trace",
"scales": [4],
"d\_part": 32,
"N": 10000
}
```
๐ถใฐ๏ธใฐ๏ธ๐ถใฐ๏ธใฐ๏ธใฐ๏ธ๐ถ
### 5.4 `sindecode_dna(...)` โ Static Decoder to DNAsignal ๐
Reconstructs a new `DNAsignal` instance from sinusoidal embeddings:
```python
reconstructed = DNAsignal.sindecode\_dna(
grouped\_embeddings = dna.code\_embeddings\_grouped,
meta\_info = dna.code\_embeddings\_meta
)
```
๐งฌ Returns a complete `DNAsignal` object with:
- reconstructed `codes[scale]` dictionaries:
- `letters`, `widths`, `heights`, `iloc`, `xloc`, `dx`
- empty signal (since waveform cannot be recovered from symbol encoding alone)
๐ง Ideal for:
- Embedding symbolic sequences for AI/ML workflows
- Comparing motifs without repeating long letters
- Visualizing symbolic structure in latent spaces
โญใฐ๏ธใฐ๏ธโญใฐ๏ธใฐ๏ธใฐ๏ธโญ
### 5.5 Summary and error estimation $\varepsilon = |\hat{t} - t|$ ๐ฌ
Each scalar $t$ (like $x_0$ or $\Delta x$) is **encoded** as:
```{math}
\mathbf{f}(t) = \left[ \sin\left(\frac{t}{r^0}\right), \cos\left(\frac{t}{r^0}\right), \dots, \sin\left(\frac{t}{r^{d/2-1}}\right), \cos\left(\frac{t}{r^{d/2-1}}\right) \right]
```
with $r = N^{2/d}$, typically $N = 10000$, and $d \sim 32$.
In decoding, we estimate $t$ by averaging multiple phase inversions:
```{math}
\hat{t} \approx \frac{1}{d/2} \sum\_{k=0}^{d/2 - 1} r^k \cdot \theta\_k,
\quad \text{where } \theta\_k = \arctan\left( \frac{\sin(t/r^k)}{\cos(t/r^k)} \right)
```
Let $L$ be the **maximum span** of $t$ values to encode (e.g., total signal length), and $d$ the embedding size (e.g., 32). Then:
* For $k=0$ (highest freq), $\text{period}_0 \sim 2\pi$
* For $k = d/2 - 1$, $\text{period}_k \sim 2\pi N$
So the **resolution** behaves like:
```{math}
\varepsilon \sim \frac{L}{N}
```
where $N$ is the frequency base and $L$ is the range of $t$ values being encoded (*e.g.*, max segment length or signal length)
| Feature | Value |
| ------------ | ------------------------------------------------------------ |
| Error scales | $\varepsilon \sim L / N$ |
| Depends on | Signal span $L$, base $N$ |
| Tunable by | Increasing $d$ or $N$ |
| Accuracy | Typically $<0.1%$ %of signal range ($L=500$ and $N=10^4$ gives $\varepsilon \approx \frac{500}{10000} = 0.05$ ) |
| Robustness | Stable across most morphologies |
The errors are acceptable for:
* Motif alignment
* Classifiers
* Density maps
* Latent embeddings
----
## ๐ 6| **Baseline Filtering and Poisson Noise Rejection**
> The **Ricker wavelet** $\psi_s(t)$ used in `sig2dna` is mathematically the **second derivative of a Gaussian kernel**. As such, applying the Continuous Wavelet Transform (CWT) with $\psi_s(t)$ is equivalent to performing a **second-order differentiation** of the signal $x(t)$ followed by a **Gaussian smoothing**, where the scale parameter $s$ controls the bandwidth.
>
> This structure makes the CWT intrinsically robust to **low-frequency noise**, **baseline drifts**, and **stationary random noise** (such as column bleeding in GC). Moreover, the symmetry of $\psi_s(t)$ ensures suppression of **linear trends**, enhancing signal clarity without distorting peak structures.
>
> For ideal Gaussian-shaped peaks, the optimal CWT response is obtained when the scale $s$ matches the peak's width at its **inflection points**, which corresponds to **half-height** for a Gaussian. This is where the symbolic motif `YAZB` is most cleanly detected.
>
> However, on real-life signals, maximizing noise rejection by increasing $s$ can blur peak details. Preserving the **morphological fidelity** of peaks while ensuring their **detectability** requires operating **near the optimal scale**, not beyond it. To this end, `sig2dna` integrates a **robust preprocessing methodology** tailored for signals acquired through **accumulation or integration** (i.e., **counting statistics**), such as total ion counts in mass spectrometry or spectroscopic intensities.
### Step 1 โ Median Baseline Subtraction ๏น๐๏น
Let $x(t)$ be the input signal. We compute a moving median over a window of width $w$:
```{math}
\text{baseline}(t) = \text{median}\left[x(t - w/2), \dots, x(t + w/2)\right]
```
Then, apply a non-negative correction:
```{math}
x\_b(t) = \max\left(0,\, x(t) - \text{baseline}(t)\right)
```
๐ปโ๐ผโ๐ฝโ๐พ๐ฟ
### Step 2 โ Poisson Noise Estimation โถ๏ธ แแ||แ|แ|||| |
From the baseline-corrected signal $x_b(t)$:
* Compute the local mean $\mu(t)$ and standard deviation $\sigma(t)$ using a uniform filter.
* Estimate the coefficient of variation:
```{math}
\text{cv}(t) = \frac{\sigma(t)}{\mu(t)}
```
Assuming Poisson noise, infer the local Poisson parameter:
```{math}
\lambda(t) = \frac{1}{\text{cv}(t)^2}
```
๐ปโ๐ผโ๐ฝโ๐พ๐ฟ
### Step 3 โ BienaymรฉโTchebychev Thresholding ๐๏ธ
To reject noise, use a threshold $T(t)$ derived from $\lambda(t)$:
```{math}
T(t) = k \cdot \sqrt{10 \lambda(t) \Delta t}
```
Filtered signal is then:
```{math}
x\_{bf}(t) =
\begin{cases}
x\_b(t) & \text{if } x\_b(t) > T(t) \\
0 & \text{otherwise}
\end{cases}
```
---
## ๐งช 7| **Synthetic Signal Generation**
Synthetic signals are modeled as a sum of Gaussian/Lorentzian/Triangle peaks. For Gaussian, they read
```{math}
s(t) = \sum\_{i} h\_i \cdot \exp\left(-\left(\frac{t - \mu\_i}{0.6006 \cdot w\_i}\right)^2\right)
```
where:
* $h_i$: peak height
* $\mu_i$: center
* $w_i$: peak width (calibrated to Full **Width Half Maximum**)
This is used to:
* Reconstruct symbolic segments
* Generate artificial mixtures
* Simulate motifs for clustering or ML training
* Parses sequences into `YAZB` motif candidates (Mass spectra)
---
## ๐ฆ 8| **Available Classes**
**Module** `sig2dna_core.signomics.py`
| Class Name | Description |
| ---------------------- | ------------------------------------------------------------ |
| `generator` | Peak shape generator: Gaussian, Lorentzian, triangle |
| `peaks` | Peak library with synthesis, parameter control, and arithmetic operations |
| `signal` | 1D signal class with plotting, peak summation, transformations, and noise |
| `signal_collection` | Wrapper for multi-signal analysis: mean, sum, scaling, alignment, synthesis |
| `DNAstr` | Symbolic sequence class with entropy, motif search, edit distances |
| `DNAsignal` | Symbolic encoding/decoding from signals (DNA-like) |
| `DNApairwiseAnalysis` | Tools for clustering, dimensionality reduction, dendrograms, visual metrics |
| `DNAsignal_collection` | Wrapper for 2D, nD DNAsignals |
| `SinusoidalEncoder` | Encoder/decoder for symbolic and numeric data using sinusoidal projections |
| `DNACodes` | Dictionary-like symbolic representation of triplet codes (letter, width, height) |
| `DNAFullCodes` | Dictionary-based encoder for resolution-based symbolic repetition |
**Class Inheritance Diagram**
```{mermaid}
graph TD;
DNACodes
DNAFullCodes
DNApairwiseAnalysis
DNAsignal
DNAsignal_collection
DNAstr
SinusoidalEncoder
generator
peaks
signal
signal_collection
UserDict --> DNACodes
dict --> DNAFullCodes
list --> DNAsignal_collection
list --> signal_collection
object --> DNApairwiseAnalysis
object --> DNAsignal
object --> SinusoidalEncoder
object --> generator
object --> peaks
object --> signal
str --> DNAstr
```
---
## ๐ 9| **Example Workflow**
```python
from signomics import DNAsignal
# Load and encode
D = DNAsignal(S, encode=True)
D.encode\_dna()
D.encode\_dna\_full()
# Visualize
D.plot\_codes(scale=4)
# Entropy and distances
entropy = D.get\_entropy(scale=4)
analysis = DNAsignal.\_pairwiseEntropyDistance([D1, D2, D3], scale=4)
```
---
## ๐ 10| **Visualization**
- `signal.plot()`, `signal_collection.plot()` : plot signals
- `DNAsignal.plot_signals()`: Original + CWT overlay
- `DNAsignal.plot_transforms()`: plot transformed signals a collection of signals
- `DNAsignal.plot_codes(scale=4)`: Colored triangle segments
- `DNAstr.plot_mask`: plot alignment mask
- `DNAstr.plot_alignment`: plot aligned codes as reconstructed signals
- `DNApairwiseAnalysis.plot_dendrogram()`, `scatter3d(), scatter(), heatmap`, `dimension_variance_curve`: Cluster and distance views
------
## ๐ 11| **Motif Detection**
Pattern search: ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท๊ฆ๊ท
```python
listPat=D.codes[4].find("YAZB")
listPat[0].to\_signal().plot() # show the first match as a signal
```
Extract and plot motifs: โโโโโโโโ
```python
D.codesfull[4].extract\_motifs("YAZB", minlen=4, plot=True)
```
------
## ๐ค 12| **Alignment**
โด Fast symbolic alignment:โ๏ธโฑ๏ธ
```python
D1.codes[4].align(D2.codes[4], engine="bio")
D1.codes[4].wrapped\_alignment()
D1.html\_alignment()
D1.plot\_alignment()
```
------
## ๐งช 13| **Examples** (unsorted)
```python
from sig2dna\_core.signomics import peaks, signal\_collection, DNAsignal
# 1. Peak creation and basic signals ๐๏ธ
p = peaks()
p.add(x=10, w=2, h=1)
p.add(x=20, w=2, h=1)
s = p.to\_signal()
s.plot()
# 2. Signal collection ๐๏ธ
s\_noisy = s.add\_noise("gaussian", scale=0.01, bias=5)
s\_scaled = s * 0.5
coll = signal\_collection(s, s\_noisy, s\_scaled)
s\_mean = coll.mean()
s\_mean.plot(label="Mean")
# 3. Synthetic mixtures ๐ฅฃ
S, pS = signal\_collection.generate\_synthetic(n\_signals=12, n\_peaks=1, ...)
Sfull = S.mean()
dna = DNAsignal(Sfull)
dna.compute\_cwt()
dna.encode\_dna\_full()
dna.plot\_codes(scale=4)
# 4. Alignment of encoded sequences ๐งฌ๐งฌ
A = dna.codesfull[4]
B = dna.codesfull[2]
A.align(B)
A.html\_alignment()
A.plot\_alignment()
# 5. Extract motifs (e.g., YAZB segments โ๏ธ
pA = A.find("YAZB")
pAs = signal\_collection(*[s.to\_signal() for s in pA])
pAs.plot()
# 6. Classification from mixtures ๐
Smix, pSmix, idSmix = signal\_collection.generate\_mixtures(...)
dnaSmix = Smix.\_toDNA(scales=[1,2,4,8,16,32])
# 7. Excess entropy distance & clustering ๐ฒ
D = DNAsignal.\_pairwiseEntropyDistance(dnaSmix, scale=4, engine="bio")
D.name = "Excess Entropy"
D.dimension\_variance\_curve()
D.select\_dimensions(10)
D.plot\_dendrogram()
D.scatter3d(n\_clusters=5)
# 8. Jaccard motif distance โ๏ธ
J = DNAsignal.\_pairwiseJaccardMotifDistance(dnaSmix, scale=4)
J.name = "YAZB Jaccard"
J.dimension\_variance\_curve()
J.select\_dimensions(10)
J.plot\_dendrogram()
J.scatter3d(n\_clusters=5)
```
---
## ๐ฆ 14| **Installation**
The `sig2dna` toolkit is composed of two core modules that must be used together:
| ๐งฉ Module | Description |
| -------------------------- | ------------------------------------------------------------ |
| ๐งฌ `sig2dna_core.signomics` | Core module implementing symbolic transformation, wavelet coding, and signal comparison (compact code, >7 Klines) |
| ๐จ๏ธ `sig2dna_core.figprint` | Utility module for saving and exporting Matplotlib figures (PDF, PNG, SVG) |
### Recommended File Structure ๐
For simplicity and consistency, it is recommended to use both modules from a local subfolder (e.g., `sig2dna_core`) within your working directory. You can clone or place the source files accordingly:
```text
๐ sig2dna/ <- your working directory
โ
โโโ ๐ sig2dna\_core/ <- folder for core modules
โ โโโ ๐จ๏ธ figprint.py <- figure saving utilities
โ โโโ ๐งฌ signomics.py <- main symbolic signal processing module (>4 Klines)
โ
โโโ ๐ sig2dna\_tools/ <- folder for tools (not included in this release)
โ
โโโ ๐ images/ <- output folder for saved figures (PDF, PNG, SVG)
โ
โโโ ๐ yourscript.py <- your script using sig2dna\_core modules
โ
โโโ ๐ test\_signomics.py <- minimal test and plotting script
โโโ ๐ casestudy\_signomics.py <- in-depth classification and clustering example
โโโ ๐ LICENSE
โโโ ๐ README.md
```
### Import Example ๐ฅ
In your scripts, import the components directly:
```python
from sig2dna\_core.signomics import peaks, signal\_collection, DNAsignal
```
### Dependencies ๐ฆ
The project relies only on standard scientific Python libraries and a few well-known optional packages. All can be installed with `conda` or `pip`:
```bash
conda install pywavelets seaborn scikit-learn
conda install -c conda-forge python-Levenshtein biopython
```
Or using `pip`:
```bash
pip install PyWavelets seaborn scikit-learn python-Levenshtein biopython
```
> โ
*No installation script is needed; simply place the module files in your working directory and ensure the structure above is respected.*
---
## ๐ก15| **Recommendations**
### Strategy for 2D or Multi-modal Chromatography ๐งญ
For **2D chromatographic systems**, such as GCรGC or LCรLC, or in workflows combining retention time and mass detection, we suggest the following dual encoding strategy:
- **Along the retention axis**: perform symbolic encoding of **TIC** (Total Ion Current) or a selected ion trace, to track **retention-based morphology**.
- **Along the $m/z$ axis**: use time-averaged spectra to encode **mass distribution patterns**, capturing molecular-level information.
๐ This combined coding captures both **substance separation** and **substance identity**, improving both **detection** (peak finding) and **quantification**.
> ๐ฏ Starting from version $0.45$, 2D signals are handled natively with the class `DNAsignal_collection`. Look at the detailed tutorial ``
### Substance Identification and Library Matching ๐
`sig2dna` includes signal reconstruction capabilities from the symbolic code, allowing for **approximate substance identification** against reference libraries.
However, when precise identification is required:
> โ
It is preferable to **transform the mass spectra of reference substances using `sig2dna`** and compare them directly to the coded signal.
This enables **symbol-level matching**, which is more robust to noise, shifts, and peak distortion than traditional numerical similarity or library lookup.
---
## ๐ | License
MIT License โ 2025 Olivier Vitrac
## ๐ง | Contact
Author: Olivier Vitrac
Contact: [olivier.vitrac@gmail.com](mailto:olivier.vitrac@gmail.com)
Version: 0.43 (2025-06-13)
------
> `Sig2dna` is part of the **Generative Simulation** initiative ๐ฑ: building modular, interpretable **AI-ready** tools for scientific modeling.
---
## ๐ |Appendices
### ๐งฉ | Methods Table (module: `sig2dna.signomics`)
| Class | Method | Docstring First Paragraph | # Lines | __version__ |
| ---------------------- | -------------------------------- | ------------------------------------------------------------ | ------- | ----------- |
| (module-level) | `import_local_module` | Import a module by name from a file path relative to the calling module. | 37 | 0.51 |
| `DNACodes` | `__init__` | Initialize self. See help(type(self)) for accurate signature. | 4 | 0.51 |
| `DNACodes` | `plot` | Plot method for DNACodes: visualizes encoded vectors or symbolic segment distribution. | 57 | 0.51 |
| `DNACodes` | `sindecode` | Decode sinusoidally embedded codes grouped by letter into symbolic segment structure. | 24 | 0.51 |
| `DNACodes` | `sinencode` | Encode symbolic segments at each scale using sinusoidal encoding grouped by letter. | 22 | 0.51 |
| `DNACodes` | `summary` | Print the number of encoded vectors or symbolic segments per scale. | 12 | 0.51 |
| `DNAFullCodes` | `__init__` | Initialize self. See help(type(self)) for accurate signature. | 4 | 0.51 |
| `DNAFullCodes` | `__repr__` | Return repr(self). | 2 | 0.51 |
| `DNAFullCodes` | `__str__` | Return str(self). | 2 | 0.51 |
| `DNAFullCodes` | `plot` | Plot method for DNAFullCodes: visualizes encoded vectors or DNA string composition. | 53 | 0.51 |
| `DNAFullCodes` | `plot_unwrapped_matrix` | Plot each letter's encoded vector in the abstract embedding space (d-space). One curve per letter, one subplot per scale. | 38 | 0.51 |
| `DNAFullCodes` | `sindecode` | Sindecode method | 11 | 0.51 |
| `DNAFullCodes` | `sinencode` | Sinencode method โ encodes each letter in the DNAFullCodes as a set of sinusoidal embeddings. | 45 | 0.51 |
| `DNAFullCodes` | `summary` | Return a brief summary of the full codes per scale. | 10 | 0.51 |
| `DNAFullCodes` | `unwrap_letters_to_matrix` | Assemble all encoded letter vectors into a matrix of shape (n_letters, d_model) for each scale. | 38 | 0.51 |
| `DNApairwiseAnalysis` | `__init__` | Initialize self. See help(type(self)) for accurate signature. | 10 | 0.51 |
| `DNApairwiseAnalysis` | `__repr__` | Return repr(self). | 7 | 0.51 |
| `DNApairwiseAnalysis` | `__str__` | Return str(self). | 2 | 0.51 |
| `DNApairwiseAnalysis` | `best_dimension` | Determine optimal dimension by maximizing silhouette score. | 16 | 0.51 |
| `DNApairwiseAnalysis` | `cluster` | Assign cluster labels from linkage matrix. | 5 | 0.51 |
| `DNApairwiseAnalysis` | `compute_linkage` | Compute hierarchical clustering. | 6 | 0.51 |
| `DNApairwiseAnalysis` | `dimension_variance_curve` | Computes the cumulative explained variance (based on pairwise distances) as a function of the number of dimensions used (from 1 to n-1). Optionally plots the curve and the point where the threshold (default 0.5) is reached. | 46 | 0.51 |
| `DNApairwiseAnalysis` | `get_cluster_labels` | Returns cluster labels from hierarchical clustering. If not computed yet, computes linkage. | 20 | 0.51 |
| `DNApairwiseAnalysis` | `heatmap` | Plot heatmap of pairwise distances. | 9 | 0.51 |
| `DNApairwiseAnalysis` | `load` | Load analysis from file. | 5 | 0.51 |
| `DNApairwiseAnalysis` | `pcoa` | Perform Principal Coordinate Analysis (PCoA). | 7 | 0.51 |
| `DNApairwiseAnalysis` | `plot_dendrogram` | Plot dendrogram from linkage matrix. | 14 | 0.51 |
| `DNApairwiseAnalysis` | `reduced_distances` | Recompute distances on selected subspace. | 4 | 0.51 |
| `DNApairwiseAnalysis` | `save` | Save current analysis to file. | 4 | 0.51 |
| `DNApairwiseAnalysis` | `scatter` | 2D scatter plot in selected dimensions with optional cluster-based coloring. | 40 | 0.51 |
| `DNApairwiseAnalysis` | `scatter3d` | 3D scatter plot in selected dimensions with optional cluster-based coloring. | 38 | 0.51 |
| `DNApairwiseAnalysis` | `select_dimensions` | Update active dimensions. | 8 | 0.51 |
| `DNAsignal` | `__init__` | Initialize DNAsignal with a signal object or 1D array. | 58 | 0.51 |
| `DNAsignal` | `__repr__` | Return repr(self). | 6 | 0.51 |
| `DNAsignal` | `__str__` | Return str(self). | 2 | 0.51 |
| `DNAsignal` | `_get_letter` | Determine letter based on monotonicity and signal range. | 31 | 0.51 |
| `DNAsignal` | `_get_triangle_from_letter` | returns the triangle (counter-clockwise, x are incr) | 27 | 0.51 |
| `DNAsignal` | `_pairwiseEntropyDistance` | Calculate excess-entropy pairwise distances. | 52 | 0.51 |
| `DNAsignal` | `_pairwiseJaccardMotifDistance` | Compute pairwise Jaccard distances based on motif presence across symbolic DNAstr sequences. | 102 | 0.51 |
| `DNAsignal` | `_pairwiseJensenShannonDistance` | Calculate pairwise Jensen-Shannon distances between DNAstr codes at a given scale. | 45 | 0.51 |
| `DNAsignal` | `_pairwiseLevenshteinDistance` | Compute pairwise Levenshtein distances between codes at a given scale. | 61 | 0.51 |
| `DNAsignal` | `align_with` | Align symbolic sequences and compute mutual entropy. | 27 | 0.51 |
| `DNAsignal` | `apply_baseline_filter` | Apply baseline filtering using moving median and local Poisson-based thresholding. | 58 | 0.51 |
| `DNAsignal` | `compute_cwt` | Compute Continuous Wavelet Transform (CWT) using the Mexican Hat wavelet. | 46 | 0.51 |
| `DNAsignal` | `encode_dna` | Encode each transformed signal into a symbolic DNA-like sequence of monotonic segments. | 67 | 0.51 |
| `DNAsignal` | `encode_dna_full` | Convert symbolic codes into DNA-like strings by repeating letters proportionally to their span. | 78 | 0.51 |
| `DNAsignal` | `entropy_from_string` | return the entropy of a string | 5 | 0.51 |
| `DNAsignal` | `find_sequence` | Find occurrences of a specific letter pattern in encoded sequence. | 9 | 0.51 |
| `DNAsignal` | `get_code` | Retrieve encoded data for a specific scale. | 3 | 0.51 |
| `DNAsignal` | `get_entropy` | Calculate Shannon entropy for encoded signal. | 5 | 0.51 |
| `DNAsignal` | `has` | Check if a DNA encoding exists for the specified scale. | 22 | 0.51 |
| `DNAsignal` | `normalize_signal` | Normalize the internal signal using one of several strategies that ensure positivity. | 18 | 0.51 |
| `DNAsignal` | `plot_codes` | Plot the symbolic DNA-like encoding as colored triangle segments. | 77 | 0.51 |
| `DNAsignal` | `plot_scalogram` | Plot a scalogram with two subplots: - Top: colored image of CWT coefficient amplitudes - Bottom: line curves of selected scales | 34 | 0.51 |
| `DNAsignal` | `plot_signals` | Plot signals. | 13 | 0.51 |
| `DNAsignal` | `plot_transforms` | Plot the stored CWT-transformed signals as a signal collection. | 13 | 0.51 |
| `DNAsignal` | `print_alignment` | print aligned sequences | 11 | 0.51 |
| `DNAsignal` | `pseudoinverse` | Approximate signal reconstruction via pseudo-inverse using stored CWT coefficients. | 64 | 0.51 |
| `DNAsignal` | `reconstruct_aligned_string` | Fast reconstruction of aligned signals | 9 | 0.51 |
| `DNAsignal` | `reconstruct_signal` | Reconstruct the signal from symbolic features (e.g., YAZB). | 31 | 0.51 |
| `DNAsignal` | `sindecode_dna` | Decode sinusoidal grouped embeddings into a DNACodes structure. | 27 | 0.51 |
| `DNAsignal` | `sinencode_dna` | Encode `self.codes` into sinusoidal embeddings (grouped by letter). | 15 | 0.51 |
| `DNAsignal` | `sinencode_dna_full` | ๐ Encode full-resolution DNA-like strings into sinusoidal embeddings grouped by letter. | 37 | 0.51 |
| `DNAsignal` | `sparsify_cwt` | Sparsify CWT coefficients by zeroing values below a threshold. | 58 | 0.51 |
| `DNAsignal` | `synthetic_signal` | Generate flexible synthetic signals. (obsolete) | 9 | 0.51 |
| `DNAsignal` | `tosignal` | Reconstruct an approximate signal from symbolic encodings. | 46 | 0.51 |
| `DNAsignal_collection` | `__init__` | Initialize a DNAsignal_collection from DNAsignal instances. | 41 | 0.51 |
| `DNAsignal_collection` | `__repr__` | Return a readable summary of the DNAsignal_collection contents. Shows the number of signals, available letters, scales, and embedding dimension. Flags whether sinusoidal encoding has been performed. | 14 | 0.51 |
| `DNAsignal_collection` | `__str__` | Return str(self). | 2 | 0.51 |
| `DNAsignal_collection` | `combine_embeddings` | Combine unwrapped embeddings across all signals for each scale. | 30 | 0.51 |
| `DNAsignal_collection` | `deconvolve_latent_sources` | Perform dimensionality reduction on the 3D tensor v_{t,m,d} to extract non-coeluted compound chromatograms using PCA, with optional plotting. | 138 | 0.51 |
| `DNAsignal_collection` | `plot` | Plot the encoded signals in subplots. Rows represent letters, columns represent scales. Each subplot contains overlaid colored curves from all signals. | 65 | 0.51 |
| `DNAsignal_collection` | `plot_embedding_projection` | Plot embedding projections of the encoded signals using PCA (default) or other DR methods. | 72 | 0.51 |
| `DNAsignal_collection` | `plot_letters` | Plot a heatmap of the letter codes (symbolic DNA) across all signals. | 48 | 0.51 |
| `DNAsignal_collection` | `plot_v_symbol_components` | Plot the components E_symbol, PE_t, PE_m and their sum v_{t,m} as image matrices for each letter at a given scale. | 56 | 0.51 |
| `DNAsignal_collection` | `plot_vtm_full` | Visualize the components and sum of the full encoded GC-MS signal at a given scale. | 66 | 0.51 |
| `DNAsignal_collection` | `reduce_dimensions` | Apply dimensionality reduction (PCA or UMAP) across signals for each scale. | 39 | 0.51 |
| `DNAsignal_collection` | `scale_alignment` | Normalize embeddings across all signals and all scales. | 24 | 0.51 |
| `DNAsignal_collection` | `sinencode_dna_full` | ๐ Encode all DNAsignal instances using full-resolution sinusoidal embeddings, grouped by letter and organized per scale. | 22 | 0.51 |
| `DNAsignal_collection` | `to_dataframe` | Export combined embeddings for all scales as a tidy pandas DataFrame, suitable for machine learning tasks. | 29 | 0.51 |
| `DNAstr` | `__add__` | Concatenate two DNAstr instances with identical dx values. | 27 | 0.51 |
| `DNAstr` | `__eq__` | Check equality based on symbolic content and dx resolution. | 15 | 0.51 |
| `DNAstr` | `__hash__` | Return hash combining the string content and dx. | 10 | 0.51 |
| `DNAstr` | `__new__` | Construct a new DNAstr object. | 42 | 0.51 |
| `DNAstr` | `__repr__` | Return a short technical representation of the DNAstr instance. | 15 | 0.51 |
| `DNAstr` | `__str__` | String representation. | 11 | 0.51 |
| `DNAstr` | `__sub__` | Subtract two DNAstr sequences by aligning and removing matched regions. | 20 | 0.51 |
| `DNAstr` | `_supports_color` | Returns True if ther terminal supports colors | 4 | 0.51 |
| `DNAstr` | `align` | Align this DNAstr sequence to another, allowing insertions/deletions to maximize matches. | 158 | 0.51 |
| `DNAstr` | `excess_entropy` | Compute the excess Shannon entropy of two DNAstr sequences H(A)+H(B)-2*H(AB) | 3 | 0.51 |
| `DNAstr` | `extract_motifs` | Extract and analyze YAZB motifs (canonical and distorted) from the symbolic sequence. | 68 | 0.51 |
| `DNAstr` | `find` | Finds all fuzzy (or regex-based) occurrences of a DNA-like sequence pattern. | 42 | 0.51 |
| `DNAstr` | `html_alignment` | Render the alignment using HTML with color coding: - green: match - blue: gap - red: substitution | 24 | 0.51 |
| `DNAstr` | `jaccard` | Compute the Jaccard distance between two DNAstr sequences. | 19 | 0.51 |
| `DNAstr` | `jensen_shannon` | Compute the Jensen-Shannon distance between self and another DNAstr. | 24 | 0.51 |
| `DNAstr` | `levenshtein` | Compute the Levenshtein distance between this DNAstr and another one. | 38 | 0.51 |
| `DNAstr` | `mutual_entropy` | Compute the Shannon mutual entropy of two DNAstr sequences from their aligned segments | 12 | 0.51 |
| `DNAstr` | `plot_alignment` | Plot a block alignment view of two DNAstr sequences with color-coded segments. | 62 | 0.51 |
| `DNAstr` | `plot_mask` | Plot a color-coded mask of the alignment between sequences. | 20 | 0.51 |
| `DNAstr` | `score` | Return an alignment score, optionally normalized. | 17 | 0.51 |
| `DNAstr` | `summary` | Summarize the DNAstr with key stats: length, unique letters, entropy, etc. | 19 | 0.51 |
| `DNAstr` | `to_signal` | Converts the symbolic DNA sequence into a synthetic NumPy array mimicking the original wavelet-transformed signal. | 54 | 0.51 |
| `DNAstr` | `vectorized` | Map the DNAstr content to an integer array using a codebook. | 21 | 0.51 |
| `DNAstr` | `wrapped_alignment` | Return a line-wrapped alignment view (multi-line), optionally color-coded for terminal/IPython usage (Spyder, Jupyter). | 48 | 0.51 |
| `SinusoidalEncoder` | `__init__` | SinuosidalEncoder Constructor | 22 | 0.51 |
| `SinusoidalEncoder` | `_decode_lsq` | Least-squares phase unwrapping decoder (robust fast decoder). | 29 | 0.51 |
| `SinusoidalEncoder` | `_decode_naive` | Simple decoder based on mean projection (coarse and periodic). | 19 | 0.51 |
| `SinusoidalEncoder` | `_decode_optimize` | Decode each embedding vector via scalar minimization. | 35 | 0.51 |
| `SinusoidalEncoder` | `_decode_svd` | SVD-based phase unwrapping decoder with regularized least-squares. | 33 | 0.51 |
| `SinusoidalEncoder` | `_inverse_sin_embed` | Inverse sinusoidal encoding (approximate). | 23 | 0.51 |
| `SinusoidalEncoder` | `_parse` | Parse values | 10 | 0.51 |
| `SinusoidalEncoder` | `_reconstruct_type` | Reconstruct types | 12 | 0.51 |
| `SinusoidalEncoder` | `_sin_embed` | Project scalar values into sinusoidal embedding space. | 20 | 0.51 |
| `SinusoidalEncoder` | `angle_difference` | Compute angular differences (โฮธ) between consecutive embeddings. | 17 | 0.51 |
| `SinusoidalEncoder` | `complex_distance` | Compute distance between two encoded arrays using complex projection. | 28 | 0.51 |
| `SinusoidalEncoder` | `decode` | Decode sinusoidal embeddings back to original values using selected method. | 42 | 0.51 |
| `SinusoidalEncoder` | `encode` | Encode input values into sinusoidal embeddings. | 31 | 0.51 |
| `SinusoidalEncoder` | `fit_encoder` | Fit a scaling factor to normalize values into a target sinusoidal-safe range. | 24 | 0.51 |
| `SinusoidalEncoder` | `group_centroid` | Compute the centroid (average embedding) of each group in complex sinusoidal space. | 43 | 0.51 |
| `SinusoidalEncoder` | `pairwise_similarity` | Compute a pairwise similarity (or distance) matrix in sinusoidal embedding space. | 27 | 0.51 |
| `SinusoidalEncoder` | `phase_alignment` | Align embedding `emb` to reference `ref` using complex phase. | 23 | 0.51 |
| `SinusoidalEncoder` | `phase_unwrap` | Perform Fourier-like phase unwrapping on sinusoidal embedding. | 24 | 0.51 |
| `SinusoidalEncoder` | `set_decode_tolerance` | Set maximum acceptable residual error for decoding. | 10 | 0.51 |
| `SinusoidalEncoder` | `sindecode_dna_grouped` | Decode sinusoidal embeddings grouped by letter into symbolic segments. | 41 | 0.51 |
| `SinusoidalEncoder` | `sinencode_dna_grouped` | Encode symbolic code segments grouped by letter into sinusoidal embeddings. | 36 | 0.51 |
| `SinusoidalEncoder` | `to_complex` | Convert sinusoidal embedding into a complex array using Euler's identity. | 18 | 0.51 |
| `SinusoidalEncoder` | `verify_roundtrip` | Perform an encode โ decode โ compare roundtrip and report accuracy. | 56 | 0.51 |
| `generator` | `__call__` | Call self as a function. | 12 | 0.51 |
| `generator` | `__init__` | define a peak generator gauss/lorentz/triangle | 3 | 0.51 |
| `generator` | `__repr__` | Return repr(self). | 2 | 0.51 |
| `peaks` | `__add__` | | 8 | 0.51 |
| `peaks` | `__delitem__` | | 14 | 0.51 |
| `peaks` | `__getitem__` | | 11 | 0.51 |
| `peaks` | `__init__` | Initialize a peak collection. | 51 | 0.51 |
| `peaks` | `__len__` | | 2 | 0.51 |
| `peaks` | `__mul__` | | 14 | 0.51 |
| `peaks` | `__repr__` | Return repr(self). | 20 | 0.51 |
| `peaks` | `__str__` | Return str(self). | 6 | 0.51 |
| `peaks` | `__truediv__` | | 3 | 0.51 |
| `peaks` | `add` | Add one or multiple peaks to the collection. | 40 | 0.51 |
| `peaks` | `as_dict` | Return the list of peaks as dict | 3 | 0.51 |
| `peaks` | `copy` | Return a deep-copy of the peaks | 5 | 0.51 |
| `peaks` | `names` | Return the list of names | 3 | 0.51 |
| `peaks` | `remove_duplicates` | | 10 | 0.51 |
| `peaks` | `rename` | | 5 | 0.51 |
| `peaks` | `sort` | Sort peaks in-place based on their center positions (x values). | 17 | 0.51 |
| `peaks` | `to_signal` | Generate a signal from a peaks object. Optionally restrict to a subset. | 8 | 0.51 |
| `peaks` | `update` | Update or insert peaks from a list of dictionaries. | 33 | 0.51 |
| `signal` | `__add__` | | 1 | 0.51 |
| `signal` | `__iadd__` | | 1 | 0.51 |
| `signal` | `__imul__` | | 1 | 0.51 |
| `signal` | `__init__` | Initialize a signal instance. | 76 | 0.51 |
| `signal` | `__isub__` | | 1 | 0.51 |
| `signal` | `__itruediv__` | | 1 | 0.51 |
| `signal` | `__mul__` | | 1 | 0.51 |
| `signal` | `__repr__` | Return repr(self). | 21 | 0.51 |
| `signal` | `__str__` | Return str(self). | 5 | 0.51 |
| `signal` | `__sub__` | | 1 | 0.51 |
| `signal` | `__truediv__` | | 1 | 0.51 |
| `signal` | `_binary_op` | Binary operation on signals | 10 | 0.51 |
| `signal` | `_copystatic` | Static copy of a signal (x and y only) | 4 | 0.51 |
| `signal` | `_current_stamp` | | 6 | 0.51 |
| `signal` | `_events` | Register a traceable action in the signal's history. | 23 | 0.51 |
| `signal` | `_from_serializable` | Convert a serialized dict to a signal | 24 | 0.51 |
| `signal` | `_toDNA` | Return a DNA encoded signal | 9 | 0.51 |
| `signal` | `_to_serializable` | Convert the signal into a dictionary suitable for JSON export. | 19 | 0.51 |
| `signal` | `add_noise` | Return a new signal with noise and/or bias added. | 33 | 0.51 |
| `signal` | `align_with` | Align two signals to a common x grid with interpolation and padding. | 35 | 0.51 |
| `signal` | `apply_poisson_baseline_filter` | Apply a baseline filter assuming Poisson-dominated statistics with adjustable gain and a rejection threshold based on the Bienaymรฉ-Tchebychev inequality. | 69 | 0.51 |
| `signal` | `backup` | Backup current state in _previous | 8 | 0.51 |
| `signal` | `copy` | Deep copy of the signal, excluding full history control flag | 24 | 0.51 |
| `signal` | `disable_fullhistory` | Disable full history tracking | 5 | 0.51 |
| `signal` | `enable_fullhistory` | Enable full history tracking | 4 | 0.51 |
| `signal` | `load` | Load a signal from a JSON or gzipped JSON file, including recursive _previous. | 23 | 0.51 |
| `signal` | `normalize` | Normalize the signal to positive values using different normalization strategies. | 76 | 0.51 |
| `signal` | `plot` | Plot the signal using matplotlib, applying either internal style settings or overrides provided at call time. | 72 | 0.51 |
| `signal` | `restore` | Restore the previous signal version if available | 11 | 0.51 |
| `signal` | `sample` | Interpolate values from x | 3 | 0.51 |
| `signal` | `save` | Save signal to JSON (optionally compressed) and optionally CSV. | 36 | 0.51 |
| `signal_collection` | `__add__` | overrride + | 14 | 0.51 |
| `signal_collection` | `__delitem__` | Delete self[key]. | 7 | 0.51 |
| `signal_collection` | `__getitem__` | sc["my_signal"] # returns a copy of the signal named "my_signal" sc["A", "B", "C"] # returns a sub-collection with those names sc[0:2] or sc[[0, 2]] # still works for index-based access | 19 | 0.51 |
| `signal_collection` | `__iadd__` | override += | 3 | 0.51 |
| `signal_collection` | `__init__` | Initialize collection with aligned signals of the same type. | 30 | 0.51 |
| `signal_collection` | `__mul__` | override * | 6 | 0.51 |
| `signal_collection` | `__radd__` | override + | 5 | 0.51 |
| `signal_collection` | `__repr__` | Return repr(self). | 4 | 0.51 |
| `signal_collection` | `__mul__` | override * | 6 | 0.51 |
| `signal_collection` | `__setitem__` | Set self[key] to value. | 5 | 0.51 |
| `signal_collection` | `__str__` | Return str(self). | 2 | 0.51 |
| `signal_collection` | `_align_all` | | 11 | 0.51 |
| `signal_collection` | `_toDNA` | Return a collection of DNA encoded signals (class DNAsignal_collection) | 9 | 0.51 |
| `signal_collection` | `append` | Append and align the new signal to the existing collection. | 5 | 0.51 |
| `signal_collection` | `mean` | Mean of selected signals, optionally weighted. | 23 | 0.51 |
| `signal_collection` | `plot` | Plot selected signals with style attributes and optional overlays. | 75 | 0.51 |
| `signal_collection` | `sum` | Sum selected signals, optionally weighted by coeffs. | 41 | 0.51 |
| `signal_collection` | `to_matrix` | Return a 2D array (n_signals x n_points) of aligned signal values. | 3 | 0.51 |
----
๐งฉ | Methods Table (module: `sig2dna.PrintableFigure`)
| Class | Method | Docstring First Paragraph | # Lines | __version__ |
| ----------------- | --------------------- | ------------------------------------------------------------ | ------- | ----------- |
| (module-level) | `_generate_figname` | Generate a cleaned filename using figure metadata or the current datetime. | 19 | |
| (module-level) | `_print_generic` | Generic saving logic for figure files. | 27 | |
| (module-level) | `custom_plt_figure` | Override `plt.figure()` to return PrintableFigure by default. | 9 | |
| (module-level) | `custom_plt_subplots` | Override `plt.subplots()` to return PrintableFigure. | 10 | |
| (module-level) | `is_valid_figure` | Check if `fig` is a valid and open Matplotlib figure. Parameters: fig (object): Any object to check. | 10 | |
| (module-level) | `print_figure` | Save a figure as PDF, PNG, and SVG. | 25 | |
| (module-level) | `print_pdf` | Save a figure as a PDF. Example: -------- >>> fig, ax = plt.subplots() >>> ax.plot([0, 1], [0, 1]) >>> print_pdf(fig, "myplot", overwrite=True) | 10 | |
| (module-level) | `print_png` | Save a figure as a PNG. | 9 | |
| (module-level) | `print_svg` | Save a figure as an SVG (vector format, dpi-independent). | 9 | |
| `PrintableFigure` | `print` | Save figure in PDF, PNG, and SVG formats. | 10 | |
| `PrintableFigure` | `print_pdf` | | 2 | |
| `PrintableFigure` | `print_png` | | 2 | |
| `PrintableFigure` | `print_svg` | | 2 | |
| `PrintableFigure` | `show` | Display figure intelligently based on context (Jupyter/script). | 15 | |