📡🧬 Blind Deconvolution of 2D Signals using DNA Representations

This tutorial demonstrates how to transform, encode, and blindly decompose 2D analytical signals (e.g., GC-MS spectra) into latent symbolic representations using sinusoidal encodings and projection techniques. The approach is part of the **Generative Simulation Initiative **🌱, and targets applications such as:

📦 Applications

Polymer fingerprinting 🫆

Formulation demixing 🔢

NIAS identification in recycled materials 🪪

Non-targeted screening 🔬

Chromatogram simplification and annotation 📊

Separation of latent signals in multiplexed detectors 📈

General symbolic representation of continuous 1D/2D data streams 🔡

We illustrate the methodology using the Python-based 📡🧬 sig2dna framework.

1 | 📖 Generalizing from 1D to 2D Symbolic Encodings

In one-dimensional (1D) signals (e.g., 1D chromatograms, spectrograms, etc.), we can identify meaningful structures via local curvature analysis (e.g., inflection points, peaks). This can be symbolically encoded into a discrete alphabet.

But in 2D signals, such as time × mass GC-MS maps, we face new challenges:

Multiple overlapping sources
Lack of predefined time-m/z semantics
Unclear spatial symmetries

To overcome these, we adopt a double-layered coding:

Symbolic Layer: numeric signal → letter codes (‘A’, ‘B’, ‘C’, ‘X’, ‘Y’, ‘Z’, ‘_’)
Geometric Layer: letter codes → latent sinusoidal embedding ($d$-dimensional)

The final representation becomes a tensor in 2D signal × latent space.

We write:

\[ v_{t, m, d} = E_{t, m, d} + PE_t(t, d) + PE_m(m, d) \]

Or, under the multiplexed acquisition assumption:

\[ v_{u, d} = E_{u, d} + PE_t(u, d) \]

where:

$E$ embedding of the symbolic sequence (i.e., encodes letter identity per segment),
$PE_t$ is sinusoidal positional encoding in time (or raster position),
$PE_m$ (optionally) encodes identity along mass/ion channels.

⚠️ GC-MS signals are time-multiplexed. m/z channels are not parallel but scanned sequentially through the detector.

2 | 🧪 Pipeline Demonstration

2.1 | 🔬 Synthetic Signal Generation: `gcms`

We simulate 5 source signals, each a sparse collection of Gaussian peaks in a 2D (time × ion channel) matrix. Signals are then mixed by linear addition, simulating the co-elution of compounds.

t = 1024  # Number of time samples (dim1)
m = 32    # Number of ion channels or m/z values (dim2)
n_peaks = (6, 10)
n2Dsignals = 5  # Number of source signals (e.g., hypothetical pure substances)

for i in range(n2Dsignals):
    sig = signal_collection.generate_synthetic(
        n_signals=m,           # Each signal is a row of the (t, m) 2D matrix
        n_peaks=n_peaks,       # Random number of peaks
        kinds=("gauss",),      # Shape of the peaks
        width_range=(0.5, 3),  # Peak widths
        height_range=(1.0, 5.0),  # Peak heights
        x_range=(0, t-1),
        n_points=t,
        normalize=False,
        seed=40 + i * 10,
        name_prefix=f"G{i}"
    )[0]
    gcms = sig if i == 0 else gcms + sig  # Create overlapping signals

💡 This example mimics 5 pure substances overlapping in a GC-MS acquisition.

Simulated overlapping 2D signals

2.2 | 🧬 Symbolic Encoding into a DNA Alphabet

Each 1D signal (ion channel) is symbolically encoded into a string using 7 discrete letters:

dna_gcms = gcms._toDNA(scales=4)

💡 A 2D signal is converted into a plain text. An isolated peak corresponds to the sequence Y+A+Z+B+ , where + indicates at least one occurrence.

Y, A, Z, B: characterize peak shapes

C, X: represent broader curvature regions

_: represents silence or gaps

Symbolic encoding (7-letter alphabet)

🖋 This step compresses local curvature into text — enabling symbolic analysis and reconstruction.

2.3 | 📡 Sinusoidal Encoding in Latent Space

The symbolic strings are embedded into a high-dimensional space using sinusoidal encoding (akin to transformers’ positional encoding).

dna_gcms.sinencode_dna_full(d_model=128, operation='sum')

Embedding matrix: maps each letter to a basis in ℝᵈ
Positional encoding: uses cosine/sine frequencies to encode position
The final tensor has shape (T, m, d) or (T⋅m, d)

Latent projections of each letter

📌 This step preserves spatial order while lifting symbolic structure into a latent space.

2D embedding of symbol space

🎯 Peak-associated letters (YAZB) form separable clusters — useful for blind deconvolution.

2.4 | 🗺️ Full vs Raster Encoding Modes

In full encoding:

\[ v[t, m, d] = E[t, m, d] + PE_t[t, d] + PE_m[m, d] \]

💡The full mode has an important memory footprint as it requires a positional encoding also along dim2.

Full 2D tensor encoding

In raster scan (flattened space): $$ v[u, d] = E[u, d] + PE_t[u, d] $$

💡 Raster encoding is memory-efficient and models the real acquisition order of a GC-MS detector.

Rasterized positional encoding (no PE_m)

🧾 Notes ▶• ılıılıılıılıılıılı. 0

Encoding dimension ($d$) should be overcomplete: $d \geq n_\text{sources}$
Corner detection in PCA helps estimate signal count
Blind projection works without knowing signal semantics
Compression is extreme: signal → DNA → ℝᵈ → PCA → signal

🧭 Next Steps ﮩ٨ـﮩﮩ٨ـ

The present tutorial stops at the latent space decomposition of complex signals. However, the full potential of the sig2dna framework lies in reconstruction, annotation, and semantic projection. Future steps include:

Letter Reconstruction from identified sources
Mass spectrum reconstruction (dim 2)
Robust Fingerprinting for authentication, traceability.
Semantic clustering with UMAP (or T-SNE)

🔤 Letter Sequence Reconstruction from Latent Components

After PCA separation, each component corresponds to a latent signal that can be mapped back to its symbolic representation. This backward transformation allows:

Visualization of symbolic segments (YAZB) contributing to a source
Identification of peak-rich zones and sparse regions
Filtering of latent components by symbolic entropy or curvature density

🧠 This symbolic back-projection helps understand and validate the chemical or physical nature of each latent signal — akin to recovering linguistic structure in unknown scripts.

💥 Mass Spectrum Reconstruction (dim 2)

By integrating along the retention time dimension (dim1), each latent component yields a pseudo-mass spectrum:

Projected TIC (Total Ion Chromatogram) ⇒ spectral profile
Useful for identifying co-eluting compounds and additives
Can be used for non-targeted mass fingerprinting

Combined with external libraries (e.g., NIST, PubChem), this reconstructed mass spectrum can be submitted to match real-world compounds.

🧬 Blind mass fingerprinting becomes possible even for overlapping or degraded signals — particularly relevant in recycled polymer matrices.

🧷 Fingerprinting, Traceability, and Classification

The symbolic encoding and latent separation provide a robust basis for:

Batch traceability in recycled or multi-layered materials
Quality classification using symbolic entropy or PCA projections
Authentication via distance metrics in symbolic or latent space

Because symbolic encodings compress structure without relying on amplitude or noise sensitivity, they are resistant to minor distortions, instrumental drift, and chemical degradation.

🔐 This enables a new generation of chemical barcodes or symbolic fingerprints suitable for industrial monitoring and forensic chemistry.

🌱Generative Simulation | olivier.vitrac@gmail.com | June 2025

📡🧬 Blind Deconvolution of 2D Signals using DNA Representations

1 | 📖 Generalizing from 1D to 2D Symbolic Encodings

2 | 🧪 Pipeline Demonstration

2.1 | 🔬 Synthetic Signal Generation: `gcms`

2.2 | 🧬 Symbolic Encoding into a DNA Alphabet

2.3 | 📡 Sinusoidal Encoding in Latent Space

2.4 | 🗺️ Full vs Raster Encoding Modes

3 | 🧊 Blind Deconvolution in Latent Space

3.1 | 📏 Number of sources

3.2 | 🔍 🔬 Chromatographic Features per Component

3.3 | 📽️ Projection of Latent Components

3.4 | 🏋 Loadings: the Basis in ℝᵈ

🧾 Notes ▶• ılıılıılıılıılıılı. 0

🧭 Next Steps ﮩ٨ـﮩﮩ٨ـ

🔤 Letter Sequence Reconstruction from Latent Components

💥 Mass Spectrum Reconstruction (dim 2)

🧷 Fingerprinting, Traceability, and Classification

📡🧬 Blind Deconvolution of 2D Signals using DNA Representations

1 | 📖 Generalizing from 1D to 2D Symbolic Encodings

2 | 🧪 Pipeline Demonstration

2.1 | 🔬 Synthetic Signal Generation: gcms

2.2 | 🧬 Symbolic Encoding into a DNA Alphabet

2.3 | 📡 Sinusoidal Encoding in Latent Space

2.4 | 🗺️ Full vs Raster Encoding Modes

3 | 🧊 Blind Deconvolution in Latent Space

3.1 | 📏 Number of sources

3.2 | 🔍 🔬 Chromatographic Features per Component

3.3 | 📽️ Projection of Latent Components

3.4 | 🏋 Loadings: the Basis in ℝᵈ

🧾 Notes ▶• ılıılıılıılıılıılı. 0

🧭 Next Steps ﮩ٨ـﮩﮩ٨ـ

🔤 Letter Sequence Reconstruction from Latent Components

💥 Mass Spectrum Reconstruction (dim 2)

🧷 Fingerprinting, Traceability, and Classification

2.1 | 🔬 Synthetic Signal Generation: `gcms`