Technical Deep-Dive

How It Works:
The Music Embedding Pipeline

This project is powered by an end-to-end machine learning pipeline built to download, process, and fundamentally understand music at a deep acoustic level. From raw audio acquisition to contrastive deep learning, here is the methodology behind the embedding model.

Data Acquisition & Resiliency

Training a robust model requires a massive, clean dataset. Building this foundation meant engineering an automated, highly resilient pipeline capable of pulling audio from YouTube at scale.

  • Intelligent Rate Limiting: The downloader uses adaptive delays, exponential backoff, and a circuit breaker pattern to smoothly handle API limits and prevent cascading failures.
  • Concurrency & Pacing: Multi-threaded workers pull audio and immediately pipe it through ffmpeg, standardizing the entire dataset into uniform 16 kHz mono MP3s.
  • Quality Control: Automated validation sweeps purge corrupted files and filter tracks by strict duration bounds (50 seconds to 7 minutes) to eliminate short sound bites and multi-hour compilations.

Feature Engineering:
Translating Audio to Data

Raw audio waveforms are incredibly noisy. Instead of feeding raw sound directly into the model, every song is processed at 10 frames per second, translating the audio into a dense, 160-dimensional feature vector per frame:

128 Mel Spectrogram Captures the core frequencies and timbral textures.
12 Chroma Features Maps the harmonic and melodic pitch classes.
3 Acoustic Descriptors RMS energy, Spectral Centroid, and Onset Strength.
17 Positional Encodings Linear timestamp + 16 sinusoidal waves (similar to LLM encodings).

To prevent bottlenecks during training, these normalized arrays are packed into a single, high-performance HDF5 database.

The Transformer Architecture

The brain of the system is a 4-layer, 4-head Transformer Encoder designed to map complex acoustic sequences into a measurable mathematical space.

  • Sequence Processing: The network digests 30-second windows (300 frames) of feature vectors, tracking how the song evolves over time.
  • Dual-View Pooling: By blending a learnable classification token (CLS) with the average of all audio frames, the model generates a comprehensive summary of the clip.
  • The Latent Space: The final output is L2-normalized into a compact 128-dimensional embedding, mapping songs onto a hypersphere where the distance between any two points represents their acoustic similarity.

Multi-Positive Contrastive Learning

The model does not use traditional classification categories like "Rock" or "Jazz." Instead, it organizes the latent space using a custom contrastive learning methodology. It learns by pure comparison.

  • The Setup: During training, the model looks at an anchor clip of a song alongside 15 positive clips (from the same song) and 100 negative clips (from entirely different songs).
  • The Objective: Using a Supervised Contrastive (SupCon) style InfoNCE loss function, the network is rewarded for mathematically pulling all the positive clips together simultaneously, while a secondary soft-repulsion term explicitly pushes the negative clips away.
  • Dynamic Refinement: The training loop utilizes a dynamic cosine schedule for its contrastive temperature — starting out highly forgiving to help the model learn basic representations, and ending incredibly sharp to establish strict boundaries between songs.

Built with PyTorch, librosa, yt-dlp, and ffmpeg.