Personalization in the Era of LLMs - Shivam Verma, Spotify

Spotify's personalization strategy in the LLM era combines foundational user modeling, catalog understanding through semantic IDs, and soft tokenization to c...

2026-05-24 By Sean Weldon

Abstract

Spotify's approach to personalization in the Large Language Model era represents a fundamental architectural shift from traditional multi-stage recommender systems to unified generative models. This analysis examines three core technical innovations: transformer-based foundational user modeling that generates daily embeddings for over one billion users, semantic ID tokenization that compresses high-dimensional content vectors into hierarchical discrete tokens, and soft tokenization techniques that project user embeddings into LLM token spaces for personalization without per-user retraining. Operating at scale across 750 million monthly active users and a catalog exceeding 100 million tracks, Spotify demonstrates how domain adaptation of open-weight LLMs enables steerable, explainable recommendations while maintaining collaborative filtering capabilities. These techniques have been productionized across AI DJ, Prompted Playlist, and Taste Profile, marking a transition from siloed product-specific models to a unified LLM backbone that combines platform knowledge with world understanding.

1. Introduction

The integration of Large Language Models into recommender systems presents fundamental architectural challenges that extend beyond conventional prompt engineering. Traditional recommendation systems employ multi-stage pipelines—candidate generation followed by successive ranking stages—that reduce millions of potential items to final ordered lists. While computationally efficient, these architectures lack inherent steerability, explainability, and the capacity to incorporate world knowledge beyond platform-specific interaction data.

Spotify's operational scale amplifies these challenges considerably. With 750 million monthly active users across 184 markets interacting with a catalog comprising 100 million tracks, 400,000 audiobooks, and millions of podcasts and video content, the platform requires recommendation infrastructure that operates at unprecedented computational scale while maintaining personalization quality. The User Representations team addresses this complexity through foundational models that serve the entire recommendation stack, transitioning from generalized autoencoder-based embeddings to sequential transformer architectures.

This analysis examines how Spotify's technical innovations enable a paradigm shift toward generative recommender systems. The central thesis posits that combining foundational user modeling, semantic content tokenization, and soft personalization techniques allows LLM-based systems to achieve both collaborative filtering generalization and user-specific adaptation. Furthermore, this approach provides steerability and explainability as inherent architectural properties rather than post-hoc additions, fundamentally altering the relationship between users and recommendation algorithms.

2. Background and Related Work

2.1 Traditional Multi-Stage Recommender Architectures

Conventional recommender systems at web scale employ a multi-stage pipeline architecture to manage computational constraints. The candidate generation stage reduces the search space from millions of items to hundreds through efficient retrieval methods such as approximate nearest neighbor search or learned retrieval models. Subsequent ranking stages apply increasingly sophisticated models to progressively smaller item sets, ultimately producing the final recommendation list. This approach optimizes computational efficiency by applying expensive models only to pre-filtered candidates, but creates siloed models specific to individual product surfaces with limited cross-product knowledge transfer.

2.2 Embedding-Based User Representation

Prior approaches to user modeling employed autoencoder-based architectures that compress user interaction histories into fixed-dimensional vectors. These generalized embeddings capture broad user preferences through dimensionality reduction but struggle with sequential dependencies and temporal dynamics inherent in consumption patterns. The limitation becomes particularly acute when modeling users whose preferences evolve over time or who exhibit context-dependent behavior across different content modalities.

2.3 Domain Adaptation in Language Models

Open-weight LLMs such as Llama and Gwen provide extensive world knowledge through pre-training on large text corpora but lack platform-specific understanding of non-textual content. Domain adaptation through post-training enables these models to incorporate proprietary data while retaining general capabilities. The fundamental challenge lies in representing non-textual content—music tracks, podcast episodes—in formats compatible with language model architectures designed for discrete token sequences.

3. Core Analysis

3.1 Foundational User Modeling at Scale

Spotify's foundational user modeling infrastructure generates embeddings daily for over one billion users, compressing interaction histories into vector representations that serve the entire recommendation stack. The transition from autoencoder-based generalized embeddings to transformer-based sequential models addresses limitations in temporal modeling by explicitly capturing the order and context of user interactions.

The architecture employs a cross-content embedding space that unifies representation learning across content modalities. Users, tracks, and podcast episodes are embedded in a shared hypersphere, enabling visualization and similarity computation across heterogeneous content types. This approach facilitates cross-content modeling where collaborative filtering signals from music consumption inform podcast recommendations and vice versa, addressing the cold-start problem for users with limited history in specific content domains.

The computational scale of this infrastructure is substantial. Generating daily embeddings for over one billion users requires distributed training and inference pipelines optimized for throughput. The resulting user representations serve as foundational inputs to downstream models across all product surfaces, creating a unified user understanding layer that replaces product-specific embedding systems.

3.2 Semantic ID Tokenization for Content Representation

Semantic IDs address the fundamental challenge of representing high-dimensional content vectors in formats compatible with LLM architectures. The technique tokenizes content vectors—typically 1000-dimensional embeddings—into sequences of 4-6 discrete tokens, analogous to how text is decomposed into word or subword tokens. This compression enables auto-regressive generation where LLMs predict the next item in a recommendation sequence token-by-token.

The tokenization exhibits hierarchical structure with semantic properties. The first two tokens are shared across similar artists within broad genres—for example, pop artists like Ariana Grande and Bruno Mars share initial tokens—while later tokens encode increasingly niche characteristics. This hierarchy emerges naturally from the tokenization process and mirrors the hierarchical structure of musical taste, where users first exhibit genre preferences before developing artist-specific and track-specific preferences.

Domain adaptation through post-training leverages semantic IDs to incorporate Spotify's catalog understanding into open-weight LLMs. By training models to predict sequences of semantic ID tokens representing user listening histories, the approach combines platform-specific collaborative filtering signals with the world knowledge embedded in pre-trained language models. This enables the model to reason about content relationships using both acoustic and cultural understanding—for instance, recognizing that certain artists are associated with specific geographic regions or time periods based on general knowledge while simultaneously understanding their position in Spotify's collaborative filtering space.

3.3 Soft Tokenization for Personalization

While semantic IDs enable LLMs to reason about content, soft tokenization addresses the complementary challenge of incorporating user-specific personalization without retraining models for each of 750 million users. The technique projects user embeddings from the foundational user modeling system into the LLM's token embedding space, creating "soft tokens" that represent individual users.

These soft tokens are inserted into model prompts alongside semantic ID sequences, enabling the LLM to condition generation on both the user's historical preferences and the current context. Critically, this approach allows personalization without requiring per-user model parameters. The foundational user embedding captures collaborative filtering signals—patterns learned from similar users—while the LLM learns to interpret these soft tokens during domain adaptation training on a subset of users.

The mechanism effectively bridges two paradigms: collaborative filtering's ability to generalize across users with similar preferences, and personalization's requirement for user-specific adaptation. The model learns during training how soft tokens representing different user archetypes influence recommendation sequences, then applies this learned interpretation to new users' embeddings at inference time. This architecture enables the system to personalize for users who were not present in the LLM's training data, addressing the practical impossibility of training on all platform users.

3.4 Production Deployment and User Steerability

The unified generative architecture has been productionized across multiple surfaces including AI DJ, Prompted Playlist (now supporting podcast recommendations), and the Next Episode feature for podcast consumption. The Taste Profile product exemplifies the shift toward user steerability, exposing the learned user representation and allowing users to select what preferences the system retains or forgets through natural language interaction.

User edits in Taste Profile feed back into the generative model, creating a feedback loop where explicit user corrections improve personalization. This represents a fundamental departure from traditional recommender systems where user control is limited to binary feedback (likes/dislikes) or implicit signals. The natural language interface enables nuanced preference specification—users can indicate that they enjoyed an artist during a specific period but no longer wish to receive similar recommendations.

4. Technical Insights

The implementation reveals several critical architectural considerations. First, the hierarchical structure of semantic IDs—where initial tokens capture broad categorical information and later tokens encode specificity—naturally aligns with the compositional structure of transformer attention mechanisms. This alignment likely contributes to the effectiveness of auto-regressive generation for recommendation sequences.

Second, the soft tokenization approach demonstrates that LLMs can learn to interpret continuous vector inputs projected into their token embedding spaces, despite being trained primarily on discrete tokens. This suggests broader applicability for incorporating non-textual modalities into language models without requiring modality-specific architectural modifications.

Third, the system exhibits a trade-off between catastrophic forgetting—where domain adaptation degrades the model's general capabilities—and effective knowledge combination. The approach mitigates this through careful post-training procedures that preserve world knowledge while incorporating platform-specific understanding. The balance proves sufficient for production deployment, though the presentation acknowledges ongoing challenges.

Implementation at scale requires daily embedding generation for over one billion users, presenting substantial computational infrastructure requirements. The cross-content embedding space necessitates joint training across content modalities, increasing training complexity relative to modality-specific systems. However, the unified architecture eliminates the need for separate candidate generation and ranking models across product surfaces, potentially reducing overall system complexity despite increased model sophistication.

5. Discussion

Spotify's approach demonstrates that generative recommender systems can operate at web scale while providing capabilities—steerability, explainability, cross-content reasoning—that traditional architectures struggle to deliver. The integration of foundational user modeling, semantic content tokenization, and soft personalization creates a coherent technical framework that addresses multiple limitations of conventional systems simultaneously.

The broader implication extends beyond recommendation systems to multi-modal learning more generally. The success of semantic IDs and soft tokenization suggests that LLMs can effectively incorporate non-textual modalities through learned discrete representations and continuous vector projections respectively. This challenges the assumption that multi-modal models require architectural modifications such as separate encoders or cross-attention mechanisms for each modality.

Nevertheless, several questions merit further investigation. The presentation notes that models experience catastrophic forgetting during domain adaptation but does not quantify the extent of degradation or specify mitigation strategies beyond careful post-training procedures. Additionally, the computational cost of daily embedding generation for over one billion users raises questions about the environmental and economic sustainability of this approach as platforms scale further. The trade-off between model sophistication and inference latency—particularly critical for real-time recommendation scenarios—receives limited discussion.

The shift toward user-steerable systems through products like Taste Profile represents a significant departure from the opacity characteristic of traditional recommender systems. However, this transparency creates new challenges around user understanding of model capabilities and limitations. Users may develop incorrect mental models of how their interactions influence recommendations, potentially leading to frustration when the system's behavior diverges from expectations.

6. Conclusion

This analysis demonstrates that Large Language Models can serve as effective backbones for recommender systems at unprecedented scale through careful integration of foundational user modeling, semantic content tokenization, and soft personalization techniques. Spotify's production deployment across multiple surfaces validates the practical viability of generative recommender systems that provide steerability and explainability as inherent architectural properties.

The key technical contributions—semantic IDs for content representation, soft tokenization for personalization, and unified cross-content modeling—collectively address fundamental limitations of traditional multi-stage pipelines while maintaining computational feasibility at web scale. The approach demonstrates that domain adaptation of open-weight LLMs can effectively combine platform-specific collaborative filtering signals with world knowledge, enabling reasoning about content that transcends purely interaction-based understanding.

For practitioners, the framework suggests that transitioning from traditional recommender architectures to LLM-based systems requires rethinking not only model architecture but also the entire recommendation stack, from user representation to content encoding to personalization mechanisms. The production success across Spotify's diverse product portfolio indicates that such transitions are achievable at scale, though they necessitate substantial infrastructure investment and careful attention to trade-offs between model sophistication and operational constraints.

Sources

Personalization in the Era of LLMs - Shivam Verma, Spotify - Original Creator (YouTube)
Analysis and summary by Sean Weldon using AI-assisted research tools

About the Author

Sean Weldon is an AI engineer and systems architect specializing in autonomous systems, agentic workflows, and applied machine learning. He builds production AI systems that automate complex business operations.

LinkedIn | Website | GitHub