Serrano.Academy
Serrano.Academy
  • 54
  • 6 537 574
Direct Preference Optimization (DPO) - How to fine-tune LLMs directly without reinforcement learning
Direct Preference Optimization (DPO) is a method used for training Large Language Models (LLMs). DPO is a direct way to train the LLM without the need for reinforcement learning, which makes it more effective and more efficient.
Learn about it in this simple video!
This is the third one in a series of 4 videos dedicated to the reinforcement learning methods used for training LLMs.
Full Playlist: ua-cam.com/play/PLs8w1Cdi-zvYviYYw_V3qe6SINReGF5M-.html
Video 0 (Optional): Introduction to deep reinforcement learning ua-cam.com/video/SgC6AZss478/v-deo.html
Video 1: Proximal Policy Optimization ua-cam.com/video/TjHH_--7l8g/v-deo.html
Video 2: Reinforcement Learning with Human Feedback ua-cam.com/video/Z_JUqJBpVOk/v-deo.html
Video 3 (This one!): Deterministic Policy Optimization
00:00 Introduction
01:08 RLHF vs DPO
07:19 The Bradley-Terry Model
11:25 KL Divergence
16:32 The Loss Function
14:36 Conclusion
Get the Grokking Machine Learning book!
manning.com/books/grokking-ma...
Discount code (40%): serranoyt
(Use the discount code on checkout)
Переглядів: 1 640

Відео

KL Divergence - How to tell how different two distributions are
Переглядів 2,8 тис.День тому
Correction (10:26). The probabilities are wrong. The correct ones are here: For Die 1: 0.4^4 * 0.2^2 * 0.1^1 * 0.1^1 * 0.2^2 For Die 2: 0.4^4 * 0.1^2 * 0.2^1 * 0.2^1 * 0.1^2 For Die 3: 0.1^4 * 0.2^2 * 0.4^1 * 0.2^1 * 0.1^2 Kullback Leibler (KL) divergence is a way to measure how far apart two distributions are. In this video, we learn KL-divergence in a simple way, using a probability game with...
Why do we divide by n-1 to estimate the variance? A visual tour through Bessel correction
Переглядів 11 тис.21 день тому
Correction: At 30:42 I write "X = Y". They're not equal, what I meant to say is "X and Y are identically distributed". The variance is a measure of how spread out a distribution is. In order to estimate the variance, one takes a sample of n points from the distribution, and calculate the average square deviation from the mean. However, this doesn't give a good estimate of the variance of the di...
Reinforcement Learning with Human Feedback - How to train and fine-tune Transformer Models
Переглядів 8 тис.4 місяці тому
Reinforcement Learning with Human Feedback (RLHF) is a method used for training Large Language Models (LLMs). In the heart of RLHF lies a very powerful reinforcement learning method called Proximal Policy Optimization. Learn about it in this simple video! This is the first one in a series of 3 videos dedicated to the reinforcement learning methods used for training LLMs. Full Playlist: ua-cam.c...
Proximal Policy Optimization (PPO) - How to train Large Language Models
Переглядів 17 тис.5 місяців тому
Reinforcement Learning with Human Feedback (RLHF) is a method used for training Large Language Models (LLMs). In the heart of RLHF lies a very powerful reinforcement learning method called Proximal Policy Optimization. Learn about it in this simple video! This is the first one in a series of 3 videos dedicated to the reinforcement learning methods used for training LLMs. Full Playlist: ua-cam.c...
Stable Diffusion - How to build amazing images with AI
Переглядів 17 тис.6 місяців тому
This video is about Stable Diffusion, the AI method to build amazing images from a prompt. If you like this material, check out LLM University from Cohere! llm.university Get the Grokking Machine Learning book! manning.com/books/grokking-ma... Discount code (40%): serranoyt (Use the discount code on checkout) 0:00 Introduction 1:27 How does Stable Diffusion work? 2:55 Embeddings 12:55 Diffusion...
What are Transformer Models and how do they work?
Переглядів 101 тис.7 місяців тому
This is the last of a series of 3 videos where we demystify Transformer models and explain them with visuals and friendly examples. Video 1: The attention mechanism in high level ua-cam.com/video/OxCpWwDCDFQ/v-deo.html Video 2: The attention mechanism with math ua-cam.com/video/UPtG_38Oq8o/v-deo.html Video 3 (This one): Transformer models If you like this material, check out LLM University from...
The math behind Attention: Keys, Queries, and Values matrices
Переглядів 212 тис.9 місяців тому
This is the second of a series of 3 videos where we demystify Transformer models and explain them with visuals and friendly examples. Video 1: The attention mechanism in high level ua-cam.com/video/OxCpWwDCDFQ/v-deo.html Video 2: The attention mechanism with math (this one) Video 3: Transformer models ua-cam.com/video/qaWMOYf4ri8/v-deo.html If you like this material, check out LLM University fr...
The Attention Mechanism in Large Language Models
Переглядів 82 тис.10 місяців тому
Attention mechanisms are crucial to the huge boom LLMs have recently had. In this video you'll see a friendly pictorial explanation of how attention mechanisms work in Large Language Models. This is the first of a series of three videos on Transformer models. Video 1: The attention mechanism in high level (this one) Video 2: The attention mechanism with math: ua-cam.com/video/UPtG_38Oq8o/v-deo....
The Binomial and Poisson Distributions
Переглядів 9 тис.Рік тому
If on average, 3 people enter a store every hour, what is the probability that over the next hour, 5 people will enter the store? The answer lies in the Poisson distribution. In this video you'll learn this distribution, starting from a much simpler one, the Binomial distribution. Euler number video: ua-cam.com/video/oikl9FCISqU/v-deo.html Grokking Machine Learning book: bit.ly/grokkingML 40% d...
Euler's number, derivatives, and the bank at the end of the universe
Переглядів 3,6 тис.Рік тому
Euler's number, e, is defined as a limit. The function e to the x is (up to multiplying by a constant) the only function that is its own derivative. How are these two related? In this video you'll find an explanation for this phenomenon using banking interest rates, and a very particular bank, located at the end of the universe.
Decision trees - A friendly introduction
Переглядів 11 тис.Рік тому
A video about decision trees, and how to train them on a simple example. Accompanying blog post: medium.com/@luis.serrano/splitting-data-by-asking-questions-decision-trees-74afed9cd849 For a code implementation, check out this repo: github.com/luisguiserrano/manning/tree/master/Chapter_9_Decision_Trees Helper videos: - Gini index: ua-cam.com/video/u4IxOk2ijSs/v-deo.html - Entropy and informatio...
How do you minimize a function when you can't take derivatives? CMA-ES and PSO
Переглядів 8 тис.Рік тому
How do you minimize a function when you can't take derivatives? CMA-ES and PSO
What is Quantum Machine Learning?
Переглядів 10 тис.Рік тому
What is Quantum Machine Learning?
Denoising and Variational Autoencoders
Переглядів 23 тис.2 роки тому
Denoising and Variational Autoencoders
Eigenvectors and Generalized Eigenspaces
Переглядів 26 тис.2 роки тому
Eigenvectors and Generalized Eigenspaces
Thompson sampling, one armed bandits, and the Beta distribution
Переглядів 21 тис.2 роки тому
Thompson sampling, one armed bandits, and the Beta distribution
The Beta distribution in 12 minutes!
Переглядів 79 тис.3 роки тому
The Beta distribution in 12 minutes!
A friendly introduction to deep reinforcement learning, Q-networks and policy gradients
Переглядів 93 тис.3 роки тому
A friendly introduction to deep reinforcement learning, Q-networks and policy gradients
The Gini Impurity Index explained in 8 minutes!
Переглядів 38 тис.3 роки тому
The Gini Impurity Index explained in 8 minutes!
The covariance matrix
Переглядів 93 тис.3 роки тому
The covariance matrix
Gaussian Mixture Models
Переглядів 68 тис.3 роки тому
Gaussian Mixture Models
Singular Value Decomposition (SVD) and Image Compression
Переглядів 89 тис.3 роки тому
Singular Value Decomposition (SVD) and Image Compression
ROC (Receiver Operating Characteristic) Curve in 10 minutes!
Переглядів 59 тис.3 роки тому
ROC (Receiver Operating Characteristic) Curve in 10 minutes!
Restricted Boltzmann Machines (RBM) - A friendly introduction
Переглядів 63 тис.3 роки тому
Restricted Boltzmann Machines (RBM) - A friendly introduction
A Friendly Introduction to Generative Adversarial Networks (GANs)
Переглядів 244 тис.4 роки тому
A Friendly Introduction to Generative Adversarial Networks (GANs)
You are much better at math than you think
Переглядів 7 тис.4 роки тому
You are much better at math than you think
Training Latent Dirichlet Allocation: Gibbs Sampling (Part 2 of 2)
Переглядів 52 тис.4 роки тому
Training Latent Dirichlet Allocation: Gibbs Sampling (Part 2 of 2)
Latent Dirichlet Allocation (Part 1 of 2)
Переглядів 128 тис.4 роки тому
Latent Dirichlet Allocation (Part 1 of 2)
Book by Luis Serrano - "Grokking Machine Learning" (40% off promo code)
Переглядів 14 тис.4 роки тому
Book by Luis Serrano - "Grokking Machine Learning" (40% off promo code)

КОМЕНТАРІ

  • @frankl1
    @frankl1 17 годин тому

    Really love the way you broke down the DPO loss, this direct way is more welcome by my brain :). Just one question on the video, I am wondering how important it is to choose the initial transformer carefully. I suspect that if it is very bad at the task, then we will have to change the initial response a lot, but because the loss function prevents from changing too much in one iteration, we will need to perform a lot tiny changes toward the good answer, making the training extremely long. Am I right ?

  • @rb4754
    @rb4754 20 годин тому

    Very nice lecture on attention.

  • @mayyutyagi
    @mayyutyagi 21 годину тому

    Now whenever I watch Serrano's video, I first like it and the start watching it coz I know the video will gonna be outstanding as always.

  • @mayyutyagi
    @mayyutyagi 22 години тому

    Liked this video and subscribed your channel today.

  • @mayyutyagi
    @mayyutyagi 22 години тому

    Amazing video... Thanks sir for this pictorial representation and explaining this complex topic with such an easy way.

  • @AravindUkrd
    @AravindUkrd День тому

    Thanks for the simplified explanation. Awesome as always. The book link in the description is not working.

  • @johnzhu5735
    @johnzhu5735 День тому

    This was very helpful

  • @siddharthabhakta3261
    @siddharthabhakta3261 День тому

    The best explanation & depiction of SVD.

  • @melihozcan8676
    @melihozcan8676 День тому

    Thanks for the excellent explanation! I used to know the KL Divergence, but now I understand it!

  • @saedsaify9944
    @saedsaify9944 День тому

    Great one, the simpler it looks and harder to build!

  • @stephenlashley6313
    @stephenlashley6313 День тому

    This and your whole series of attention NN is a thing of beauty! There are many ways of simplifying this here, but you come the closest to understanding Attention NN and QC are identical and QC is much better. In my opinion QC has never been done correctly, the gates are too confusing and poorly understood. QC is not still in simplified infant stage, it is mature what QC can do and matches all Psychology observations. All problems in Biology and NLP are sequences of strings.

  • @cloudshoring
    @cloudshoring 2 дні тому

    awesome!

  • @bifidoc
    @bifidoc 2 дні тому

    Thanks!

    • @SerranoAcademy
      @SerranoAcademy 2 дні тому

      Thank you so much for your kind contribution @bifidoc!!! 💜🙏🏼

  • @user-xc8vy4cw9k
    @user-xc8vy4cw9k 2 дні тому

    I would like to say thank you for the wonderful video. I want to learn reinforcement learning for my future study in the field of robotics. I have seen that you only have 4 videos about RL. I am hungry for more of your videos. I found that your videos are easier to understand because you explain well. Please add more RL videos. Thank you 🙏

    • @SerranoAcademy
      @SerranoAcademy 2 дні тому

      Thank you for the suggestion! Definitely! Any ideas on what topics in RL to cover?

    • @user-xc8vy4cw9k
      @user-xc8vy4cw9k 21 годину тому

      @@SerranoAcademy more videos in the field of Robotics please. Thank you. You may also guide me how I can approach the study of reinforcement learning.

  • @user-xc8vy4cw9k
    @user-xc8vy4cw9k 2 дні тому

    I would like to say thank you for the wonderful video. I want to learn reinforcement learning for my future study in the field of robotics. I have seen that you only have 4 videos about RL. I am hungry for more of your videos. I found that your videos are easier to understand because you explain well. Please add more RL videos. Thank you 🙏

  • @Omsip123
    @Omsip123 2 дні тому

    So well explained

  • @guzh
    @guzh 2 дні тому

    DPO main equation should be PPO main equation.

  • @epepchuy
    @epepchuy 3 дні тому

    Exvelente explciacion!!!

  • @iantanwx
    @iantanwx 3 дні тому

    Most intuitive explanation for QKV, as someone with only an elementary understanding of linear algebra.

  • @VerdonTrigance
    @VerdonTrigance 3 дні тому

    It's kinda hard to remember all of these formulas and it's demotivating me from further learning.

    • @javiergimenezmoya86
      @javiergimenezmoya86 3 дні тому

      You do not have to remember that formulas. You only have to understand the logic in them.

  • @IceMetalPunk
    @IceMetalPunk 3 дні тому

    I'm a little confused about one thing: the reward function, even in the Bradley-Terry model, is based on the human-given scores for individual context-prediction pairs, right? And πθ is the probability from the current iteration of the network, and πRef is the probability from the original, untuned network? So then after that "mathematical manipulation", how does the human-given set of scores become represented by the network's predictions all of a sudden?

  • @user-xc8vy4cw9k
    @user-xc8vy4cw9k 3 дні тому

    Thank you for the wonderful video. Please add more practical example videos for the application of reinforcement learning.

    • @SerranoAcademy
      @SerranoAcademy 3 дні тому

      Thank you! Definitely! Here's a playlist of applications of RL to training large language models. ua-cam.com/play/PLs8w1Cdi-zvYviYYw_V3qe6SINReGF5M-.html

  • @laodrofotic7713
    @laodrofotic7713 3 дні тому

    noone of the videos I seen on this subject actually explain where the hell qkv values come from! its amazing people jump on making video while not understanding the concepts clearly! I guess youtube must pay a lot of money! But this video does a good job of explaining most of the things, it never does tell us where the actual qkv values come from, how do the embendings turn into them, and actually got things wrong in my oppinion. the q comes from embeddings that are multiplied by the wq, which is a weight and parameter in the model, but then the question is, where does wq wk wv come from???

  • @bendim94
    @bendim94 3 дні тому

    how do you choose the number of features in the 2 matrices, i.e. how did you choose to have 2 features only?

  • @Priyanshuc2425
    @Priyanshuc2425 3 дні тому

    Hey I know this 👦. He is my Maths teacher who don't only teach but make us visualize why we learn the topic and how will it useful in real world ❤

  • @Q793148210
    @Q793148210 3 дні тому

    It‘s was just so clear. 😃

  • @DienTran-zh6kj
    @DienTran-zh6kj 4 дні тому

    I love his teaching, he makes complex things seem simple.

  • @shouvikdey7078
    @shouvikdey7078 4 дні тому

    Love your videos, please make more such videos on mathematical description of generative models such as GAN, Diffusion, etc.

    • @SerranoAcademy
      @SerranoAcademy 3 дні тому

      Thank you! I got some on GANs and Diffusion models, check them out! GANs: ua-cam.com/video/8L11aMN5KY8/v-deo.html Stable diffusion: ua-cam.com/video/JmATtG0yA5E/v-deo.html

  • @mohammadarafah7757
    @mohammadarafah7757 4 дні тому

    We expect to describe wasserstein distance 😊

    • @SerranoAcademy
      @SerranoAcademy 3 дні тому

      Ah good idea! I'll add it to the list, as well as earth-mover's distance. :)

    • @mohammadarafah7757
      @mohammadarafah7757 3 дні тому

      @SerranoAcademy I also highly recommend to describe Explainable AI (XAI) which depends on statistics.

  • @mehdiberchid1974
    @mehdiberchid1974 4 дні тому

    thank u

  • @bernardorinconceron6139
    @bernardorinconceron6139 4 дні тому

    Thank you Luis. I'm sure I'll use this very soon.

  • @shahnawazalam55
    @shahnawazalam55 4 дні тому

    That was intuitive as butter

  • @frankl1
    @frankl1 4 дні тому

    Great video. One question I have, why would I use KL instead of CE? are there situations in which one would be more suitable than the other ?

    • @SerranoAcademy
      @SerranoAcademy 4 дні тому

      That is a great question! KL(P,Q) is really the CE(P,Q), except you subtract the entropy H(P). The reason for this is that if you compare a distribution with itself, you want to get a zero. With CE, you don't get zero, so the CE of a distribution with itself could potentially be very high.

  • @Ashishkumar-id1nn
    @Ashishkumar-id1nn 4 дні тому

    why did you take average at 6:30 ?

    • @SerranoAcademy
      @SerranoAcademy 4 дні тому

      Great question! I took the average because the product is p_i^(nq^i), so the log is nq_i log(p_i), and I want to get rid of that n. It’s not super needed for the math, but I did it so that it gives exactly the KL divergence instead of n times it.

    • @Ashishkumar-id1nn
      @Ashishkumar-id1nn 4 дні тому

      @@SerranoAcademy thanks for the clarification

  • @__-de6he
    @__-de6he 4 дні тому

    Thanks. That was good except so elementary things explanation like logarithm manipulations (every who is interested in your video has already known elementary math).

  • @debashisghosh3133
    @debashisghosh3133 4 дні тому

    most intuitive video on KL Divergence...loved it.

  • @aruntakhur
    @aruntakhur 5 днів тому

    The number shown in cells (2,1) and (3,1) in the table (time 10:32) to calculate the probabilities of the sequences are typo mistake. Pls correct it . ua-cam.com/video/sjgZxuCm_8Q/v-deo.html

    • @SerranoAcademy
      @SerranoAcademy 4 дні тому

      Oh yikes you’re right, thank you! I can’t fix it but I’ll add a note

  • @sra-cu6fz
    @sra-cu6fz 5 днів тому

    Thanks for posting this.

  • @johanaluna7385
    @johanaluna7385 5 днів тому

    Wow!! Thank you!!! Finally I got it !

  • @paedrufernando2351
    @paedrufernando2351 5 днів тому

    What a Morning surprise..lovely video

  • @motizin1
    @motizin1 5 днів тому

    Thanks!

    • @SerranoAcademy
      @SerranoAcademy 5 днів тому

      Thank you so much for your kind contribution @motizin1, it means a lot! 💜🙏🏼

  • @_erika.be.bee_
    @_erika.be.bee_ 6 днів тому

    I've watched so many videos, read through so many websites, and asked so many questions to chatgpt about pca. I have to learn it as quick as possible for a research program. This is by far the best explanation I've seen!!

  • @paveltsvetkov7948
    @paveltsvetkov7948 6 днів тому

    Why do you need Value neural network? Why can't you train the policy neural network alone? Is it because the value neural network allows to replace human evaluator and get more training samples for the policy network without need for human input?

  • @HitAndMissLab
    @HitAndMissLab 7 днів тому

    here is prop up for a great channel.

  • @bhavikdudhrejiya852
    @bhavikdudhrejiya852 7 днів тому

    Ohh I missed this.

  • @HitAndMissLab
    @HitAndMissLab 7 днів тому

    Just discovered it. Phenomenally perfect balance between intuition and quantitative explanation.

  • @27equalsawesome
    @27equalsawesome 7 днів тому

    dawg...what is this music lmaoo

  • @wirotep.1210
    @wirotep.1210 8 днів тому

    2 Favorite teachers.

  • @amalradwan7193
    @amalradwan7193 10 днів тому

    thanks really