3-PyCaret regression (automated machine learning) from zero to almost hero

Поділитися
Вставка
  • Опубліковано 26 чер 2024
  • ​ @Pedram Jahangiry
    My GitHub repository: github.com/PJalgotrader/platf...
    0:20 List of regression models in PyCaret
    1:55 Where to find the python notebooks
    3:40 Running the notebook on Google Colab
    5:10 Links and resources (API documentation)
    10:12 Import data set in PyCaret
    17:35 Split data into train, hold out and unseen data
    20:55 setup()
    35:05 compare_models()
    45:00 create_model()
    48:30 tune_model()
    53:00 plot_model(): residuals, parameters, learning curve, validation curve
    1:07:42 interpret_model()
    1:10:20 ensemble_models()
    1:14:12 blend_models() vs stack_models()
    1:22:45 predict_model() and finalize_model()
    1:28:00 save_model() and load_model()

КОМЕНТАРІ • 16

  • @marziehmirzaei3603
    @marziehmirzaei3603 Місяць тому

    Thanks Dr Jahangiri. I almost get the concept from this video and hope to learn how to work with it on my own laptop during the next tutorials. :)

  • @maheshsanjaychivateres982
    @maheshsanjaychivateres982 Рік тому

    Nice

  • @shakeelahmad3162
    @shakeelahmad3162 Рік тому +1

    Hi Pedram, thanks for the great video. If possible please answer my question.
    --My data sheet is having 20 input features and 12 output predictions. Is there any model which performs multi-output predictions? I might use Deep learning ANN models, but what about ML. Are there any better ways to tackle this kind of multi-output problems?

    • @pedramjahangiry
      @pedramjahangiry  Рік тому +1

      Absolutely, multi-output (or multi-target) prediction is indeed a task that some machine learning models are designed to handle. Let's discuss both traditional machine learning models and deep learning approaches:
      Traditional Machine Learning Models:
      Several traditional machine learning algorithms can handle multi-output prediction problems directly, which means they can handle multiple dependent variables that need to be predicted.
      Decision Trees and Random Forests: These models can handle multiple outputs directly by modeling each output independently.
      MultiOutputRegressor in Scikit-learn: This is a strategy for performing multi-output regression. It fits one regressor per target separately. It's a simple strategy that allows you to turn any single-output regressor into a multi-output one. For example, if you have an SVM regressor which is typically used for single-output regression, you can use MultiOutputRegressor to make it compatible with multi-output tasks.
      Deep Learning Models:
      Artificial Neural Networks (ANNs), including deep learning models, are well-suited to multi-output prediction tasks. A typical setup might involve an input layer with nodes for each feature in your dataset, one or more hidden layers, and an output layer with nodes for each of your targets.
      Multi-output Dense Layers: You can define a Dense output layer with 12 nodes (for your 12 outputs) and train it on your data. This will effectively be performing multi-output regression.
      Multiple Output Models: In some cases, you might want to have a separate output layer for each of your targets. This might be the case if the targets have different scales or units. Keras supports multiple outputs by allowing you to specify a list of output layers.
      Next week, I am releasing a video covering multi-output regression. Keep an eye on it!

  • @wowbestboy
    @wowbestboy Рік тому

    Hi Pedram, first of all congratulation. You are doing an amazing priceless job! I appreciate it.
    I have several questions and issues, because it may be the same for others as well, I am putting them here to receive your ideas:
    1- Are all these ML models applied in this video, applicable for time series forecast as well? (I assume they do not touch on the order of the data)
    2- If the answer for question 1 is yes, how about the K fold number? an another way of asking, How does the K number affects the analysis?
    3- Is there a short way to optimize the K number? or I just continue with the default number?
    4- Is there any other parameter or factor to affect on the applicability of using these models for time series prediction?
    5- To run the " exp1.plot_model(tune_xgboost, plot = 'learning') " line, I received the following error:
    ImportError: cannot import name '_png' from 'matplotlib' (/usr/local/lib/python3.8/dist-packages/matplotlib/__init__.py)
    i checked different resources to overcome this error, many were in favor of matplotlib package version, I tried with several versions, the result is the same. Do you have any idea about it? In the end, I was unable to plot the results.
    Sorry for the lengthy comment and sorry if I did not noticed your already explained comments on the videos . Looking forward to receive your replies.
    Thanks.
    Saeed

    • @pedramjahangiry
      @pedramjahangiry  Рік тому +1

      Hi Saeed, thank for your feedback and asking great questions:
      1- Almost all of them are available for time series + for univariate time series, there are more models including exponential smoothing, ETS, ARIMA, theta and bunch of others. I will cover them in this channel within the next couple of months.
      2- Cross validation is a different story for time series. Next week I will be releasing a video talking about challenges in ML time series. Will cover different techniques for ts CV soon.
      3- in practice, K=5 or 10 works just fine. there is no optimization going on here. the point is to get a more stable prediction path for CV estimations and that is achieve usually by k=10. If the variance of the performance metric is still high, we can do something called iterative cross validation. but that is computationally expensive.
      4- again, time series is a different animal. Stay tune in the next couple of weeks as I will talk about them more.
      5- Try copying the exact same message to ChatGPT and take it from there.
      lmk if you had any other questions.
      Cheers,
      Pedram

  • @MJJi-tw8uj
    @MJJi-tw8uj 3 місяці тому

    Hi Pedram,
    Thanks for the insightful video on regression with PyCaret! I have a question about handling categorical features during preprocessing in PyCaret.
    PyCaret normalizes the >> entire dataset

    • @pedramjahangiry
      @pedramjahangiry  2 місяці тому

      Absolutely MJ. I personally don't rely on the preprocessing step in PyCaret exactly because of the reason you mentioned. So, yes, preprocess the data first and then pass it to PyCaret set up while turning off it's automatic preprocessing.

    • @MJJi-tw8uj
      @MJJi-tw8uj Місяць тому

      ​@@pedramjahangiry Thank you for your answer, Pedram, and congratulations on your recent award. :)

  • @user-hi4tw6mq3r
    @user-hi4tw6mq3r 6 місяців тому

    do you know if the interpret_model can be run on the stacker?

    • @pedramjahangiry
      @pedramjahangiry  6 місяців тому

      Hi there, I haven't tried using interpret_model with a stacked model in PyCaret myself yet, so I can't confirm its compatibility. It might be worth experimenting with this feature. If you try it, I'd appreciate hearing about your experience

  • @user-fq1ff9zo1n
    @user-fq1ff9zo1n 6 місяців тому

    Dear Pedram,
    I appreciate your informative video. Given the R-squared scores for my model (TrainR2=0.983, TestR2=0.811, CvR2=0.8396), do you believe my model is exhibiting signs of overfitting?
    Thank you.

    • @pedramjahangiry
      @pedramjahangiry  6 місяців тому

      I really can't tell without looking into your dataset and problem statement in general. However, The R-squared scores you've provided for your model (Train R² = 0.983, Test R² = 0.811, CV R² = 0.8396) suggest there might be some degree of overfitting. Here's how you should read these scores:
      Training R² (0.983): This is a very high R-squared value, indicating that the model fits the training data almost perfectly. While a high training score is generally desirable, an excessively high score can be a sign that the model is too complex and is capturing not only the underlying patterns but also the noise in the training data.
      Test R² (0.811): The test R-squared is significantly lower than the training R-squared. This drop in performance indicates that the model does not generalize as well to unseen data as it does to the training data. In a well-fitting model, you would expect the training and test R-squared values to be closer.
      Cross-Validation R² (0.8396): Cross-validation helps to assess the model's ability to generalize to an independent dataset. The CV R² is closer to the test R² than the training R², reinforcing the notion that the model's performance on unseen data is lower than on the training data.
      it seems that there is substantial difference between the training R² and the test/CV R² scores which could suggest overfitting. Your model has learned the training data too well, including its noise and outliers, which do not generalize to the test data.
      I would suggest using chatGPT to help you with how you can address overfitting. lmk if you had other questions.

    • @user-fq1ff9zo1n
      @user-fq1ff9zo1n 6 місяців тому

      @@pedramjahangiry
      Thank you for your response and your time. I'm currently engaged in predicting the compressive strength of concrete with a dataset comprising just 110 entries for both training and testing the model. To mitigate overfitting, I've employed tuning options and cross validation in PyCaret, resulting in the mentioned R-squared value.

  • @superfreiheit1
    @superfreiheit1 5 днів тому

    Can you make the code area bigger. hard to read