Use cross_val_score and GridSearchCV on a Pipeline

Поділитися
Вставка
  • Опубліковано 11 січ 2025

КОМЕНТАРІ • 40

  • @dataschool
    @dataschool  3 роки тому

    Thanks for watching! 🙌 If you have any questions about cross-validation, grid search, or hyperparameter tuning, let me know! 💬 And if you're new to pipelines, I recommend starting with this video: ua-cam.com/video/1Y6O9nCo0-I/v-deo.html

  • @theodoretourneux5662
    @theodoretourneux5662 2 роки тому +4

    Amazing! Exactly what I was looking for

  • @willw4096
    @willw4096 2 роки тому

    2:25 cross_val_score splits the data and THEN applies the pipeline steps, versus preprocess the data and then use cross validation on just the model ; 3:20 Preprocessing before splitting the data doesn't properly simulate reality; splitting then preprocessing does simulate reality.

    • @dataschool
      @dataschool  2 роки тому

      Exactly! Thanks for pulling out these quotes, Will, and thanks also for joining as a channel member! 🙏

  • @Harry-ex1vw
    @Harry-ex1vw Рік тому

    To those who wonders what Kevin was trying to say about the order of preprocessing and splitting the data, this is the clarification I got from chatGPT. I can't validate its answer due to my limited knowledge, but hell yeah it makes a lot of sense to me!
    "In machine learning, it is common to use techniques like cross-validation to evaluate the performance of a model. Cross-validation involves splitting the data into several subsets, then using each subset to train the model and evaluate its performance.
    It is better to apply data preprocessing steps (such as scaling or encoding categorical variables) after splitting the data into subsets, rather than before.
    The reason for this is that if you preprocess the entire dataset before splitting, you may inadvertently introduce information leakage between the training and testing subsets. For example, if you scale the entire dataset before splitting it into subsets, the scaling factors will be influenced by the values in the testing subset, which should not be used during training.
    By contrast, if you apply preprocessing steps after splitting the data, you can be sure that the preprocessing is only based on the training data and not the testing data. This better simulates the real-world scenario, where you only have access to the training data when building a model."

    • @dataschool
      @dataschool  Рік тому

      Hi Harry, thanks so much for posting this! The ChatGPT response is absolutely correct!

  • @breegette8263
    @breegette8263 2 роки тому +1

    How would you set it up to test through multiple classifiers for best model&hyperparameters?

  • @BiranchiNarayanNayak
    @BiranchiNarayanNayak 3 роки тому +2

    Excellent tutorial.

    • @dataschool
      @dataschool  3 роки тому

      Thanks! Glad it was helpful to you!

  • @abimbolaobadare6691
    @abimbolaobadare6691 3 роки тому +1

    Welcome back sensei.

  • @yohkk8zaa
    @yohkk8zaa 2 роки тому

    Thanks for clear description~
    and I want to ask a question that do I need to split all the samples to training data and test data with train_test_split before I do the grid search, or should I do cv split first then to do grid search?

  • @dnguyendev
    @dnguyendev 3 роки тому +1

    Great content! Can you make a video about data scaling? Good to see you're back!

    • @dataschool
      @dataschool  3 роки тому

      Thanks for the suggestion, and for your kind comment! 🙏

  • @chetansehgal3228
    @chetansehgal3228 Рік тому

    I have a question. Param names for GridsearchCV are not case-sensitive or do we need to mention step names in lower case?

    • @dataschool
      @dataschool  Рік тому

      Great question. Param names are case-sensitive! I'm using lowercase step names because make_pipeline automatically creates lowercase step names. Hope that helps!

  • @RoyAAD
    @RoyAAD Рік тому

    The step name is not affected by how you define the transformer?
    For example you wrote
    logisticregression__C and not clf__C.
    How do you know it should be logisticregression? Is there a way to check?

    • @dataschool
      @dataschool  Рік тому +1

      Great question! You examine the Pipeline step names.

  • @pinksincerely
    @pinksincerely 3 роки тому

    hello! thank you so much for this! super useful. can I ask, no need to do the train test split outside of the cross val or pipeline? thanks!

    • @dataschool
      @dataschool  2 роки тому

      Depends on your goals... it's complicated to explain briefly, I'm sorry!

  • @Techbyyogi
    @Techbyyogi 3 роки тому +1

    After long time see you sir🤘🤘

    • @dataschool
      @dataschool  3 роки тому

      Good to be back! New videos coming every Tuesday and Thursday through the end of October 😄

  • @petroskoulouris3225
    @petroskoulouris3225 3 роки тому

    Your teaching style and presentation is excellent. Quick question: how can you get feature importance from each model during cross validation?

    • @theodoretourneux5662
      @theodoretourneux5662 2 роки тому

      Sklearn has permutation feature performance built in. I think you could just do that operation last in the pipe. If you’re looking for feature selection you can do something like sklearns sequentialFeatureSelection at the end of a pipe I think

  • @CJP3
    @CJP3 3 роки тому +1

    Really good to know! Thanks!

    • @dataschool
      @dataschool  3 роки тому

      You're very welcome! Glad it's helpful to you! 🙌

  • @AG2009to2013
    @AG2009to2013 3 роки тому

    Thanks for the video!
    Could you, please, clarify one thing
    if I do the following steps, standardization inside cross_val_score will be applied only to X or to X and y?
    If it will be applied to X and y, how can I make it to be applied only to X?
    scalar = StandardScaler()
    clf = svm.LinearSVC()
    pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
    cv = KFold(n_splits=4)
    scores = cross_val_score(pipeline, X, y, cv = cv)

    • @dataschool
      @dataschool  3 роки тому

      Great question! The standardization will only be applied to X.

    • @AG2009to2013
      @AG2009to2013 3 роки тому

      Thanks!

  • @christopherwickens8199
    @christopherwickens8199 3 роки тому

    Thanks for the tutorial. Are you effectively doing nested cross validation here?

    • @dataschool
      @dataschool  3 роки тому +1

      No, this is not nested cross-validation, though you can actually achieve that by running cross_val_score on the GridSearchCV object!

    • @christopherwickens8199
      @christopherwickens8199 3 роки тому

      @@dataschool I see! Thanks for the tip.

  • @eatbreathedatascience9593
    @eatbreathedatascience9593 3 роки тому

    Great Video !!!

  • @marinapachecovillaschi2367
    @marinapachecovillaschi2367 2 роки тому

    I'm looking for a solution to loop through a few models and run a cross_val_score on each one of them but I can't seem to have that inside a pipeline. Any thoughts?

    • @pritamdodeja
      @pritamdodeja 2 роки тому

      Your params can actually be a list of dictionaries, and one of the elements of that list should have the estimator that you want to try out. The way I think of it is the set of parameters passed needs to be applicable to a particular element in that list. In this way, you can vary both the preprocessing steps as well as the models themselves. I suspect you can also vary the features passed to various parts of ColumnTransformer too, but this part I haven't yet done.

  • @DiaaHaresYusf
    @DiaaHaresYusf 2 роки тому

    bro, something not correct here.. if you have many categorical data cross validation will throw an error as it splits up for second or third time it will find new data that was note hot encoded.
    please correct this for audience of urs ♥