End to End Text Classification using Python and Scikit learn

Поділитися
Вставка
  • Опубліковано 11 жов 2024

КОМЕНТАРІ • 63

  • @anuradhabalasubramanian9845

    Lovely presentation Sir .Thank you so much for the detailed video .Hats Off to you !

  • @ijeffking
    @ijeffking 4 роки тому +1

    Extremely useful tutorial. The explanation and pointers make it specifically easy to follow and learn especially the usefulness of H2o...Thank you very much.

  • @deepakkumar-zc9ok
    @deepakkumar-zc9ok 4 роки тому +1

    Wow it is a great tutorial, eagerly waiting for the next video.

  • @sukamal5832
    @sukamal5832 4 роки тому +1

    Nice tutorial and providing details of every argument and there usefulness. Use of H2O autoML is new learning for me. Thank you.

  • @Induraj11
    @Induraj11 3 роки тому +1

    Excellent tutorial Srivat's Sir!

  • @karimbaig8573
    @karimbaig8573 4 роки тому +1

    I really liked watching the video. Few approaches as you mentioned could be word2vec approach but the draw back of it is that it is very data specific and generalizing it on unseen data is tough, one more pipeline could be
    1. Transforming using GloVe +H2O pipeline you just mentioned
    2. Using ELMo, BERT for transforming (look into flair package) and then implementing H2O pipeline
    These two approaches could be very good to boost your f1 score.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Karim.. Thank you and Yes these are option to boost f1 score. I will be covering both the approach on same dataset in my future video.. Lets us see how close we can get on f1 score as this data is difficult to get very high accuracy as well

  • @bijaynayak6473
    @bijaynayak6473 2 роки тому

    Extremely useful

  • @abhishekbhardwaj7214
    @abhishekbhardwaj7214 4 роки тому +1

    Very well summarised.

  • @sharanbabu2001
    @sharanbabu2001 4 роки тому +1

    Congrats on hitting the 20K mark, sir! Glad that I am a part of this.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Thanks Sharan and thank you for engaging along the journey :)

  • @vinothkumar-xw1wy
    @vinothkumar-xw1wy 4 роки тому +1

    Great content. Happy that I learnt something new !! Thanks so much Srivatsan

  • @aboseutube
    @aboseutube 4 роки тому +1

    Excellent tutorial, Srivatsan. Learn something new about H20. Thanks very much for the explanation. However one request if you can go bit slow; sometimes I need to pause the video to understand the concept.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Thanks Aniruddha.. Noted down and I am trying to slow down but once I am in front to record do not realize it. I will try to be slower in future but meanwhile if you can watch my video in 0.90x speed it will solve the problem temporarily

    • @aboseutube
      @aboseutube 4 роки тому

      @@AIEngineeringLife sure. Thanks for the suggestion.

  • @shrikantkulkarni9461
    @shrikantkulkarni9461 4 роки тому +1

    Great tutorial .. would oversampling or undersampling methods help with imbalanced class in this case.. I have used it for non-NLP use cases but not yet on NLP

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Shrikant .. Undersampling highly imbalanced classes helps. Typically 1:10 is a good ratio of imbalance to further use class weights. Oversampling depends on data and on large feature space it can introduce noise which can further deteriorate model with real world data. Again that was my observation and I have not found much success with oversampling

  • @sushantpenshanwar
    @sushantpenshanwar 4 роки тому +1

    Thank you for the great Tutorial. What if we combine the last two classes into one. and later have another binary classifier for them. Will that help improve the Recall ?

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +2

      Sushant.. Yes it might help. I would start and see if we can collect more data. I took 6 months of complaints, maybe for last 2 classes alone I can go for 2 years of it. we can also try downsampling majority classes by some percentage. If both does not help we can build separate binary classifier. I would go for training better vectorizer than TFIDF and see if performance improves

    • @sushantpenshanwar
      @sushantpenshanwar 4 роки тому +1

      @@AIEngineeringLife I will try to do both the options. Thank you again sir for the great tutorial. Hope more are coming. Will update here if I get good results.

  • @Mayuresh751
    @Mayuresh751 3 роки тому +1

    Amazing tutorial! I have just one query. The aml.train() step just takes too long, more than an hour and I haven't been able to proceed. Do I need to change the runtime type or something to make it run faster?

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому +1

      If you set to GPU only xgboost uses it in h2o automl. Best is you can add parameters to automl to reduce to specific model type and give time or rounds to limit iteration

    • @Mayuresh751
      @Mayuresh751 3 роки тому

      @@AIEngineeringLife I did try the exact same parameters as shown. But I shall try adding more parameters to see if it reduces the running. Thank you so much! Looking forward to viewing more tutorials :)

  • @siddhant17khare
    @siddhant17khare 3 роки тому +1

    Extremely useful content Sir! Thanks a lot for the demonstration (y)
    Also Sir, I recall that in one of your videos that I saw (related to NLP), you had used named entity as features while training a classification model, was that the case ? I am just asking Sir because I am unable to find that video now.
    Please let me know your suggestions.

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому +1

      Siddhant.. In one of twitter analysis video I used NER to add new features to dataframe but did not use it to build model. I was showing more on how we can think of multiple aspects

  • @rupakgoyal1611
    @rupakgoyal1611 4 роки тому +1

    Thank you

  • @rockshubham9592
    @rockshubham9592 4 роки тому +1

    just one question, as you used test size as 80% but generally i know we take something like reverse for test size

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      During development I used 20% for training to test it quickly as H2O Automl on 80% was taking hour or more. I forgot to switch it back to 80-20 post testing. But someone pointed out and I tested with 80-20 post my video and output pattern in similar to validate. Sorry for the confusion though
      Only difference in this case is I am training with less data and validating with other chunk but since the technique i used is not data savvy it did not impact the model much as would have complex models like LSTM which is data hungry. Thank for point out though

  • @sasikiran7172
    @sasikiran7172 3 роки тому +1

    Thanks for the lecture.I think train size must be 75% but u have taken test_size is 75, please check it once.

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому +1

      Yes Sasi.. I have mentioned it in another comment as well. I was testing with lower ratio and forget to switch before taking the video. But both way the resultant output is same. The confusion matrix does not change much post 25% as well as this technique demonstrated here is not data savvy as neural networks :) . Thanks for pointing it

  • @abhijitmalode3189
    @abhijitmalode3189 3 роки тому +1

    Nice video where I get the dataset

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому

      You will find it in my git repo here - github.com/srivatsan88/UA-camLI/tree/master/dataset
      You can also follow the link to the dataset from video and use it directly in notebook

  • @Akash5130
    @Akash5130 4 роки тому +1

    Hey Srivatsan! I was trying to find the Notebook File for this project on your Github. Can you please help here! Thank you :)

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Here you go - github.com/srivatsan88/Natural-Language-Processing/blob/master/Text_Classification_using_TFIDF_AutoML_scikit_learn.ipynb

    • @Akash5130
      @Akash5130 4 роки тому

      @@AIEngineeringLife Thank you so much!!!

  • @kimayashah7374
    @kimayashah7374 4 роки тому +1

    Awesome explanation. Could you please provide github link for this code

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +4

      Here it is - github.com/srivatsan88/Natural-Language-Processing/blob/master/Text_Classification_using_TFIDF_AutoML_scikit_learn.ipynb

  • @kishorereddy521
    @kishorereddy521 3 роки тому

    Hai sir, this class weights concept work for every boasting technique right !

  • @bhaveshsalvi4437
    @bhaveshsalvi4437 3 роки тому +1

    As you coverted h2o xgboost parameter to actual parameter how can i convert h2o GBM parameter to actual parameter for running a manual GBM algorithm with same paramter as h2o used.

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому

      I doubt Bhavesh. Reason XGB has is because it has lot of hyperarameters. In contrast GBM has limited and can be directly mapped easily

    • @bhaveshsalvi4437
      @bhaveshsalvi4437 3 роки тому

      Thanks for acknowledging.

  • @karimbaig8573
    @karimbaig8573 3 роки тому +1

    How to add l1/l2 regularozation in the h2o automl itself

    • @AIEngineeringLife
      @AIEngineeringLife  3 роки тому

      You can check the documentation where you can set individual model hyperparameter here. Since this will be at model level that is ideal way to go - docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

  • @moushtaqahmad3570
    @moushtaqahmad3570 4 роки тому +1

    how to convert h2o hyperparameters into GBM hyperparameters instead of XGBoost hyperparameters?(GBM model gave least mean per class error for my dataset)

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +2

      Moushtaq.. You do not have to you can use H2O GBM parameters directly in scikit as well.. XGBoost was required as H2O had done some internal mapping to make it consistent with other api

    • @moushtaqahmad3570
      @moushtaqahmad3570 4 роки тому

      Any thread on how to use H2o GBM parameters in scikit?(Am a newbie)

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому

      Actually you can directly save GBM model and deploy it. I just showed example in scikit but if u want to do u can just use scikit GBM model. I do not have example of it though

  • @majdoubwided6666
    @majdoubwided6666 4 роки тому +1

    Hi, very useful tutorial. When i run this code:
    h2o_train_df = h2o.H2OFrame(train_df)
    h2o_test_df = h2o.H2OFrame(test_df)
    => this error appears:
    UnicodeEncodeError: 'charmap' codec can't encode characters in position 7525-7528: character maps to
    PS: im using another file (containing Facebook posts in multilingual Arabic, French Dialect..) How to solve it plz.

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      I think you may have to encode the data into target language and then pass it on to H2O dataframe. Encoding set is not correct. You can either set encoding at pandas dataframe or you can follow below issue tracker - github.com/llSourcell/twitter_sentiment_challenge/issues/1

    • @majdoubwided6666
      @majdoubwided6666 4 роки тому

      @@AIEngineeringLife Yes i putted encoding='utf-8' but in vain ! errors still appear. Can you please make a video about treating multilingual text because im suffering here

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      Majdoub.. I can.. do you have any dataset in open domain that I can try with

    • @majdoubwided6666
      @majdoubwided6666 4 роки тому

      @@AIEngineeringLife I can send you mine if you want 'its 1400 Facebook post ( I collect it from Facebook groups)

    • @AIEngineeringLife
      @AIEngineeringLife  4 роки тому +1

      yes.. please send me the file as linkedin message if possible

  • @ansylpinto2301
    @ansylpinto2301 4 роки тому

    Please share the code as well. It would be great help.

  • @JiminPark-ld2xx
    @JiminPark-ld2xx 2 роки тому

    I have this issue when I type,
    df('Priority').value_counts()
    ---------------------------------------------------------------------------
    TypeError Traceback (most recent call last)
    in ()
    ----> 1 df('Priority').value_counts()
    TypeError: 'DataFrame' object is not callable
    Whact is the issue here. I can't figure it out.

    • @JiminPark-ld2xx
      @JiminPark-ld2xx 2 роки тому

      I actually found the ans. The code should have like this,
      df['Priority'].value_counts()
      You have to change the brackets from ( ) to [ ]. In "Priority" is the column I want to count and how many # of records are there in each category.

  • @PushpitKamboj
    @PushpitKamboj Рік тому

    sir please dataset ka link toh de dia kro

  • @shankhadeepghosal731
    @shankhadeepghosal731 3 роки тому

    Where can I get the data source?

  • @adesojialu1051
    @adesojialu1051 3 роки тому

    Error. The jre