Machine Learning Question - Training AI to Detect Bots (Full mock interview)

Поділитися
Вставка
  • Опубліковано 14 чер 2024
  • Ace your machine learning interviews with Exponent’s ML engineer interview course: bit.ly/45HCQEi
    In this interview, Nathan discusses how to train a machine-learning model to flag potential bot accounts on social networks. He addresses the issue of class imbalance in the training data and suggests subsampling or oversampling based on the data set size. Nathan also discusses the trade-offs between different models and metrics for evaluating model performance.
    Chapters (Powered by ChapterMe) -
    00:00 - Intro
    00:53 - Preventing bots and malicious actors on social networks
    04:20 - Dataset handling subsampling, oversampling, intelligent training
    06:31 - Collecting minority class examples for model learning
    08:33 - Algorithmic model selection for class imbalance
    12:05 - Model Ensembling Weights for models based on empirical distribution
    13:02 - Methods for ensembling models
    17:48 - Split training examples to avoid bias
    21:36 - Classification metrics accuracy, precision, false positives
    23:55 - F1 score, AUROC, and manual review
    26:18 - Metrics for costeffective model evaluation
    30:00 - Retraining model against bots, robust evaluation pipeline
    31:34 - Data mining for bot classification
    32:06 - Machine learning model training with gold standard labels
    33:36 - Discussing pros and cons of adversarial training
    Want more Machine Learning content?
    Fake News Detection System - Machine Learning Mock Interview - • Fake News Detection Sy...
    Machine Learning Interview - Create a System to Predict Netflix Watch Times • Netflix ML Question - ...
    Amazon Machine Learning Engineer Interview: K-Means Clustering - • Amazon Machine Learnin...
    How to Become a Machine Learning Engineer - • How to Become a Machin...
    👉 Subscribe to our channel: bit.ly/exponentyt
    🕊️ Follow us on Twitter: bit.ly/exptweet
    💙 Like us on Facebook for special discounts: bit.ly/exponentfb
    📷 Check us out on Instagram: bit.ly/exponentig
    📹 Watch us on TikTok: bit.ly/exponenttikttok
    ABOUT US:
    Did you enjoy this interview question and answer? Want to land your dream career? Exponent is an online community, course, and coaching platform to help you ace your upcoming interview. Exponent has helped people land their dream careers at companies like Google, Microsoft, Amazon, and high-growth startups. Exponent is currently licensed by Stanford, Yale, UW, and others.
    Our courses include interview lessons, questions, and complete answers with video walkthroughs. Access hours of real interview videos, where we analyze what went right or wrong, and our 1000+ community of expert coaches and industry professionals, to help you get your dream job and more!
    #techjobinterviewprep #interviewtips #jobinterviewpreparation #Exponent #machinelearningengineer #datascientist #artificialintelligence #computervision #MachineLearning #AI #BotDetection

КОМЕНТАРІ • 13

  • @tryexponent
    @tryexponent  5 місяців тому

    Make sure you're interview-ready with Exponent's machine learning case interview course: bit.ly/45HCQEi

  • @sophiophile
    @sophiophile 4 місяці тому +2

    You can still get a probability as your output from a tree based model by using a tree regressor instead of a classifier. This would also be more amenable to tuning and ensembling.

  • @sophiophile
    @sophiophile 4 місяці тому +2

    As for approaches to gathering more data, you can also likely use any number of techniques to generate synthetic data, although you need to be careful that this does not result in overfitting if the synthetic data too closely resembles the real data (although youll run into the same issue, probably worse, if you repeeatedly sample)

  • @shilashm5691
    @shilashm5691 8 місяців тому +2

    Precision-Recall should be used in case of class imbalance

  • @ArunKumar-bp5lo
    @ArunKumar-bp5lo 9 місяців тому +1

    more on ml interview pls

  • @sophiophile
    @sophiophile 4 місяці тому +2

    96/2/2 split seems very extreme. That kind of split might be appropriate when you are updating a model, but during early stages of training the likelihood of spurious results is way too high IMO.

    • @timothybaker2822
      @timothybaker2822 3 місяці тому

      I think having a larger (i.e., 96%) training dataset decreases the chance of the model learning spurious correlations. Basically, a larger train set implies more bot examples in the train set which decreases the chance that the model hones in on a spurious feature. On the other hand, a small validation set may increasing model overfitting from hyper parameter tuning and having a small test set may lead to a less accurate final model evaluation.

  • @askwhydude
    @askwhydude 8 місяців тому

    Don't you think LR is a good method for smaller datasets? For larger datasets, I would approach a different way. I would think in extraction of features from the bot set with top 10 features and validate whether the new data has any such feature and flag it as BOT if it has. I guess this will work with any MLS dataset.

  • @shilashm5691
    @shilashm5691 8 місяців тому

    ensembling is diff from deploying multiple models in same endpoint right?

    • @tryexponent
      @tryexponent  7 місяців тому

      Hey Shilash! Ensembling is combining the predictions of multiple models to improve the overall performance by aggregating the predictions of multiple models. On the other hand, multiple models deployed at the same endpoint may not necessarily be used for the same task

  • @rahulsahani_AI
    @rahulsahani_AI 4 місяці тому

    Data augmentation can be done for the 5% defected bot data right?

    • @timothybaker2822
      @timothybaker2822 3 місяці тому

      Data augmentation definitely can help with small dataset size, so it makes sense that it can also help with class imbalance. I'm familiar with data augmentation for image applications, but not with this problem's type of data (which I'm guessing is a mix of user data and activity). How would you do it for this problem?

  • @user-uh8ez8ue2y
    @user-uh8ez8ue2y 9 місяців тому

    Where is the voice is it in the headphone 🤔