ML System Design Mock Interview - Build an ML System That Classifies Which Tweets Are Toxic

Поділитися
Вставка
  • Опубліковано 2 чер 2024
  • Ace your machine learning interviews with Exponent’s ML engineer interview course: bit.ly/3SSbxC4
    A machine learning engineer demonstrates the process of building a system to classify tweets as harmful or not. The engineer explores the dataset, emphasizing data pre-processing and tokenization using a pre-trained tokenizer. A sequential model architecture is chosen with layers for embedding, LSTM, and non-linearity, and their roles are explained. The engineer discusses monitoring training and validation loss to detect overfitting or underfitting and suggests countermeasures. For evaluation, metrics like precision, recall, and accuracy are proposed, considering the dataset's imbalance. The engineer acknowledges the potential benefits of using a different model architecture like BERT and highlights the importance of evaluating model calibration and interpretability aspects.
    Chapters (Powered by ChapterMe) -
    00:00 - Introduction to Building a Toxic Tweet Classification System
    01:53 - Overview of Binary Classification and Predictions
    02:59 - Model Deployment and Monitoring
    05:11 - Text Classification: Preprocessing Pipeline
    08:46 - Balancing Dataset Samples
    11:27 - Advanced Preprocessing for Machine Learning
    22:23 - Building a Sequential Model with Keras
    28:15 - Understanding LSTM Layers for Contextual Information
    31:31 - Model Summary: Training, GPU Use, and Loss Function
    34:29 - Model Training Strategies and Overfitting Prevention
    38:47 - Evaluating Model Precision and Recall
    43:09 - Automated Sentiment Processing with Instant Models
    46:31 - Leveraging BERT Tokens for Classification
    48:24 - Fundamentals of Machine Learning and Model Validation
    Want more machine learning content?
    - Fake News Detection System - Machine Learning Mock Interview - • Fake News Detection Sy...
    - Amazon Machine Learning Engineer Interview: K-Means Clustering - • Amazon Machine Learnin...
    - How to Become a Machine Learning Engineer - • How to Become a Machin...
    👉 Subscribe to our channel: bit.ly/exponentyt
    🕊️ Follow us on Twitter: bit.ly/exptweet
    💙 Like us on Facebook for special discounts: bit.ly/exponentfb
    📷 Check us out on Instagram: bit.ly/exponentig
    📹 Watch us on TikTok: bit.ly/exponenttikttok
    ABOUT US:
    Did you enjoy this interview question and answer? Want to land your dream career? Exponent is an online community, course, and coaching platform to help you ace your upcoming interview. Exponent has helped people land their dream careers at companies like Google, Microsoft, Amazon, and high-growth startups. Exponent is currently licensed by Stanford, Yale, UW, and others.
    Our courses include interview lessons, questions, and complete answers with video walkthroughs. Access hours of real interview videos, where we analyze what went right or wrong, and our 1000+ community of expert coaches and industry professionals, to help you get your dream job and more!

КОМЕНТАРІ • 9

  • @kaanbicakci
    @kaanbicakci 15 днів тому +1

    Calling shuffle() method on a tf.data.Dataset instance before splitting datasets can cause data leakage. The dataset is reshuffled in every iteration so everytime one of those take() and skip() methods are called, the order of the gathered elements from the "dataset" is different which may introduce overlapping samples. Here's a small example (the output will be different everytime but you should see the overlap after running multiple times):
    import tensorflow as tf
    import pandas as pd
    import numpy as np
    num_rows = 10
    dataset = tf.data.Dataset.from_tensor_slices(np.arange(1, num_rows + 1))
    dataset = dataset.cache()
    dataset = dataset.shuffle(num_rows)
    dataset = dataset.batch(2)
    dataset = dataset.prefetch(1)
    train = dataset.take(2)
    val = dataset.skip(2).take(1)
    test = dataset.skip(3).take(1)
    def extract_ids(ds):
    ids = []
    for batch in ds:
    ids.extend(batch.numpy())
    return np.array(ids)
    train_ids = extract_ids(train)
    val_ids = extract_ids(val)
    test_ids = extract_ids(test)
    train_val_overlap = np.intersect1d(train_ids, val_ids)
    train_test_overlap = np.intersect1d(train_ids, test_ids)
    val_test_overlap = np.intersect1d(val_ids, test_ids)
    print("Train IDs:", train_ids)
    print("Val IDs:", val_ids)
    print("Test IDs:", test_ids)
    print("Train-Val Overlap:", train_val_overlap)
    print("Train-Test Overlap:", train_test_overlap)
    print("Val-Test Overlap:", val_test_overlap)

  • @diegofabiano8489
    @diegofabiano8489 2 місяці тому +6

    I honestly like much better the Machine Learning system design interviews, the one with the Meta engineer where he actually applied the steps was awesome!

  • @jackjill67
    @jackjill67 2 місяці тому +4

    First useful video... otherwise most people just talk through

  • @mandanafasounaki2192
    @mandanafasounaki2192 Місяць тому

    Great work, solid coding skills. The thing I would want to add is that when we use BERT tokenizer, all the information, that is required to be extracted from the text for classification, is already embedded into the vectors. A simple perceptron could work well on top of the embeddings. But your approach is great for demonstrating the development lifecycle of an ML project.

  • @DrAhdol
    @DrAhdol 2 місяці тому +1

    Something I'd like to see more from some of these ML videos are acknowledgements of approaches not leveraging NN. For something like this, you could leverage multinomial naive bayes with bag of words/tf-idf scores and get good performance with super fast inference speed as a baseline to compare the more complex NN models.

  • @alexb2997
    @alexb2997 18 днів тому

    Just to represent for recurrent networks -- It's a little unfair on LSTMs to suggest they might struggle with long term dependencies for tweets. Transformers do have an easier architecture for handling long term retrieval, but LSTMs were a specifically designed variant of RNNs for handling long term dependencies. For tweet-length documents, you'd be fine. I'm not saying don't use a transformer, just don't write off recurrent models so quickly.

  • @TooManyPBJs
    @TooManyPBJs Місяць тому

    Isn't it a bit duplicative to add LSTM with BERT tokens since BERT is already sequence aware?

    • @alexb2997
      @alexb2997 19 днів тому

      The tokens are just simple vocab indices, there's no sequence encoding involved at that stage. The sequence magic happens within the transformer, which wasn't used here.

  • @user-dx4un7gg2z
    @user-dx4un7gg2z 2 місяці тому +1

    How did you scrape this data from twitter, twitter API has lots of restrictions. Can you please explain that.