RDDs, DataFrames and Datasets in Apache Spark - NE Scala 2016

Поділитися
Вставка
  • Опубліковано 27 сер 2024
  • Traditionally, Apache Spark jobs have been written using Resilient Distributed Datasets (RDDs), a Scala Collections-like API. RDDs are type-safe, but they can be problematic: It's easy to write a suboptimal job, and RDDs are significantly slower in Python than in Scala. DataFrames address some of these problems, and they're much faster, even in Scala; but, DataFrames aren't type-safe, and they're arguably less flexible.
    Enter Datasets, a type-safe, object-oriented programming interface that works with the DataFrames API, provide some of the benefits of RDDs, and can be optimized via the Catalyst optimizer.
    This talk will briefly recap RDDs and DataFrames, introduce the Datasets API, and then, through a live demonstration, compare the performance of all three against the same non-trivial data source.
    Talk by Brian Clapper
    March 4th, 2016
    www.nescala.org/
    Produced by NewCircle - Spark Training & Resources:
    newcircle.com

КОМЕНТАРІ • 23

  • @apetiteful
    @apetiteful 4 роки тому +2

    This was 4 years ago. But still it helped a ton. Now Datasets are integral part of spark.

  • @yonglelyu4117
    @yonglelyu4117 7 років тому +7

    confused by datasets and dataframe, this video solve my confusion!

  • @williamnarmontas9549
    @williamnarmontas9549 8 років тому +5

    Slides:
    www.ardentex.com/publications/RDDs-DataFrames-and-Datasets-in-Apache-Spark.pdf
    www.ardentex.com/publications/RDDs-DataFrames-and-Datasets-in-Apache-Spark/

  • @prabhubentick7165
    @prabhubentick7165 6 років тому +2

    Awesome explanation. Thanks for uploading.

  • @prateekgautam7398
    @prateekgautam7398 9 місяців тому

    He commented about "lambdas" a lot. I know what lambda functions are but somebody explain the context in which he is talking about "lambda" in this video? for instance while starting with datasets here 18:12

  • @dishajain2026
    @dishajain2026 5 років тому +1

    Very nice explanation!!

  • @osamafrankkimemenihian4311
    @osamafrankkimemenihian4311 2 роки тому

    Thanks. This was super helpful

  • @RahulChaudharyy
    @RahulChaudharyy 6 років тому

    This was really helpful. Thanks a ton!!

  • @ArifTak
    @ArifTak 5 років тому

    Very helpful, thank you.

  • @nasreenmohsin
    @nasreenmohsin 5 років тому +1

    good lecture ... please let me ask one thing if your hair is RAW Data and your beard is structure Data and your Clothes are semi Structure Data which Technique Should be used RDD, DataFram Or Data set please Explain with Example.

  • @AmitKumarGrrowingSlow
    @AmitKumarGrrowingSlow 8 років тому +2

    Do anyone know the answer to that question asked in last? Do they are going to use datasets in mllib libraries?

    • @Rodrio21
      @Rodrio21 7 років тому

      Hey Amit, I was interested in the same question because I have used MLlib week ago. Probably you already know at this time, although the answer is here. spark.apache.org/docs/latest/ml-guide.html

  • @FaraazAhmad
    @FaraazAhmad 5 років тому

    I had to google UTSL, I'm glad I did

  • @KA-du7vm
    @KA-du7vm 3 роки тому +1

    this guy got a spark see his hair!

  • @Ayoub-adventures
    @Ayoub-adventures 3 роки тому

    For me, all these presentations are the same and are very high level unfortunately..

  • @pikachu7173
    @pikachu7173 6 років тому

    good basic stuff :)

  • @EugenePetrash
    @EugenePetrash 2 роки тому

    Чё, уже и Коломойский в BigData подался? ;)