RDD first() vs take(1)

rdd.first() returns the element at location 0 of the rdd

rdd.take(1) returns an array consisting of only 0th element of the rdd

So if you need to exclude first line of a file(headers) and you are planning to use filter on rdd, then you need to use first()

e.g(python):

header = lines.first() rows = lines.filter(lambda x: x != header)

Key/Value RDD

If you are not modifying the keys from transformation on RDDs, make sure to call mapValues() or flatMapValues() instead of map() or flatMap(). It's more efficient. It allows spark to maintain partitions from original RDDs and not shuffle the data around which can be very expensive when you are running the job on a cluster.

Using these functions means they keys are not being modified and are not being exposed.

Random notes

RDD first() vs take(1)

Key/Value RDD

results matching ""

No results matching ""