How To Specify The Partition For Mappartition In Spark
What I would like to do is compute each list separately so for example if I have 5 list ([1,2,3,4,5,6],[2,3,4,5,6],[3,4,5,6],[4,5,6],[5,6]) and I would like to get the 5 lists with
Solution 1:
As far as I understand your intentions all you need here is to keep individual lists separate when you parallelize
your data:
data = [[1,2,3,4,5,6], [2,3,4,5,6,7], [3,4,5,6,7,8],
[4,5,6,7,8,9], [5,6,7,8,9,10]]
rdd = sc.parallelize(data)
rdd.take(1) # A single element of a RDD is a whole list
## [[1, 2, 3, 4, 5, 6]]
Now you can simply map
using a function of your choice:
def drop_six(xs):
return [x for x in xs if x != 6]
rdd.map(drop_six).take(3)
## [[1, 2, 3, 4, 5], [2, 3, 4, 5, 7], [3, 4, 5, 7, 8]]
Baca Juga
Post a Comment for "How To Specify The Partition For Mappartition In Spark"