Sum In Spark Gone Bad
Based on Unbalanced factor of KMeans?, I am trying to compute the Unbalanced Factor, but I fail. Every element of the RDD r2_10 is a pair, where the key is cluster and the value is
Solution 1:
The problem is because you missed to count the number of points grouped in each cluster, thus you have to change how pdd
was created.
pdd = r2_10.map(lambda x: (x[0], len(x[1]))).reduceByKey(lambda a, b: a + b)
However, You could obtain the same result in a single pass (without computing pdd
), by mapping the values of the RDD
and then reducing by using sum
.
total = r2_10.map(lambda x: len(x[1])).sum()
Post a Comment for "Sum In Spark Gone Bad"