Skip to content Skip to sidebar Skip to footer

Sum In Spark Gone Bad

Based on Unbalanced factor of KMeans?, I am trying to compute the Unbalanced Factor, but I fail. Every element of the RDD r2_10 is a pair, where the key is cluster and the value is

Solution 1:

The problem is because you missed to count the number of points grouped in each cluster, thus you have to change how pdd was created.

pdd = r2_10.map(lambda x: (x[0], len(x[1]))).reduceByKey(lambda a, b: a + b)

However, You could obtain the same result in a single pass (without computing pdd), by mapping the values of the RDD and then reducing by using sum.

total = r2_10.map(lambda x: len(x[1])).sum()

Post a Comment for "Sum In Spark Gone Bad"