Divide RDD into sub parts

asifyar profile image AsifYar ・2 min read

Problem Definition

Divide the Spark RDD into N(as defined by user) sub parts and perform same function on these sub parts with different data set on each sub part


I am having RDD. I want to divide that RDD into sub parts let call him partitions. Now on each partition i want to perform same task for multiple iteration.
Here is code that explains well

case class shape ( dim:Int) {
  val random = new Random()
  var X      : Array[Double]       =   Array.fill(dim)(random.nextDouble() * (100-10)+11 )
  var Y      : Array[Double]       =   Array.fill(dim)( math.random)
  var Loss       : Double              =   math.random
  var NewLoss       : Double              =   math.random 

val N = 1000 //in real N will be in millions
val d = 100 //in real N will be in millions
val nP = 4  // nP defines number of sub parts into which RDD is divided
val iterations = 1000 // in real analysis iteration will be in millions or billions  
val list = List.fill(N)(new shape(d))

 list.map { x =>
  x.Loss = function. SphereFunc(x.X) // Update Loss of each element

val rdd = sc.parallelize(batList, nP)
var partitioned = rdd.persist()

for(iter <- 1 to iterations) {

 partitioned = partitioned.mapPartitionsWithIndex { (k, iterator  )  =>
 val li = iterator.toList
 val localBest = li.minBy(_.Loss).X
 li.map { j =>
        j.Y = ((j.X, localBest).zipped.map(_ - _).map(_ * math.random), j.Y).zipped.map(_ + _)
        j.X = (j.X, j.Y).zipped.map(_ + _)

 li.filter(math.random > _.Loss)
        .map { j =>
          j.X = localBest.map(_ + math.random)            
 li.map{j => j.NewLoss = SphereFunc(j.X)

 li.filter(j => math.random < j.NewLoss && j.NewLoss < j.Loss).map { j =>
        j.Loss = j.NewLoss  

def SphereFunc(list: List[Double]): Double = {
list.foldLeft(0.0)((x, xs) => x + xs * xs)

In this code I created RDD and for each iteration I call mapPartitionsWithIndex to get each partition data and updated elements of each partition.

Problem with this code is that for each iteration, when I call mapPartitionsWithIndex it will perform required operation on each element of partitions and it will again create RDD. So partitions will not remain same. For each iteration RDD is divided again to partitions. Also will cause lot of shuffeling.

But I want to create partitions from RDD at start and than perform desired operation on partitions till user defined iteration completes.
Here i am calling mapPartitionsWithIndex iterations times. Is this possible to call mapPartitionsWithIndex only one time and than iterate over partitions till iterations

How can I achieve that? I have to run experiments on cluster with more than 100 cores.


Editor guide