Skip to content Skip to sidebar Skip to footer

How Does The Number Of Partitions Affect `wholeTextFiles` And `textFiles`?

In the spark, I understand how to use wholeTextFiles and textFiles, but I'm not sure which to use when. Here is what I know so far: When dealing with files that are not split by

Solution 1:

For reference, wholeTextFiles uses WholeTextFileInputFormat which extends CombineFileInputFormat.

A couple of notes on wholeTextFiles.

  • Each record in the RDD returned by wholeTextFiles has the file name and the entire contents of the file. This means that a file cannot be split (at all).
  • Because it extends CombineFileInputFormat, it will try to combine groups of smaller files into one partition.

If I have two small files in a directory, it is possible that both files will end up in a single partition. If I set minPartitions=2, then I will likely get two partitions back instead.

Now if I were to set minPartitions=3, I will still get back two partitions because the contract for wholeTextFiles is that each record in the RDD contain an entire file.


Post a Comment for "How Does The Number Of Partitions Affect `wholeTextFiles` And `textFiles`?"