How Does The Number Of Partitions Affect `wholeTextFiles` And `textFiles`?
In the spark, I understand how to use wholeTextFiles and textFiles, but I'm not sure which to use when. Here is what I know so far: When dealing with files that are not split by
Solution 1:
For reference, wholeTextFiles
uses WholeTextFileInputFormat
which extends CombineFileInputFormat.
A couple of notes on wholeTextFiles
.
- Each record in the RDD returned by
wholeTextFiles
has the file name and the entire contents of the file. This means that a file cannot be split (at all). - Because it extends
CombineFileInputFormat
, it will try to combine groups of smaller files into one partition.
If I have two small files in a directory, it is possible that both files will end up in a single partition. If I set minPartitions=2
, then I will likely get two partitions back instead.
Now if I were to set minPartitions=3
, I will still get back two partitions because the contract for wholeTextFiles
is that each record in the RDD contain an entire file.
Post a Comment for "How Does The Number Of Partitions Affect `wholeTextFiles` And `textFiles`?"