Spark Ml Gradient Boosted Trees Not Using All Nodes
Solution 1:
Thanks for the clarifying comments above.
It's not necessary that Spark's implementation is faster than XGBoost. In fact, I would expect what you're seeing.
The biggest factor is that XGBoost was designed and written specifically with Gradient Boosted Trees in mind. Spark, on the other hand is way more general purpose and most likely doesn't have the same kind of optimizations that XGBoost has. See here for a difference between XGBoost and scikit-learn's implementation of the classifier algorithm. If you want to really get into the details, you can read the paper and even the code behind XGBoost and Spark's implementations.
Remember, XGBoost is also parallel/distributed. It just uses multiple threads on the same machine. Spark helps you run the algorithm when the data doesn't fit on a single machine.
A couple other minor points I can think of are a) Spark does have a non-trivial startup time. Communication across different machines can also add up. b) XGBoost is written in C++ which is in general great for numerical computation.
As for why only 3-4 cores are being used by Spark, that depends on what your dataset size is, how it is being distributed across nodes, what is the number of executors that spark is launching, which stage is taking up most time, memory configuration, etc. You can use Spark UI to try and figure out what's going on. It's hard to say why that's happening for your dataset without looking at it.
Hope that helps.
Edit: I just found this great answer comparing execution times between a simple Spark application vs a standalone java application - https://stackoverflow.com/a/49241051/5509005. Same principles apply here as well, in fact much more so since XGBoost is highly optimized.
Post a Comment for "Spark Ml Gradient Boosted Trees Not Using All Nodes"