Spark Dataframe In Python - Execution Stuck When Using Udfs
I have a spark job written in Python which is reading data from the CSV files using DataBricks CSV reader. I want to convert some columns from string to double by applying an udf f
Solution 1:
Long story short don't use Python UDFs (and UDFs in general) unless it is necessary:
- it is inefficient due to full round-trip through Python interpreter
- cannot be optimized by Catalyst
- creates long lineages if used iteratively
For simple operations like this one just use built-in functions:
from pyspark.sql.functions import regexp_replace
decimal_separator = ","
exprs = [
regexp_replace(c, decimal_separator, ".").cast("float").alias(c)
if c in columns else c
for c in df.columns
]
df.select(*exprs)
Post a Comment for "Spark Dataframe In Python - Execution Stuck When Using Udfs"