Skip to content Skip to sidebar Skip to footer

Spark Dataframe In Python - Execution Stuck When Using Udfs

I have a spark job written in Python which is reading data from the CSV files using DataBricks CSV reader. I want to convert some columns from string to double by applying an udf f

Solution 1:

Long story short don't use Python UDFs (and UDFs in general) unless it is necessary:

  • it is inefficient due to full round-trip through Python interpreter
  • cannot be optimized by Catalyst
  • creates long lineages if used iteratively

For simple operations like this one just use built-in functions:

from pyspark.sql.functions import regexp_replace

decimal_separator = ","
exprs = [
    regexp_replace(c, decimal_separator, ".").cast("float").alias(c) 
    if c in columns else c 
    for c in df.columns
]

df.select(*exprs)

Post a Comment for "Spark Dataframe In Python - Execution Stuck When Using Udfs"