Spark Dataframe In Python - Execution Stuck When Using Udfs

February 22, 2024 Post a Comment

I have a spark job written in Python which is reading data from the CSV files using DataBricks CSV reader. I want to convert some columns from string to double by applying an udf f

Solution 1:

Long story short don't use Python UDFs (and UDFs in general) unless it is necessary:

it is inefficient due to full round-trip through Python interpreter
cannot be optimized by Catalyst
creates long lineages if used iteratively

For simple operations like this one just use built-in functions:

from pyspark.sql.functions import regexp_replace

decimal_separator = ","
exprs = [
    regexp_replace(c, decimal_separator, ".").cast("float").alias(c) 
    if c in columns else c 
    for c in df.columns
]

df.select(*exprs)

Baca Juga

Df.topandas() - Failed To Locate The Winutils Binary In The Hadoop Binary Path
Perform A User Defined Function On A Column Of A Large Pyspark Dataframe Based On Some Columns Of Another Pyspark Dataframe On Databricks
Pyspark. Transformer That Generates A Random Number Generates Always The Same Number

Learn Python Tutorials

Spark Dataframe In Python - Execution Stuck When Using Udfs

Solution 1:

Post a Comment for "Spark Dataframe In Python - Execution Stuck When Using Udfs"