Skip to content Skip to sidebar Skip to footer

How To Calculate Difference Between Dates Excluding Weekends In Pyspark 2.2.0

I have the below pyspark df which can be recreated by the code df = spark.createDataFrame([(1, 'John Doe', '2020-11-30'),(2, 'John Doe', '2020-11-27'),(3, 'John Doe', '2020-11-29')

Solution 1:

You can't call collect in the UDF. You can only pass in columns to the UDF, so you should pass in the date column and the lag date column, as shown below:

import numpy as np
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import IntegerType

df = spark.createDataFrame([
    (1, "John Doe", "2020-11-30"),
    (2, "John Doe", "2020-11-27"),
    (3, "John Doe", "2020-11-29")],
    ("id", "name", "date")
) 

workdaysUDF = F.udf(lambda date1, date2: int(np.busday_count(date2, date1)) if (date1 is not None and date2 is not None) else None, IntegerType())
df = df.withColumn("date_dif", workdaysUDF(F.col('date'), F.lag(F.col('date')).over(Window.partitionBy('name').orderBy('id'))))
df.show()

+---+--------+----------+--------+
| id|    name|      date|date_dif|
+---+--------+----------+--------+
|  1|John Doe|2020-11-30|    null|
|  2|John Doe|2020-11-27|      -1|
|  3|John Doe|2020-11-29|       1|
+---+--------+----------+--------+

Post a Comment for "How To Calculate Difference Between Dates Excluding Weekends In Pyspark 2.2.0"