Skip to content Skip to sidebar Skip to footer

How To Explode An Array Without Duplicate Records

This is continuation to the question here in pyspark sql Add different Qtr start_date, End_date for exploded rows. Thanks. I have the following dataframe which has a array list as

Solution 1:

Instead of exploding the array, you can pick the values from the array based on it's position.

This position can be dynamically generated using row_number as shown below.

from pyspark.sql.functions import row_number, expr
from pyspark.sql import Windowwindow= Window.partitionBy('customer_number').orderBy('new_sdt')

df.withColumn('row_num', row_number().over(window)).\
withColumn('cf_new', expr("cf_values[row_num - 1]")).\
drop('row_num').show()

Output:

+---------------+------------+----------+----------+---+------------+----------+----------+------+
|customer_number|sales_target|start_date|  end_date|noq|   cf_values|   new_sdt| new_edate|cf_new|
+---------------+------------+----------+----------+---+------------+----------+----------+------+
|        A011021|          15|2020-01-01|2020-12-31|  4|[4, 4, 4, 3]|2020-01-01|2020-03-31|     4|
|        A011021|          15|2020-01-01|2020-12-31|  4|[4, 4, 4, 3]|2020-04-01|2020-06-30|     4|
|        A011021|          15|2020-01-01|2020-12-31|  4|[4, 4, 4, 3]|2020-07-01|2020-09-30|     4|
|        A011021|          15|2020-01-01|2020-12-31|  4|[4, 4, 4, 3]|2020-10-01|2020-12-31|     3|
+---------------+------------+----------+----------+---+------------+----------+----------+------+

Post a Comment for "How To Explode An Array Without Duplicate Records"