PySpark DataFrames - Filtering Using Comparisons Between Columns Of Different Types
Solution 1:
Let's start with why. DataFrame
API is a DSL for SQL and SQL evaluation rules apply. Whenever you apply an operator on objects of different types, CAST
operation is applied, according to predefined rules, on an operand of lower precedence. In general numeric types, have higher precedence, therefore (following the execution plan df.select(df['intcol'] != 'miss').explain(True)
):
== Parsed Logical Plan ==
'Project [NOT (intcol#0 = miss) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false
is rewritten as
== Analyzed Logical Plan ==
(NOT (intcol = miss)): boolean
Project [NOT (intcol#0 = cast(miss as double)) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false
where 'miss'
is CASTED
to double
, and later converted to NULL
== Optimized Logical Plan ==
Project [null AS (NOT (intcol = miss))#22]
+- LogicalRDD [intcol#0, strcol#1], false
as cast with this operand is undefined.
Since equality with NULL
is undefined as well - Difference between === null and isNull in Spark DataDrame - filter
yields an empty result.
Now how to address that. Both explicit casting:
df.filter(df['intcol'].cast("string") != 'miss')
and null safe equality:
df.filter(~df['intcol'].cast("string").eqNullSafe('miss'))
should do the trick.
Also please note that NaN
values are not NULL
and conversion via Pandas is lossy - Pandas dataframe to Spark dataframe, handling NaN conversions to actual null?
Post a Comment for "PySpark DataFrames - Filtering Using Comparisons Between Columns Of Different Types"