Skip to content Skip to sidebar Skip to footer

PySpark DataFrames - Filtering Using Comparisons Between Columns Of Different Types

Suppose you have a dataframe with columns of various types (string, double...) and a special value 'miss' that represents 'missing value' in string-typed columns. from pyspark.sql

Solution 1:

Let's start with why. DataFrame API is a DSL for SQL and SQL evaluation rules apply. Whenever you apply an operator on objects of different types, CAST operation is applied, according to predefined rules, on an operand of lower precedence. In general numeric types, have higher precedence, therefore (following the execution plan df.select(df['intcol'] != 'miss').explain(True)):

== Parsed Logical Plan ==
'Project [NOT (intcol#0 = miss) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false

is rewritten as

== Analyzed Logical Plan ==
(NOT (intcol = miss)): boolean
Project [NOT (intcol#0 = cast(miss as double)) AS (NOT (intcol = miss))#12]
+- LogicalRDD [intcol#0, strcol#1], false

where 'miss' is CASTED to double, and later converted to NULL

== Optimized Logical Plan ==
Project [null AS (NOT (intcol = miss))#22]
+- LogicalRDD [intcol#0, strcol#1], false

as cast with this operand is undefined.

Since equality with NULL is undefined as well - Difference between === null and isNull in Spark DataDrame - filter yields an empty result.

Now how to address that. Both explicit casting:

df.filter(df['intcol'].cast("string") != 'miss')

and null safe equality:

df.filter(~df['intcol'].cast("string").eqNullSafe('miss'))

should do the trick.

Also please note that NaN values are not NULL and conversion via Pandas is lossy - Pandas dataframe to Spark dataframe, handling NaN conversions to actual null?


Post a Comment for "PySpark DataFrames - Filtering Using Comparisons Between Columns Of Different Types"