Skip to content Skip to sidebar Skip to footer

Convert Stringtype To Arraytype In Pyspark

I am trying to Run the FPGrowth algorithm in PySpark on my Dataset. from pyspark.ml.fpm import FPGrowth fpGrowth = FPGrowth(itemsCol='name', minSupport=0.5,minConfidence=0.6) mod

Solution 1:

Split by comma for each row in the name column of your dataframe. e.g.

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('list', PandasUDFType.SCALAR)defsplit_comma(v):
    return v[1:-1].split(',')

df.withColumn('name', split_comma(df.name))

Or better, don't defer this. Set name directly to the list.

rd2 = rd.map(lambda x: (x[1], x[0][0], x[0][1].split(',')))
rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))

Solution 2:

Based on your previous question, it seems as though you are building rdd2 incorrectly.

Try this:

rd2 = rd.map(lambda x: (x[1], x[0][0] , x[0][1].split(",")))
rd3 = rd2.map(lambda p:Row(id=int(p[0]), name=p[2], actor=str(p[1])))

The change is that we call str.split(",") on x[0][1] so that it will convert a string like 'a,b' to a list: ['a', 'b'].

Post a Comment for "Convert Stringtype To Arraytype In Pyspark"