How to remove array element in PySpark dataframe? - python-3.x

I want to remove barcode from this array.
My dataframe looks like the sample given below,
|-- variants: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- admin_graphql_api_id: string (nullable = true)
| | |-- barcode: string (nullable = true)
| | |-- compare_at_price: string (nullable = true)
Can you help me to remove the element from the dataframe using PySpark.

You can use arrays_zip:
from pyspark.sql.types import ArrayType, StringType, StructType, StructField
df = df.withColumn("variants", F.arrays_zip("variants.admin_graphql_api_id", "variants.compare_at_price"))
df = df.withColumn("variants", F.col("variants").cast(schema))
df.printSchema()
prints
root
|-- variants: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- admin_graphql_api_id: string (nullable = true)
| | |-- compare_at_price: string (nullable = true)
The second withColumn is necessary to set the field names of the new struct.
arrays_zip is only available for Spark version >= 2.4.0. If you are using an older Spark version, you could use an UDF:
def func(array):
return [[x.admin_graphql_api_id, x.compare_at_price] for x in array]
func_udf = F.udf(func, schema)
df = df.withColumn("variants", func_udf("variants"))

Related

How to convert DataFrame columns from struct<value:double> to struct<values:array<double>> in pyspark?

I have a DataFrame with this structure:
root
|-- features: struct (nullable = true)
| |-- value: double (nullable = true)
and I wanna convert value with double type to "values with array" type.
How can I do that?
You can specify the conversion explicitly using struct and array:
import pyspark.sql.functions as F
df.printSchema()
#root
# |-- features: struct (nullable = false)
# | |-- value: double (nullable = false)
df2 = df.withColumn(
'features',
F.struct(
F.array(F.col('features')['value']).alias('values')
)
)
df2.printSchema()
#root
# |-- features: struct (nullable = false)
# | |-- values: array (nullable = false)
# | | |-- element: double (containsNull = false)

How to Convert Map of Struct type to Json in Spark2

I have a map field in dataset with below schema
|-- party: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- partyName: string (nullable = true)
| | |-- cdrId: string (nullable = true)
| | |-- legalEntityId: string (nullable = true)
| | |-- customPartyId: string (nullable = true)
| | |-- partyIdScheme: string (nullable = true)
| | |-- customPartyIdScheme: string (nullable = true)
| | |-- bdrId: string (nullable = true)
Need to convert it to JSON type. Please suggest how to do it. Thanks in advance
Spark provides to_json function available for DataFrame operations:
import org.apache.spark.sql.functions._
import spark.implicits._
val df =
List(
("key1", "party01", "cdrId01"),
("key2", "party02", "cdrId02"),
)
.toDF("key", "partyName", "cdrId")
.select(struct($"key", struct($"partyName", $"cdrId")).as("col1"))
.agg(map_from_entries(collect_set($"col1")).as("map_col"))
.select($"map_col", to_json($"map_col").as("json_col"))

How to convert a list with structure like (key1, list(key2, value)) into a dataframe in pyspark?

I have a list as shown below:
It is of the type as shown below:
[(key1, [(key11, value11), (key12, value12)]), (key2, [(key21, value21), (key22, value22)...])...]
A sample structure is shown below:
[('1052762305',
[('1007819788', 0.9206884810054885),
('1005886801', 0.913818268123084),
('1003863766', 0.9131746152849486),
('1007811435', 0.9128666156173751),
('1005879599', 0.9126368405937075),
('1003705572', 0.9122051062936369),
('1007804896', 0.9083424459788203),
('1005890270', 0.8982097535650703),
('1007806781', 0.8708761186829758),
('1003670458', 0.8452789033694487)]),
('1064808607',
[('1007804896', 0.9984397647563017),
('1003705572', 0.9970498347406341),
('1005879599', 0.9951581013190172),
('1007811435', 0.9934813787902085),
('1005886801', 0.9930572794622374),
('1003863766', 0.9928815742735568),
('1007819788', 0.9869723713790797),
('1005890270', 0.9642640856016443),
('1007806781', 0.9211558765137313),
('1003670458', 0.8519872445941068)])]
I want to convert this into a dataframe of the form
key1 key2 score
1052762305 1007819788 0.9206884810054885
1052762305 1005886801 0.913818268123084
1052762305 1003863766 0.9131746152849486
... ... ...
1064808607 1007804896 0.9984397647563017
1064808607 1003705572 0.9970498347406341
1064808607 1005879599 0.9951581013190172
... ... ...
How can we implement this in pyspark?
You can create a schema upfront with the input. Use explode and access the elements with in the value struct.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField,StringType,ArrayType, DoubleType
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
schema = StructType([StructField("key1",StringType()), StructField("value",ArrayType(
StructType([ StructField("key2", StringType()),
StructField("score", DoubleType())])
)) ])
df = spark.createDataFrame(
[('1052762305',
[('1007819788', 0.9206884810054885),
('1005886801', 0.913818268123084),
('1003863766', 0.9131746152849486),
('1007811435', 0.9128666156173751),
('1005879599', 0.9126368405937075),
('1003705572', 0.9122051062936369),
('1007804896', 0.9083424459788203),
('1005890270', 0.8982097535650703),
('1007806781', 0.8708761186829758),
('1003670458', 0.8452789033694487)]),
('1064808607',
[('1007804896', 0.9984397647563017),
('1003705572', 0.9970498347406341),
('1005879599', 0.9951581013190172),
('1007811435', 0.9934813787902085),
('1005886801', 0.9930572794622374),
('1003863766', 0.9928815742735568),
('1007819788', 0.9869723713790797),
('1005890270', 0.9642640856016443),
('1007806781', 0.9211558765137313),
('1003670458', 0.8519872445941068)])
],schema
)
df.show()
+----------+--------------------+
| key1| value |
+----------+--------------------+
|1052762305|[[1007819788, 0.9...|
|1064808607|[[1007804896, 0.9...|
+----------+--------------------+
df.printSchema()
root
|-- key1: string (nullable = true)
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key2: string (nullable = true)
| | |-- score: double (nullable = true)
df1=df.select('key1',F.explode('value').alias('value'))
df1.show()
+----------+--------------------+
| key1| value |
+----------+--------------------+
|1052762305|[1007819788, 0.92...|
|1052762305|[1005886801, 0.91...|
|1052762305|[1003863766, 0.91...|
|1052762305|[1007811435, 0.91...|
|1052762305|[1005879599, 0.91...|
|1052762305|[1003705572, 0.91...|
|1052762305|[1007804896, 0.90...|
|1052762305|[1005890270, 0.89...|
|1052762305|[1007806781, 0.87...|
|1052762305|[1003670458, 0.84...|
|1064808607|[1007804896, 0.99...|
|1064808607|[1003705572, 0.99...|
|1064808607|[1005879599, 0.99...|
|1064808607|[1007811435, 0.99...|
|1064808607|[1005886801, 0.99...|
|1064808607|[1003863766, 0.99...|
|1064808607|[1007819788, 0.98...|
|1064808607|[1005890270, 0.96...|
|1064808607|[1007806781, 0.92...|
|1064808607|[1003670458, 0.85...|
+----------+--------------------+
df1.printSchema()
root
|-- key1: string (nullable = true)
|-- value: struct (nullable = true)
| |-- key2: string (nullable = true)
| |-- score: double (nullable = true)
df1.select("key1", "value.key2","value.score").show()
+----------+----------+------------------+
| key1| key2| score|
+----------+----------+------------------+
|1052762305|1007819788|0.9206884810054885|
|1052762305|1005886801| 0.913818268123084|
|1052762305|1003863766|0.9131746152849486|
|1052762305|1007811435|0.9128666156173751|
|1052762305|1005879599|0.9126368405937075|
|1052762305|1003705572|0.9122051062936369|
|1052762305|1007804896|0.9083424459788203|
|1052762305|1005890270|0.8982097535650703|
|1052762305|1007806781|0.8708761186829758|
|1052762305|1003670458|0.8452789033694487|
|1064808607|1007804896|0.9984397647563017|
|1064808607|1003705572|0.9970498347406341|
|1064808607|1005879599|0.9951581013190172|
|1064808607|1007811435|0.9934813787902085|
|1064808607|1005886801|0.9930572794622374|
|1064808607|1003863766|0.9928815742735568|
|1064808607|1007819788|0.9869723713790797|
|1064808607|1005890270|0.9642640856016443|
|1064808607|1007806781|0.9211558765137313|
|1064808607|1003670458|0.8519872445941068|
You basically need to do following:
create a dataframe from your list
promote the pairs from elements of array into a separate row by using explode
extract key & value from pair via select
This could be done by something like this (source data is in the variable called a):
from pyspark.sql.functions import explode, col
df = spark.createDataFrame(a, ['key1', 'val'])
df2 = df.select(col('key1'), explode(col('val')).alias('val'))
df3 = df2.select('key1', col('val')._1.alias('key2'), col('val')._2.alias('value'))
we can check that schema & data is matching:
>>> df3.printSchema()
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- value: double (nullable = true)
>>> df3.show(2)
+----------+----------+------------------+
| key1| key2| value|
+----------+----------+------------------+
|1052762305|1007819788|0.9206884810054885|
|1052762305|1005886801| 0.913818268123084|
+----------+----------+------------------+
only showing top 2 rows
we can also check the schema for intermediate results:
>>> df.printSchema()
root
|-- key1: string (nullable = true)
|-- val: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: double (nullable = true)
>>> df2.printSchema()
root
|-- key1: string (nullable = true)
|-- val: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: double (nullable = true)

PySpark Array<double> is not Array<double>

I am running a very simple Spark (2.4.0 on Databricks) ML script:
from pyspark.ml.clustering import LDA
lda = LDA(k=10, maxIter=100).setFeaturesCol('features')
model = lda.fit(dataset)
But received following error:
IllegalArgumentException: 'requirement failed: Column features must be of type equal to one of the following types: [struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>] but was actually of type array<double>.'
Why my array<double> is not an array<double>?
Here is the schema:
root
|-- BagOfWords: struct (nullable = true)
| |-- indices: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- size: long (nullable = true)
| |-- type: long (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: double (containsNull = true)
|-- tokens: array (nullable = true)
| |-- element: string (containsNull = true)
|-- features: array (nullable = true)
| |-- element: double (containsNull = true)
You probably need to convert it into vector form using vector assembler
from pyspark.ml.feature import VectorAssembler

Slice array of structs using column values

I want to use Spark slice function with start and length defined as Column(s).
def slice(x: Column, start: Int, length: Int): Column
x looks like this:
`|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: double (nullable = true)
| | |-- b : double (nullable = true)
| | |-- c: double (nullable = true)
| | |-- d: string (nullable = true)
| | |-- e: double (nullable = true)
| | |-- f: double (nullable = true)
| | |-- g: long (nullable = true)
| | |-- h: double (nullable = true)
| | |-- i: double (nullable = true)
...
`
any idea on how to achieve this ?
Thanks !
You cannot use the built-in DataFrame DSL function slice for this (as it needs constant slice bounds), you can use an UDF for that. If df is your dataframe and you have a from und until column, then you can do:
val mySlice = udf(
(data:Seq[Row], from:Int, until:Int) => data.slice(from,until),
df.schema.fields.find(_.name=="x").get.dataType
)
df
.select(mySlice($"x",$"from",$"until"))
.show()
Alternatively, you can use the SQL-Expression in Spark SQL:
df
.select(expr("slice(x,from,until)"))
.show()

Resources