approx_count_distinct pyspark agg function with rsd argument in Databricks - apache-spark

In databricks, when I run approx_count_distinct function with 'rsd' argument, it returns the error message. It works without this argument.
Dataset
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James |Sales |3000 |
|Michael |Sales |4600 |
|Robert |Sales |4100 |
|Maria |Finance |3000 |
|James |Sales |3000 |
|Scott |Finance |3300 |
|Jen |Finance |3900 |
|Jeff |Marketing |3000 |
|Kumar |Marketing |2000 |
|Saif |Sales |4100 |
+-------------+----------+------+
Code
from pyspark.sql.functions import approx_count_distinct
df.agg(approx_count_distinct(col("salary"))).alias("salaryDistinct")
Error message
py4j.Py4JException: Method approx_count_distinct([class org.apache.spark.sql.Column, class java.lang.Integer]) does not exist

I reproduced the above and got the same error.
The above error occurs when we give the rsd value as integer. As per pyspark.sql.functions.approx_count_distinct() rsd value should be float.
Desired result when float is given.

Related

How to pivot or transform data in ArrayType format in pyspark?

I have data in following format:
|cust_id |card_num |balance|payment |due |card_type|
|:-------|:--------|:------|:-------|:----|:------- |
|c1 |1234 |567 |344 |33 |A |
|c1 |2345 |57 |44 |3 |B |
|c2 |123 |561 |34 |39 |A |
|c3 |345 |517 |914 |23 |C |
|c3 |127 |56 |34 |32 |B |
|c3 |347 |67 |344 |332 |B |
I want it to be converted into following ArrayType.
|cust_id|card_num |balance |payment |due | card_type|
|:------|:-------- |:------ |:------- |:---- |:---- |
|c1 |[1234,2345] |[567,57] |[344,44] |[33,3] |[A,B] |
|c2 |[123] |[561] |[34] |[39] |[A] |
|c3 |[345,127,347]|[517,56,67]|914,34,344]|[23,32,332]|[C,B,B] |
How to write a generic code in pyspark to do this transformation and save it in csv format?
You just need to group by cust_id column and use collect_list function to get array type aggregated columns.
df = # input
df.groupBy("cust_id").agg(
collect_list("card_num").alias("card_num"),
collect_list("balance").alias("balance"),
collect_list("payment").alias("payment"),
collect_list("due").alias("due"),
collect_list("card_type").alias("card_type"))

Pyspark how to create a customized csv from data frame

i have below data frame which i need to load in to csv with customized row and values
common_df.show()
+--------+----------+-----+----+-----+-----------+-------+---+
|name |department|state|id |name | department| state | id|
+--------+----------+-----+----+-----+-----------+-------+---+
|James |Sales |NY |101 |James| Sales1 |null |101|
|Maria |Finance |CA |102 |Maria| Finance | |102|
|Jen |Marketing |NY |103 |Jen | |NY2 |103|
i am following below approach currently to convert df to csv
pandasdf=common_df.toPandas()
pandasdf.to_csv("s3://mylocation/result.csv")
The above going to convert with same structure in csv.
however i need to structure from above format to something below, I think the solution would be to split each row to two allocating the id on left within data frame. but i don't see any example or solution directly from spark
|name |dept |state|id |
------------------------------------
101 |James |Sales |NY |101 |
|James |null |NY |101 |
------------------------------------
102 |Maria |Finance | |102 |
|Maria |Finance |CA |102 |
-------------------------------------
103 |Jen |Marketing |NY |103 |
|Jen | |NY2 |103 |
------------------------------------
Any solution to this?

How to apply a filter to a section of a Pyspark dataframe

I have a PySpark dataframe df that looks like this:
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id |gender|salary|
+---------+----------+--------+-----+------+------+
|James | |Smith |36636|M |3000 |
|Michael |Rose | |40288|M |4000 |
|Robert | |Williams|42114|M |4000 |
|Maria |Anne |Jones |39192|F |4000 |
|Jen |Mary |Brown |30001|F |2000 |
+---------+----------+--------+-----+------+------+
I need to apply a filter of id > 4000 only to gender = M, and preserve all the gender = F. Therefore, the final dataframe should look like this:
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id |gender|salary|
+---------+----------+--------+-----+------+------+
|Michael |Rose | |40288|M |4000 |
|Robert | |Williams|42114|M |4000 |
|Maria |Anne |Jones |39192|F |4000 |
|Jen |Mary |Brown |30001|F |2000 |
+---------+----------+--------+-----+------+------+
The only way I can think of doing this is:
df_temp1 = df.filter(df.gender == 'F')
df_temp2 = df.where(df.gender == 'M').filter(df.id > 4000)
df = df_temp1.union(df_temp2)
Is this the most efficient way to do this? I'm new to Spark so any help is appreciated!
This should do the trick. where is an alias for filter.
>>> df.show()
+-------+------+-----+
| name|gender| id|
+-------+------+-----+
| James| M|36636|
|Michael| M|40288|
| Robert| F|42114|
| Maria| F|39192|
| Jen| F|30001|
+-------+------+-----+
>>> df.where(''' (gender == 'M' and id > 40000) OR gender == 'F' ''').show()
+-------+------+-----+
| name|gender| id|
+-------+------+-----+
|Michael| M|40288|
| Robert| F|42114|
| Maria| F|39192|
| Jen| F|30001|
+-------+------+-----+
use both the condition using OR
**
df = spark.createDataFrame([(36636,"M"),(40288,"M"),(42114,"M"),(39192,"F"),(30001,"F")],["id","gender"])
df = df.filter(((F.col("id") > 40000) & (F.col("gender") == F.lit("M"))) | (F.col("gender") == F.lit("F")))
df.show()
**
Output
+-----+------+
| id|gender|
+-----+------+
|40288| M|
|42114| M|
|39192| F|
|30001| F|
+-----+------+

Perform NGram on Spark DataFrame

I'm using Spark 2.3.1, I have Spark DataFrame like this
+----------+
| values|
+----------+
|embodiment|
| present|
| invention|
| include|
| pairing|
| two|
| wireless|
| device|
| placing|
| least|
| one|
| two|
+----------+
I want to perform a Spark ml n-Gram feature like this.
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(tokenized_df)
Following Error occurred on this line bigramDataFrame = bigram.transform(tokenized_df)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Input type must be ArrayType(StringType) but got StringType.'
So I changed my code
df_new = tokenized_df.withColumn("testing", array(tokenized_df["values"]))
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(df_new)
bigramDataFrame.show()
So I got my final Data Frame as Follow
+----------+------------+-------+
| values| testing|bigrams|
+----------+------------+-------+
|embodiment|[embodiment]| []|
| present| [present]| []|
| invention| [invention]| []|
| include| [include]| []|
| pairing| [pairing]| []|
| two| [two]| []|
| wireless| [wireless]| []|
| device| [device]| []|
| placing| [placing]| []|
| least| [least]| []|
| one| [one]| []|
| two| [two]| []|
+----------+------------+-------+
Why my bigram column value is empty.
I want my output for bigram column as follow
+----------+
| bigrams|
+--------------------+
|embodiment present |
|present invention |
|invention include |
|include pairing |
|pairing two |
|two wireless |
|wireless device |
|device placing |
|placing least |
|least one |
|one two |
+--------------------+
Your bi-gram column value is empty because there are no bi-grams in each row of your 'values' column.
If your values in input data frame look like:
+--------------------------------------------+
|values |
+--------------------------------------------+
|embodiment present invention include pairing|
|two wireless device placing |
|least one two |
+--------------------------------------------+
Then you can get the output in bi-grams as below:
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|values |testing |ngrams |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|embodiment present invention include pairing|[embodiment, present, invention, include, pairing]|[embodiment present, present invention, invention include, include pairing]|
|two wireless device placing |[two, wireless, device, placing] |[two wireless, wireless device, device placing] |
|least one two |[least, one, two] |[least one, one two] |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
The scala spark code to do this is:
val df_new = df.withColumn("testing", split(df("values")," "))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
A bi-gram is a sequence of two adjacent elements from a string of
tokens, which are typically letters, syllables, or words.
But in your input data frame, you have only one token in each row, hence you are not getting any bi-grams out of it.
So, for your question, you can do something like this:
Input: df1
+----------+
|values |
+----------+
|embodiment|
|present |
|invention |
|include |
|pairing |
|two |
|wireless |
|devic |
|placing |
|least |
|one |
|two |
+----------+
Output: ngramDataFrameInRows
+------------------+
|ngrams |
+------------------+
|embodiment present|
|present invention |
|invention include |
|include pairing |
|pairing two |
|two wireless |
|wireless devic |
|devic placing |
|placing least |
|least one |
|one two |
+------------------+
Spark scala code:
val df_new=df1.agg(collect_list("values").alias("testing"))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
val ngramDataFrameInRows=ngramDataFrame.select(explode(col("ngrams")).alias("ngrams"))

Call a function for each row of a dataframe in pyspark[non pandas]

There is a function in pyspark:
def sum(a,b):
c=a+b
return c
It has to be run on each record of a very very large dataframe using spark sql:
x = sum(df.select["NUM1"].first()["NUM1"], df.select["NUM2"].first()["NUM2"])
But this would run it only for the first record of the df and not for all rows.
I understand it could be done using a lambda, but I am not able to code it in the desired way.
In reality; c would be a dataframe and the function would be doing a lot of spark.sql stuff and return it. I would have to call that function for each row.
I guess, I will try to pick it up using this sum(a,b) as an analogy.
+----------+----------+-----------+
| NUM1 | NUM2 | XYZ |
+----------+----------+-----------+
| 10 | 20 | HELLO|
| 90 | 60 | WORLD|
| 50 | 45 | SPARK|
+----------+----------+-----------+
+----------+----------+-----------+------+
| NUM1 | NUM2 | XYZ | VALUE|
+----------+----------+-----------+------+
| 10 | 20 | HELLO|30 |
| 90 | 60 | WORLD|150 |
| 50 | 45 | SPARK|95 |
+----------+----------+-----------+------+
Python: 3.7.4
Spark: 2.2
You can use .withColumn function:
from pyspark.sql.functions import col
from pyspark.sql.types import LongType
df.show()
+----+----+-----+
|NUM1|NUM2| XYZ|
+----+----+-----+
| 10| 20|HELLO|
| 90| 60|WORLD|
| 50| 45|SPARK|
+----+----+-----+
def mysum(a,b):
return a + b
spark.udf.register("mysumudf", mysum, LongType())
df2 = df.withColumn("VALUE", mysum(col("NUM1"),col("NUM2"))
df2.show()
+----+----+-----+-----+
|NUM1|NUM2| XYZ|VALUE|
+----+----+-----+-----+
| 10| 20|HELLO| 30|
| 90| 60|WORLD| 150|
| 50| 45|SPARK| 95|
+----+----+-----+-----+
We can do it in below ways and while registering udf 3rd argument that is return type is not mandatory.
from pyspark.sql import functions as F
df1 = spark.createDataFrame([(10,20,'HELLO'),(90,60,'WORLD'),(50,45,'SPARK')],['NUM1','NUM2','XYZ'])
df1.show()
df2=df1.withColumn('VALUE',F.expr('NUM1 + NUM2'))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ |VALUE|
+----+----+-----+-----+
|10 |20 |HELLO|30 |
|90 |60 |WORLD|150 |
|50 |45 |SPARK|95 |
+----+----+-----+-----+
(or)
def sum(c1,c2):
return c1+c2
spark.udf.register('sum_udf1',sum)
df2=df1.withColumn('VALUE',F.expr("sum_udf1(NUM1,NUM2)"))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ |VALUE|
+----+----+-----+-----+
|10 |20 |HELLO|30 |
|90 |60 |WORLD|150 |
|50 |45 |SPARK|95 |
+----+----+-----+-----+
(or)
sum_udf2=F.udf(lambda x,y: x+y)
df2=df1.withColumn('VALUE',sum_udf2('NUM1','NUM2'))
df2.show(3,False)
+----+----+-----+-----+
|NUM1|NUM2|XYZ |VALUE|
+----+----+-----+-----+
|10 |20 |HELLO|30 |
|90 |60 |WORLD|150 |
|50 |45 |SPARK|95 |
+----+----+-----+-----+
Use the below simple approach:
1. Import pyspark.sql functions
from pyspark.sql import functions as F
2. Use F.expr() function
df.withColumn("VALUE",F.expr("NUM1+NUM2")<br>

Resources