Can you please explain the parameters of takeSample takeSample() in pyspark [duplicate] - apache-spark

Reading the spark documentation: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
There is this boolean parameter withReplacement without much explanation.
sample(withReplacement, fraction, seed=None)
What is it and how do we use it?

The parameter withReplacement controls the Uniqueness of sample result. If we treat a Dataset as a bucket of balls, withReplacement=true means, taking a random ball out of the bucket and place it back into it. that means, the same ball can be picked up again.
Assuming all unique elements in a Dataset:
withReplacement=true, same element can be produced more than once as the result of sample.
withReplacement=false, each element of the dataset will be sampled only once.
import spark.implicits._
val df = Seq(1, 2, 3, 5, 6, 7, 8, 9, 10).toDF("ids")
df.show()
df.sample(true, 0.5, 5)
.show
df.sample(false, 0.5, 5)
.show
Result
+---+
|ids|
+---+
| 1|
| 2|
| 3|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 6|
| 7|
| 7|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 1|
| 3|
| 7|
| 8|
| 9|
+---+

This is actually mentioned in the spark docs version 2.3.
https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
withReplacement – Sample with replacement
case class Member(id: Int, name: String, role: String)
val member1 = new Member(1, "User1", "Data Engineer")
val member2 = new Member(2, "User2", "Software Engineer")
val member3 = new Member(3, "User3", "DevOps Engineer")
val memberDF = Seq(member1, member2, member3).toDF
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 1|User1| Data Engineer|
| 2|User2|Software Engineer|
+---+-----+-----------------+
memberDF.sample(true, 0.4).show
+---+-----+---------------+
| id| name| role|
+---+-----+---------------+
| 3|User3|DevOps Engineer|
+---+-----+---------------+
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 2|User2|Software Engineer|
| 3|User3| DevOps Engineer|
+---+-----+-----------------+

Related

how to execute many expressions in the selectExpr

it is possible to apply many expression in the same selectExpr,
for example If I have this DF:
+---+
| i|
+---+
| 10|
| 15|
| 11|
| 56|
+---+
how to multiply by 2 and rename the column as this :
df.selectExpr("i*2 as multiplication")
def selectExpr(exprs: String*): org.apache.spark.sql.DataFrame
If you have many expressions you have to pass them comma separated strings. Please check below code.
scala> val df = (1 to 10).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> df.selectExpr("id*2 as twotimes", "id * 3 as threetimes").show
+--------+----------+
|twotimes|threetimes|
+--------+----------+
| 2| 3|
| 4| 6|
| 6| 9|
| 8| 12|
| 10| 15|
| 12| 18|
| 14| 21|
| 16| 24|
| 18| 27|
| 20| 30|
+--------+----------+
Yes, you can pass multiple expressions inside the df.selectExpr. https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset#selectExpr(exprs:String*):org.apache.spark.sql.DataFrame
scala> case class Person(name: String, lanme: String)
scala> val personDS = Seq(Person("Max", 1), Person("Adam", 2), Person("Muller", 3)).toDS()
scala > personDs.show(false)
+------+---+
|name |age|
+------+---+
|Max |1 |
|Adam |2 |
|Muller|3 |
+------+---+
scala> personDS.selectExpr("age*2 as multiple","name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+
Or else you can also use withColumn to achieve the same results
scala> personDS.withColumn("multiple",$"age"*2).select($"multiple",$"name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+

PySpark - Convert String to Array

I have a column in my dataframe that is a string with the value like ["value_a", "value_b"].
What is the best way to convert this column to Array and explode it? For now, I'm doing something like:
explode(split(col("value"), ",")).alias("value")
But I'm getting strings like ["awesome" or "John" or "Maria]" and the expected output should be awesome, John, Maria (one item per line, that why i'm using explode).
Sample code to reproduce:
sample_input = [
{"id":1,"value":"[\"johnny\", \"maria\"]"},
{"id":2,"value":"[\"awesome\", \"George\"]"}
]
df = spark.createDataFrame(sample_input)
df.select(col("id"), explode(split(col("value"), ",")).alias("value")).show(n=10)
Output generated by code above:
+---+----------+
| id| value|
+---+----------+
| 1| ["johnny"|
| 1| "maria"]|
| 2|["awesome"|
| 2| "George"]|
+---+----------+
Expected should be:
+---+----------+
| id| value|
+---+----------+
| 1| johnny |
| 1| maria |
| 2| awesome|
| 2| George|
+---+----------+
Worked for me.
sample_input = [
{"id":1,"value":["johnny", "maria"]},
{"id":2,"value":["awesome", "George"]}
]
df = spark.createDataFrame(sample_input)
df.select(col("id"), explode(col("value")).alias("value")).show(n=10)
+---+-------+
| id| value|
+---+-------+
| 1| johnny|
| 1| maria|
| 2|awesome|
| 2| George|
+---+-------+

What does withReplacement do, if specified for sample against a Spark Dataframe

Reading the spark documentation: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
There is this boolean parameter withReplacement without much explanation.
sample(withReplacement, fraction, seed=None)
What is it and how do we use it?
The parameter withReplacement controls the Uniqueness of sample result. If we treat a Dataset as a bucket of balls, withReplacement=true means, taking a random ball out of the bucket and place it back into it. that means, the same ball can be picked up again.
Assuming all unique elements in a Dataset:
withReplacement=true, same element can be produced more than once as the result of sample.
withReplacement=false, each element of the dataset will be sampled only once.
import spark.implicits._
val df = Seq(1, 2, 3, 5, 6, 7, 8, 9, 10).toDF("ids")
df.show()
df.sample(true, 0.5, 5)
.show
df.sample(false, 0.5, 5)
.show
Result
+---+
|ids|
+---+
| 1|
| 2|
| 3|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 6|
| 7|
| 7|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 1|
| 3|
| 7|
| 8|
| 9|
+---+
This is actually mentioned in the spark docs version 2.3.
https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
withReplacement – Sample with replacement
case class Member(id: Int, name: String, role: String)
val member1 = new Member(1, "User1", "Data Engineer")
val member2 = new Member(2, "User2", "Software Engineer")
val member3 = new Member(3, "User3", "DevOps Engineer")
val memberDF = Seq(member1, member2, member3).toDF
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 1|User1| Data Engineer|
| 2|User2|Software Engineer|
+---+-----+-----------------+
memberDF.sample(true, 0.4).show
+---+-----+---------------+
| id| name| role|
+---+-----+---------------+
| 3|User3|DevOps Engineer|
+---+-----+---------------+
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 2|User2|Software Engineer|
| 3|User3| DevOps Engineer|
+---+-----+-----------------+

How to rename an existing Spark SQL function

I am using Spark to call functions on the data which is submitted by the user.
How can I rename an already existing function to a different name like like REGEXP_REPLACE to REPLACE?
I tried the following code :
ss.udf.register("REPLACE", REGEXP_REPLACE) // This doesn't work
ss.udf.register("sum_in_all", sumInAll)
ss.udf.register("mod", mod)
ss.udf.register("average_in_all", averageInAll)
Import it with an alias :
import org.apache.spark.sql.functions.{regexp_replace => replace }
df.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
df.withColumn("replaced", replace($"id", "(\\d)" , "$1+1") ).show
+---+--------+
| id|replaced|
+---+--------+
| 0| 0+1|
| 1| 1+1|
| 2| 2+1|
| 3| 3+1|
| 4| 4+1|
| 5| 5+1|
| 6| 6+1|
| 7| 7+1|
| 8| 8+1|
| 9| 9+1|
+---+--------+
To do it with Spark SQL, you'll have to re-register the function in Hive with a different name :
sqlContext.sql(" create temporary function replace
as 'org.apache.hadoop.hive.ql.udf.UDFRegExpReplace' ")
sqlContext.sql(""" select replace("a,b,c", "," ,".") """).show
+-----+
| _c0|
+-----+
|a.b.c|
+-----+

splitting content of a pyspark dataframe column and aggregating them into new columns

I am trying to extract and split the data within pyspark dataframe column, following which, aggregate it into a new columns.
Input Table.
+--+-----------+
|id|description|
+--+-----------+
|1 | 3:2,3|2:1|
|2 | 2 |
|3 | 2:12,16 |
|4 | 3:2,4,6 |
|5 | 2 |
|6 | 2:3,7|2:3|
+--------------+
Desired Output.
+--+-----------+-------+-----------+
|id|description|sum_emp|org_changed|
+--+-----------+-------+-----------+
|1 | 3:2,3|2:1| 5 | 3 |
|2 | 2 | 2 | 0 |
|3 | 2:12,16 | 2 | 2 |
|4 | 3:2,4,6 | 3 | 3 |
|5 | 2 | 2 | 0 |
|6 | 2:3,7|2:3| 4 | 3 |
+--------------+-------+-----------+
Before the ":", values ought to be added. The values post the ":" are to be counted. The | marks the shift in the record(can be ignored)
Some data points are as long as 2:3,4,5|3:4,6,3|4:3,7,8
Any help would be greatly appreciated
Scenario Explained:
Considering the 6th id for example. The 6 refers to a biz unit id. The 'Description' column describes the team within that given unit.
Now for the meaning of the values 2:3,7|2:3 are as follows:
1)Fist 2 followed by 3&7 = there are 2 folks in the team and one of them has been in another org for 3 years and for 7 years (perhaps its the second guys first company)
2)Second 2 followed by 3 = there are 2 folks again in a sub team, and 1 person has spent 3 years in another org.
Desired output:
sum_emp = total number of employees in that given biz unit.
org_changed = total number of organizations folks in that biz unit have changed.
First let's create our dataframe:
df = spark.createDataFrame(
sc.parallelize([[1,"3:2,3|2:1"],
[2,"2"],
[3,"2:12,16"],
[4,"3:2,4,6"],
[5,"2"],
[6,"2:3,7|2:3"]]),
["id","description"])
+---+-----------+
| id|description|
+---+-----------+
| 1| 3:2,3|2:1|
| 2| 2|
| 3| 2:12,16|
| 4| 3:2,4,6|
| 5| 2|
| 6| 2:3,7|2:3|
+---+-----------+
First we'll split the records and explode the resulting array so we only have one record per line:
import pyspark.sql.functions as psf
df = df.withColumn(
"record",
psf.explode(psf.split("description", '\|'))
)
+---+-----------+-------+
| id|description| record|
+---+-----------+-------+
| 1| 3:2,3|2:1| 3:2,3|
| 1| 3:2,3|2:1| 2:1|
| 2| 2| 2|
| 3| 2:12,16|2:12,16|
| 4| 3:2,4,6|3:2,4,6|
| 5| 2| 2|
| 6| 2:3,7|2:3| 2:3,7|
| 6| 2:3,7|2:3| 2:3|
+---+-----------+-------+
Now we'll split records into the number of players and a list of years:
df = df.withColumn(
"record",
psf.split("record", ':')
).withColumn(
"nb_players",
psf.col("record")[0]
).withColumn(
"years",
psf.split(psf.col("record")[1], ',')
)
+---+-----------+----------+----------+---------+
| id|description| record|nb_players| years|
+---+-----------+----------+----------+---------+
| 1| 3:2,3|2:1| [3, 2,3]| 3| [2, 3]|
| 1| 3:2,3|2:1| [2, 1]| 2| [1]|
| 2| 2| [2]| 2| null|
| 3| 2:12,16|[2, 12,16]| 2| [12, 16]|
| 4| 3:2,4,6|[3, 2,4,6]| 3|[2, 4, 6]|
| 5| 2| [2]| 2| null|
| 6| 2:3,7|2:3| [2, 3,7]| 2| [3, 7]|
| 6| 2:3,7|2:3| [2, 3]| 2| [3]|
+---+-----------+----------+----------+---------+
Finally, we want to sum for each id the number of players and the length of years:
df = df.withColumn(
"years_size",
psf.when(psf.size("years") > 0, psf.size("years")).otherwise(0)
).groupby("id").agg(
psf.first("description").alias("description"),
psf.sum("nb_players").alias("sum_emp"),
psf.sum("years_size").alias("org_changed")
).sort("id").show()
+---+-----------+-------+-----------+
| id|description|sum_emp|org_changed|
+---+-----------+-------+-----------+
| 1| 3:2,3|2:1| 5.0| 3|
| 2| 2| 2.0| 0|
| 3| 2:12,16| 2.0| 2|
| 4| 3:2,4,6| 3.0| 3|
| 5| 2| 2.0| 0|
| 6| 2:3,7|2:3| 4.0| 3|
+---+-----------+-------+-----------+

Resources