How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe?
Here's a dummy example. I am creating two dataframes scores and lastnames, and within each lies a column that is the same across the two dataframes. In the UDF applied on scores, I want to filter on lastnames and return a string found in lastname.
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sc = SparkContext("local")
sqlCtx = SQLContext(sc)
# Generate Random Data
import itertools
import random
student_ids = ['student1', 'student2', 'student3']
subjects = ['Math', 'Biology', 'Chemistry', 'Physics']
random.seed(1)
data = []
for (student_id, subject) in itertools.product(student_ids, subjects):
data.append((student_id, subject, random.randint(0, 100)))
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("student_id", StringType(), nullable=False),
StructField("subject", StringType(), nullable=False),
StructField("score", IntegerType(), nullable=False)
])
# Create DataFrame
rdd = sc.parallelize(data)
scores = sqlCtx.createDataFrame(rdd, schema)
# create another dataframe
last_name = ["Granger", "Weasley", "Potter"]
data2 = []
for i in range(len(student_ids)):
data2.append((student_ids[i], last_name[i]))
schema = StructType([
StructField("student_id", StringType(), nullable=False),
StructField("last_name", StringType(), nullable=False)
])
rdd = sc.parallelize(data2)
lastnames = sqlCtx.createDataFrame(rdd, schema)
scores.show()
lastnames.show()
from pyspark.sql.functions import udf
def getLastName(sid):
tmp_df = lastnames.filter(lastnames.student_id == sid)
return tmp_df.last_name
getLastName_udf = udf(getLastName, StringType())
scores.withColumn("last_name", getLastName_udf("student_id")).show(10)
And the following is the last part of the trace:
Py4JError: An error occurred while calling o114.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
You can't directly reference a dataframe (or an RDD) from inside a UDF. The DataFrame object is a handle on your driver that spark uses to represent the data and actions that will happen out on the cluster. The code inside your UDF's will run out on the cluster at a time of Spark's choosing. Spark does this by serializing that code, and making copies of any variables included in the closure and sending them out to each worker.
What instead you want to do, is use the constructs Spark provides in it's API to join/combine the two DataFrames. If one of the data sets is small, you can manually send out the data in a broadcast variable, and then access it from your UDF. Otherwise, you can just create the two dataframes like you did, then use the join operation to combine them. Something like this should work:
joined = scores.withColumnRenamed("student_id", "join_id")
joined = joined.join(lastnames, joined.join_id == lastnames.student_id)\
.drop("join_id")
joined.show()
+---------+-----+----------+---------+
| subject|score|student_id|last_name|
+---------+-----+----------+---------+
| Math| 13| student1| Granger|
| Biology| 85| student1| Granger|
|Chemistry| 77| student1| Granger|
| Physics| 25| student1| Granger|
| Math| 50| student2| Weasley|
| Biology| 45| student2| Weasley|
|Chemistry| 65| student2| Weasley|
| Physics| 79| student2| Weasley|
| Math| 9| student3| Potter|
| Biology| 2| student3| Potter|
|Chemistry| 84| student3| Potter|
| Physics| 43| student3| Potter|
+---------+-----+----------+---------+
It's also worth noting, that under the hood Spark DataFrames has an optimization where a DataFrame that is part of a join can be converted to a broadcast variable to avoid a shuffle if it is small enough. So if you do the join method listed above, you should get the best possible performance, without sacrificing the ability to handle larger data sets.
Changing pair to dictionary for easy lookup of names
data2 = {}
for i in range(len(student_ids)):
data2[student_ids[i]] = last_name[i]
Instead of creating rdd and making it to df create broadcast variable
//rdd = sc.parallelize(data2)
//lastnames = sqlCtx.createDataFrame(rdd, schema)
lastnames = sc.broadcast(data2)
Now access this in udf with values attr on broadcast variable(lastnames).
from pyspark.sql.functions import udf
def getLastName(sid):
return lastnames.value[sid]
Related
I want to break the correlation between a column and the rest of the dataframe. I want to do this while maintaining the distribution of the values in the said column
In pandas, I used to achieve this by simply shuffling the values of a column and then assigning the values to the column. It is not so straightforward in the case of pyspark because of the data being partitioned. I do not think there is even a way in pyspark to set a new column in a dataframe with a column from another dataframe
So, In pyspark, how do I achieve the following?:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
Also, I'm hoping you give me a solution that does not include unnecessary shuffles.
one way to do it is:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import row_number,lit,rand
from pyspark.sql.window import Window
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
dfs = spark.createDataFrame(df)
w = Window().orderBy(lit('A'))
dfs = dfs.withColumn("row_num", row_number().over(w))
dfs_ts = dfs.select('b')
dfs_ts = dfs_ts.withColumn('o',rand()).orderBy('o')
dfs = dfs.drop('b')
dfs_ts = dfs_ts.drop('o')
w = Window().orderBy(lit('A'))
dfs_ts = dfs_ts.withColumn("row_num", row_number().over(w))
dfs = dfs.join(dfs_ts,on='row_num').drop('row_num')
But, I do not need the shuffles that come with join and they are not necessary. If a blind hstack is possible per partition basis in pyspark that should be enough. Also, the window function tells me that I have not defined any partitions so all my data would be collected to one partition. Might as well use pandas in that case
I think you could achieve something like that on the RDD level. Not sure if it would be worth all the extra steps, and depending on the size of the partitions it might not perform really well because you'd have to hold all the values of the partition in memory for the shuffle.
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd
df = pd.DataFrame({'a':[x for x in range(10)],'b':[x for x in range(10)]})
dfs = spark.createDataFrame(df)
dfs.show()
rdd_a = dfs.select("a").rdd
rdd_b = dfs.select("b").rdd
from random import shuffle
def f(iterator):
items = list(iterator)
shuffle(items)
for x in items:
yield x
from pyspark import Row
def reconstruct(x):
return Row(a=x[0].a, b=x[1].b)
rdd_b_reordered = rdd_b.mapPartitions(f)
df_reordered = spark.createDataFrame(rdd_a.zip(rdd_b_reordered).map(reconstruct))
df_reordered.show()
"""
My output:
+---+---+
| a| b|
+---+---+
| 0| 2|
| 1| 4|
| 2| 0|
| 3| 1|
| 4| 3|
| 5| 7|
| 6| 9|
| 7| 5|
| 8| 6|
| 9| 8|
+---+---+
"""
Maybe you can tweak that to your needs. This would only shuffle the things over each partition to avoid the partition shuffle. Also probably better to do it in Scala.
I am not fully confident in this answer, but I think it's right.
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
So if we get your Pandas example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
We can just wrap the .sample expression in a function.
def shuffle(df: pd.DataFrame) -> pd.DataFrame:
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
return df
And then we can bring it to Spark using the transform function. The ``transform` function can take in both Pandas and Spark DataFrames and then will convert it to Spark if you are using the Spark engine.
from fugue import transform
import fugue_spark
transform(df, shuffle, schema="*", engine="spark")
But of course, this transform is applied per partition. If you don't specify the partitions, it uses the default partitions. If you want to shuffle to really randomize it, you can do:
transform(df, shuffle, schema="*", engine="spark", partition={"algo":"rand"}).show()
And Fugue will partition your data randomly before this operation.
You may not see the shuffle for your test case because the data is really small. If you end up with 4 partitions with 1 row each, they will end up returning the same value.
Hi I have a spark sql dataframe with a whole bunch of columns. One of the columns ("date") is a date field. I want to apply the following transformation to every row in that column.
This is what would I do if it were a pandas dataframe. I cant seem to figure out the spark equivalent
df["date"] = df["date"].map(lambda x: x.isoformat() + "Z")
The column has values of the form
2020-12-07 01:01:48
I want the values to be of the form:
2020-12-07T01:01:48Z
Try something like that:
from pyspark.sql.types import StructType, StructField, DateType, StringType, IntegerType
from pyspark.sql.functions import col
from pyspark.sql import functions as F
schema = StructType([
StructField("date",StringType(),True),
StructField("age", StringType(),True)])
df = spark.createDataFrame([(None,22),(None,25)],schema=schema)
Z = F.lit("Z").cast(StringType())
datetime = F.current_date().cast(StringType())
datetimeZ = F.concat(datetime,Z)
df = df.withColumn("date", datetimeZ)
df.show(5)
+-----------+---+
| date|age|
+-----------+---+
|2021-06-15Z| 22|
|2021-06-15Z| 25|
+-----------+---+
I am trying to read csv file and convert into dataframe.
input.txt
4324,'Andy',43.5,20.3,53.21
2342,'Sam',22.1
3248,'Jane',11.05,12.87
6457,'Bob',32.1,75.23,71.6
Schema: Id, Name,Jan,Feb,March
As you see the csv file doesn't have "," if there are no trailing expenses.
Code:
from pyspark.sql.types import *
input1= sc.textFile('/FileStore/tables/input.txt').map(lambda x: x.split(","))
schema = StructType([StructField('Id',StringType(),True), StructField('Name',StringType(),True), StructField('Jan',StringType(),True), StructField('Feb',StringType(),True), StructField('Mar',StringType(),True)])
df3 = sqlContext.createDataFrame(input1, schema)
I get ValueError: Length of object (4) does not match with length of fields (5). How do I resolve this?
I would first import the file using pandas which should handle everything for you. From there you can then convert the pandas DataFrame to spark and do all your usual stuff. I copied your example txt file and quickly wrote up some code to confirm that it would all work:
import pandas as pd
# Reading in txt file as csv
df_pandas = pd.read_csv('<your location>/test.txt',
sep=",")
# Converting to spark dataframe and displaying
df_spark = spark.createDataFrame(df_pandas)
display(df_pandas)
Which produced the following output:
The faster method would be to import through spark:
# Importing csv file using pyspark
csv_import = sqlContext.read\
.format('csv')\
.options(sep = ',', header='true', inferSchema='true')\
.load('<your location>/test.txt')
display(csv_import)
Which gives the same output.
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()
fields = [StructField('Id', StringType(), True), StructField('Name', StringType(), True),
StructField('Jan', StringType(), True), StructField('Feb', StringType(), True),
StructField('Mar', StringType(), True)]
schema = StructType(fields)
data = spark.read.format("csv").load("test2.txt")
df3 = spark.createDataFrame(data.rdd, schema)
df3.show()
Output:
+----+------+-----+-----+-----+
| Id| Name| Jan| Feb| Mar|
+----+------+-----+-----+-----+
|4324|'Andy'| 43.5| 20.3|53.21|
|2342| 'Sam'| 22.1| null| null|
|3248|'Jane'|11.05|12.87| null|
|6457| 'Bob'| 32.1|75.23| 71.6|
+----+------+-----+-----+-----+
Here are a couple options for you to consider. These use the wildcard character, so you can loop through all folders and sub-folders, look for files with names that match a specific pattern, and merge everything into a dingle dataframe.
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.head()
myDFCsv.count()
//////////////////////////////////////////
// If you also need to load the filename
import org.apache.spark.sql.functions.input_file_name
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
.withColumn("file_name",input_file_name())
myDFCsv.show(false)
myDFCsv.head()
myDFCsv.count()
I am trying, in pyspark, to obtain a new column by rounding one column of a table to the precision specified, in each row, by another column of the same table, e.g., from the following table:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
I should be able to obtain the following result:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+
In particular, I have tried the following code:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import (
StructType, StructField, FloatType, LongType,
IntegerType
)
pdDF = pd.DataFrame(columns=["Data", "Rounding"], data=[[3.141592, 3],
[0.577215, 1]])
mySchema = StructType([ StructField("Data", FloatType(), True),
StructField("Rounding", IntegerType(), True)])
spark = (SparkSession.builder
.master("local")
.appName("column rounding")
.getOrCreate())
df = spark.createDataFrame(pdDF,schema=mySchema)
df.show()
df.createOrReplaceTempView("df_table")
df_rounded = spark.sql("SELECT Data, Rounding, ROUND(Data, Rounding) AS Rounded_Column FROM df_table")
df_rounded .show()
but I get the following error:
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'round(df_table.`Data`, df_table.`Rounding`)' due to data type mismatch: Only foldable Expression is allowed for scale arguments; line 1 pos 23;\n'Project [Data#0, Rounding#1, round(Data#0, Rounding#1) AS Rounded_Column#12]\n+- SubqueryAlias df_table\n +- LogicalRDD [Data#0, Rounding#1], false\n"
Any help would be deeply appreciated :)
With spark sql , the catalyst throws out the following error in your run - Only foldable Expression is allowed for scale arguments
i.e #param scale new scale to be round to, this should be a constant int at runtime
ROUND only expect a Literal for the scale. you can try out writing custom code instead of spark-sql way.
EDIT:
With UDF,
val df = Seq(
(3.141592,3),
(0.577215,1)).toDF("Data","Rounding")
df.show()
df.createOrReplaceTempView("df_table")
import org.apache.spark.sql.functions._
def RoundUDF(customvalue:Double, customscale:Int):Double = BigDecimal(customvalue).setScale(customscale, BigDecimal.RoundingMode.HALF_UP).toDouble
spark.udf.register("RoundUDF", RoundUDF(_:Double,_:Int):Double)
val df_rounded = spark.sql("select Data, Rounding, RoundUDF(Data, Rounding) as Rounded_Column from df_table")
df_rounded.show()
Input:
+--------+--------+
| Data|Rounding|
+--------+--------+
|3.141592| 3|
|0.577215| 1|
+--------+--------+
Output:
+--------+--------+--------------+
| Data|Rounding|Rounded_Column|
+--------+--------+--------------+
|3.141592| 3| 3.142|
|0.577215| 1| 0.6|
+--------+--------+--------------+
I read a .csv file to Spark DataFrame. For a DoubleType column is there a way to specify at the time of the file read that this column should be rounded to 2 decimal places? I'm also supplying a custom schema to the DataFrameReader API call. Here's my schema and API calls:
val customSchema = StructType(Array(StructField("id_1", IntegerType, true),
StructField("id_2", IntegerType, true),
StructField("id_3", DoubleType, true)))
#using Spark's CSV reader with custom schema
#spark == SparkSession()
val parsedSchema = spark.read.format("csv").schema(customSchema).option("header", "true").option("nullvalue", "?").load("C:\\Scala\\SparkAnalytics\\block_1.csv")
After the file read into DataFrame I can round the decimals like:
parsedSchema.withColumn("cmp_fname_c1", round($"cmp_fname_c1", 3))
But this creates a new DataFrame, so I'd also like to know if it can be done in-place instead of creating a new DataFrame.
Thanks
You can specify, say, DecimalType(10, 2) for the DoubleType column in your customSchema when loading your CSV file. Let's say you have a CSV file with the following content:
id_1,id_2,Id_3
1,10,5.555
2,20,6.0
3,30,7.444
Sample code below:
import org.apache.spark.sql.types._
val customSchema = StructType(Array(
StructField("id_1", IntegerType, true),
StructField("id_2", IntegerType, true),
StructField("id_3", DecimalType(10, 2), true)
))
spark.read.format("csv").schema(customSchema).
option("header", "true").option("nullvalue", "?").
load("/path/to/csvfile").
show
// +----+----+----+
// |id_1|id_2|id_3|
// +----+----+----+
// | 1| 10|5.56|
// | 2| 20|6.00|
// | 3| 30|7.44|
// +----+----+----+