In spark I want to be able to parallelise over multiple dataframes.
The method I am trying is to nest dataframes in a parent dataframe but I am not sure the syntax or if it is possible.
For example I have the following 2 dataframes:
df1:
+-----------+---------+--------------------+------+
|id |asset_id | date| text|
+-----------+---------+--------------------+------+
|20160629025| A1|2016-06-30 11:41:...|aaa...|
|20160423007| A1|2016-04-23 19:40:...|bbb...|
|20160312012| A2|2016-03-12 19:41:...|ccc...|
|20160617006| A2|2016-06-17 10:36:...|ddd...|
|20160624001| A2|2016-06-24 04:39:...|eee...|
df2:
+--------+--------------------+--------------+
|asset_id| best_date_time| Other_fields|
+--------+--------------------+--------------+
| A1|2016-09-28 11:33:...| abc|
| A1|2016-06-24 00:00:...| edf|
| A1|2016-08-12 00:00:...| hij|
| A2|2016-07-01 00:00:...| klm|
| A2|2016-07-10 00:00:...| nop|
So i want to combine these to produce something like this.
+--------+--------------------+-------------------+
|asset_id| df1| df2|
+--------+--------------------+-------------------+
| A1| [df1 - rows for A1]|[df2 - rows for A1]|
| A2| [df1 - rows for A2]|[df2 - rows for A2]|
Note, I don't want to join or union them as that would be very sparse (I actually have about 30 dataframes and thousands of assets each with thousands of rows).
I then plan to do a groupByKey on this so that I get something like this that I can call a function on:
[('A1', <pyspark.resultiterable.ResultIterable object at 0x2534310>), ('A2', <pyspark.resultiterable.ResultIterable object at 0x25d2310>)]
I'm new to spark so any help greatly appreciated.
TL;DR It is not possible to nest DataFrames but you can use complex types.
In this case you could for example (Spark 2.0 or later):
from pyspark.sql.functions import collect_list, struct
df1_grouped = (df1
.groupBy("asset_id")
.agg(collect_list(struct("id", "date", "text"))))
df2_grouped = (df2
.groupBy("asset_id")
.agg(collect_list(struct("best_date_time", "Other_fields"))))
df1_grouped.join(df2_grouped, ["asset_id"], "fullouter")
but you have to be aware that:
It is quite expensive.
It has limited applications. In general nested structures are cumbersome to use and require complex and expensive (especially in PySpark) UDFs.
Related
I have two dataframes df1 and df2 somewhat like this:
import pandas as pd
from spark.sql import SparkSession
spark = SparkSession.builder.appName("someAppname").getOrCreate()
df1 = spark.createDataFrame(pd.DataFrame({"entity_nm": ["Joe B", "Donald", "Barack Obama"]}))
df2 = spark.createDataFrame(pd.DataFrame({"aliases": ["Joe Biden; Biden Joe", "Donald Trump; Donald J. Trump", "Barack Obama", "Joe Burrow"], "id": [1, 2, 3, 4]}))
I want to join df2 on df1 based on a string contains match, it does work when I do it like this:
df_joined = df1.join(df2, df2.aliases.contains(df1.entity_nm), how="left")
That join gives me my desired result:
+------------+--------------------+---+
| entity_nm| aliases| id|
+------------+--------------------+---+
| Joe B|Joe Biden; Biden Joe| 1|
| Joe B|Joe Burrow | 4|
| Donald|Donald Trump; Don...| 2|
|Barack Obama| Barack Obama| 3|
Problem here: I tried to do this with a list of 60k entity names in df1 and around 6 million aliases in df2 and this approach takes like forever until at some point my Spark session will just crash due to memory errors. I'm pretty sure that my approach is very naive and far from optimized.
I've read this blog post which suggests to use a udf but I don't have any Scala knowledge and struggle to understand and recreate it in PySpark.
Any suggestions or help on how to optimize my approach? I need to do tasks like this a lot, so any help would be greatly appreciated.
I have a dataframe with the below data and columns:
sales_df.select('sales', 'monthly_sales').show()
+----------------+--------------------------+
| sales | monthly_sales |
+----------------+--------------------------+
| mid| 50.0|
| low| 21.0|
| low| 25.0|
| high| 70.0|
| mid| 60.0|
| high| 75.0|
| high| 95.0|
|................|..........................|
|................|..........................|
|................|..........................|
| low| 25.0|
| low| 20.0|
+----------------+--------------------------+
I am trying to find the average of each sales type into a dataframe where I only have three rows(one for each sales type) in my final dataframe.
sale & average_sale
I used groupBy to achieve this.
sales_df.groupBy("sales").avg("monthly_sales").alias('average_sales').show()
and I was able to get the average sale as well.
+----------------+-------------------------------+
| sales | average sales |
+----------------+-------------------------------+
| mid| 5.568177828054298|
| high| 1.361184210526316|
| low| 3.014350758853288|
+----------------+-------------------------------+
This ran faster because I am running my logic on test data which has 200 rows. So the code ran in no time. But I have huge data in my actual application and then there is the problem of data shuffle due to groupBy.
Is there any better way to find out the average without using groupBy ?
Could anyone let me know the efficient way to achieve the solution considering huge data size.
groupBy is exactly what you're looking for. Spark is designed to handle big data (in any size, really), so what you should do is configure your Spark application properly (i.e giving it the right amount of memory, increase the number of cores, using more executors, improve parallelism, ...)
I have a dataframe as below:
+----------+----------+--------+
| FNAME| LNAME| AGE|
+----------+----------+--------+
| EARL| JONES| 35|
| MARK| WOOD| 20|
+----------+----------+--------+
I am trying to add a new column as value to this dataframe which should be like this:
+----------+----------+--------+------+------------------------------------+
| FNAME| LNAME| AGE| VALUE |
+----------+----------+--------+-------------------------------------------+
| EARL| JONES| 35|{"FNAME":"EARL","LNAME":"JONES","AGE":"35"}|
| MARK| WOOD| 20|{"FNAME":"MARK","WOOD":"JONES","AGE":"20"} |
+----------+----------+--------+-------------------------------------------+
I am not able to achieve this using withColumn or any json function.
Any headstart would be appreciated.
Spark: 2.3
Python: 3.7.x
Please consider using the SQL function to_jsonwhich you can find in org.apache.spark.sql.functions
Here's the solution :
df.withColumn("VALUE", to_json(struct($"FNAME", $"LNAME", $"AGE"))
And you can also avoid specifying the columns' names as follows :
df.withColumn("VALUE", to_json(struct(df.columns.map(col): _*)
PS: the code I provided is written in scala, but it's the same logic for Python, you just have to use a spark SQL function which is available in both programming languages.
I hope It helps,
scala solution:
val df2 = df.select(
to_json(
map_from_arrays(lit(df.columns), array('*))
).as("value")
)
pyton solution: (I don't know how to do it for n-cols like in scala because map_from_arrays not exists in pyspark)
import pyspark.sql.functions as f
df.select(f.to_json(
f.create_map(f.lit("FNAME"), df.FNAME, f.lit("LNAME"), df.LNAME, f.lit("AGE"), df.AGE)
).alias("value")
).show(truncate=False)
output:
+-------------------------------------------+
|value |
+-------------------------------------------+
|{"FNAME":"EARL","LNAME":"JONES","AGE":"35"}|
|{"FNAME":"MARK","LNAME":"WOOD","AGE":"20"} |
+-------------------------------------------+
Achieved using:
df.withColumn("VALUE", to_json(struct([df[x] for x in df.columns])))
I've come across something strange recently in Spark. As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.
During a df.union(df2), does the order of the columns matter? I would've assumed that it shouldn't, but according to the wisdom of sql forums it does.
So we have df1
df1
| a| b|
+---+----+
| 1| asd|
| 2|asda|
| 3| f1f|
+---+----+
df2
| b| a|
+----+---+
| asd| 1|
|asda| 2|
| f1f| 3|
+----+---+
result
| a| b|
+----+----+
| 1| asd|
| 2|asda|
| 3| f1f|
| asd| 1|
|asda| 2|
| f1f| 3|
+----+----+
It looks like the schema from df1 was used, but the data appears to have joined following the order of their original dataframes.
Obviously the solution would be to do df1.union(df2.select(df1.columns))
But the main question is, why does it do this? Is it simply because it's part of pyspark.sql, or is there some underlying data architecture in Spark that I've goofed up in understanding?
code to create test set if anyone wants to try
d1={'a':[1,2,3], 'b':['asd','asda','f1f']}
d2={ 'b':['asd','asda','f1f'], 'a':[1,2,3],}
pdf1=pd.DataFrame(d1)
pdf2=pd.DataFrame(d2)
df1=spark.createDataFrame(pdf1)
df2=spark.createDataFrame(pdf2)
test=df1.union(df2)
The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.
in spark Union is not done on metadata of columns and data is not shuffled like you would think it would. rather union is done on the column numbers as in, if you are unioning 2 Df's both must have the same numbers of columns..you will have to take in consideration of positions of your columns previous to doing union. unlike SQL or Oracle or other RDBMS, underlying files in spark are physical files. hope that answers your question
This question already has answers here:
Multiple Aggregate operations on the same column of a spark dataframe
(6 answers)
Closed 5 years ago.
To group by a Spark data-frame with pyspark I use command like that:
df2 = df.groupBy('_c1','_c3').agg({'_c4':'max', '_c2' : 'avg'})
As a result I get output like that:
+-----------------+-------------+------------------+--------+
| _c1| _c3| avg(_c2)|max(_c4)|
+-----------------+-------------+------------------+--------+
| Local-gov| HS-grad| 644952.5714285715| 9|
| Local-gov| Assoc-acdm|365081.64285714284| 12|
| Never-worked| Some-college| 462294.0| 10|
| Local-gov| Bachelors| 398296.35| 13|
| Federal-gov| HS-grad| 493293.0| 9|
| Private| 12th| 632520.5454545454| 8|
| State-gov| Assoc-voc| 412814.0| 11|
| ?| HS-grad| 545870.9230769231| 9|
| Private| Prof-school|340322.89130434784| 15|
+-----------------+-------------+------------------+--------+
Which is nice but there are two things that I miss:
I would like to have a control over the names of the columns. For example I want a new column to be named avg_c2 instead avg(_c2).
I want to aggregate the same column in different ways. For example I might want to know minimum and maximum of column _c4. I tried that following and it does not work:
df2 = df.groupBy('_c1','_c3').agg({'_c4':('min','max'), '_c2' : 'avg'})
Is there a way to achieve what I need?
you have to use withColumn api and generate new columns or replace the old ones
Or you can use alias to have the required column name instead of default avg(_c2)
I haven't used pyspark yet but in scala I do something like
import org.apache.spark.sql.functions._
df2 = df.groupBy("_c1","_c3").agg(max(col("_c4")).alias("max_c4"), min(col("_c4")).alias("min_c4"), avg(col("_c2")).alias("avg_c2"))