Suppose I have the dataframe below:
s.id
s.first
s.last
s.age
d.first
d.last
d.age
UPDATED_FIELDS
1
AAA
BBB
10
AAA__
BBB
10
["first"]
2
CCC
DDD
20
CCC__
DDD
21
["first", "age"]
I want to transform it to the format below, so for each of the UPDATED_FIELDS in the first dataframe, I want to create a new row in my second dataframe.
id
field
s_value
d_value
1
first
AAA
AAA_
2
first
CCC
CCC_
2
age
20
21
I feel like I need to create a new dataframe, but couldn't get it working
This isn't straightforward because after you explode "UPDATED_FIELDS" to create the "field" column, you have the following dataframe:
+----+-------+------+-----+-------+------+-----+--------------+-----+
|s.id|s.first|s.last|s.age|d.first|d.last|d.age|UPDATED_FIELDS|field|
+----+-------+------+-----+-------+------+-----+--------------+-----+
| 1| AAA| BBB| 10| AAA__| BBB| 10| [first]|first|
| 2| CCC| DDD| 20| CCC__| DDD| 21| [first, age]|first|
| 2| CCC| DDD| 20| CCC__| DDD| 21| [first, age]| age|
+----+-------+------+-----+-------+------+-----+--------------+-----+
The new columns s_value and d_value depend on using the literal from the exploded column "field" to tell us which column's value to use. It would be nice to do something like:
df.withColumn(
"field", F.explode("UPDATED_FIELDS")
).withColumn(
"s_value", F.col(f"s.{F.col('field')}")
)
... but this will result in an error because {F.col('field')} cannot be interpreted as a literal.
Update: based on this helpful answer, you can in fact use a string literal as a column lookup. You will want to first make sure you can access df['field'] so you need to perform the explode separately first, then use F.coalesce(*F.when(...)) as a way of obtaining the value inside the column name that comes from the string inside df['field']:
df = df.withColumn(
"field", F.explode("UPDATED_FIELDS")
)
## obtain list of all unique strings in "field"
base_cols = df.select(F.collect_set("field").alias("column")).first()["column"]
df.withColumn(
"s_value", F.coalesce(*[F.when(df['field'] == c, df[f"`s.{c}`"]) for c in base_cols])
).withColumn(
"d_value", F.coalesce(*[F.when(df['field'] == c, df[f"`d.{c}`"]) for c in base_cols])
).select(
F.col("`s.id`").alias("id"),"field","s_value","d_value"
)
+---+-----+-------+-------+
| id|field|s_value|d_value|
+---+-----+-------+-------+
| 1|first| AAA| AAA__|
| 2|first| CCC| CCC__|
| 2| age| 20| 21|
+---+-----+-------+-------+
Related
I have a specific requirement where I need to query a dataframe based on a range condition.
The values of the range come from the rows of another dataframe and so I will have as many queries as the rows in this different dataframe.
Using collect() in my scenario seems to be the bottleneck because it brings every row to the driver.
Sample example:
I need to execute a query on table 2 for every row in table 1
Table 1:
ID1
Num1
Num2
1
10
3
2
40
4
Table 2
ID2
Num3
1
9
2
39
3
22
4
12
For the first row in table 1, I create a range [10-3,10+3] =[7,13] => this becomes the range for the first query.
For the second row in table 2, I create a range [40-4,40+4] =[36,44] => this becomes the range for the second query.
I am currently doing collect() and iterating over the rows to get the values. I use these values as ranges in my queries for Table 2.
Output of Query 1:
ID2
Num3
1
9
4
12
Output of Query 2:
ID2
Num3
2
39
Since the number of rows in table 1 is very large, doing a collect() operation is costly.
And since the values are numeric, I assume a join won't work.
Any help in optimizing this task is appreciated.
Depending on what you want your output to look like, you could solve this with a join. Consider the following code:
case class FirstType(id1: Int, num1: Int, num2: Int)
case class Bounds(id1: Int, lowerBound: Int, upperBound: Int)
case class SecondType(id2: Int, num3: Int)
val df = Seq((1, 10, 3), (2, 40, 4)).toDF("id1", "num1", "num2").as[FirstType]
df.show
+---+----+----+
|id1|num1|num2|
+---+----+----+
| 1| 10| 3|
| 2| 40| 4|
+---+----+----+
val df2 = Seq((1, 9), (2, 39), (3, 22), (4, 12)).toDF("id2", "num3").as[SecondType]
df2.show
+---+----+
|id2|num3|
+---+----+
| 1| 9|
| 2| 39|
| 3| 22|
| 4| 12|
+---+----+
val bounds = df.map(x => Bounds(x.id1, x.num1 - x.num2, x.num1 + x.num2))
bounds.show
+---+----------+----------+
|id1|lowerBound|upperBound|
+---+----------+----------+
| 1| 7| 13|
| 2| 36| 44|
+---+----------+----------+
val test = bounds.join(df2, df2("num3") >= bounds("lowerBound") && df2("num3") <= bounds("upperBound"))
test.show
+---+----------+----------+---+----+
|id1|lowerBound|upperBound|id2|num3|
+---+----------+----------+---+----+
| 1| 7| 13| 1| 9|
| 2| 36| 44| 2| 39|
| 1| 7| 13| 4| 12|
+---+----------+----------+---+----+
In here, I do the following:
Create 3 case classes to be able to use typed datasets later on
Create the 2 dataframes
Create an auxilliary dataframe called bounds, which contains the lower/upper bounds
Join the second dataframe onto that auxilliary one
As you can see, the test dataframe contains the result. For each unique combination of the id1, lowerBound and upperBound columns you'll get the different dataframes that you wanted if you look at the id2 and num3 columns only.
You could, for example use a groupBy operation to group by these 3 columns and then do whatever you wanted with the output KeyValueGroupedDataset (something like test.groupBy("id1", "lowerBound", "upperBound")). From there on it depends on what you want: if you want to apply an operation to each dataset for each of the bounds you could use the mapValues method of KeyValueGroupedDataset.
Hope this helps!
I have a dataframe and an id column as a group. For each id I want to pair its elements in the following way:
title id
sal 1
summer 1
fada 1
row 2
winter 2
gole 2
jack 3
noway 3
output
title id pair
sal 1 None
summer 1 summer,sal
fada 1 fada,summer
row 2 None
winter 2 winter, row
gole 2 gole,winter
jack 3 None
noway 3 noway,jack
As you can see in the output, we pair from the last element of the group id, with an element above it. Since the first element of the group does not have a pair I put None. I should also mention that this can be done in pandas by the following code, but I need Pyspark code since my data is big.
df=data.assign(pair=data.groupby('id')['title'].apply(lambda x: x.str.cat(x.shift(1),sep=',')))
|
I can't emphasise more that a Spark dataframe is an unordered collection of rows, so saying something like "the element above it" is undefined without a column to order by. You can fake an ordering using F.monotonically_increasing_id(), but I'm not sure if that's what you wanted.
from pyspark.sql import functions as F, Window
w = Window.partitionBy('id').orderBy(F.monotonically_increasing_id())
df2 = df.withColumn(
'pair',
F.when(
F.lag('title').over(w).isNotNull(),
F.concat_ws(',', 'title', F.lag('title').over(w))
)
)
df2.show()
+------+---+-----------+
| title| id| pair|
+------+---+-----------+
| sal| 1| null|
|summer| 1| summer,sal|
| fada| 1|fada,summer|
| jack| 3| null|
| noway| 3| noway,jack|
| row| 2| null|
|winter| 2| winter,row|
| gole| 2|gole,winter|
+------+---+-----------+
I have a dataframe with consist of 5 columns . I need to add a new column at 3rd Position . How to achieve this in spark .
df.show()
+---------+--------+---+----------+--------+
|last_name|position|age|salary_inc| segment|
+---------+--------+---+----------+--------+
| george| IT| 10| 2313| one|
| jhon| non-it| 21| 34344| null|
| mark| IT| 11| 16161| third|
| spencer| it| 31| 2322| null|
| spencer| non-it| 41| 2322|Valuable|
+---------+--------+---+----------+--------+
Add new_column at position 3
+---------+--------+-----------+---+----------+--------+
|last_name|position|new_column |age|salary_inc| segment|
+---------+--------+-----------+---+----------+--------+
Can you please help me on this
(
df.withColumn("new_column", ...)
.select("last_name",
"position",
"new_column",
...)
.show()
)
Where first ellipses indicate what you're creating in your new column called "new_column"; for example lit(1) would give you literal (constant) 1 of type IntegerType. Second ellipses indicate remaining columns in the order you wish to select.
I want to perform subtract between 2 dataframes in pyspark. Challenge is that I have to ignore some columns while subtracting dataframe. But end dataframe should have all the columns, including ignored columns.
Here is an example:
userLeft = sc.parallelize([
Row(id=u'1',
first_name=u'Steve',
last_name=u'Kent',
email=u's.kent#email.com',
date1=u'2017-02-08'),
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace#email.com',
date1=u'2017-02-09'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh#email.com',
date1=u'2017-02-10')
]).toDF()
userRight = sc.parallelize([
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace#email.com',
date1=u'2017-02-11'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh#email.com',
date1=u'2017-02-12')
]).toDF()
Expected:
ActiveDF = userLeft.subtract(userRight) ||| Ignore "date1" column while subtracting.
End result should look something like this including "date1" column.
+----------+--------------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+--------------------+----------+---+---------+
|2017-02-08| s.kent#email.com| Steve| 1| Kent|
+----------+--------------------+----------+---+---------+
Seems you need anti-join:
userLeft.join(userRight, ["id"], "leftanti").show()
+----------+----------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+----------------+----------+---+---------+
|2017-02-08|s.kent#email.com| Steve| 1| Kent|
+----------+----------------+----------+---+---------+
You can also use a full join and only keep null values:
userLeft.join(
userRight,
[c for c in userLeft.columns if c != "date1"],
"full"
).filter(psf.isnull(userLeft.date1) | psf.isnull(userRight.date1)).show()
+------------------+----------+---+---------+----------+----------+
| email|first_name| id|last_name| date1| date1|
+------------------+----------+---+---------+----------+----------+
|marge.hh#email.com| null| 3| hh|2017-02-10| null|
|marge.hh#email.com| null| 3| hh| null|2017-02-12|
| s.kent#email.com| Steve| 1| Kent|2017-02-08| null|
+------------------+----------+---+---------+----------+----------+
If you want to use joins, whether it's leftanti or full you'll need to find default values for your null in the joining columns (I think we discussed it in a previous thread).
You can also just drop the column that bothers you subtract and join:
df = userLeft.drop("date1").subtract(userRight.drop("date1"))
userLeft.join(df, df.columns).show()
+----------------+----------+---+---------+----------+
| email|first_name| id|last_name| date1|
+----------------+----------+---+---------+----------+
|s.kent#email.com| Steve| 1| Kent|2017-02-08|
+----------------+----------+---+---------+----------+
I have a Pyspark Dataframe with this structure:
+----+----+----+----+---+
|user| A/B| C| A/B| C |
+----+----+-------------+
| 1 | 0| 1| 1| 2|
| 2 | 0| 2| 4| 0|
+----+----+----+----+---+
I had originally two dataframes, but I outer joined them using user as key, so there could be also null values. I can't find the way to sum the columns with equal name in order to get a dataframe like this:
+----+----+----+
|user| A/B| C|
+----+----+----+
| 1 | 1| 3|
| 2 | 4| 2|
+----+----+----+
Also note that there could be many equal columns, so selecting literally each column is not an option. In pandas this was possible using "user" as Index and then adding both dataframes. How can I do this on Spark?
I have a work around for this
val dataFrameOneColumns=df1.columns.map(a=>if(a.equals("user")) a else a+"_1")
val updatedDF=df1.toDF(dataFrameOneColumns:_*)
Now make the Join then the out will contain the Values with different names
Then make the tuple of the list to be combined
val newlist=df1.columns.filter(_.equals("user").zip(dataFrameOneColumns.filter(_.equals("user"))
And them Combine the value of the Columns within each tuple and get the desired output !
PS: i am guessing you can write the logic for combining ! So i am not spoon feeding !