I'm using spark 2.3 and scala 2.11.8.
I have a Dataframe like below,
--------------------------------------------------------
| ID | Name | Desc_map |
--------------------------------------------------------
| 1 | abcd | "Company" -> "aa" , "Salary" -> "1" ....|
| 2 | efgh | "Company" -> "bb" , "Salary" -> "2" ....|
| 3 | ijkl | "Company" -> "cc" , "Salary" -> "3" ....|
| 4 | mnop | "Company" -> "dd" , "Salary" -> "4" ....|
--------------------------------------------------------
Expected Dataframe,
----------------------------------------
| ID | Name | Company | Salary | .... |
----------------------------------------
| 1 | abcd | aa | 1 | .... |
| 2 | efgh | bb | 2 | .... |
| 3 | ijkl | cc | 3 | .... |
| 4 | mnop | dd | 4 | .... |
----------------------------------------
Any help is appreciated.
If data is your dataset that contains:
+---+----+----------------------------+
|ID |Name|Map |
+---+----+----------------------------+
|1 |abcd|{Company -> aa, Salary -> 1}|
|2 |efgh|{Company -> bb, Salary -> 2}|
|3 |ijkl|{Company -> cc, Salary -> 3}|
|4 |mnop|{Company -> aa, Salary -> 4}|
+---+----+----------------------------+
You can get your desired output through:
data = data.selectExpr(
"ID",
"Name",
"Map.Company",
"Map.Salary"
)
Final output:
+---+----+-------+------+
|ID |Name|Company|Salary|
+---+----+-------+------+
|1 |abcd|aa |1 |
|2 |efgh|bb |2 |
|3 |ijkl|cc |3 |
|4 |mnop|aa |4 |
+---+----+-------+------+
Good luck!
Related
I have a csv as described:
s_table | s_name | t_cast | t_d |
aaaaaa | juuoo | TRUE |float |
aaaaaa | juueo | TRUE |float |
aaaaaa | ju4oo | | |
aaaaaa | juuoo | | |
aaaaaa | thyoo | | |
aaaaaa | juioo | | |
aaaaaa | rtyoo | | |
I am trying to use pyspark when condition to check the condition of t_cast with s_table and if it is TRUE, return a statement in a new column.
What i've tried is:
filters = filters.withColumn("p3", f.when((f.col("s_table") == "aaaaaa") & (f.col("t_cast").isNull()),f.col("s_name")).
when((f.col("s_table") == "aaaaaa") & (f.col("t_cast") == True),
f"CAST({f.col('s_table')} AS {f.col('t_d')}) AS {f.col('s_table')}"))
What I am trying to achieve is for the column p3 to return this:
s_table | s_name | t_cast | t_d | p_3 |
aaaaaa | juuoo | TRUE |float | cast ('juuoo' as float) as 'juuoo' |
aaaaaa | juueo | TRUE |float | cast ('juueo' as float) as 'juuoo' |
aaaaaa | ju4oo | | | ju4oo |
aaaaaa | juuoo | | | juuoo |
aaaaaa | thyoo | | | thyoo |
aaaaaa | juioo | | | juioo |
aaaaaa | rtyoo | | | rtyoo |
But the result that I get is:
CAST(Column<'s_field'> AS Column<'t_data_type'>) AS Column<'s_field'>,
CAST(Column<'s_field'> AS Column<'t_data_type'>) AS Column<'s_field'>,
I feel like I am almost there, but I can't quite figure it out.
You need to use Spark concat function instead of Python string format to get the expected string. Something like:
import pyspark.sql.functions as F
filters = filters.withColumn(
"p3",
(F.when((F.col("s_table") == "aaaaaa") & (F.col("t_cast").isNull()), F.col("s_name"))
.when((F.col("s_table") == "aaaaaa") & F.col("t_cast"),
F.expr(r"concat('CAST(\'', s_name, '\' AS ', t_d, ') AS \'', s_table, '\'')")
)
)
)
filters.show(truncate=False)
#+-------+------+------+-----+----------------------------------+
#|s_table|s_name|t_cast|t_d |p3 |
#+-------+------+------+-----+----------------------------------+
#|aaaaaa |juuoo |true |float|CAST('juuoo' AS float) AS 'aaaaaa'|
#|aaaaaa |juueo |true |float|CAST('juueo' AS float) AS 'aaaaaa'|
#|aaaaaa |ju4oo |null |null |ju4oo |
#|aaaaaa |juuoo |null |null |juuoo |
#|aaaaaa |thyoo |null |null |thyoo |
#|aaaaaa |juioo |null |null |juioo |
#|aaaaaa |rtyoo |null |null |rtyoo |
#+-------+------+------+-----+----------------------------------+
I have two similar dataframes, one has a single date and the other has multiple dates plus an additional column:
df:
| yyyy_mm_dd | id | region | country | product | count |
|------------|-----|--------|----------|---------|-------|
| 2021-06-14 | 111 | EMEA | Spain | P1 | 10 |
| 2021-06-14 | 111 | EMEA | England | P1 | 9 |
| 2021-06-14 | 111 | EMEA | France | P1 | 10 |
| 2021-06-14 | 111 | EMEA | Spain | P2 | 299 |
| 2021-06-14 | 111 | EMEA | England | P2 | 39 |
| 2021-06-14 | 111 | EMEA | France | P2 | 10 |
| 2021-06-14 | 112 | LATAM | Brazil | P1 | 64 |
| 2021-06-14 | 112 | LATAM | Paraguay | P2 | 21 |
| 2021-06-14 | ... | ... | ... | ... | ... |
df1:
| yyyy_mm_dd | id | region | country | product | count | fullfilments |
|------------|-----|--------|----------|---------|-------|--------------|
| 2021-06-14 | 111 | EMEA | Spain | P1 | 1 | 1 |
| 2021-06-14 | 111 | EMEA | England | P1 | 1 | 3 |
| 2021-06-14 | 111 | EMEA | France | P1 | 2 | 4 |
| 2021-06-14 | 111 | EMEA | Spain | P2 | 1 | 1 |
| 2021-06-14 | 111 | EMEA | England | P2 | 2 | 1 |
| 2021-06-14 | 111 | EMEA | France | P2 | 1 | 5 |
| 2021-06-14 | 112 | LATAM | Brazil | P1 | 2 | 2 |
| 2021-06-14 | 112 | LATAM | Paraguay | P2 | 21 | 1 |
| 2021-06-14 | ... | ... | ... | ... | ... | ... |
| 2021-06-13 | 111 | EMEA | Spain | P1 | 0 | 1 |
| 2021-06-13 | 111 | EMEA | England | P2 | 0 | 2 |
Df1 has many dates of grouped data and df only has one date. I would like to replace the count column in df1 with the count in df for matching rows (yyyy_mm_dd, id, region, country, product) and retain fullfilments.
I could probably join both together and drop count in the first df, however I only want to replace where the date is matching and retain all other rows in df1.
You can simply join and use the coalesce function.
When you do the left join from the first dataframe to the second, the only matching records have the not null new_count value. Now, use the coalesce function that will return the first value when it is not null but the second value when the first is null.
coalesce(a , b ) => a
coalesce(a , null) => a
coalesce(null, b ) => b
From your dataframes,
from pyspark.sql import functions as f
df1 = spark.read.option("inferSchema","true").option("header","true").csv("test1.csv")
+----------+---+------+--------+-------+-----+
|yyyy_mm_dd|id |region|country |product|count|
+----------+---+------+--------+-------+-----+
|2021-06-14|111|EMEA |Spain |P1 |10 |
|2021-06-14|111|EMEA |England |P1 |9 |
|2021-06-14|111|EMEA |France |P1 |10 |
|2021-06-14|111|EMEA |Spain |P2 |299 |
|2021-06-14|111|EMEA |England |P2 |39 |
|2021-06-14|111|EMEA |France |P2 |10 |
|2021-06-14|112|LATAM |Brazil |P1 |64 |
|2021-06-14|112|LATAM |Paraguay|P2 |21 |
+----------+---+------+--------+-------+-----+
df2 = spark.read.option("inferSchema","true").option("header","true").csv("test2.csv")
+----------+---+------+--------+-------+-----+------------+
|yyyy_mm_dd|id |region|country |product|count|fullfilments|
+----------+---+------+--------+-------+-----+------------+
|2021-06-14|111|EMEA |Spain |P1 |1 |1 |
|2021-06-14|111|EMEA |England |P1 |1 |3 |
|2021-06-14|111|EMEA |France |P1 |2 |4 |
|2021-06-14|111|EMEA |Spain |P2 |1 |1 |
|2021-06-14|111|EMEA |England |P2 |2 |1 |
|2021-06-14|111|EMEA |France |P2 |1 |5 |
|2021-06-14|112|LATAM |Brazil |P1 |2 |2 |
|2021-06-14|112|LATAM |Paraguay|P2 |21 |1 |
|2021-06-13|111|EMEA |Spain |P1 |0 |1 |
|2021-06-13|111|EMEA |England |P2 |0 |2 |
+----------+---+------+--------+-------+-----+------------+
the join of two dataframes are given by follows:
cols_to_join = ['yyyy_mm_dd', 'id', 'region', 'country', 'product']
df3 = df2.join(df1.withColumnRenamed('count', 'new_count'), cols_to_join, 'left') \
.withColumn('count', f.coalesce('new_count', 'count')).drop('new_count')
df3.show(truncate=False)
+----------+---+------+--------+-------+-----+------------+
|yyyy_mm_dd|id |region|country |product|count|fullfilments|
+----------+---+------+--------+-------+-----+------------+
|2021-06-14|111|EMEA |Spain |P1 |10 |1 |
|2021-06-14|111|EMEA |England |P1 |9 |3 |
|2021-06-14|111|EMEA |France |P1 |10 |4 |
|2021-06-14|111|EMEA |Spain |P2 |299 |1 |
|2021-06-14|111|EMEA |England |P2 |39 |1 |
|2021-06-14|111|EMEA |France |P2 |10 |5 |
|2021-06-14|112|LATAM |Brazil |P1 |64 |2 |
|2021-06-14|112|LATAM |Paraguay|P2 |21 |1 |
|2021-06-13|111|EMEA |Spain |P1 |0 |1 |
|2021-06-13|111|EMEA |England |P2 |0 |2 |
+----------+---+------+--------+-------+-----+------------+
Everytime you need to retrieve a column from different dataframes you must join them:
import pyspark.sql.functions as f
df2 = df1.join(df.withColumnRenamed('count', 'new_count'),
on=['yyyy_mm_dd', 'id', 'region', 'country', 'product'], how='left')
df2 = (df2
.withColumn('count', f.coalesce('new_count', 'count'))
.drop('new_count'))
df2.show(truncate=False)
I have a spark dataframe that looks like this:
+----+------+-------------+
|user| level|value_pair |
+----+------+-------------+
| A | 25 |(23.52,25.12)|
| A | 6 |(0,0) |
| A | 2 |(11,12.12) |
| A | 32 |(17,16.12) |
| B | 22 |(19,57.12) |
| B | 42 |(10,3.2) |
| B | 43 |(32,21.0) |
| C | 33 |(12,0) |
| D | 32 |(265.21,19.2)|
| D | 62 |(57.12,50.12)|
| D | 32 |(75.12,57.12)|
| E | 63 |(0,0) |
+----+------+-------------+
How do I extract the values in the value_pair column and add them to two new columns called value1 and value2, using the comma as the separator.
+----+------+-------------+-------+
|user| level|value1 |value2 |
+----+------+-------------+-------+
| A | 25 |23.52 |25.12 |
| A | 6 |0 |0 |
| A | 2 |11 |12.12 |
| A | 32 |17 |16.12 |
| B | 22 |19 |57.12 |
| B | 42 |10 |3.2 |
| B | 43 |32 |21.0 |
| C | 33 |12 |0 |
| D | 32 |265.21 |19.2 |
| D | 62 |57.12 |50.12 |
| D | 32 |75.12 |57.12 |
| E | 63 |0 |0 |
+----+------+-------------+-------+
I know I can separate the values like so:
df = df.withColumn('value1', pyspark.sql.functions.split(df['value_pair'], ',')[0]
df = df.withColumn('value2', pyspark.sql.functions.split(df['value_pair'], ',')[1]
But how do I also get rid of the parantheses?
For the parentheses, as shown in the comments you can use regexp_replace, but you also need to include \. The backslash \ is the escape character for regular expressions.
Also, I believe you need to first remove the brackets, and then expand the column.
from pyspark.sql.functions import split
from pyspark.sql.functions import regexp_replace
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\(",""))
df = df.withColumn('value_pair', regexp_replace(df.value_pair, "\)",""))
df = df.withColumn('value1', split(df['value_pair'], ',').getItem(0)) \
.withColumn('value2', split(df['value_pair'], ',').getItem(1))
>>> df.show(truncate=False)
+----+-----+-----------+------+---------+
|user|level|value_pair |value1|value2 |
+----+-----+-----------+------+---------+
| A |25 |23.52,25.12|23.52 |25.12 |
| A |6 |0,0 |0 |0 |
| A |2 |11,12.12 |11 |12.12 |
| A |32 |17,16.12 |17 |16.12 |
| B |22 |19,57.12 |19 |57.12 |
| B |42 |10,3.2 |10 |3.2 |
| B |43 |32,21.0 |32 |21.0 |
| C |33 |12,0 |12 |0 |
| D |32 |265.21,19.2|265.21|19.2 |
| D |62 |57.12,50.12|57.12 |50.12 |
| D |32 |75.12,57.12|75.12 |57.12 |
| E |63 |0,0 |0 |0 |
+----+-----+-----------+------+---------+
As noticed, I changed slightly your code on how you grab the 2 items.
More information can be found here
I'm trying to get a concatenated data as a single column using below datasets.
Sample DS:
val df = sc.parallelize(Seq(
("a", 1,2,3),
("b", 4,6,5)
)).toDF("value", "id1", "id2", "id3")
+-------+-----+-----+-----+
| value | id1 | id2 | id3 |
+-------+-----+-----+-----+
| a | 1 | 2 | 3 |
| b | 4 | 6 | 5 |
+-------+-----+-----+-----+
from the Reference Dataset
+----+----------+--------+
| id | descr | parent|
+----+----------+--------+
| 1 | apple | fruit |
| 2 | banana | fruit |
| 3 | cat | animal |
| 4 | dog | animal |
| 5 | elephant | animal |
| 6 | Flight | object |
+----+----------+--------+
val ref= sc.parallelize(Seq(
(1,"apple","fruit"),
(2,"banana","fruit"),
(3,"cat","animal"),
(4,"dog","animal"),
(5,"elephant","animal"),
(6,"Flight","object"),
)).toDF("id", "descr", "parent")
I am trying to get the below desired OutPut
+-----------------------+--------------------------+
| desc | parent |
+-----------------------+--------------------------+
| apple+banana+cat/M | fruit+fruit+animal/M |
| dog+Flight+elephant/M | animal+object+animal/M |
+-----------------------+--------------------------+
And also I need to concat only if(id2,id3) is not null. Otherwise only with id1.
I breaking my head for the solution.
Exploding the first dataframe df and joining to ref with followed by groupBy should work as you expected
val dfNew = df.withColumn("id", explode(array("id1", "id2", "id3")))
.select("id", "value")
ref.join(dfNew, Seq("id"))
.groupBy("value")
.agg(
concat_ws("+", collect_list("descr")) as "desc",
concat_ws("+", collect_list("parent")) as "parent"
)
.drop("value")
.show()
Output:
+-------------------+--------------------+
|desc |parent |
+-------------------+--------------------+
|Flight+elephant+dog|object+animal+animal|
|apple+cat+banana |fruit+animal+fruit |
+-------------------+--------------------+
I've more than 2 tables and I wish to join them and create a single table where queries will be faster.
Table-1
---------------
user | activityId
---------------
user1 | 123
user2 | 123
user3 | 123
user4 | 123
user5 | 123
---------------
Table-2
---------------------------------
user | activityId | event-1-time
---------------------------------
user2 | 123 | 1001
user2 | 123 | 1002
user3 | 123 | 1003
user5 | 123 | 1004
---------------------------------
Table-3
---------------------------------
user | activityId | event-2-time
---------------------------------
user2 | 123 | 10001
user5 | 123 | 10002
---------------------------------
Left join on table-1 over (user,activityId) with table-2 & table-3 will produce result like:
Joined-data
--------------------------------------------------------------------
user | activityId | event-1 | event-1-time | event-2 | event-2-time
--------------------------------------------------------------------
user1 | 123 | 0 | null | 0 | null
user2 | 123 | 1 | 1001 | 1 | 10001
user2 | 123 | 1 | 1002 | 1 | 10001
user3 | 123 | 1 | 1003 | 0 | null
user4 | 123 | 0 | null | 0 | null
user5 | 123 | 1 | 1004 | 1 | 10002
--------------------------------------------------------------------
I wish to remove the redundancy introduced with event-2 with same time i.e. event-2 appeared only once but reported twice since event-1 appeared twice.
In other words user and activityId grouped records should be distinct at every table level.
I want following output. I do not care about relationship(event-1 with event-2). Is there anything which allows to customize join and achieve this behavior
user | activityId | event-1 | event-1-time | event-2 | event-2-time
--------------------------------------------------------------------
user1 | 123 | 0 | null | 0 | null
user2 | 123 | 1 | 1001 | 1 | 10001
user2 | 123 | 1 | 1002 | 0 | null
user3 | 123 | 1 | 1003 | 0 | null
user4 | 123 | 0 | null | 0 | null
user5 | 123 | 1 | 1004 | 1 | 10002
--------------------------------------------------------------------
Edit:
I'm using Scala for joining these tables. Query used:
val joined = table1.join(table2, Seq("user","activityId"), "left").join(table3, Seq("user","activityId"), "left")
joined.select(table1("user"), table1("activityId"), when(table2("activityId").isNull,0).otherwise(1) as "event-1",
table2("timestamp") as "event-1-time"), when(table3("activityId").isNull, 0).otherwise(1) as "event-2", table3("timestamp") as "event-2-time").show
You should create an additional column populating with row index for each group of user ordering by activityId and then use that added column in the outer join process
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("user").orderBy("activityId")
import org.apache.spark.sql.functions._
val tempTable1 = table1.withColumn("rowNumber", row_number().over(windowSpec))
val tempTable2 = table2.withColumn("rowNumber", row_number().over(windowSpec)).withColumn("event-1", lit(1))
val tempTable3 = table3.withColumn("rowNumber", row_number().over(windowSpec)).withColumn("event-2", lit(1))
tempTable1
.join(tempTable2, Seq("user", "activityId", "rowNumber"), "outer")
.join(tempTable3, Seq("user", "activityId", "rowNumber"), "outer")
.drop("rowNumber")
.na.fill(0)
You should get your desired output dataframe as
+-----+----------+------------+-------+------------+-------+
|user |activityId|event-1-time|event-1|event-2-time|event-2|
+-----+----------+------------+-------+------------+-------+
|user1|123 |null |0 |null |0 |
|user2|123 |1002 |1 |null |0 |
|user2|123 |1001 |1 |10001 |1 |
|user3|123 |1003 |1 |null |0 |
|user4|123 |null |0 |null |0 |
|user5|123 |1004 |1 |10002 |1 |
+-----+----------+------------+-------+------------+-------+
Below is a code implementation of the requirement
from pyspark.sql import Row
ll = [('test',123),('test',123),('test',123),('test',123)]
rdd = sc.parallelize(ll)
test1 = rdd.map(lambda x: Row(user=x[0], activityid=int(x[1])))
test1_df = sqlContext.createDataFrame(test1)
mm = [('test',123,1001),('test',123,1002),('test',123,1003),('test',123,1004)]
rdd1 = sc.parallelize(mm)
test2 = rdd1.map(lambda x: Row(user=x[0],
activityid=int(x[1]),event_time_1=int(x[2])))
test2_df = sqlContext.createDataFrame(test2)
nn = [('test',123,10001),('test',123,10002)]
rdd2 = sc.parallelize(nn)
test3 = rdd2.map(lambda x: Row(user=x[0],
activityid=int(x[1]),event_time_2=int(x[2])))
test3_df = sqlContext.createDataFrame(test3)
from pyspark.sql.window import Window
import pyspark.sql.functions as func
from pyspark.sql.functions import dense_rank, rank
n = Window.partitionBy(test2_df.user,test2_df.activityid).orderBy(test2_df.event_time_1)
int2_df = test2_df.select("user","activityid","event_time_1",rank().over(n).alias("col_rank")).filter('col_rank = 1')
o = Window.partitionBy(test3_df.user,test3_df.activityid).orderBy(test3_df.event_time_2)
int3_df = test3_df.select("user","activityid","event_time_2",rank().over(o).alias("col_rank")).filter('col_rank = 1')
test1_df.distinct().join(int2_df,["user","activityid"],"leftouter").join(int3_df,["user","activityid"],"leftouter").show(10)
+----+----------+------------+--------+------------+--------+
|user|activityid|event_time_1|col_rank|event_time_2|col_rank|
+----+----------+------------+--------+------------+--------+
|test| 123| 1001| 1| 10001| 1|
+----+----------+------------+--------+------------+--------+