Related
I have a dataframe that has distinct 'send' and 'receive' rows. I need to join these rows in a single one with send and receive columns, using PySpark. Notice that the ID is the same for the lines and the action identifier is ACTION_ID:
Original dataframe:
+------------------------------------+------------------------+---------+--------------------+
|ID |MSG_DT |ACTION_CD|MESSAGE |
+------------------------------------+------------------------+---------+--------------------+
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07T21:24:54.552Z|receive |Oi |
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07T21:24:54.852Z|send |Olá! |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07T21:25:06.565Z|receive |4 |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07T21:25:06.688Z|send |Certo |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07T21:25:30.408Z|receive |1 |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07T21:25:30.479Z|send |⭐️*Antes de você ir |
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07T21:25:52.798Z|receive |788884 |
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07T21:25:57.435Z|send |Agora |
+------------------------------------+------------------------+---------+--------------------+
How I need:
+------------------------------------+------------------------+-------+-------------------+
|ID |MSG_DT |RECEIVE|SEND |
+------------------------------------+------------------------+-------+-------------------+
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07T21:24:54.552Z|Oi |Olá! |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07T21:25:06.565Z|4 |Certo |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07T21:25:30.408Z|1 |⭐️*Antes de você ir|
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07T21:25:52.798Z|788884 |Agora |
+------------------------------------+------------------------+-------+-------------------+
Ps.: The MSG_DT is the earliest record.
You can construct the RECEIVE and SEND by applying first expression over computed columns that are created depending on ACTION_CD.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [("d2636151-b95e-4845-8014-0a113c381ff9", "2022-08-07T21:24:54.552Z", "receive", "Oi",),
("d2636151-b95e-4845-8014-0a113c381ff9", "2022-08-07T21:24:54.852Z", "send", "Olá!",),
("4241224b-9ba5-4eda-8e16-7e3aeaacf164", "2022-08-07T21:25:06.565Z", "receive", "4",),
("4241224b-9ba5-4eda-8e16-7e3aeaacf164", "2022-08-07T21:25:06.688Z", "send", "Certo",),
("bd46c6fb-1315-4418-9943-2e7d3151f788", "2022-08-07T21:25:30.408Z", "receive", "1",),
("bd46c6fb-1315-4418-9943-2e7d3151f788", "2022-08-07T21:25:30.479Z", "send", "️*Antes de você ir",),
("14da8519-6e4c-4edc-88ea-e33c14533dd9", "2022-08-07T21:25:52.798Z", "receive", "788884",),
("14da8519-6e4c-4edc-88ea-e33c14533dd9", "2022-08-07T21:25:57.435Z", "send", "Agora",), ]
df = spark.createDataFrame(data, ("ID", "MSG_DT", "ACTION_CD", "MESSAGE")).withColumn("MSG_DT", F.to_timestamp("MSG_DT"))
ws = W.partitionBy("ID").orderBy("MSG_DT")
first_rows = ws.rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
action_column_selection = lambda action: F.first(F.when(F.col("ACTION_CD") == action, F.col("MESSAGE")), ignorenulls=True).over(first_rows)
(df.select("*",
action_column_selection("receive").alias("RECEIVE"),
action_column_selection("send").alias("SEND"),
F.row_number().over(ws).alias("rn"))
.where("rn = 1")
.drop("ACTION_CD", "MESSAGE", "rn")).show(truncate=False)
"""
+------------------------------------+-----------------------+-------+------------------+
|ID |MSG_DT |RECEIVE|SEND |
+------------------------------------+-----------------------+-------+------------------+
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07 23:25:52.798|788884 |Agora |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07 23:25:06.565|4 |Certo |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07 23:25:30.408|1 |️*Antes de você ir|
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07 23:24:54.552|Oi |Olá! |
+------------------------------------+-----------------------+-------+------------------+
"""
How does Spark SQL implement the group by aggregate? I want to group by name field and based on the latest data to get the latest salary. How to write the SQL
The data is:
+-------+------|+---------|
// | name |salary|date |
// +-------+------|+---------|
// |AA | 3000|2022-01 |
// |AA | 4500|2022-02 |
// |BB | 3500|2022-01 |
// |BB | 4000|2022-02 |
// +-------+------+----------|
The expected result is:
+-------+------|
// | name |salary|
// +-------+------|
// |AA | 4500|
// |BB | 4000|
// +-------+------+
Assuming that the dataframe is registered as a temporary view named tmp, first use the row_number windowing function for each group (name) in reverse order by date Assign the line number (rn), and then take all the lines with rn=1.
sql = """
select name, salary from
(select *, row_number() over (partition by name order by date desc) as rn
from tmp)
where rn = 1
"""
df = spark.sql(sql)
df.show(truncate=False)
First convert your string to a date.
Covert the date to an UNixTimestamp.(number representation of a date, so you can use Max)
User "First" as an aggregate
function that retrieves a value of your aggregate results. (The first results, so if there is a date tie, it could pull either one.)
:
simpleData = [("James","Sales","NY",90000,34,'2022-02-01'),
("Michael","Sales","NY",86000,56,'2022-02-01'),
("Robert","Sales","CA",81000,30,'2022-02-01'),
("Maria","Finance","CA",90000,24,'2022-02-01'),
("Raman","Finance","CA",99000,40,'2022-03-01'),
("Scott","Finance","NY",83000,36,'2022-04-01'),
("Jen","Finance","NY",79000,53,'2022-04-01'),
("Jeff","Marketing","CA",80000,25,'2022-04-01'),
("Kumar","Marketing","NY",91000,50,'2022-05-01')
]
schema = ["employee_name","name","state","salary","age","updated"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.withColumn(
"dateUpdated",
unix_timestamp(
to_date(
col("updated") ,
"yyyy-MM-dd"
)
)
).groupBy("name")
.agg(
max("dateUpdated"),
first("salary").alias("Salary")
).show()
+---------+----------------+------+
| name|max(dateUpdated)|Salary|
+---------+----------------+------+
| Sales| 1643691600| 90000|
| Finance| 1648785600| 90000|
|Marketing| 1651377600| 80000|
+---------+----------------+------+
My usual trick is to "zip" date and salary together (depends on what do you want to sort first)
from pyspark.sql import functions as F
(df
.groupBy('name')
.agg(F.max(F.array('date', 'salary')).alias('max_date_salary'))
.withColumn('max_salary', F.col('max_date_salary')[1])
.show()
)
+----+---------------+----------+
|name|max_date_salary|max_salary|
+----+---------------+----------+
| AA|[2022-02, 4500]| 4500|
| BB|[2022-02, 4000]| 4000|
+----+---------------+----------+
I have a text. From each line I want to filter everything after some stop word. For example :
stop_words=['with','is', '/']
One of the rows is:
senior manager with experience
I want to remove everything after with (including with) so the output is:
senior manager
I have big-data and am working with Spark in Python.
You can find the location of the stop words using instr, and get a substring up to that location.
import pyspark.sql.functions as F
stop_words = ['with', 'is', '/']
df = spark.createDataFrame([
['senior manager with experience'],
['is good'],
['xxx//'],
['other text']
]).toDF('col')
df.show(truncate=False)
+------------------------------+
|col |
+------------------------------+
|senior manager with experience|
|is good |
|xxx // |
|other text |
+------------------------------+
df2 = df.withColumn('idx',
F.coalesce(
# Get the smallest index of a stop word in the string
F.least(*[F.when(F.instr('col', s) != 0, F.instr('col', s)) for s in stop_words]),
# If no stop words found, get the whole string
F.length('col') + 1)
).selectExpr('trim(substring(col, 1, idx-1)) col')
df2.show()
+--------------+
| col|
+--------------+
|senior manager|
| |
| xxx|
| other text|
+--------------+
You can use udf and get index of first occurrence of stop word in col, then again using one more udf, you can substring col message.
val df = List("senior manager with experience", "is good", "xxx//", "other text").toDF("col")
val index_udf = udf ( (col_value :String ) => {val result = for (elem <- stop_words; if col_value.contains(elem)) yield col_value.indexOf(elem)
if (result.isEmpty) col_value.length else result.min } )
val substr_udf = udf((elem:String, index:Int) => elem.substring(0, index))
val df3 = df.withColumn("index", index_udf($"col")).withColumn("substr_message", substr_udf($"col", $"index")).select($"substr_message").withColumnRenamed("substr_message", "col")
df3.show()
+---------------+
| col|
+---------------+
|senior manager |
| |
| xxx|
| other text|
+---------------+
I have a dataframe named dataDF which columns I want to rename. Other dataframe mapDF has "original_name" -> "code_name" mapping. I want to change dataDF's columns names from its "original_name" to "code_name" as per mapDF having those values. I am trying to re-assign dataDF in a loop, but yields low performance when the data size is huge and also losing parallelism. Can this be done in a better way to achieve parallelism and good performance with a huge dataDF dataset?
import sparkSession.sqlContext.implicits._
var dataDF = Seq((10, 20, 30, 40, 50),(100, 200, 300, 400, 500),(10, 222, 333, 444, 555),(1123, 2123, 3123, 4123, 5123),(1321, 2321, 3321, 4321, 5321))
.toDF("col_1", "col_2", "col_3", "col_4", "col_5")
dataDF.show(false)
val mapDF = Seq(("col_1", "code_1", "true"),("col_3", "code_3", "true"),("col_4", "code_4", "true"),("col_5", "code_5", "true"))
.toDF("original_name", "code_name", "important")
mapDF.show(false)
val map_of_codename = mapDF.rdd.map(x => (x.getString(0), x.getString(1))).collectAsMap()
dataDF.columns.foreach(x => {
if (map_of_codename.contains(x))
dataDF = dataDF.withColumnRenamed(x, map_of_codename.get(x).get)
else
dataDF = dataDF.withColumnRenamed(x, "None")
}
)
dataDF.show(false)
========================
dataDF
+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|
+-----+-----+-----+-----+-----+
|10 |20 |30 |40 |50 |
|100 |200 |300 |400 |500 |
|10 |222 |333 |444 |555 |
|1123 |2123 |3123 |4123 |5123 |
|1321 |2321 |3321 |4321 |5321 |
+-----+-----+-----+-----+-----+
mapDF
+-------------+---------+---------+
|original_name|code_name|important|
+-------------+---------+---------+
|col_1 |code_1 |true |
|col_3 |code_3 |true |
|col_4 |code_4 |true |
|col_5 |code_5 |true |
+-------------+---------+---------+
expected DF:
+------+----+------+------+------+
|code_1|None|code_3|code_4|code_5|
+------+----+------+------+------+
|10 |20 |30 |40 |50 |
|100 |200 |300 |400 |500 |
|10 |222 |333 |444 |555 |
|1123 |2123|3123 |4123 |5123 |
|1321 |2321|3321 |4321 |5321 |
+------+----+------+------+------+
As an alternative you can try to use the aliases, like that:
val aliases = dataDF.columns.map(columnName => $"${columnName}".as(map_of_codename.getOrElse(columnName, "None")))
dataDF.select(aliases: _*).show()
dataDF.select(aliases: _*).explain(true)
The execution plan will be then composed of a single projection node like and it may help to reduce the optimization phase:
== Analyzed Logical Plan ==
code_1: int, None: int, code_3: int, code_4: int, code_5: int
Project [col_1#16 AS code_1#77, col_2#17 AS None#78, col_3#18 AS code_3#79, col_4#19 AS code_4#80, col_5#20 AS code_5#81]
+- Project [_1#5 AS col_1#16, _2#6 AS col_2#17, _3#7 AS col_3#18, _4#8 AS col_4#19, _5#9 AS col_5#20]
+- LocalRelation [_1#5, _2#6, _3#7, _4#8, _5#9]
That being said, I'm not sure if it will solve the performance issue because in both cases, yours foreach and the proposal above, the physical plan can be optimized to a single node thanks to the CollapseProject rule.
FYI, withColumnRenamed uses similar approach under-the-hood, except that it does it for every column separately:
def withColumnRenamed(existingName: String, newName: String): DataFrame = {
val resolver = sparkSession.sessionState.analyzer.resolver
val output = queryExecution.analyzed.output
val shouldRename = output.exists(f => resolver(f.name, existingName))
if (shouldRename) {
val columns = output.map { col =>
if (resolver(col.name, existingName)) {
Column(col).as(newName)
} else {
Column(col)
}
}
select(columns : _*)
} else {
toDF()
}
}
Do you have any input related to the observed performance issues? Some measures that could help to identify the operation taking time? Maybe it's not necessarily related to the column renaming? What are you doing later with these renamed columns?
One approach is to obtain the full mapping column list without spark first, then to a for loop to rename all column instead of call columns.foreach
Here is an example of my solution (sorry I wasn't an expert in Scala, some data parsing might be ugly)
var dataDF = Seq((10, 20, 30, 40, 50),(100, 200, 300, 400, 500),(10, 222, 333, 444, 555),(1123, 2123, 3123, 4123, 5123),(1321, 2321, 3321, 4321, 5321))
.toDF("col_1", "col_2", "col_3", "col_4", "col_5")
dataDF.show(false)
val mapDF = Seq(("col_1", "code_1", "true"),("col_3", "code_3", "true"),("col_4", "code_4", "true"),("col_5", "code_5", "true"))
.toDF("original_name", "code_name", "important")
val schema_mapping = mapDF.select("original_name", "code_name").collect()
//For mapping of None column (col 2)
val none_mapping = old_schema.map(x => if (!schema_mapping.map(x => x(0)).contains(x)) Array[String](x, "None")).filter(_ != ())
for(i <- 0 until schema_mapping.length){
try {
dataDF = dataDF.withColumnRenamed(schema_mapping(i)(0).toString, schema_mapping(i)(1).toString)
}
catch{
case e : Throwable => println("cannot rename" + schema_mapping(i)(0).toString + " to " + schema_mapping(i)(1).toString)
}
}
for(i <- 0 until none_mapping.length){
try {
dataDF = dataDF.withColumnRenamed(none_mapping(i).asInstanceOf[Array[String]](0), none_mapping(i).asInstanceOf[Array[String]](1))
}
catch{
case e : Throwable => println("cannot rename")
}
}
dataDF.show(false)
In spark UI for each column rename it become a stage, but those stage should be executed in parallel when we see with DAG visualization.
I am using spark-sql-2.4.1v with java8.
I have following scenario
val df = Seq(
("0.9192019", "0.1992019", "0.9955999"),
("0.9292018", "0.2992019", "0.99662018"),
("0.9392017", "0.3992019", "0.99772000")).toDF("item1_value","item2_value","item3_value")
.withColumn("item1_value", $"item1_value".cast(DoubleType))
.withColumn("item2_value", $"item2_value".cast(DoubleType))
.withColumn("item3_value", $"item3_value".cast(DoubleType))
df.show(20)
I need an expected output something like this
-----------------------------------------------------------------------------------
col_name | sum_of_column | avg_of_column | vari_of_column
-----------------------------------------------------------------------------------
"item1_value" | sum("item1_value") | avg("item1_value") | variance("item1_value")
"item2_value" | sum("item2_value") | avg("item2_value") | variance("item2_value")
"item3_value" | sum("item3_value") | avg("item3_value") | variance("item3_value")
----------------------------------------------------------------------------------
how to achieve this dynamically .. tomorrow i may have
This is sample code that can achieve this. You can make column list dynamic and add more functions if needed.
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column
val df = Seq(
("0.9192019", "0.1992019", "0.9955999"),
("0.9292018", "0.2992019", "0.99662018"),
("0.9392017", "0.3992019", "0.99772000")).
toDF("item1_value","item2_value","item3_value").
withColumn("item1_value", $"item1_value".cast(DoubleType)).
withColumn("item2_value", $"item2_value".cast(DoubleType)).
withColumn("item3_value", $"item3_value".cast(DoubleType))
val aggregateColumns = Seq("item1_value","item2_value","item3_value")
var aggDFs = aggregateColumns.map( c => {
df.groupBy().agg(lit(c).as("col_name"),sum(c).as("sum_of_column"), avg(c).as("avg_of_column"), variance(c).as("var_of_column"))
})
var combinedDF = aggDFs.reduce(_ union _)
This returns following output:
scala> df.show(10,false)
+-----------+-----------+-----------+
|item1_value|item2_value|item3_value|
+-----------+-----------+-----------+
|0.9192019 |0.1992019 |0.9955999 |
|0.9292018 |0.2992019 |0.99662018 |
|0.9392017 |0.3992019 |0.99772 |
+-----------+-----------+-----------+
scala> combinedDF.show(10,false)
+-----------+------------------+------------------+---------------------+
|col_name |sum_of_column |avg_of_column |var_of_column |
+-----------+------------------+------------------+---------------------+
|item1_value|2.7876054 |0.9292018 |9.999800000999957E-5 |
|item2_value|0.8976057000000001|0.2992019 |0.010000000000000002 |
|item3_value|2.9899400800000002|0.9966466933333334|1.1242332201333484E-6|
+-----------+------------------+------------------+---------------------+