SparkSQL - Extract multiple regex matches (using SQL only)

SparkSQL - Extract multiple regex matches (using SQL only) - apache-spark

I have a dataset of SQL queries in raw text and another with a regular expression of all the possible table names:
# queries
+-----+----------------------------------------------+
| id | query |
+-----+----------------------------------------------+
| 1 | select * from table_a, table_b |
| 2 | select * from table_c join table_d... |
+-----+----------------------------------------------+
# regexp
'table_a|table_b|table_c|table_d'
And I wanted the following result:
# expected result
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | [table_a, table_b] |
| 2 | [table_c, table_d] |
+-----+----------------------------------------------+
But using the following SQL in Spark, all I get is the first match...
select
id,
regexp_extract(query, 'table_a|table_b|table_c|table_d') as tables
from queries
# actual result
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | table_a |
| 2 | table_c |
+-----+----------------------------------------------+
Is there any way to do this using only Spark SQL? This is the function I am using https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#regexp_extract
EDIT
I would also accept a solution that returned the following:
# alternative solution
+-----+----------------------------------------------+
| id | tables |
+-----+----------------------------------------------+
| 1 | table_a |
| 1 | table_b |
| 2 | table_c |
| 2 | table_d |
+-----+----------------------------------------------+
SOLUTION
#chlebek solved this below. I reformatted his SQL using CTEs for better readability:
with
split_queries as (
select
id,
explode(split(query, ' ')) as col
from queries
),
extracted_tables as (
select
id,
regexp_extract(col, 'table_a|table_b|table_c|table_d', 0) as rx
from split_queries
)
select
id,
collect_set(rx) as tables
from extracted_tables
where rx != ''
group by id
Bear in mind that the split(query, ' ') part of the query will split your SQL only by spaces. If you have other things such as tabs, line breaks, comments, etc., you should deal with these before or when splitting.

If you have only a few values to check you can achieve it using contains function instead of regexp:
val names = Seq("table_a","table_b","table_c","table_d")
def c(col: Column) = names.map(n => when(col.contains(n),n).otherwise(""))
df.select('id,array_remove(array(c('query):_*),"").as("result")).show(false)
but using regexp it will looks like below (Spark SQL API):
df.select('id,explode(split('query," ")))
.select('id,regexp_extract('col,"table_a|table_b|table_c|table_d",0).as("rx"))
.filter('rx=!="")
.groupBy('id)
.agg(collect_list('rx))
and it could be translated to below SQL query:
select id, collect_list(rx) from
(select id, regexp_extract(col,'table_a|table_b|table_c|table_d',0) as rx from
(select id, explode(split(query,' ')) as col from df) q1
) q2
where rx != '' group by id
so output will be:
+---+------------------+
| id| collect_list(rx)|
+---+------------------+
| 1|[table_a, table_b]|
| 2|[table_c, table_d]|
+---+------------------+

As you are using spark-sql, you can use sql parser & it will do job for you.
def getTables(query: String): Seq[String] = {
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query)
import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
logicalPlan.collect { case r: UnresolvedRelation => r.tableName }
}
val query = "select * from table_1 as a left join table_2 as b on
a.id=b.id"
scala> getTables(query).foreach(println)
table_1
table_2
You can register 'getTables' as udf & use in query

You can use another SQL function available in Spark called collect_list https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#collect_list. You can find another sample https://mungingdata.com/apache-spark/arraytype-columns/
Basically, applying to your code it should be
val df = spark.sql("select 1 id, 'select * from table_a, table_b' query" )
val df1 = spark.sql("select 2 id, 'select * from table_c join table_d' query" )
val df3 = df.union(df1)
df3.createOrReplaceTempView("tabla")
spark.sql("""
select id, collect_list(tables) from (
select id, explode(split(query, ' ')) as tables
from tabla)
where tables like 'table%' group by id""").show
The output will be
+---+--------------------+
| id|collect_list(tables)|
+---+--------------------+
| 1| [table_a,, table_b]|
| 2| [table_c, table_d]|
+---+--------------------+
Hope this helps

If you are on spark>=2.4 then you can remove exploding and collecting the same operations by using higher order functions on array and without any subqueries-
Load the test data
val data =
"""
|id | query
|1 | select * from table_a, table_b
|2 | select * from table_c join table_d on table_c.id=table_d.id
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(";"))
.toSeq.toDS()
val df = spark.read
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.printSchema()
df.show(false)
/**
* root
* |-- id: integer (nullable = true)
* |-- query: string (nullable = true)
*
* +---+-----------------------------------------------------------+
* |id |query |
* +---+-----------------------------------------------------------+
* |1 |select * from table_a, table_b |
* |2 |select * from table_c join table_d on table_c.id=table_d.id|
* +---+-----------------------------------------------------------+
*/
Extract the tables from query
// spark >= 2.4.0
df.createOrReplaceTempView("queries")
spark.sql(
"""
|select id,
| array_distinct(
| FILTER(
| split(query, '\\.|=|\\s+|,'), x -> x rlike 'table_a|table_b|table_c|table_d'
| )
| )as tables
|FROM
| queries
""".stripMargin)
.show(false)
/**
* +---+------------------+
* |id |tables |
* +---+------------------+
* |1 |[table_a, table_b]|
* |2 |[table_c, table_d]|
* +---+------------------+
*/

Related

Databricks Spark SQL subquery based query throws TreeNodeException

I am running a pretty simple query in databricks notebook which involves a subquery.
select recorddate, count(*)
from( select record_date as recorddate, column1
from table1
where record_date >= date_sub(current_date(), 1)
)t
group by recorddate
order by recorddate
I get the following exception:
Error in SQL statement: package.TreeNodeException: Binding attribute, tree: recorddate
And when remove the order by clause, the query runs fine. I see some posts talking about similar issues but exactly the same. Is this a known behavior? Any workaround/fix for this?

Working well for me, (spark = 2.4.5) I think problem is something different-
val df = spark.sql("select current_date() as record_date, '1' column1")
df.show(false)
/**
* +-----------+-------+
* |record_date|column1|
* +-----------+-------+
* |2020-07-29 |1 |
* +-----------+-------+
*/
df.createOrReplaceTempView("table1")
spark.sql(
"""
|select recorddate, count(*)
|from( select record_date as recorddate, column1
| from table1
| where record_date >= date_sub(current_date(), 1)
| )t
|group by recorddate
|order by recorddate
|
""".stripMargin)
.show(false)
/**
* +----------+--------+
* |recorddate|count(1)|
* +----------+--------+
* |2020-07-29|1 |
* +----------+--------+
*/

Median calculation in spark 1.6 Error: expected but identifier DIV found

I am trying to calculate median on latitude column based on group (destinationid and LocationID) column
Scala Spark 1.6
Data in JSON looks like:
DESTINATION_ID,LOCATION_ID,LATITUDE
[ENSG00000257017,EAST_0000182,0.07092000000000001]
[ENSG00000257017,WEST_0001397,0.07092000000000001]
[ENSG00000181965,EAST_1001951,0.07056000000000001]
[ENSG00000146648,EAST_0000616,0.07092000000000001]
[ENSG00000111537,WEST_0001845,0.07092000000000001]
[ENSG00000103222,EAST_0000565,0.07056000000000001]
[ENSG00000118137,EAST_0000508,0.07092000000000001]
[ENSG00000112715,EAST_0000616,0.07092000000000001]
[ENSG00000108984,EAST_0000574,0.07056000000000001]
[ENSG00000159640,NORTH_797,0.07092000000000001]
[ENSG00000113522,NORTH_790,0.07056000000000001]
[ENSG00000133895,NORTH_562,0.07056000000000001]
Code
var ds = sqlContext.sql("""
SELECT DESTINATION_ID,LOCATION_ID, avg(LATITUDE) as median
FROM ( SELECT DESTINATION_ID,LOCATION_ID, LATITUDE, rN, (CASE WHEN cN % 2 = 0 then (cN DIV 2) ELSE (cN DIV 2) + 1 end) as m1, (cN DIV 2) + 1 as m2
FROM (
SELECT DESTINATION_ID,LOCATION_ID, LATITUDE, row_number() OVER (PARTITION BY DESTINATION_ID,LOCATION_ID ORDER BY LATITUDE ) as rN,
count(LATITUDE) OVER (PARTITION BY DESTINATION_ID,LOCATION_ID ) as cN
FROM people
) s
) r
WHERE rN BETWEEN m1 and m2
GROUP BY DESTINATION_ID,LOCATION_ID
""")
Error:
**Exception in thread "main" java.lang.RuntimeException: [3.98] failure: ``)''
expected but identifier DIV found**
Please help me if i am missing something.
Or
Techie please guide me Is there any better way to calculate median in spark
Thanks

I tried to execute the above query with the test input you provided, as below-
val data =
"""
|DESTINATION_ID,LOCATION_ID,LATITUDE
|ENSG00000257017,EAST_0000182,0.07092000000000001
|ENSG00000257017,WEST_0001397,0.07092000000000001
|ENSG00000181965,EAST_1001951,0.07056000000000001
|ENSG00000146648,EAST_0000616,0.07092000000000001
|ENSG00000111537,WEST_0001845,0.07092000000000001
|ENSG00000103222,EAST_0000565,0.07056000000000001
|ENSG00000118137,EAST_0000508,0.07092000000000001
|ENSG00000112715,EAST_0000616,0.07092000000000001
|ENSG00000108984,EAST_0000574,0.07056000000000001
|ENSG00000159640,NORTH_797,0.07092000000000001
|ENSG00000113522,NORTH_790,0.07056000000000001
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\,").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +---------------+------------+-------------------+
* |DESTINATION_ID |LOCATION_ID |LATITUDE |
* +---------------+------------+-------------------+
* |ENSG00000257017|EAST_0000182|0.07092000000000001|
* |ENSG00000257017|WEST_0001397|0.07092000000000001|
* |ENSG00000181965|EAST_1001951|0.07056000000000001|
* |ENSG00000146648|EAST_0000616|0.07092000000000001|
* |ENSG00000111537|WEST_0001845|0.07092000000000001|
* |ENSG00000103222|EAST_0000565|0.07056000000000001|
* |ENSG00000118137|EAST_0000508|0.07092000000000001|
* |ENSG00000112715|EAST_0000616|0.07092000000000001|
* |ENSG00000108984|EAST_0000574|0.07056000000000001|
* |ENSG00000159640|NORTH_797 |0.07092000000000001|
* |ENSG00000113522|NORTH_790 |0.07056000000000001|
* +---------------+------------+-------------------+
*
* root
* |-- DESTINATION_ID: string (nullable = true)
* |-- LOCATION_ID: string (nullable = true)
* |-- LATITUDE: double (nullable = true)
*/
df.createOrReplaceTempView("people")
spark.sql(
"""
|SELECT
| DESTINATION_ID,
| LOCATION_ID,
| avg(LATITUDE) as median
|FROM
| (
| SELECT
| DESTINATION_ID,
| LOCATION_ID,
| LATITUDE,
| rN,
| (
| CASE WHEN cN % 2 = 0 then (cN / 2) ELSE (cN / 2) + 1 end
| ) as m1,
| (cN / 2) + 1 as m2
| FROM
| (
| SELECT
| DESTINATION_ID,
| LOCATION_ID,
| LATITUDE,
| row_number() OVER (
| PARTITION BY DESTINATION_ID,
| LOCATION_ID
| ORDER BY
| LATITUDE
| ) as rN,
| count(LATITUDE) OVER (PARTITION BY DESTINATION_ID, LOCATION_ID) as cN
| FROM
| people
| ) s
| ) r
|WHERE
| rN BETWEEN m1
| and m2
|GROUP BY
| DESTINATION_ID,
| LOCATION_ID
""".stripMargin)
.show(false)
/**
* +--------------+-----------+------+
* |DESTINATION_ID|LOCATION_ID|median|
* +--------------+-----------+------+
* +--------------+-----------+------+
*/
You need to check your query or input, its not providing any output
check IF THE BELOW QUERY HELPS -
spark.sql(
"""
|SELECT *
|FROM people k NATURAL JOIN
|(SELECT
| DESTINATION_ID,
| LOCATION_ID,
| avg(LATITUDE) as median
|FROM
| (
| SELECT
| DESTINATION_ID,
| LOCATION_ID,
| LATITUDE,
| rN,
| (
| CASE WHEN cN % 2 = 0 then (cN / 2) ELSE (cN / 2) - 1 end
| ) as m1,
| (cN / 2) + 1 as m2
| FROM
| (
| SELECT
| DESTINATION_ID,
| LOCATION_ID,
| LATITUDE,
| row_number() OVER (
| PARTITION BY DESTINATION_ID,
| LOCATION_ID
| ORDER BY
| LATITUDE
| ) as rN,
| count(LATITUDE) OVER (PARTITION BY DESTINATION_ID, LOCATION_ID) as cN
| FROM
| people
| ) s
| ) r
|WHERE
| rN BETWEEN m1
| and m2
|GROUP BY
| DESTINATION_ID,
| LOCATION_ID
| ) t
""".stripMargin)
.show(false)
/**
* +---------------+------------+-------------------+-------------------+
* |DESTINATION_ID |LOCATION_ID |LATITUDE |median |
* +---------------+------------+-------------------+-------------------+
* |ENSG00000111537|WEST_0001845|0.07092000000000001|0.07092000000000001|
* |ENSG00000257017|WEST_0001397|0.07092000000000001|0.07092000000000001|
* |ENSG00000103222|EAST_0000565|0.07056000000000001|0.07056000000000001|
* |ENSG00000108984|EAST_0000574|0.07056000000000001|0.07056000000000001|
* |ENSG00000112715|EAST_0000616|0.07092000000000001|0.07092000000000001|
* |ENSG00000113522|NORTH_790 |0.07056000000000001|0.07056000000000001|
* |ENSG00000118137|EAST_0000508|0.07092000000000001|0.07092000000000001|
* |ENSG00000146648|EAST_0000616|0.07092000000000001|0.07092000000000001|
* |ENSG00000159640|NORTH_797 |0.07092000000000001|0.07092000000000001|
* |ENSG00000181965|EAST_1001951|0.07056000000000001|0.07056000000000001|
* |ENSG00000257017|EAST_0000182|0.07092000000000001|0.07092000000000001|
* +---------------+------------+-------------------+-------------------+
*/

Get value for latest record incase of multiple records for same group

I have a dataset which will have multiple records for an id column field grouped on other columns. For this dataset, I want to derive a new column only for the latest record of each group. I was using a case statement to derive the new column and union to get the value for the latest record. I was thinking to avoid using UNION as it is an expensive operation in spark-sql.
Input:
person_id order_id order_ts order_amt
1 1 2020-01-01 10:10:10 10
1 2 2020-01-01 10:15:15 15
2 3 2020-01-01 10:10:10 0
2 4 2020-01-01 10:15:15 15
From the above input, person_id 1 has two orders (1,2) and person_id 2 has two orders (3,4). I want to derive a column for only latest order for a given person.
Expected Output:
person_id order_id order_ts order_amt valid_order
1 1 2020-01-01 10:10:10 10 N
1 2 2020-01-01 10:15:15 15 Y
2 3 2020-01-01 10:10:10 0 N
2 4 2020-01-01 10:15:15 15 Y
I tried below query to get the output using UNION in the query:
select person_id, order_id, order_ts, order_amt, valid_order
from
(
select *, row_number() over(partition by order_id order by derive_order) as rnk
from
(
select person_id, order_id, order_ts, order_amt, 'N' as valid_order, 'before' as derive_order
from test_table
UNION
select person_id, order_id, order_ts, order_amt,
case when order_amt is not null and order_amt >0 then 'Y' else 'N' end as valid_order,
'after' as derive_order
from
(
select *, row_number() over(partition by person_id order by order_ts desc) as rnk
from test_table
) where rnk = 1
) final
) where rnk = 1 order by person_id, order_id;
I also got the same output using a combination of left outer join and inner join.
Join Query:
select final.person_id, final.order_id, final.order_ts, final.order_amt,
case when final.valid_order is null then 'N' else final.valid_order end as valid_order
from
(
select c.person_id, c.order_id, c.order_ts, c.order_amt, d.valid_order from test_table c
left outer join
(
select a.*, case when a.order_amt is not null and a.order_amt >0 then 'Y' else 'N' end as valid_order
from test_table a
inner join
(
select person_id, max(order_id) as order_id from test_table group by 1
) b on a.person_id = b.person_id and a.order_id = b.order_id
) d on c.order_id = d.order_id
) final order by person_id, order_id;
Our input dataset will have around 20Million records. Is there a better-optimized way to get the same output apart from the above queries.
Any help would be appreciated.

check if it helps-
val data =
"""
|person_id | order_id | order_ts |order_amt
| 1 | 1 | 2020-01-01 10:10:10 | 10
| 1 | 2 | 2020-01-01 10:15:15 | 15
| 2 | 3 | 2020-01-01 10:10:10 | 0
| 2 | 4 | 2020-01-01 10:15:15 | 15
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.printSchema()
df.show(false)
/**
* root
* |-- person_id: integer (nullable = true)
* |-- order_id: integer (nullable = true)
* |-- order_ts: timestamp (nullable = true)
* |-- order_amt: integer (nullable = true)
*
* +---------+--------+-------------------+---------+
* |person_id|order_id|order_ts |order_amt|
* +---------+--------+-------------------+---------+
* |1 |1 |2020-01-01 10:10:10|10 |
* |1 |2 |2020-01-01 10:15:15|15 |
* |2 |3 |2020-01-01 10:10:10|0 |
* |2 |4 |2020-01-01 10:15:15|15 |
* +---------+--------+-------------------+---------+
*/
Using spark DSL
df.withColumn("latest", max($"order_ts").over(Window.partitionBy("person_id")))
.withColumn("valid_order", when(unix_timestamp($"latest") - unix_timestamp($"order_ts") =!= 0, lit("N"))
.otherwise(lit("Y"))
)
.show(false)
/**
* +---------+--------+-------------------+---------+-------------------+-----------+
* |person_id|order_id|order_ts |order_amt|latest |valid_order|
* +---------+--------+-------------------+---------+-------------------+-----------+
* |2 |3 |2020-01-01 10:10:10|0 |2020-01-01 10:15:15|N |
* |2 |4 |2020-01-01 10:15:15|15 |2020-01-01 10:15:15|Y |
* |1 |1 |2020-01-01 10:10:10|10 |2020-01-01 10:15:15|N |
* |1 |2 |2020-01-01 10:15:15|15 |2020-01-01 10:15:15|Y |
* +---------+--------+-------------------+---------+-------------------+-----------+
*/
Using SPARK SQL
// Spark SQL
df.createOrReplaceTempView("order_table")
spark.sql(
"""
|select person_id, order_id, order_ts, order_amt, latest,
| case when (unix_timestamp(latest) - unix_timestamp(order_ts) != 0) then 'N' else 'Y' end as valid_order
| from
| (select person_id, order_id, order_ts, order_amt, max(order_ts) over (partition by person_id) as latest FROM order_table) a
""".stripMargin)
.show(false)
/**
* +---------+--------+-------------------+---------+-------------------+-----------+
* |person_id|order_id|order_ts |order_amt|latest |valid_order|
* +---------+--------+-------------------+---------+-------------------+-----------+
* |2 |3 |2020-01-01 10:10:10|0 |2020-01-01 10:15:15|N |
* |2 |4 |2020-01-01 10:15:15|15 |2020-01-01 10:15:15|Y |
* |1 |1 |2020-01-01 10:10:10|10 |2020-01-01 10:15:15|N |
* |1 |2 |2020-01-01 10:15:15|15 |2020-01-01 10:15:15|Y |
* +---------+--------+-------------------+---------+-------------------+-----------+
*/

It can be done without joins or union. Also this condition a.order_amt is not null and a.order_amt >0 is redundant because if amount > 0 it is already NOT NULL.
select person_id, order_id, order_ts, order_amt,
case when rn=1 and order_amt>0 then 'Y' else 'N' end as valid_order
from
(
select person_id, order_id, order_ts, order_amt,
row_number() over(partition by person_id order by order_ts desc) as rn
from test_table a
) s

Get all Not null columns of spark dataframe in one Column

I need to select all not nulls column from Hive table and insert them into Hbase. For example, consider the below table:
Name Place Department Experience
==============================================
Ram | Ramgarh | Sales | 14
Lakshman | Lakshmanpur |Operations |
Sita | Sitapur | | 14
Ravan | | | 25
I have to write all the not null columns from above table to Hbase. So I wrote a logic to get not null columns in one column of dataframe as below. Name column is mandatory there.
Name Place Department Experience Not_null_columns
================================================================================
Ram Ramgarh Sales 14 Name, Place, Department, Experience
Lakshman Lakshmanpur Operations Name, Place, Department
Sita Sitapur 14 Name, Place, Experience
Ravan 25 Name, Experience
Now my requirement is to create a column in dataframe with all values of not null columns in a single column as provided below.
Name Place Department Experience Not_null_columns_values
Ram Ramgarh Sales 14 Name: Ram, Place: Ramgarh, Department: Sales, Experince: 14
Lakshman Lakshmanpur Operations Name: Lakshman, Place: Lakshmanpur, Department: Operations
Sita Sitapur 14 Name: Sita, Place: Sitapur, Experience: 14
Ravan 25 Name: Ravan, Experience: 25
Once I get above df I will write it to Hbase with Name as key and last column as value.
Please let me know if there could have been a better approach to do this.

Try this-
Load the test data provided
val data =
"""
|Name | Place | Department | Experience
|
|Ram | Ramgarh | Sales | 14
|
|Lakshman | Lakshmanpur |Operations |
|
|Sita | Sitapur | | 14
|
|Ravan | | | 25
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
// .option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +--------+-----------+----------+----------+
* |Name |Place |Department|Experience|
* +--------+-----------+----------+----------+
* |Ram |Ramgarh |Sales |14 |
* |Lakshman|Lakshmanpur|Operations|null |
* |Sita |Sitapur |null |14 |
* |Ravan |null |null |25 |
* +--------+-----------+----------+----------+
*
* root
* |-- Name: string (nullable = true)
* |-- Place: string (nullable = true)
* |-- Department: string (nullable = true)
* |-- Experience: integer (nullable = true)
*/
convert struct and then json
val x = df.withColumn("Not_null_columns_values",
to_json(struct(df.columns.map(col): _*)))
x.show(false)
x.printSchema()
/**
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
* |Name |Place |Department|Experience|Not_null_columns_values |
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
* |Ram |Ramgarh |Sales |14 |{"Name":"Ram","Place":"Ramgarh","Department":"Sales","Experience":14}|
* |Lakshman|Lakshmanpur|Operations|null |{"Name":"Lakshman","Place":"Lakshmanpur","Department":"Operations"} |
* |Sita |Sitapur |null |14 |{"Name":"Sita","Place":"Sitapur","Experience":14} |
* |Ravan |null |null |25 |{"Name":"Ravan","Experience":25} |
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
*/

How to concatenate spark dataframe columns using Spark sql in databricks

I have two columns called "FirstName" and "LastName" in my dataframe, how can I concatenate this two columns into one.
|Id |FirstName|LastName|
| 1 | A | B |
| | | |
| | | |
I want to make it like this
|Id |FullName |
| 1 | AB |
| | |
| | |
my query look like this but it raises an error
val kgt=spark.sql("""
Select Id,FirstName+' '+ContactLastName AS FullName from tblAA """)
kgt.createOrReplaceTempView("NameTable")

Here we go with the Spark SQL solution:
spark.sql("select Id, CONCAT(FirstName,' ',LastName) as FullName from NameTable").show(false)
OR
spark.sql( " select Id, FirstName || ' ' ||LastName as FullName from NameTable ").show(false)

from pyspark.sql import functions as F
df = df.withColumn('FullName', F.concat(F.col('First_name'), F.col('last_name')))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

SparkSQL - Extract multiple regex matches (using SQL only) - apache-spark

Related

Databricks Spark SQL subquery based query throws TreeNodeException

Median calculation in spark 1.6 Error: expected but identifier DIV found

Get value for latest record incase of multiple records for same group

Get all Not null columns of spark dataframe in one Column

How to concatenate spark dataframe columns using Spark sql in databricks

Categories

Resources