Getting error in Spark SQL when trying to concat '}' character - apache-spark

I have a use case where I need to concat a '}' to a string using Spark SQL. The sample dataset is as below:
+--------------------------------------+-----+
|col_1 |col_2|
+--------------------------------------+-----+
|{"key_1" : "val_1","key_2" : "val_2"}|abcd |
+--------------------------------------+-----+
root
|-- col_1: string (nullable = true)
|-- col_2: string (nullable = true)
I want to check the length of col_1 and based on that add value of col_2 into the JSON-formatted string of col_1. I have written a Spark SQL query as below:
select *, case when length(col_1) = 2 then
concat(substring(col_1, 0, length(col_1) - 1), '"col_2":"',cast(col_2 as STRING), '"}')
else concat(substring(col_1, 0, length(col_1) - 1), ',"col_2":"', cast(col_2 as STRING), '"}')
end as mod_col_1
from df
The query parsing fails when encountering the '}' character. Is there any way to add/escape this character in the query. Or any way to generate the desired string.
Expected output:
when col_1 = "{}"
+--------------------------------------+-----+
|col_1 |col_2|
+--------------------------------------+-----+
|{}|abcd |
+--------------------------------------+-----+
output:
+--------------------------------------+-----------------+---------------------------------------------------------------+
|col_1|col_2|mod_col_1 |
+--------------------------------------+-----------------+---------------------------------------------------------------+
|{}|abcd |{'col_2' : 'abcd'}|
+--------------------------------------+-----------------+---------------------------------------------------------------+
when,
col_1 = {"key_1" : "val_1", "key_2" : "val_2"}
+--------------------------------------+-----+
|col_1 |col_2|
+--------------------------------------+-----+
|{"key_1" : "val_1","key_2" : "val_2"}|abcd |
+--------------------------------------+-----+
output:
+--------------------------------------+-----------------+---------------------------------------------------------------+
|col_1|col_2|mod_col_1 |
+--------------------------------------+-----------------+---------------------------------------------------------------+
|{"key_1" : "val_1","key_2" : "val_2"}|abcd |{"key_1" : "val_1"key_2" : "val_2","col_2":"abcd"}|
+--------------------------------------+-----------------+---------------------------------------------------------------+
Happy to share more details if required.

You can try the regexp_replace() function.
Check this
spark.sql(s"""
with t1 ( select '{"key_1" : "val_1","key_2" : "val_2"}' col_1, 'abcd' col_2
union all
select '{}', 'defg' )
select *, case
when col_1 = '{}' then "{ 'col_2' : '"||col_2|| "'}"
else regexp_replace(col_1,"[}]",":")||"'col_2' : '"|| col_2 || "'}"
end x from t1
""").show(50,false)
Output:
+-------------------------------------+-----+------------------------------------------------------+
|col_1 |col_2|x |
+-------------------------------------+-----+------------------------------------------------------+
|{"key_1" : "val_1","key_2" : "val_2"}|abcd |{"key_1" : "val_1","key_2" : "val_2":'col_2' : 'abcd'}|
|{} |defg |{ 'col_2' : 'defg'} |
+-------------------------------------+-----+------------------------------------------------------+
Update:
To get the output in double quotes, wrap it in single quotes
spark.sql(s"""
with t1 ( select '{"key_1" : "val_1","key_2" : "val_2"}' col_1, 'abcd' col_2
union all
select '{}', 'defg' )
select *, case
when col_1 = '{}' then '{ "col_2" : "' ||col_2|| '"}'
else regexp_replace(col_1,"[}]",":")||'"col_2" : "'|| col_2 || '"}'
end x from t1
""").show(50,false)
+-------------------------------------+-----+------------------------------------------------------+
|col_1 |col_2|x |
+-------------------------------------+-----+------------------------------------------------------+
|{"key_1" : "val_1","key_2" : "val_2"}|abcd |{"key_1" : "val_1","key_2" : "val_2":"col_2" : "abcd"}|
|{} |defg |{ "col_2" : "defg"} |
+-------------------------------------+-----+------------------------------------------------------+

Related

In PySpark dataframe, Join send and receive rows in columns

I have a dataframe that has distinct 'send' and 'receive' rows. I need to join these rows in a single one with send and receive columns, using PySpark. Notice that the ID is the same for the lines and the action identifier is ACTION_ID:
Original dataframe:
+------------------------------------+------------------------+---------+--------------------+
|ID |MSG_DT |ACTION_CD|MESSAGE |
+------------------------------------+------------------------+---------+--------------------+
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07T21:24:54.552Z|receive |Oi |
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07T21:24:54.852Z|send |Olá! |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07T21:25:06.565Z|receive |4 |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07T21:25:06.688Z|send |Certo |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07T21:25:30.408Z|receive |1 |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07T21:25:30.479Z|send |⭐️*Antes de você ir |
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07T21:25:52.798Z|receive |788884 |
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07T21:25:57.435Z|send |Agora |
+------------------------------------+------------------------+---------+--------------------+
How I need:
+------------------------------------+------------------------+-------+-------------------+
|ID |MSG_DT |RECEIVE|SEND |
+------------------------------------+------------------------+-------+-------------------+
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07T21:24:54.552Z|Oi |Olá! |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07T21:25:06.565Z|4 |Certo |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07T21:25:30.408Z|1 |⭐️*Antes de você ir|
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07T21:25:52.798Z|788884 |Agora |
+------------------------------------+------------------------+-------+-------------------+
Ps.: The MSG_DT is the earliest record.
You can construct the RECEIVE and SEND by applying first expression over computed columns that are created depending on ACTION_CD.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [("d2636151-b95e-4845-8014-0a113c381ff9", "2022-08-07T21:24:54.552Z", "receive", "Oi",),
("d2636151-b95e-4845-8014-0a113c381ff9", "2022-08-07T21:24:54.852Z", "send", "Olá!",),
("4241224b-9ba5-4eda-8e16-7e3aeaacf164", "2022-08-07T21:25:06.565Z", "receive", "4",),
("4241224b-9ba5-4eda-8e16-7e3aeaacf164", "2022-08-07T21:25:06.688Z", "send", "Certo",),
("bd46c6fb-1315-4418-9943-2e7d3151f788", "2022-08-07T21:25:30.408Z", "receive", "1",),
("bd46c6fb-1315-4418-9943-2e7d3151f788", "2022-08-07T21:25:30.479Z", "send", "️*Antes de você ir",),
("14da8519-6e4c-4edc-88ea-e33c14533dd9", "2022-08-07T21:25:52.798Z", "receive", "788884",),
("14da8519-6e4c-4edc-88ea-e33c14533dd9", "2022-08-07T21:25:57.435Z", "send", "Agora",), ]
df = spark.createDataFrame(data, ("ID", "MSG_DT", "ACTION_CD", "MESSAGE")).withColumn("MSG_DT", F.to_timestamp("MSG_DT"))
ws = W.partitionBy("ID").orderBy("MSG_DT")
first_rows = ws.rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
action_column_selection = lambda action: F.first(F.when(F.col("ACTION_CD") == action, F.col("MESSAGE")), ignorenulls=True).over(first_rows)
(df.select("*",
action_column_selection("receive").alias("RECEIVE"),
action_column_selection("send").alias("SEND"),
F.row_number().over(ws).alias("rn"))
.where("rn = 1")
.drop("ACTION_CD", "MESSAGE", "rn")).show(truncate=False)
"""
+------------------------------------+-----------------------+-------+------------------+
|ID |MSG_DT |RECEIVE|SEND |
+------------------------------------+-----------------------+-------+------------------+
|14da8519-6e4c-4edc-88ea-e33c14533dd9|2022-08-07 23:25:52.798|788884 |Agora |
|4241224b-9ba5-4eda-8e16-7e3aeaacf164|2022-08-07 23:25:06.565|4 |Certo |
|bd46c6fb-1315-4418-9943-2e7d3151f788|2022-08-07 23:25:30.408|1 |️*Antes de você ir|
|d2636151-b95e-4845-8014-0a113c381ff9|2022-08-07 23:24:54.552|Oi |Olá! |
+------------------------------------+-----------------------+-------+------------------+
"""

How does Spark SQL implement the group by aggregate

How does Spark SQL implement the group by aggregate? I want to group by name field and based on the latest data to get the latest salary. How to write the SQL
The data is:
+-------+------|+---------|
// | name |salary|date |
// +-------+------|+---------|
// |AA | 3000|2022-01 |
// |AA | 4500|2022-02 |
// |BB | 3500|2022-01 |
// |BB | 4000|2022-02 |
// +-------+------+----------|
The expected result is:
+-------+------|
// | name |salary|
// +-------+------|
// |AA | 4500|
// |BB | 4000|
// +-------+------+
Assuming that the dataframe is registered as a temporary view named tmp, first use the row_number windowing function for each group (name) in reverse order by date Assign the line number (rn), and then take all the lines with rn=1.
sql = """
select name, salary from
(select *, row_number() over (partition by name order by date desc) as rn
from tmp)
where rn = 1
"""
df = spark.sql(sql)
df.show(truncate=False)
First convert your string to a date.
Covert the date to an UNixTimestamp.(number representation of a date, so you can use Max)
User "First" as an aggregate
function that retrieves a value of your aggregate results. (The first results, so if there is a date tie, it could pull either one.)
:
simpleData = [("James","Sales","NY",90000,34,'2022-02-01'),
("Michael","Sales","NY",86000,56,'2022-02-01'),
("Robert","Sales","CA",81000,30,'2022-02-01'),
("Maria","Finance","CA",90000,24,'2022-02-01'),
("Raman","Finance","CA",99000,40,'2022-03-01'),
("Scott","Finance","NY",83000,36,'2022-04-01'),
("Jen","Finance","NY",79000,53,'2022-04-01'),
("Jeff","Marketing","CA",80000,25,'2022-04-01'),
("Kumar","Marketing","NY",91000,50,'2022-05-01')
]
schema = ["employee_name","name","state","salary","age","updated"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
df.withColumn(
"dateUpdated",
unix_timestamp(
to_date(
col("updated") ,
"yyyy-MM-dd"
)
)
).groupBy("name")
.agg(
max("dateUpdated"),
first("salary").alias("Salary")
).show()
+---------+----------------+------+
| name|max(dateUpdated)|Salary|
+---------+----------------+------+
| Sales| 1643691600| 90000|
| Finance| 1648785600| 90000|
|Marketing| 1651377600| 80000|
+---------+----------------+------+
My usual trick is to "zip" date and salary together (depends on what do you want to sort first)
from pyspark.sql import functions as F
(df
.groupBy('name')
.agg(F.max(F.array('date', 'salary')).alias('max_date_salary'))
.withColumn('max_salary', F.col('max_date_salary')[1])
.show()
)
+----+---------------+----------+
|name|max_date_salary|max_salary|
+----+---------------+----------+
| AA|[2022-02, 4500]| 4500|
| BB|[2022-02, 4000]| 4000|
+----+---------------+----------+

How to filter text after some stop word?

I have a text. From each line I want to filter everything after some stop word. For example :
stop_words=['with','is', '/']
One of the rows is:
senior manager with experience
I want to remove everything after with (including with) so the output is:
senior manager
I have big-data and am working with Spark in Python.
You can find the location of the stop words using instr, and get a substring up to that location.
import pyspark.sql.functions as F
stop_words = ['with', 'is', '/']
df = spark.createDataFrame([
['senior manager with experience'],
['is good'],
['xxx//'],
['other text']
]).toDF('col')
df.show(truncate=False)
+------------------------------+
|col |
+------------------------------+
|senior manager with experience|
|is good |
|xxx // |
|other text |
+------------------------------+
df2 = df.withColumn('idx',
F.coalesce(
# Get the smallest index of a stop word in the string
F.least(*[F.when(F.instr('col', s) != 0, F.instr('col', s)) for s in stop_words]),
# If no stop words found, get the whole string
F.length('col') + 1)
).selectExpr('trim(substring(col, 1, idx-1)) col')
df2.show()
+--------------+
| col|
+--------------+
|senior manager|
| |
| xxx|
| other text|
+--------------+
You can use udf and get index of first occurrence of stop word in col, then again using one more udf, you can substring col message.
val df = List("senior manager with experience", "is good", "xxx//", "other text").toDF("col")
val index_udf = udf ( (col_value :String ) => {val result = for (elem <- stop_words; if col_value.contains(elem)) yield col_value.indexOf(elem)
if (result.isEmpty) col_value.length else result.min } )
val substr_udf = udf((elem:String, index:Int) => elem.substring(0, index))
val df3 = df.withColumn("index", index_udf($"col")).withColumn("substr_message", substr_udf($"col", $"index")).select($"substr_message").withColumnRenamed("substr_message", "col")
df3.show()
+---------------+
| col|
+---------------+
|senior manager |
| |
| xxx|
| other text|
+---------------+

how to achieve execute several functions on each column dynamically?

I am using spark-sql-2.4.1v with java8.
I have following scenario
val df = Seq(
("0.9192019", "0.1992019", "0.9955999"),
("0.9292018", "0.2992019", "0.99662018"),
("0.9392017", "0.3992019", "0.99772000")).toDF("item1_value","item2_value","item3_value")
.withColumn("item1_value", $"item1_value".cast(DoubleType))
.withColumn("item2_value", $"item2_value".cast(DoubleType))
.withColumn("item3_value", $"item3_value".cast(DoubleType))
df.show(20)
I need an expected output something like this
-----------------------------------------------------------------------------------
col_name | sum_of_column | avg_of_column | vari_of_column
-----------------------------------------------------------------------------------
"item1_value" | sum("item1_value") | avg("item1_value") | variance("item1_value")
"item2_value" | sum("item2_value") | avg("item2_value") | variance("item2_value")
"item3_value" | sum("item3_value") | avg("item3_value") | variance("item3_value")
----------------------------------------------------------------------------------
how to achieve this dynamically .. tomorrow i may have
This is sample code that can achieve this. You can make column list dynamic and add more functions if needed.
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column
val df = Seq(
("0.9192019", "0.1992019", "0.9955999"),
("0.9292018", "0.2992019", "0.99662018"),
("0.9392017", "0.3992019", "0.99772000")).
toDF("item1_value","item2_value","item3_value").
withColumn("item1_value", $"item1_value".cast(DoubleType)).
withColumn("item2_value", $"item2_value".cast(DoubleType)).
withColumn("item3_value", $"item3_value".cast(DoubleType))
val aggregateColumns = Seq("item1_value","item2_value","item3_value")
var aggDFs = aggregateColumns.map( c => {
df.groupBy().agg(lit(c).as("col_name"),sum(c).as("sum_of_column"), avg(c).as("avg_of_column"), variance(c).as("var_of_column"))
})
var combinedDF = aggDFs.reduce(_ union _)
This returns following output:
scala> df.show(10,false)
+-----------+-----------+-----------+
|item1_value|item2_value|item3_value|
+-----------+-----------+-----------+
|0.9192019 |0.1992019 |0.9955999 |
|0.9292018 |0.2992019 |0.99662018 |
|0.9392017 |0.3992019 |0.99772 |
+-----------+-----------+-----------+
scala> combinedDF.show(10,false)
+-----------+------------------+------------------+---------------------+
|col_name |sum_of_column |avg_of_column |var_of_column |
+-----------+------------------+------------------+---------------------+
|item1_value|2.7876054 |0.9292018 |9.999800000999957E-5 |
|item2_value|0.8976057000000001|0.2992019 |0.010000000000000002 |
|item3_value|2.9899400800000002|0.9966466933333334|1.1242332201333484E-6|
+-----------+------------------+------------------+---------------------+

Updated data still persist in CQL table

I created a table with SET as a column using CQL .
CREATE TABLE z_test.activity_follow (
activity_type_id text,
error_message text,
error_type text,
file text,
line_no text,
project_api_key text,
project_name text,
project_party_id text,
release_stage_id text,
stage_name text,
project_type_name text,
activity_type_name text,
account_id text,
created_at text,
secure_url text,
error_count text,
user_id set<text>,
PRIMARY KEY (activity_type_id,error_message,error_type,file,line_no,project_api_key,project_name,project_party_id,release_stage_id,stage_name,project_type_name,activity_type_name,account_id,created_at,secure_url)
);
Where z_test is my keyspace.
Then i inserted one value into the table using following query,
UPDATE z_test.activity_follow SET user_id = user_id + {'46'} , error_count = '4'
WHERE activity_type_id = '1'
AND error_message = '1'
AND error_type = '1'
AND FILE = '1'
AND line_no = '1'
AND project_api_key = '1'
AND project_name = '1'
AND project_party_id = '1'
AND release_stage_id = '1'
AND stage_name = '1'
AND project_type_name = '1'
AND activity_type_name = '1'
AND account_id = '1'
AND secure_url = '1'
AND created_at = '1'
UPDATE z_test.activity_follow SET user_id = user_id + {'464'} , error_count = '4'
WHERE activity_type_id = '1'
AND error_message = '1'
AND error_type = '1'
AND FILE = '1'
AND line_no = '1'
AND project_api_key = '1'
AND project_name = '1'
AND project_party_id = '1'
AND release_stage_id = '1'
AND stage_name = '1'
AND project_type_name = '1'
AND activity_type_name = '1'
AND account_id = '1'
AND secure_url = '1'
AND created_at = '1'
The values is inserted successfully. And i used following select statement,
SELECT * FROM z_test.users WHERE emails CONTAINS 'test#mail.com';
And i got the following result,
activity_type_id | error_message | error_type | file | line_no | project_api_key | project_name | project_party_id | release_stage_id | stage_name | project_type_name | activity_type_name | account_id | created_at | secure_url | error_count | user_id
------------------+------------------------------------+----------------+--------------------------------------------------------------------+---------+--------------------------------------+--------------------------+------------------+------------------+-------------+-------------------+--------------------+------------+---------------------+-------------------------------------------------+-------------+---------
1 | alebvevcbvghhgrt123 is not defined | ReferenceError | http://localhost/ems-sdk/netspective_ems_js/example/automatic.html | 19 | 8aec5ce3-e924-3090-9bfe-57a440feba5f | Prescribewell-citrus-123 | 48 | 4 | Development | Php | exception | 47 | 2015-03-03 04:04:23 | PRE-EX-429c3daae9c108dffec32f113b9ca9cff1bb0468 | 1 | {'464'}
Then i removed one email from the table using,
UPDATE z_test.activity_follow SET user_id = user_id - {'46'} , error_count = '4'
WHERE activity_type_id = '1'
AND error_message = '1'
AND error_type = '1'
AND FILE = '1'
AND line_no = '1'
AND project_api_key = '1'
AND project_name = '1'
AND project_party_id = '1'
AND release_stage_id = '1'
AND stage_name = '1'
AND project_type_name = '1'
AND activity_type_name = '1'
AND account_id = '1'
AND secure_url = '1'
AND created_at = '1'
Now when i am using the above query ,
SELECT * FROM z_test.activity_follow WHERE user_id CONTAINS '46';
And it still returns the row,
activity_type_id | error_message | error_type | file | line_no | project_api_key | project_name | project_party_id | release_stage_id | stage_name | project_type_name | activity_type_name | account_id | created_at | secure_url | error_count | user_id
------------------+------------------------------------+----------------+--------------------------------------------------------------------+---------+--------------------------------------+--------------------------+------------------+------------------+-------------+-------------------+--------------------+------------+---------------------+-------------------------------------------------+-------------+---------
1 | alebvevcbvghhgrt123 is not defined | ReferenceError | http://localhost/ems-sdk/netspective_ems_js/example/automatic.html | 19 | 8aec5ce3-e924-3090-9bfe-57a440feba5f | Prescribewell-citrus-123 | 48 | 4 | Development | Php | exception | 47 | 2015-03-03 04:04:23 | PRE-EX-429c3daae9c108dffec32f113b9ca9cff1bb0468 | 1 | {'464'}
Why i am getting this behavior? is it expected in CQL? If i can remove this how? I have given every value as 1 for test, i tried it with other values also.
What client are you using to perform your CQL statements? Is this all done in cqlsh or something else?
This is just a shot in a dark guess, but if you run two CQL statements matching the same primary key quickly after one another, it's possible that they are given the same writetime in cassandra which means one of the mutations will be ignored.
See: Cassandra: Writes after setting a column to null are lost randomly. Is this a bug, or I am doing something wrong?
If you are running Cassandra 2.1.2+ cassandra will now break ties if there are writes/upates at the same millisecond (CASSANDRA-6123)

Resources