replacing json attributes partially in spark dataframe - apache-spark

Environment: I'm using Databricks with spark 3.3.0 and Python 3.
Problem trying to solve: I'm trying to replace some of the attribute values of a json struct column. I have a dataframe that contains a struct type column that has the following json content structure:
ID
myCol
1
{"att1": "abcde", "att2": "def", "att3": "defg", "att4": "defabc"}
2
{"att1": "xyfp", "att2": "asdf", "att3": "ertyui", "att4": "asdfg"}
3
{"att1": "fjhj", "att2": "zxcxzvc", "att3": "wtwert", "att4": "mjgkj"}
The dataframe contains thousands of records, I'm a bit new to spark programming so I've been having a hard time to come up with a way to replace the values of "att1" and "att3" in all rows in the dataframe with the same value but leaving only the first two characters and masking the rest, i.e from the example above:
Expected Output:
ID
myCol
1
{"att1": "ab---", "att2": "def", "att3": "de--", "att4": "defabc"}
2
{"att1": "xy--", "att2": "asdf", "att3": "er----", "att4": "asdfg"}
3
{"att1": "fj--", "att2": "zxcxzvc", "att3": "wt----", "att4": "mjgkj"}
I was looking into maybe using org.apache.spark.sql.functions.regexp_replace but i don't know how to replace only part of the value, i.e from "abcde" to "ab---", i've looked at similar examples online except every single one of them replaces the entire value and the value is known beforehand such as this one https://stackoverflow.com/a/68899109/1994202, however, i need to leave the first two original characters and the value is not static.
Any suggestions? performance would also be important

Lookbehind will help you to select correct characters to replace:
(?<=..).
Above means "any character, if it's preceded by two characters. This will not select first or second character, because they are preceded by none/one.
Therefore:
df.withColumn("myCol", col("myCol").withField("att1", regexp_replace(col("myCol.att1"), "(?<=..).", "-"))
.withField("att3", regexp_replace(col("myCol.att3"), "(?<=..).", "-"))).show()

Full Fledged example using create_map(), based on what #Kombajn zbożowy answered using lookbehind in the regular expression.
Create a dataframe as follows using the following schema.
from pyspark.sql.types import *
data = [
(1, {"att1": "abcde", "att2": "def", "att3": "defg", "att4": "defabc"}),
(2, {"att1": "xyfp", "att2": "asdf", "att3": "ertyui", "att4": "asdfg"}),
(3, {"att1": "fjhj", "att2": "zxcxzvc", "att3": "wtwert", "att4": "mjgkj"})
]
schema = StructType([
StructField("ID",IntegerType(), True),
StructField('myCol', MapType(StringType(),StringType()),True)
])
df = spark.createDataFrame(data,schema = schema)
df.show(truncate=False)
Then use create_map() function to update the json fields of att1 and att3
df.withColumn('myCol',create_map
(
lit('att1'),regexp_replace(col("myCol.att1"), "(?<=..).", "-"),
lit('att2'),col("myCol.att2"),
lit('att3'),regexp_replace(col("myCol.att3"), "(?<=..).", "-"),
lit('att4'),col("myCol.att4"),
)
).show(truncate=False)
Find the below picture for your reference:

Related

Cosmos DB json array that matchs all the words

I need a query that can get me the document from a list of words for example if I use
select c from c join (SELECT distinct VALUE c.id FROM c JOIN word IN c.words WHERE word in('word1',word2) and tag in('motorcycle')) ORDER BY c._ts desc
it will bring both documents, I want to retrieve only the first one cause it matches the two words and not only one.
Document 1
"c": {
"id": "d0f1723c-0a55-454a-9cf8-3884f2d8d61a",
"words": [
"word1",
"word2",
"word3",
]}
Document 2
"c": {
"id": "d0f1723c-0a55-454a-9cf8-3884f2d8d61a",
"words": [
"word1",
"word4",
"word5",
]}
You should be able to cover this with two ARRAY_CONTAINS expressions in your WHERE clause (and no need for a JOIN):
SELECT c.id FROM c
WHERE ARRAY_CONTAINS(c.words, 'word1')
AND ARRAY_CONTAINS(c.words, 'word2')
This should return the id of your first document.

find duplicate rows containing various types of lists (of lists) in pandas dataframe

Background
I have the following df that contains a mix of list types
import pandas as pd
df = pd.DataFrame({'Size' : [[[['small', 'small', 'big', 'big']]], [['big', 'small','small']], ['big'], ['big']],
'ID': [1,2,3,3],
'Animal' : [['cat', 'dog', 'dog', 'cat'], ['dog', 'pig','dog'], ['pig'], ['pig']]
})
Which looks like this
Animal ID Size
0 [cat, dog, dog, cat] 1 [[[small, small, big, big]]]
1 [dog, pig, dog] 2 [[big, small, small]]
2 [pig] 3 [big]
3 [pig] 3 [big]
Problem
I use the following
df.duplicated()
I get the following error since my dataframe contains list (at least I think this is why)
TypeError: unhashable type: 'list'
Question
How do I check for duplicate rows in a dataframe that contains multiple types of lists?
df.loc[df.astype(str).drop_duplicates().index]

Apache Cassandra Data modeling

I need some help for a data model to save smart meter data, im pretty new working with cassandra.
The data that has to be stored:
This is a example of 1 smart meter:
{"logical_name": "smgw_123",
"ldevs":
[{"logical_name": "sm_1", "objects": [{"capture_time": 390600, "unit": 30, "scaler": -3, "status": "000", "value": 152.361925}]},
{"logical_name": "sm_2", "objects": [{"capture_time": 390601, "unit": 33, "scaler": -3, "status": "000", "value": 0.3208547253907171}]},
{"logical_name": "sm_3", "objects": [{"capture_time": 390602, "unit": 36, "scaler": -3, "status": "000", "value": 162.636025}]}]
}
So this is 1 smart meter gateway with the logical_name "smgw_123".
And in the ldevs array are 3 smartmeters with their values described.
So the smart meter gateway has a relation to the 3 smart meters. And the smart meters again have their own data.
Questions
I dont know how I can store these data which have relations in a no sql database (in my case cassandra).
Do I have to use than 2 columns? Like smartmetergateway (logical name, smart meter1, smart meter 2, smart meter 3)
and another with smart meter (logical name, capture time, unit, scaler, status, value)
???
Another problem is, all smart meter gateways can have different amount of smart meters.
I hope I could describe my problem understandable.
thx
In Cassandra data modelling, the first thing you should do is to determine your queries. You will model partition keys and clustering columns of your tables according to your queries.
In your example, I assume you will query your smart meter gateways based on their logical names. I mean, your queries will look like
select <some_columns>
from smart_meter_gateway
where smg_logical_name = <a_smg_logical_name>;
Also I assume each smart meter gateway logical names are unique and each smart meter name in ldevs array has a unique logical name.
If this is the case, you should create a table with a partition key column of smg_logical_name and clustering column of sm_logical_name. By doing this, you will create a table where each smart meter gateway partition will contain some number of rows of smart meters:
create table smart_meter_gateway
(
smg_logical_name text,
sm_logical_name text,
capture_time int,
unit int,
scaler int,
status text,
value decimal,
primary key ((smg_logical_name), sm_logical_name)
);
And you can insert into this table by using following statements:
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_1', 390600, 30, -3, '000', 152.361925);
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_2', 390601, 33, -3, '000', 0.3208547253907171);
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_3', 390602, 36, -3, '000', 162.636025);
And when you query smart_meter_gateway table by smg_logical_name, you will get 3 rows in the result set:
select * from smart_meter_gateway where smg_logical_name = 'smgw_123';
The result of this query is:
smg_logical_name sm_logical_name capture_time scaler status unit value
smgw_123 sm_1 390600 -3 000 30 152.361925
smgw_123 sm_2 390601 -3 000 33 0.3208547253907171
smgw_123 sm_3 390602 -3 000 36 162.636025
You can also add sm_name as a filter to your query:
select *
from smart_meter_gateway
where smg_logical_name = 'smgw_123' and sm_logical_name = 'sm_1';
This time you will get only 1 row in the result set:
smg_logical_name sm_logical_name capture_time scaler status unit value
smgw_123 sm_1 390600 -3 000 30 152.361925
Note that there are other ways you can model your data. For example, you can use collection columns for ldevs array and this approach has some advantages and disadvantages. As I said in the beginning, it depends on your query needs.

Spark SQL secondary filtering and grouping

Problem: I have a data set A {filed1, field2, field3...}, and I would like to first group A by say, field1, then in each of the resulting groups, I would like to do bunch of subqueries, for example, count the number of rows that have field2 == true, or count the number of distinct field3 that have field4 == "some_value" and field5 == false, etc.
Some alternatives I can think of: I can write a customized user defined aggregate function that takes a function that computes the condition for filtering, but this way I have to create an instance of it for every query condition. I've also looked at the countDistinct function can achieve some of the operations, but I can't figure out how to use it to implement the filter-distinct-count semantic.
In Pig, I can do:
FOREACH (GROUP A by field1) {
field_a = FILTER A by field2 == TRUE;
field_b = FILTER A by field4 == 'some_value' AND field5 == FALSE;
field_c = DISTINCT field_b.field3;
GENERATE FLATTEN(group),
COUNT(field_a) as fa,
COUNT(field_b) as fb,
COUNT(field_c) as fc,
Is there a way to do this in Spark SQL?
Excluding distinct count this is can solved by simple sum over condition:
import org.apache.spark.sql.functions.sum
val df = sc.parallelize(Seq(
(1L, true, "x", "foo", true), (1L, true, "y", "bar", false),
(1L, true, "z", "foo", true), (2L, false, "y", "bar", false),
(2L, true, "x", "foo", false)
)).toDF("field1", "field2", "field3", "field4", "field5")
val left = df.groupBy($"field1").agg(
sum($"field2".cast("int")).alias("fa"),
sum(($"field4" === "foo" && ! $"field5").cast("int")).alias("fb")
)
left.show
// +------+---+---+
// |field1| fa| fb|
// +------+---+---+
// | 1| 3| 0|
// | 2| 1| 1|
// +------+---+---+
Unfortunately is much more tricky. GROUP BY clause in Spark SQL doesn't physically group data. Not to mention that finding distinct elements is quite expensive. Probably the best thing you can do is to compute distinct counts separately and simply join the results:
val right = df.where($"field4" === "foo" && ! $"field5")
.select($"field1".alias("field1_"), $"field3")
.distinct
.groupBy($"field1_")
.agg(count("*").alias("fc"))
val joined = left
.join(right, $"field1" === $"field1_", "leftouter")
.na.fill(0)
Using UDAF to count distinct values per condition is definitely an option but efficient implementation will be rather tricky. Converting from internal representation is rather expensive, and implementing fast UDAF with a collection storage is not cheap either. If you can accept approximate solution you can use bloom filter there.

JSONB Postgres 9.4

I have a jsonb storing my order product:
CREATE TABLE configuration (
documentid text PRIMARY KEY
, data jsonb NOT NULL
);
Records:
(1, [{"itemid": "PROD001", "qty": 10}, {"itemid": "PROD002", "qty": 20}]),
(2, [{"itemid": "PROD001", "qty": 5}, {"itemid": "PROD003", "qty": 6}, {"itemid": "PROD004", "qty": 7}]),
(2, [{"itemid": "PROD002", "qty": 8}])
I already index data using GIN.
How do I:
Select all sales that has PROD001
Select all sales that has itemid LIKE P%1
Select all sales that has qty > 10
get each product total qty
Postgres documentation regarding JSON functionality is really clear and worth your attention. As for the use cases you provided, answers may be the following:
-- Answer 1
SELECT sales_item
FROM configuration,jsonb_array_elements(data) AS sales_item
WHERE sales_item->>'itemid' = 'PROD001'; -- "->>" returns TEXT value
-- Answer 2
SELECT sales_item FROM configuration,jsonb_array_elements(data) AS sales_item
WHERE sales_item->>'itemid' LIKE 'P%1'; -- just use LIKE
-- Answer 3
SELECT sales_item FROM configuration,jsonb_array_elements(data) AS sales_item
WHERE (sales_item->>'qty')::INTEGER > 10; -- convert TEXT to INTEGER
For last example you just need to use Postgres Window function:
-- Answer 4
SELECT DISTINCT documentid,sum((sales_item->>'qty')::INTEGER)
OVER (PARTITION BY documentid)
FROM configuration,jsonb_array_elements(data) as sales_item;
Examples 1,3 with JsQuery extension
SELECT * FROM
(SELECT jsonb_array_elements(data) as elem FROM configuration
WHERE data ## '#.itemid = "PROD001"') q
WHERE q.elem ## 'itemid = "PROD001"'
SELECT * FROM
(SELECT jsonb_array_elements(data) as elem FROM configuration
WHERE data ## '#.qty > 10') q
WHERE q.elem ## 'qty > 10'
Inner query WHERE clause filters rows doesn't have any array element that matches requirements. Than jsonb_array_elements func. is applied only for needed rows of "configuration" table.

Resources