Is there a way we can split a JSON containing array of strings into computed columns?
credit_cards column is of jsonb type with below sample data:
[{"bank": "HDFC Bank", "cvv": "8253", "expiry": "2020-05-31T14:22:34.61426Z", "name": "18a81ea99250bf236a5e27a762a32d62", "number": "c4ca4238acf96a36"}, {"bank": "HDFC Bank", "cvv": "9214", "expiry": "2020-05-30T21:44:55.173339Z", "name": "6725a156df733ec2dd33b94f06ee2e06", "number": "c81e728dacf96a36"}, {"bank": "HDFC Bank", "cvv": "1161", "expiry": "2020-05-31T07:59:28.458905Z", "name": "eb102765424d07d8b713211c14e837b4", "number": "eccbc87eacf96a36"}]
I tried this, but looks like its not supported in Computed Column.
alter table users_combined add column projected_number STRING[] AS (json_array_elements(credit_cards)->>'number') STORED;
ERROR: json_array_elements(): generator functions are not allowed in computed column
Another alternative which worked was:
alter table users_combined add column projected_number STRING[] AS (ARRAY[credit_cards->0->>'number',credit_cards->1->>'number',credit_cards->2->>'number']) STORED;
However this has a problem of user having to specify the "indices" of credit_card array. If we've more than 3 credit cards then we'll have to alter the column with new indices.
So is there a way to create Computed Column without having to specify the indices?
There is no way to do this. But, inverted indexes are a way to get the same capabilities for looking up data in the table that I think you're going for here.
If you create an inverted index on the table, then you can search for rows that have a given number attribute efficiently:
demo#127.0.0.1:26257/defaultdb> create table users_combined (credit_cards jsonb);
CREATE TABLE
Time: 3ms total (execution 3ms / network 0ms)
demo#127.0.0.1:26257/defaultdb> create inverted index on users_combined(credit_cards);
CREATE INDEX
Time: 53ms total (execution 4ms / network 49ms)
demo#127.0.0.1:26257/defaultdb> explain select * from users_combined where credit_cards->'number' = '"c4ca4238acf96a36"';
info
----------------------------------------------------------------------------------------
distribution: local
vectorized: true
• index join
│ table: users_combined#primary
│
└── • scan
table: users_combined#users_combined_credit_cards_idx
spans: [/'{"number": "c4ca4238acf96a36"}' - /'{"number": "c4ca4238acf96a36"}']
(10 rows)
Related
name
contact
address
"max"
[{"email": "watson#commerce.gov", "phone": "650-333-3456"}, {"email": "emily#gmail.com", "phone": "238-111-7689"}]
{"city": "Baltimore", "state": "MD"}
"kyle"
[{"email": "johnsmith#yahoo.com", "phone": "425-231-8754"}]
{"city": "Barton", "state": "TN"}
I am working with a dataframe in Pyspark that has a few columns including the two mentioned above. I need to create columns dynamically based on the contact fields.
When I use the "." operator on contact as contact.email I get a list of emails. I need to create separate column for each of the emails.
contact.email0, contact.email1, etc.
I found this code online, which partially does what I want, but I don't completely understand it.
employee_data.select(
'name', *[col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]).show(truncate=False)
The range is static in this case, but my range could be dynamic. How can I get the size of list to loop through it? I tried size(col('contact.email')) or len(col('contact.email')) but got an error saying the col('column name') object is not iterable.
Desired output something like -
name
contact.email0
contact.email1
max
watson#commerce.gov
emily#gmail.com
kyle
johnsmith#yahoo.com
null
You can get desired output by using pivot function,
# convert contact struct to array of emails by using transform function
# explode the array
# perform pivot
df.select("name", posexplode_outer(expr("transform(contact, c-> c.email)"))) \
.withColumn("email", concat(lit("contact.email"), col("pos"))) \
.groupBy("name").pivot("email").agg(first("col")) \
.show(truncate=False)
+----+-------------------+---------------+
|name|contact.email0 |contact.email1 |
+----+-------------------+---------------+
|kyle|johnsmith#yahoo.com|null |
|max |watson#commerce.gov|emily#gmail.com|
+----+-------------------+---------------+
To understand what the solution you found does, we can print the expression in a shell:
>>> [F.col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]
[Column<'contact.email[0] AS `contact.email0`'>, Column<'contact.email[1] AS `contact.email1`'>]
Basically, it creates two columns, one for the first element of the array contact.email and one for the second element. That's all there is to it.
SOLUTION 1
Keep this solution. But you need to find the max size of your array first:
max_size = df.select(F.max(F.size("contact"))).first()[0]
df.select('name',
*[F.col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(max_size)])\
.show(truncate=False)
SOLUTION 2
Use posexplode to generate one row per element of the array + a pos column containing the index of the email in the array. Then use a pivot to create the columns you want.
df.select('name', F.posexplode('contact.email').alias('pos', 'email'))\
.withColumn('pos', F.concat(F.lit('contact.email'), 'pos'))\
.groupBy('name')\
.pivot('pos')\
.agg(F.first('email'))\
.show()
Both solutions yield:
+----+-------------------+---------------+
|name|contact.email0 |contact.email1 |
+----+-------------------+---------------+
|max |watson#commerce.gov|emily#gmail.com|
|kyle|johnsmith#yahoo.com|null |
+----+-------------------+---------------+
You can use size or array_length functions to get the length of the list in the contact column, and then use that in the range function to dynamically create columns for each email. Here's an example:
from pyspark.sql.functions import size, array_length
contact_size = size(col('contact'))
employee_data.select(
'name', *[col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(contact_size)]).show(truncate=False)
Or, using array_length:
from pyspark.sql.functions import size, array_length
contact_size = array_length(col('contact'))
employee_data.select(
'name', *[col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(contact_size)]).show(truncate=False)
I'm trying to write a query that uses a JOIN to perform a geo-spatial match against locations in a array. I got it working, but added DISTINCT in order to de-duplicate (Query A):
SELECT DISTINCT VALUE
u
FROM
u
JOIN loc IN u.locations
WHERE
ST_WITHIN(
{'type':'Point','coordinates':[loc.longitude,loc.latitude]},
{'type':'Polygon','coordinates':[[[-108,-43],[-108,-40],[-110,-40],[-110,-43],[-108,-43]]]})
However, I then found that combining DISTINCT with continuation tokens isn't supported unless you also add ORDER BY:
System.ArgumentException: Distict query requires a matching order by in order to return a continuation token. If you would like to serve this query through continuation tokens, then please rewrite the query in the form 'SELECT DISTINCT VALUE c.blah FROM c ORDER BY c.blah' and please make sure that there is a range index on 'c.blah'.
So I tried adding ORDER BY like this (Query B):
SELECT DISTINCT VALUE
u
FROM
u
JOIN loc IN u.locations
WHERE
ST_WITHIN(
{'type':'Point','coordinates':[loc.longitude,loc.latitude]},
{'type':'Polygon','coordinates':[[[-108,-43],[-108,-40],[-110,-40],[-110,-43],[-108,-43]]]})
ORDER BY
u.created
The problem is, the DISTINCT no longer appears to be taking effect because it returns, for example, the same record twice.
To reproduce this, create a single document with this data:
{
"id": "b6dd3e9b-e6c5-4e5a-a257-371e386f1c2e",
"locations": [
{
"latitude": -42,
"longitude": -109
},
{
"latitude": -42,
"longitude": -109
}
],
"created": "2019-03-06T03:43:52.328Z"
}
Then run Query A above. You will get a single result, despite the fact that both locations match the predicate. If you remove the DISTINCT, you'll get the same document twice.
Now run Query B and you'll see it returns the same document twice, despite the DISTINCT clause.
What am I doing wrong here?
Reproduced your issue indeed,based on my researching,it seems a defect in cosmos db distinct query. Please refer to this link:Provide support for DISTINCT.
This feature is broke in the data explorer. Because cosmos can only
return 100 results per page at a time, the distinct keyword will only
apply to a single page. So, if your result set contains more than 100
results, you may still get duplicates back - they will simply be on
separately paged result sets.
You could describe your own situation and vote up this feedback case.
I need some help for a data model to save smart meter data, im pretty new working with cassandra.
The data that has to be stored:
This is a example of 1 smart meter:
{"logical_name": "smgw_123",
"ldevs":
[{"logical_name": "sm_1", "objects": [{"capture_time": 390600, "unit": 30, "scaler": -3, "status": "000", "value": 152.361925}]},
{"logical_name": "sm_2", "objects": [{"capture_time": 390601, "unit": 33, "scaler": -3, "status": "000", "value": 0.3208547253907171}]},
{"logical_name": "sm_3", "objects": [{"capture_time": 390602, "unit": 36, "scaler": -3, "status": "000", "value": 162.636025}]}]
}
So this is 1 smart meter gateway with the logical_name "smgw_123".
And in the ldevs array are 3 smartmeters with their values described.
So the smart meter gateway has a relation to the 3 smart meters. And the smart meters again have their own data.
Questions
I dont know how I can store these data which have relations in a no sql database (in my case cassandra).
Do I have to use than 2 columns? Like smartmetergateway (logical name, smart meter1, smart meter 2, smart meter 3)
and another with smart meter (logical name, capture time, unit, scaler, status, value)
???
Another problem is, all smart meter gateways can have different amount of smart meters.
I hope I could describe my problem understandable.
thx
In Cassandra data modelling, the first thing you should do is to determine your queries. You will model partition keys and clustering columns of your tables according to your queries.
In your example, I assume you will query your smart meter gateways based on their logical names. I mean, your queries will look like
select <some_columns>
from smart_meter_gateway
where smg_logical_name = <a_smg_logical_name>;
Also I assume each smart meter gateway logical names are unique and each smart meter name in ldevs array has a unique logical name.
If this is the case, you should create a table with a partition key column of smg_logical_name and clustering column of sm_logical_name. By doing this, you will create a table where each smart meter gateway partition will contain some number of rows of smart meters:
create table smart_meter_gateway
(
smg_logical_name text,
sm_logical_name text,
capture_time int,
unit int,
scaler int,
status text,
value decimal,
primary key ((smg_logical_name), sm_logical_name)
);
And you can insert into this table by using following statements:
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_1', 390600, 30, -3, '000', 152.361925);
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_2', 390601, 33, -3, '000', 0.3208547253907171);
insert into smart_meter_gateway (smg_logical_name, sm_logical_name, capture_time, unit, scaler, status, value)
values ('smgw_123', 'sm_3', 390602, 36, -3, '000', 162.636025);
And when you query smart_meter_gateway table by smg_logical_name, you will get 3 rows in the result set:
select * from smart_meter_gateway where smg_logical_name = 'smgw_123';
The result of this query is:
smg_logical_name sm_logical_name capture_time scaler status unit value
smgw_123 sm_1 390600 -3 000 30 152.361925
smgw_123 sm_2 390601 -3 000 33 0.3208547253907171
smgw_123 sm_3 390602 -3 000 36 162.636025
You can also add sm_name as a filter to your query:
select *
from smart_meter_gateway
where smg_logical_name = 'smgw_123' and sm_logical_name = 'sm_1';
This time you will get only 1 row in the result set:
smg_logical_name sm_logical_name capture_time scaler status unit value
smgw_123 sm_1 390600 -3 000 30 152.361925
Note that there are other ways you can model your data. For example, you can use collection columns for ldevs array and this approach has some advantages and disadvantages. As I said in the beginning, it depends on your query needs.
This query cost 265 RU/s:
SELECT top 1 * FROM c
WHERE c.CollectPackageId = 'd0613cbb-492b-4464-b66b-3634b5571826'
ORDER BY c.StartFetchDateTimeUtc DESC
StartFetchDateTimeUtc is a string property, serialized by using the Cosmos API
This query cost 5 RU/s:
SELECT top 1 * FROM c
WHERE c.CollectPackageId = 'd0613cbb-492b-4464-b66b-3634b5571826'
ORDER BY c._ts DESC
_ts is a built in field, a Unix-based numeric timestamp.
Example result (only including this field and _ts):
"StartFetchDateTimeUtc": "2017-08-08T03:35:04.1654152Z",
"_ts": 1502163306
The index is in place and follows the suggestions & tutorials how to configure a sortable string/timestamp. It looks like:
{
"path": "/StartFetchDateTimeUtc/?",
"indexes": [
{
"kind": "Range",
"dataType": "String",
"precision": -1
}
]
}
According to this article, the "Item size,Item property count,Data consistency,Indexed properties,Document indexing,Query patterns,Script usage" variables will affect the RU.
So it is very strange that different property costs different RU.
I also create a test demo on my side(with your index and same document property). I have inserted 1000 records to the documentdb. The two different query costs same RU. I suggest you could start a new collection and test again.
The result is like this:
Order by StartFetchDateTimeUtc
Order by _ts
Is there any way to range query rows with a composite row key when using random partitioning?
Im workling with column families created via CQL v3 like this:
CREATE TABLE products ( rowkey CompositeType(UTF8Type,UTF8Type,UTF8Type,UTF8Type)
PRIMARY KEY, prod_id varchar, class_id varchar, date varchar);
The data in the table looks like this:
RowKey: 6:3:2:19
=> (column=class_id, value=254, timestamp=1346800102625002)
=> (column=date, value=2034, timestamp=1346800102625000)
=> (column=prod_id, value=1922, timestamp=1346800102625001)
-------------------
RowKey: 0:14:1:16
=> (column=class_id, value=144, timestamp=1346797896819002)
=> (column=date, value=234, timestamp=1346797896819000)
=> (column=prod_id, value=4322, timestamp=1346797896819001)
-------------------
I’m trying to find a way to range query over these composite row keys analog to how we slice query over composite columns. Following approach sometimes actually succeeds in returning something useful depending on the start and stop key I choose.
Composite startKey = new Composite();
startKey.addComponent(0, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(1, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(2, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(3, "3", Composite.ComponentEquality.EQUAL);
Composite stopKey = new Composite();
stopKey.addComponent(0, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(1, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(2, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(3, "6" , Composite.ComponentEquality.GREATER_THAN_EQUAL);
RangeSlicesQuery<Composite, String, String> rangeSlicesQuery =
HFactory.createRangeSlicesQuery(keyspace, CompositeSerializer.get(), StringSerializer.get(), StringSerializer.get());
rangeSlicesQuery.setColumnFamily(columnFamilyName);
rangeSlicesQuery.setKeys(startKey,stopKey);
rangeSlicesQuery.setRange("", "", false, 3);
Most of the time the database returns this:
InvalidRequestException(why:start key's md5 sorts after end key's md5.
this is not allowed; you probably should not specify end key at all,
under RandomPartitioner)
Does somebody have an idea if something like this can be achieved WITHOUT using the order preserving partitioner? Do I have to build a custom row key index for this use case?
Thanks a lot!
Additional information:
What I’m trying to do is storing sales transaction data in a table which uses both composite row keys to encode date/time/place and composite columns to store information about the sold items:
The set of items per transaction varies in size and includes information about size, color and quantity of every item:
{ ... items :
[ { item_id : 43523 , size : 050 , color : 123 , qty : 1 } ,
{ item_id : 64233 , size : 048 , color : 834 , qty : 1 } ,
{ item_id : 23984 , size : 000 , color : 341 , qty : 3 } ,
… ] }
There’s also information about where and when the transaction happened including a unique transaction id:
{ trx_id : 23324827346, store_id : 8934 , date : 20110303 , time : 0947 , …
My initial approach was putting every item in a separate row and let the application group items back together by transaction id. That’s working fine. But now I’m trying to leverage the structuring capabilities of composite columns to persist the nested item data within a representation (per item) like this:
item_id:’size’ = <value> ; item_id:’color’ = <value> ; item_id:’qty’ = <value> ; …
43523:size = 050 ; 43523:color = 123 ; 43523:qty = 1 ; …
The rest of the data would be encoded in a composite row key like this:
date : time : store_id : trx_id
20110303 : 0947 : 001 : 23324827346
I need to be able to run queries like: All items which were sold between the dates 20110301 and 20110310 between times 1200 and 1400 in stores 25 - 50. What I achieved so far with composite columns was using one wide row per store and putting all the rest of the data into 3 different composite columns per item:
date:time:<type>:prod_id:transaction_id = <value> ; …
20110303:0947:size:43523:23324827346 = 050 ;
20110303:0947:color:43523:23324827346 = 123 ;
20110303:0947:qty:43523:23324827346 = 1 ;
It’s working, but it doesn’t really look highly efficient.
Is there any other alternative?
You're creating one row per partition, so it should be clear that RandomPartitioner will not give you ordered range queries.
You can do ordered ranges within a partition, which is very common, e.g. http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/ and http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra