Unnesting map values as individual columns in Athena / presto - presto

My question is somewhat similar to this ( Athena/Presto - UNNEST MAP to columns ). But in my case, I know what columns I need before hand.
My use case is this
I have a json blob which contains the following structures
{
"reqId" : "1234",
"clientId" : "client",
"response" : [
{
"name" : "Susan",
"projects" : [
{
"name" : "project1",
"completed" : true
},
{
"name" : "project2",
"completed" : false
}
]
},
{
"name" : "Adams",
"projects" : [
{
"name" : "project1",
"completed" : true
},
{
"name" : "project2",
"completed" : false
}
]
}
]
}
I need to create a view which will return output something like this
name | project | Completed |
----------+-------------+------------+
Susan | project1 | true |
Susan | project2 | false |
Adams | project1 | true |
Adams | project2 | false |
I tried the following and other approaches. This one was the closest I can get
WITH dataset AS (
SELECT 'Susan' as name, transform(filter(CAST(json_extract('{
"projects": [{"name":"project1", "completed":false}, {"name":"project3", "completed":false},
{"name":"project2", "completed":true}]}', '$.projects') AS ARRAY<MAP<VARCHAR, VARCHAR>>), p -> (p['name'] != 'project1')), p -> ROW(map_values(p))) AS projects
)
SELECT * from dataset
CROSS JOIN UNNEST(projects)
This is the output I am getting
name projects _col2
1 Susan [{field0=[project3, false]}, {field0=[project2, true]}] {field0=[project3, false]}
2 Susan [{field0=[project3, false]}, {field0=[project2, true]}] {field0=[project2, true]}
I basically want to unnest the key-value pairs of my map as separate columns. How do I do this in presto / Athena?

Your JSON example seems to be invalid, it misses a , after "name" : "Susan" and "name" : "Adams". Besides that, you can achieve your expected output by this query, you need to UNNEST two times and also requires some casting:
with dataset as
(
select json_parse('{"reqId" : "1234","clientId" : "client","response" : [{"name" : "Susan","projects" : [{"name" : "project1","completed" : true},{"name" : "project2","completed" : false}]},{"name" : "Adams","projects" : [{"name" : "project1","completed" : true},{"name" : "project2","completed" : false}]}]}') as json_col
)
,unnest_response as
(
select *
from dataset
cross join UNNEST(cast(json_extract(json_col, '$.response') as array<JSON>)) as t (response)
)
select
json_extract_scalar(response, '$.name') name,
json_extract_scalar(project, '$.name') project_name,
json_extract_scalar(project, '$.completed') project_completed
from unnest_response
cross join UNNEST(cast(json_extract(response, '$.projects') as array<JSON>)) as t (project);

Related

Arangodb AQL Joining, merging, embedding nested three collections or more

I have the following collections, based on the example Arangodb doc here but have added a third collection called region
Users
{
"name" : {
"first" : "John",
"last" : "Doe"
},
"city" : "cities/2241300989",
"_id" : "users/2290649597",
"_rev" : "2290649597",
"_key" : "2290649597"
}
Cities
{
"population" : 1000,
"name" : "Metropolis",
"region" : "regions/2282300990",
"_id" : "cities/2241300989",
"_rev" : "2241300989",
"_key" : "2241300989"
}
Regions
{
"name" : "SomeRegion1",
"_id" : "regions/2282300990",
"_rev" : "2282300990",
"_key" : "2282300990"
}
I want to have a target result like this
[
{
"user" : {
"name" : {
"first" : "John",
"last" : "Doe"
},
"_id" : "users/2290649597",
"_rev" : "2290649597",
"_key" : "2290649597"
},
"city" : {
"population" : 1000,
"name" : "Metropolis",
"_id" : "cities/2241300989",
"_rev" : "2241300989",
"_key" : "2241300989",
"region" : {
"name" : "SomeRegion1",
"_id" : "regions/2282300990",
"_rev" : "2282300990",
"_key" : "2282300990"
}
}
}
]
The example in the Arangodb doc here only has queries for two collections
FOR u IN users
FOR c IN cities
FILTER u.city == c._id RETURN merge(u, {city: c})
# However I want to have more than two collections e.g.
FOR u IN users
FOR c IN cities
For r IN regions
FILTER u.city == c._id and c.region == r._id RETURN merge(????????)
How would you get the result with three collections joined as above? What happens if I want a forth nested one?
When you store a document _id that references another collection, then you can leverage the DOCUMENT AQL command.
So your AQL query becomes a bit simpler, like this:
FOR u IN users
LET city = DOCUMENT(u.city)
LET city_with_region = MERGE(city, { region: DOCUMENT(city.region})
RETURN MERGE(u, { city: city_with_region})
This query could be collapsed even more, but I left it like this so it's more self documenting.
What is cool about DOCUMENT is you can return only a single attribute of a document, such as LET region_name = DOCUMENT(city.region).name.
I've also found that in most cases it's more performant that doing a subquery to locate the document.
Probably something like this:
FOR u IN users
FOR c IN cities
For r IN regions
FILTER u.city == c._id AND c.region == r._id
RETURN { user: u, city: MERGE(c, {region: r } }
Is there a particular reason why you store ids instead of keys to refer to cities and regions? The _id is just a virtual field that consists of the _key prefixed by the collection name (plus a slash). So this would work just as well (I intentionally omit the internal _id and _rev fields):
Users
{
"name" : {
"first" : "John",
"last" : "Doe"
},
"city" : "2241300989",
"_key" : "2290649597"
}
Cities
{
"population" : 1000,
"name" : "Metropolis",
"region" : "2282300990",
"_key" : "2241300989"
}
Regions
{
"name" : "SomeRegion1",
"_key" : "2282300990"
}
FOR u IN users
FOR c IN cities
For r IN regions
FILTER u.city == c._key AND c.region == r._key
RETURN { user: u, city: MERGE(c, {region: r } }

Accessing unknown levels of PSCustomObject nested object

I'm getting a response from an API with unknown nested levels of properties, this is an example:
affects_rating : True
assets : {#{asset=xxxxxxxxxxxxx; identifier=; category=low; importance=0.0; is_ip=True}}
details : #{check_pass=; diligence_annotations=; geo_ip_location=NL; grade=GOOD; remediations=System.Object[]; vulnerabilities=System.Object[]; dest_port=443; rollup_end_date=2021-06-06;
rollup_start_date=2020-03-18}
evidence_key : xxxxxxxxx:xxxx
first_seen : 2020-03-18
last_seen : 2021-06-06
related_findings : {}
risk_category : Diligence
risk_vector : open_ports
risk_vector_label : Open Ports
rolledup_observation_id : xxxx-xxxx==
severity : 1.0
severity_category : minor
tags : {}
asset_overrides : {}
duration :
comments :
remaining_decay : 59
temporary_id : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
affects_rating : True
assets : {#{asset=xxxx.xxxx.com; identifier=; category=low; importance=0.0002340946; is_ip=False}, #{asset=xxxx.xxxx.com; identifier=; category=critical; importance=0.45131093; is_ip=False},
So far I've tried to access each value with a table, but sometimes the records contains an object, which outputs System.Object[] to the CSV file.
foreach ($item in $findings.results) {
$tabledata = [ordered]#{
temporary_id = $item.temporary_id
affects_rating = $item.affects_rating
asset = $item.assets.asset
asset_identifier = $item.assets.identifier
asset_category = $item.assets.category
asset_importance = $item.assets.importance
asset_is_ip = $item.assets.is_ip
modal_data = $item.details.diligence_annotations.modal_data
modal_tags = $item.details.diligence_annotations.modal_tags
server = $item.details.diligence_annotations.server
}
}
The type of the variable $findings is a PSCustomObject
PS C:\Users\bryanar> $findings.GetType()
IsPublic IsSerial Name BaseType
-------- -------- ---- --------
True False PSCustomObject System.Object
Any recommendations?
Let's say I have an object $result full of properties, lists, or other objects:
$result = [pscustomobject]#{Name='Foo';Level='1'
Object= [pscustomobject]#{Name='Bar';Level='2'
List= #(
[pscustomobject]#{User='John Smith';Dept='Accounting'},
[pscustomobject]#{User='Bob Smith';Dept='Accounting'}
)}}
$result | fl
Name : Foo
Level : 1
Object : #{Name=Bar; Level=2; List=System.Object[]}
If you just want to see the whole object for troubleshooting/exploring purposes, I find the easiest way is to convert it to json or xml:
$result | ConvertTo-Json -Depth 10
{
"Name": "Foo",
"Level": "1",
"Object": {
"Name": "Bar",
"Level": "2",
"List": [
{
"User": "John Smith",
"Dept": "Accounting"
},
{
"User": "Bob Smith",
"Dept": "Accounting"
}
]
}
}
If you want to save an object like this, use Export-CLIXML instead of CSV. It's very verbose, but great for when you need to re-use an object since it keeps the type information.

Compare two Collections in MongoDB and show the differences

I'm trying to compare two collections in mongodb. I have Collection A and Collection B and I only want to show the Differences. How is this done? I thought it could be done with the Aggregation Framework but I did not get the expected values. I just want to see which Document in Collection A is not the same as in Collection B.
Collection: A
{
"_id" : ObjectId("x"),
"p" : [
{
"t" : 1,
"p" : 123
},
{
"t" : 2,
"p" : 123
}
]
},
{
"_id" : ObjectId("y"),
"p" : [
{
"t" : 1,
"p" : 234
},
{
"t" : 2,
"p" : 234
}
]
}
Collection: B
{
"_id" : ObjectId("x"),
"p" : [
{
"t" : 1,
"p" : 123
},
{
"t" : 2,
"p" : 538458 // OTHER VALUE HERE
}
]
},
{
"_id" : ObjectId("y"),
"p" : [
{
"t" : 1,
"p" : 234
},
{
"t" : 2,
"p" : 234
}
]
}
You could export each collection by using mongoexport, this will create a file with all the documents, but make sure you omit the _id (documents maybe identical but will have different ids):
mongoexport --db db_name --collection collection_name | sed '/"_id":/s/"_id":[^,]*,//' > file_name.json
Then you can compare the two files using diff.

Mongodb sort with case insensitive manner

I am struct very hard at one project in nodejs(express) with mongodb as database. When i get all data using sort() it returns data in wrong manner, so is there way to get it properly format as i am expecting as below:
If we have three record in DB:
---------------------
id | Name | aga
---------------------
1 | atul | 21
---------------------
2 | Bhavik | 22
---------------------
3 | Jay | 25
What i am getting at present is:
2,3,1 series data
What i expect is to come is:
1,2,3
It means is to ignore the case while sorting is it possible without adding new column.
You need to use collation here with locale: "en"
db.collection.find({}).collation({ locale: "en" }).sort({ name: 1 })
So for the below document
{ "_id" : 1, "name" : "Bhavik" }
{ "_id" : 2, "name" : "Jay" }
{ "_id" : 3, "name" : "atul" }
You will get
{ "_id" : 3, "name" : "atul" }
{ "_id" : 1, "name" : "Bhavik" }
{ "_id" : 2, "name" : "Jay" }
Create the collection with a default collation by this way you can order by any property with case insensitive.
db.createCollection("collection_name", { collation: { locale: 'en_US', strength: 2 } } )
db.getCollection('collection_name').find({}).sort( { 'property_name': -1 } )
More info: https://docs.mongodb.com/manual/core/index-case-insensitive/
You can pass collation: { locale: 'en' } directly in the options parameter of the find method:
db.collection.find({ ...query }, {
sort: ...,
limit: ...,
collation: { locale: 'en' }
}

cassandra-cli 'list' in cassandra 3.0

I want to view the "rowkey" with its stored data in cassandra 3.0. I know, the depreciated cassandra-cli had the 'list'-command. However, in cassandra 3.0, I cannot find the replacement for the 'list'-command. Anyone knows the new cli-command for 'list'?
You can use sstabledump utility as #chris-lohfink suggested. How to use it? Create keyspace, table in it populate some data:
cqlsh> CREATE KEYSPACE IF NOT EXISTS minetest WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
cqlsh> CREATE TABLE object_coordinates (
... object_id int PRIMARY KEY,
... coordinate text
... );
cqlsh> use minetest;
cqlsh:minetest> insert into object_coordinates (object_id, coordinate) values (564682,'59.8505,34.0035');
cqlsh:minetest> insert into object_coordinates (object_id, coordinate) values (1235,'61.7814,40.3316');
cqlsh:minetest> select object_id, coordinate, writetime(coordinate) from object_coordinates;
object_id | coordinate | writetime(coordinate)
-----------+-----------------+-----------------------
1235 | 61.7814,40.3316 | 1480436931275615
564682 | 59.8505,34.0035 | 1480436927707627
(2 rows)
object_id is a primary (partition key) key, coordinate is clustering one.
Flush changes to disk:
# nodetool flush
Find sstable on disk and analyze it:
# cd /var/lib/cassandra/data/minetest/object_coordinates-e19d4c40b65011e68563f1a7ec2d3d77
# ls
backups mc-1-big-CompressionInfo.db mc-1-big-Data.db mc-1-big-Digest.crc32 mc-1-big-Filter.db mc-1-big-Index.db mc-1-big-Statistics.db mc-1-big-Summary.db mc-1-big-TOC.txt
# sstabledump mc-1-big-Data.db
[
{
"partition" : {
"key" : [ "1235" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 18,
"liveness_info" : { "tstamp" : "2016-11-29T16:28:51.275615Z" },
"cells" : [
{ "name" : "coordinate", "value" : "61.7814,40.3316" }
]
}
]
},
{
"partition" : {
"key" : [ "564682" ],
"position" : 43
},
"rows" : [
{
"type" : "row",
"position" : 61,
"liveness_info" : { "tstamp" : "2016-11-29T16:28:47.707627Z" },
"cells" : [
{ "name" : "coordinate", "value" : "59.8505,34.0035" }
]
}
]
}
]
Or with -d flag:
# sstabledump mc-1-big-Data.db -d
[1235]#0 Row[info=[ts=1480436931275615] ]: | [coordinate=61.7814,40.3316 ts=1480436931275615]
[564682]#43 Row[info=[ts=1480436927707627] ]: | [coordinate=59.8505,34.0035 ts=1480436927707627
Output says that 1235 and 564682 and saves coordinates in those partitions.
Link to doc http://www.datastax.com/dev/blog/debugging-sstables-in-3-0-with-sstabledump
PS. sstabledump is provided by cassandra-tools package in ubuntu.

Resources