Aggregate on inner object array - azure

I am trying to write a query by performing an aggregate on one of its properties that is an array of objects. As an example in the below json structure I want the country and the biggest airport as two columns in the output
[
{
"Country": "US",
"Airports": [
{
"Name": "Kodiak Airport",
"Area": "100"
},
{
"Name": "Homer Airport",
"Area": "87"
}
]
},
{
"Country": "Mexico",
"Airports": [
{
"Name": "Gulfport-Biloxi International Airport",
"Area": "94"
},
{
"Name": "El Paso International Airport",
"Area": "68"
}
]
}
]
so the reuslt will be 2 columns, country name and biggest airport's name as below:
Country Airport
US Kodiak Airport
Mexico Gulfport-Biloxi International Airport.
The following query returns country and the first airport's name in the array airports_s.
MyLogs_CL
| project country_s, Airports = todynamic(airports_s)
| project country_s, Airports[0].name
But I don't how to perform an aggregate on that array and return the object which has the highest area among them.

Related

Cosmos Db: How to query for the maximum value of a property in an array of arrays?

I'm not sure how to query when using CosmosDb as I'm used to SQL. My question is about how to get the maximum value of a property in an array of arrays. I've been trying subqueries so far but apparently I don't understand very well how they work.
In an structure such as the one below, how do I query the city with more population among all states using the Data Explorer in Azure:
{
"id": 1,
"states": [
{
"name": "New York",
"cities": [
{
"name": "New York",
"population": 8500000
},
{
"name": "Hempstead",
"population": 750000
},
{
"name": "Brookhaven",
"population": 500000
}
]
},
{
"name": "California",
"cities":[
{
"name": "Los Angeles",
"population": 4000000
},
{
"name": "San Diego",
"population": 1400000
},
{
"name": "San Jose",
"population": 1000000
}
]
}
]
}
This is currently not possible as far as I know.
It would look a bit like this:
SELECT TOP 1 state.name as stateName, city.name as cityName, city.population FROM c
join state in c.states
join city in state.cities
--order by city.population desc <-- this does not work in this case
You could write a user defined function that will allow you to write the query you probably expect, similar to this: CosmosDB sort results by a value into an array
The result could look like:
SELECT c.name, udf.OnlyMaxPop(c.states) FROM c
function OnlyMaxPop(states){
function compareStates(stateA,stateB){
stateB.cities[0].poplulation - stateA.cities[0].population;
}
onlywithOneCity = states.map(s => {
maxpop = Math.max.apply(Math, s.cities.map(o => o.population));
return {
name: s.name,
cities: s.cities.filter(x => x.population === maxpop)
}
});
return onlywithOneCity.sort(compareStates)[0];
}
You would probably need to adapt the function to your exact query needs, but I am not certain what your desired result would look like.

PySpark Dataframe to Json - grouping data

We are trying to create a json from a dataframe. Please find the dataframe below,
+----------+--------------------+----------+--------------------+-----------------+--------------------+---------------+--------------------+---------------+--------------------+--------------------+
| CustId| TIN|EntityType| EntityAttributes|AddressPreference| AddressDetails|EmailPreference| EmailDetails|PhonePreference| PhoneDetails| MemberDetails|
+----------+--------------------+----------+--------------------+-----------------+--------------------+---------------+--------------------+---------------+--------------------+--------------------+
|1234567890|XXXXXXXXXXXXXXXXXX...| Person|[{null, PRINCESS,...| Alternate|[{Home, 460 M XXX...| Primary|[{Home, HEREBY...| Alternate|[{Home, {88888888...|[{7777777, 999999...|
|1234567890|XXXXXXXXXXXXXXXXXX...| Person|[{null, PRINCESS,...| Alternate|[{Home, 460 M XXX...| Primary|[{Home, HEREBY...| Primary|[{Home, {88888888...|[{7777777, 999999...|
|1234567890|XXXXXXXXXXXXXXXXXX...| Person|[{null, PRINCESS,...| Primary|[{Home, PO BOX 695020...| Primary|[{Home, HEREBY...| Alternate|[{Home, {88888888...|[{7777777, 999999...|
|1234567890|XXXXXXXXXXXXXXXXXX...| Person|[{null, PRINCESS,...| Primary|[{Home, PO BOX 695020...| Primary|[{Home, HEREBY...| Primary|[{Home, {88888888...|[{7777777, 999999...|
+----------+--------------------+----------+--------------------+-----------------+--------------------+---------------+--------------------+---------------+--------------------+--------------------+
So the initial columns custid, TIN, Entitytype,EntityAttributes will be same for a particular customer, say 1234567890 in our example. But he might be having multiple addresses/phone/email. Could you please help us on how to group them under 1 json.
Expected Structure :
{
"CustId": 1234567890,
"TIN": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"EntityType": "Person",
"EntityAttributes": [
{
"FirstName": "PRINCESS",
"LastName": "XXXXXX",
"BirthDate": "xxxx-xx-xx",
"DeceasedFlag": "False"
}
],
"Address": [
{
"AddressPreference": "Alternate",
"AddressDetails": {
"AddressType": "Home",
"Address1": "460",
"City": "XXXX",
"State": "XXX",
"Zip": "XXXX"
}
},
{
"AddressPreference": "Primary",
"AddressDetails": {
"AddressType": "Home",
"Address1": "PO BOX 695020",
"City": "XXX",
"State": "XXXX",
"Zip": "695020",
}
}
],
"Phone": [
{
"PhonePreference": "Primary",
"PhoneDetails": {
"PhoneType": "Home",
"PhoneNumber": "xxxxx",
"FormatPhoneNumber": "xxxxxx"
}
},
{
"PhonePreference": "Alternate",
"PhoneDetails": {
"PhoneType": "Home",
"PhoneNumber": "xxxx",
"FormatPhoneNumber": "xxxxx"
}
},
{
],
"Email": [
{
"EmailPreference": "Primary",
"EmailDetails": {
"EmailType": "Home",
"EmailAddress": "xxxxxxx#GMAIL.COM"
}
}
],
}
]
}
UPDATE
Tried with the below recommended group by method, it ended up giving 1 customer details, but the email is repeated 4 times in the list. Ideally it should be having only 1 email. Also In the Address Preference Alternate has 1 address and primary has 1 address, but the Alternate shows 2 entries and primary shows 2. Could you please help with an ideal solution.
Probably this should work. id is like a custid in your example which has repeating values.
>>> df.show()
+----+------------+----------+
| id| address| email|
+----+------------+----------+
|1001| address-a| email-a|
|1001| address-b| email-b|
|1002|address-1002|email-1002|
|1003|address-1003|email-1002|
|1002| address-c| email-2|
+----+------------+----------+
Aggregate on those repeating columns and then convert to JSON
>>> results = df.groupBy("id").agg(collect_list("address").alias("address"),collect_list("email").alias("email")).toJSON().collect()
>>> for i in results: print(i)
...
{"id":"1003","address":["address-1003"],"email":["email-1002"]}
{"id":"1002","address":["address-1002","address-c"],"email":["email-1002","email-2"]}
{"id":"1001","address":["address-a","address-b"],"email":["email-a","email-b"]}

How to find common struct for all documents in collection?

I have an array of documents, that have more or less same structure. But I need find fields that present in all documents. Somethink like:
{
"name": "Jow",
"salary": 7000,
"age": 25,
"city": "Mumbai"
},
{
"name": "Mike",
"backname": "Brown",
"sex": "male",
"city": "Minks",
"age": 30
},
{
"name": "Piter",
"hobby": "footbol",
"age": 25,
"location": "USA"
},
{
"name": "Maria",
"age": 22,
"city": "Paris"
},
All docs have name and age. How to find them with ArangoDB?
You could do the following:
Retrieve the attribute names of each document
Get the intersection of those attributes
i.e.
LET attrs = (FOR item IN test RETURN ATTRIBUTES(item, true))
RETURN APPLY("INTERSECTION", attrs)
APPLY is necessary so each list of attributes in attrs can be passed as a separate parameter to INTERSECTION.
Documentation:
ATTRIBUTES: https://www.arangodb.com/docs/stable/aql/functions-document.html#attributes
INTERSECTION: https://www.arangodb.com/docs/stable/aql/functions-array.html#intersection
APPLY: https://www.arangodb.com/docs/stable/aql/functions-miscellaneous.html#apply

Return distinct and sorted query in AQL

So I have two collections, one with cities with an array of postal codes as a property and one with postal codes and their latitude & longitude.
I want to return the cities closest to a coordinate. This is easy enough with a geo index but the issue I'm having is the same city being returned multiple times and some times it can be the 1st and 3rd closest because the postal code that I'm searching in bordering another city.
cities example data:
[
{
"_key": "30936019",
"_id": "cities/30936019",
"_rev": "30936019",
"countryCode": "US",
"label": "Colorado Springs, CO",
"name": "Colorado Springs",
"postalCodes": [
"80904",
"80927"
],
"region": "CO"
},
{
"_key": "30983621",
"_id": "cities/30983621",
"_rev": "30983621",
"countryCode": "US",
"label": "Manitou Springs, CO",
"name": "Manitou Springs",
"postalCodes": [
"80829"
],
"region": "CO"
}
]
postalCodes example data:
[
{
"_key": "32132856",
"_id": "postalCodes/32132856",
"_rev": "32132856",
"countryCode": "US",
"location": [
38.9286,
-104.6583
],
"postalCode": "80927"
},
{
"_key": "32147422",
"_id": "postalCodes/32147422",
"_rev": "32147422",
"countryCode": "US",
"location": [
38.8533,
-104.8595
],
"postalCode": "80904"
},
{
"_key": "32172144",
"_id": "postalCodes/32172144",
"_rev": "32172144",
"countryCode": "US",
"location": [
38.855,
-104.9058
],
"postalCode": "80829"
}
]
The following query works but as an ArangoDB newbie I'm wondering if there's a more efficient way to do this:
FOR p IN WITHIN(postalCodes, 38.8609, -104.8734, 30000, 'distance')
FOR c IN cities
FILTER p.postalCode IN c.postalCodes AND c.countryCode == p.countryCode
COLLECT close = c._id AGGREGATE distance = MIN(p.distance)
FOR c2 IN cities
FILTER c2._id == close
SORT distance
RETURN c2
The first FOR in the query will use the geo index and probably return few documents (just the postal codes around the specified location).
The second FOR will look up the city for each found postal code. This may be an issue, depending on whether there is an index present on cities.postalCodes and cities.countryCode. If not, then the second FOR has to do a full scan of the cities collection each time it is involved. This will be inefficient. It may therefore be create an index on the two attributes like this:
db.cities.ensureIndex({ type: "hash", fields: ["countryCode", "postalCodes[*]"] });
The third FOR can be removed entirely when not COLLECTing by c._id but by c:
FOR p IN WITHIN(postalCodes, 38.8609, -104.8734, 30000, 'distance')
FOR c IN cities
FILTER p.postalCode IN c.postalCodes AND c.countryCode == p.countryCode
COLLECT city = c AGGREGATE distance = MIN(p.distance)
SORT distance
RETURN city
This will shorten the query string, but it may not help efficiency much I think, as the third FOR will use the primary index to look up the city documents, which is O(1).
In general, when in doubt about a query using indexes, you can use db._explain(queryString) to show which indexes will be used by a query.

Update inner object in arangodb

I have an object stored in arangodb which has additional inner objects, my current use case requires that I update just one of the elements.
Store Object
{
"status": "Active",
"physicalCode": "99999",
"postalCode": "999999",
"tradingCurrency": "USD",
"taxRate": "14",
"priceVatInclusive": "No",
"type": "eCommerce",
"name": "John and Sons inc",
"description": "John and Sons inc",
"createdDate": "2015-05-25T11:04:14+0200",
"modifiedDate": "2015-05-25T11:04:14+0200",
"physicalAddress": "Corner moon and space 9 station",
"postalAddress": "PO Box 44757553",
"physicalCountry": "Mars Sector 9",
"postalCountry": "Mars Sector 9",
"createdBy": "john.doe",
"modifiedBy": "john.doe",
"users": [
{
"id": "577458630580",
"username": "john.doe"
}
],
"products": [
{
"sellingPrice": "95.00",
"inStock": "10",
"name": "School Shirt Green",
"code": "SKITO2939999995",
"warehouseId": "723468998682"
},
{
"sellingPrice": "95.00",
"inStock": "5",
"name": "School Shirt Red",
"code": "SKITO245454949495",
"warehouseId": "723468998682"
},
{
"sellingPrice": "95.00",
"inStock": "10",
"discount": "5%",
"name": "School Shirt Blue",
"code": "SKITO293949495",
"warehouseId": "723468998682"
}
]
}
I want to change just one of the products stock value
{
"sellingPrice": "95.00",
"inStock": "10",
"discount": "5%",
"name": "School Shirt Blue",
"code": "SKITO293949495",
"warehouseId": "723468998682"
}
Like update store product stock less 1 where store id = x, something to this effect
FOR store IN stores
FILTER store._key == "837108415472"
FOR product IN store.products
FILTER product.code == "SKITO293949495"
UPDATE product WITH { inStock: (product.inStock - 1) } IN store.products
Apart from the above possibly it makes sense to store product as a separate document in collection store_products. I believe in NOSQL that is the best approach to reduce document size.
Found answer
here arangodb-aql-update-single-object-in-embedded-array and there
arangodb-aql-update-for-internal-field-of-object
I however believe it is best to maintain separate documents and rather use joins when retrieving. Updates easily

Resources