Related
I have a sample database as below:
SNO
Name
Address
99123
Mike
Texas
88124
Tom
California
I want to keep my SNO in elastic search _id to make it easier to update documents according to my SNO.
Python code to create an index:
abc = {
"settings": {
"number_of_shards": 2,
"number_of_replicas": 2
}
}
es.indices.create(index='test',body = abc)
I fetched data from postman as below:
{
"_index": "test",
"_id": "13",
"_data": {
"FirstName": "Sample4",
"LastName": "ABCDEFG",
"Designation": "ABCDEF",
"Salary": "99",
"DateOfJoining": "2020-05-05",
"Address": "ABCDE",
"Gender": "ABCDE",
"Age": "21",
"MaritalStatus": "ABCDE",
"Interests": "ABCDEF",
"timestamp": "2020-05-05T14:42:46.394115",
"country": "Nepal"
}
}
And Insert code in python is below:
req_JSON = request.json
input_index = req_JSON['_index']
input_id = req_JSON['_id']
input_data = req_JSON['_data']
doc = input_data
res = es.index(index=input_index, body=doc)
I thought _id will remain the same as what I had given but it generated the auto _id.
You can simply do it like this:
res = es.index(index=input_index, body=doc, id=input_id)
^
|
add this
I have a dataframe that looks as below:
items_df
======================================================
| customer item_type brand price quantity |
|====================================================|
| 1 bread reems 20 10 |
| 2 butter spencers 10 21 |
| 3 jam niles 10 22 |
| 1 bread marks 16 18 |
| 1 butter jims 19 12 |
| 1 jam jills 16 6 |
| 2 bread marks 16 18 |
======================================================
I create an rdd that converts the above to a dict:
rdd = items_df.rdd.map(lambda row: row.asDict())
The result looks like:
[
{ "customer": 1, "item_type": "bread", "brand": "reems", "price": 20, "quantity": 10 },
{ "customer": 2, "item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21 },
{ "customer": 3, "item_type": "jam", "brand": "niles", "price": 10, "quantity": 22 },
{ "customer": 1, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 },
{ "customer": 1, "item_type": "butter", "brand": "jims", "price": 19, "quantity": 12 },
{ "customer": 1, "item_type": "jam", "brand": "jills", "price": 16, "quantity": 6 },
{ "customer": 2, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 }
]
I would like to group the above rows first by customer. Then I would like to introduce custom new keys "breads", "butters", "jams" and group all these rows for that customer. So my rdd reduces from 7 rows to 3 rows.
The output would look as below:
[
{
"customer": 1,
"breads": [
{"item_type": "bread", "brand": "reems", "price": 20, "quantity": 10},
{"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18},
],
"butters": [
{"item_type": "butter", "brand": "jims", "price": 19, "quantity": 12}
],
"jams": [
{"item_type": "jam", "brand": "jills", "price": 16, "quantity": 6}
]
},
{
"customer": 2,
"breads": [
{"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18}
],
"butters": [
{"item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21}
],
"jams": []
},
{
"customer": 3,
"breads": [],
"butters": [],
"jams": [
{"item_type": "jam", "brand": "niles", "price": 10, "quantity": 22}
]
}
]
Would anyone know how to achieve the above using PySpark? I would like to know if there is a solution using reduceByKey() or something similar. I am hoping to avoid the use of groupByKey() if possible.
First add a column item_types to pivot dataframe.
items_df = items_df.withColumn('item_types', F.concat(F.col('item_type'),F.lit('s')))
items_df.show()
+--------+---------+--------+-----+--------+----------+
|customer|item_type| brand|price|quantity|item_types|
+--------+---------+--------+-----+--------+----------+
| 1| bread| reems| 20| 10| breads|
| 2| butter|spencers| 10| 21| butters|
| 3| jam| niles| 10| 22| jams|
| 1| bread| marks| 16| 18| breads|
| 1| butter| jims| 19| 12| butters|
| 1| jam| jills| 16| 6| jams|
| 2| bread| marks| 16| 18| breads|
+--------+---------+--------+-----+--------+----------+
Then you can pivot table with customer group and use F.collect_list() to aggregate other columns at the same time.
items_df = items_df.groupby(['customer']).pivot("item_types").agg(
F.collect_list(F.struct(F.col("item_type"),F.col("brand"), F.col("price"),F.col("quantity")))
).sort('customer')
items_df.show()
+--------+--------------------+--------------------+--------------------+
|customer| breads| butters| jams|
+--------+--------------------+--------------------+--------------------+
| 1|[[bread, reems, 2...|[[butter, jims, 1...|[[jam, jills, 16,...|
| 2|[[bread, marks, 1...|[[butter, spencer...| []|
| 3| []| []|[[jam, niles, 10,...|
+--------+--------------------+--------------------+--------------------+
Finally you need set recursive=True to convert the nested Row into dict.
rdd = items_df.rdd.map(lambda row: row.asDict(recursive=True))
print(rdd.take(10))
[{'customer': 1,
'breads': [{'item_type': u'bread', 'brand': u'reems', 'price': 20, 'quantity': 10},
{'item_type': u'bread', 'brand': u'marks', 'price': 16, 'quantity': 18}],
'butters': [{'item_type': u'butter', 'brand': u'jims', 'price': 19, 'quantity': 12}],
'jams': [{'item_type': u'jam', 'brand': u'jills', 'price': 16, 'quantity': 6}]},
{'customer': 2,
'breads': [{'item_type': u'bread', 'brand': u'marks', 'price': 16, 'quantity': 18}],
'butters': [{'item_type': u'butter', 'brand': u'spencers', 'price': 10, 'quantity': 21}],
'jams': []},
{'customer': 3,
'breads': [],
'butters': [],
'jams': [{'item_type': u'jam', 'brand': u'niles', 'price': 10, 'quantity': 22}]}]
I used another approach as well using reduceByKey() in rdd. Given the dataframe items_df, first convert it to rdd:
rdd = items_df.rdd.map(lambda row: row.asDict())
Transform each row to have tuple (customer, [row_obj]) where we have row_obj is in a list:
rdd = rdd.map(lambda row: ( row["customer"], [row] ) )
Group by customer using reduceByKey, where the lists are concatenated for a given customer:
rdd = rdd.reduceByKey(lambda x,y: x+y)
Transform the tuple back to dict where key is customer and value is list of all rows associated:
rdd = rdd.map(lambda tup: { tup[0]: tup[1] } )
Since each customer data is all now in a row, we can segregate the data as breads, butters, jams using a custom function:
def organize_items_in_customer(row):
cust_id = list(row.keys())[0]
items = row[cust_id]
new_cust_obj = { "customer": cust_id, "breads": [], "butters": [], "jams": [] }
plurals = { "bread":"breads", "butter":"butters", "jam":"jams" }
for item in items:
item_type = item["item_type"]
key = plurals[item_type]
new_cust_obj[key].append(item)
return new_cust_obj
Call the above function to transform rdd:
rdd = rdd.map(organize_items_in_customer)
I have a json msg coming from iotHub like:
{
"deviceId": "abc",
"topic": "data",
"data": {
"varname1": [{
"t": "timestamp1",
"v": "value1",
"f": "respondFrame1"
},
{
"t": "timestamp2",
"v": "value2",
"f": "respondFrame2"
}],
"varname2": [{
"t": "timestamp1",
"v": "value1",
"f": "respondFrame1"
},
{
"t": "timestamp2",
"v": "value2",
"f": "respondFrame2"
}]
}
}
and want to store this by azure stream analytics job into a transact sql like this:
ID | deviceId | varname | timestamp | respondFrame | value
-----+------------+-----------+-------------+----------------+--------
1 | abc | varname1 | timestamp1 | respondFrame1 | value1
2 | abc | varname1 | timestamp2 | respondFrame2 | value2
3 | abc | varname2 | timestamp1 | respondFrame1 | value1
4 | abc | varname2 | timestamp2 | respondFrame2 | value2
does anaybody knwo how to handle this stacked iterations and combine it (cross apply)?
something like this "phantomCode":
deviceId = msg.deviceId
for d in msg.data:
for key in d:
varname = key.name
timestamp = key[varname].t
frame = key[varname].f
value = key[varname].v
UPDATE regarding to JS Azure answer:
with the code
WITH datalist AS
(
SELECT
iotHubAlias.deviceId,
data.PropertyName as varname,
data.PropertyValue as arrayData
FROM [iotHub] as iotHubAlias
CROSS APPLY GetRecordProperties(iotHubAlias.data) AS data
WHERE iotHubAlias.topic = 'data'
)
SELECT
datalist.deviceId,
datalist.varname,
arrayElement.ArrayValue.t as [timestamp],
arrayElement.ArrayValue.f as respondFrame,
arrayElement.ArrayValue.v as value
INTO [temporary]
FROM datalist
CROSS APPLY GetArrayElements(datalist.arrayData) AS arrayElement
I always get an error:
{
"channels": "Operation",
"correlationId": "f9d4437b-707e-4892-a37b-8ad721eb1bb2",
"description": "",
"eventDataId": "ef5a5f2b-8c2f-49c2-91f0-16213aaa959d",
"eventName": {
"value": "streamingNode0",
"localizedValue": "streamingNode0"
},
"category": {
"value": "Administrative",
"localizedValue": "Administrative"
},
"eventTimestamp": "2018-08-21T18:23:39.1804989Z",
"id": "/subscriptions/46cd2f8f-b46b-4428-8f7b-c7d942ff745d/resourceGroups/fieldtest/providers/Microsoft.StreamAnalytics/streamingjobs/streamAnalytics4fieldtest/events/ef5a5f2b-8c2f-49c2-91f0-16213aaa959d/ticks/636704726191804989",
"level": "Error",
"operationId": "7a38a957-1a51-4da1-a679-eae1c7e3a65b",
"operationName": {
"value": "Process Events: Processing events Runtime Error",
"localizedValue": "Process Events: Processing events Runtime Error"
},
"resourceGroupName": "fieldtest",
"resourceProviderName": {
"value": "Microsoft.StreamAnalytics",
"localizedValue": "Microsoft.StreamAnalytics"
},
"resourceType": {
"value": "Microsoft.StreamAnalytics/streamingjobs",
"localizedValue": "Microsoft.StreamAnalytics/streamingjobs"
},
"resourceId": "/subscriptions/46cd2f8f-b46b-4428-8f7b-c7d942ff745d/resourceGroups/fieldtest/providers/Microsoft.StreamAnalytics/streamingjobs/streamAnalytics4fieldtest",
"status": {
"value": "Failed",
"localizedValue": "Failed"
},
"subStatus": {
"value": "",
"localizedValue": ""
},
"submissionTimestamp": "2018-08-21T18:24:34.0981187Z",
"subscriptionId": "46cd2f8f-b46b-4428-8f7b-c7d942ff745d",
"properties": {
"Message Time": "2018-08-21 18:23:39Z",
"Error": "- Unable to cast object of type 'Microsoft.EventProcessing.RuntimeTypes.ValueArray' to type 'Microsoft.EventProcessing.RuntimeTypes.IRecord'.\r\n",
"Message": "Runtime exception occurred while processing events, - Unable to cast object of type 'Microsoft.EventProcessing.RuntimeTypes.ValueArray' to type 'Microsoft.EventProcessing.RuntimeTypes.IRecord'.\r\n, : OutputSourceAlias:temporary;",
"Type": "SqlRuntimeError",
"Correlation ID": "f9d4437b-707e-4892-a37b-8ad721eb1bb2"
},
"relatedEvents": []
}
and here an example of a real json msg coming from a device:
{
"topic": "data",
"data": {
"ExternalFlowTemperatureSensor": [{
"t": "2018-08-22T11:00:11.955381",
"v": 16.64103,
"f": "Q6ES8KJIN1NX2DRGH36RX1WDT"
}],
"AdaStartsP2": [{
"t": "2018-08-22T11:00:12.863383",
"v": 382.363138,
"f": "9IY7B4DFBAMOLH3GNKRUNUQNUX"
},
{
"t": "2018-08-22T11:00:54.172501",
"v": 104.0,
"f": "IUJMP20CYQK60B"
}],
"s_DriftData[4].c32_ZeitLetzterTest": [{
"t": "2018-08-22T11:01:01.829568",
"v": 348.2916,
"f": "MMTPWQVLL02CA"
}]
},
"deviceId": "test_3c27db"
}
and (to have it complete) the creation code for the sql table:
create table temporary (
id int NOT NULL IDENTITY PRIMARY KEY,
deviceId nvarchar(20) NOT NULL,
timestamp datetime NOT NULL,
varname nvarchar(100) NOT NULL,
value float,
respondFrame nvarchar(50)
)
the following query will give you the expected output
WITH step1 AS
(
SELECT
event.deviceID,
data.PropertyName as varname,
data.PropertyValue as arrayData
FROM blobtest as event
CROSS APPLY GetRecordProperties(event.data) AS data
)
SELECT
event.deviceId,
event.varname,
arrayElement.ArrayValue.t as [timestamp],
arrayElement.ArrayValue.f as frame,
arrayElement.ArrayValue.v as value
FROM step1 as event
CROSS APPLY GetArrayElements(event.arrayData) AS arrayElement
You can find more info about JSON parsing on our documentation page "Parse JSON and Avro data in Azure Stream Analytics"
Let me know if you have any other question.
JS (Azure Stream Analytics)
We have 200k records. When running search query for the first time with size: 500 I am getting results in doc-1, doc-2, doc-3. But when I run the same search query for the second time I am getting the order changed to doc-2, doc-1, etc ... why the search result order varies each time when we run the same query ?
Query : {"explain":true,"size":500,"query":{"query_string":{"query":" ( (NAME:\"BANK AMERICA\")^50 OR (Names.Name:(BANK AMERICA))^30 OR (NAME_PAIR:\"BANK AMERICA\")^30 OR (NORMAL_NAME:(BANK AMERICA) AND CITY:\"\" ) ^40 OR (NORMAL_NAME:(BANK AMERICA))^30 OR (Styles.value:\"BS\")^5 OR (NORMAL_NAME:\"BANK AMERICA\")^5 OR (address.streetName:\"\" AND CITY:\"\")^30 OR (ZIP:\"\")^6 OR (address.streetName:\"\")^6 OR (address.streetNumber:\"\" AND address.streetName:\"\")^15 OR (telephones.telephone:\"\")^50 OR (mailAddresses.postbox:\"\")^6 ) "}},"sort":[{"_score":{"order":"desc"}},{"statusIndicator":{"order":"asc"}}],"aggs":{"NAME":{"filter":{"term":{"NAME":"ATLS"}}}}}
when running the above the the results are :
"hits": {
"total": 106421,
"max_score": null,
"hits": [
{
"_shard": 0,
"_node": "1",
"_index": "allocation_e1",
"_type": "my_type",
"_id": "217600050_826_E1",
"_score": 2.9569159,
"_routing": "E1",
"_source": {
"sample_number": 217600050,
"countryCode": 101,
"state": "E1",
"name": "BANK of AMERICA Plc",
when ruining the same query oneagain the results are :
Query : {"explain":true,"size":500,"query":{"query_string":{"query":" ( (NAME:\"BANK AMERICA\")^50 OR (Names.Name:(BANK AMERICA))^30 OR (NAME_PAIR:\"BANK AMERICA\")^30 OR (NORMAL_NAME:(BANK AMERICA) AND CITY:\"\" ) ^40 OR (NORMAL_NAME:(BANK AMERICA))^30 OR (Styles.value:\"BS\")^5 OR (NORMAL_NAME:\"BANK AMERICA\")^5 OR (address.streetName:\"\" AND CITY:\"\")^30 OR (ZIP:\"\")^6 OR (address.streetName:\"\")^6 OR (address.streetNumber:\"\" AND address.streetName:\"\")^15 OR (telephones.telephone:\"\")^50 OR (mailAddresses.postbox:\"\")^6 ) "}},"sort":[{"_score":{"order":"desc"}},{"statusIndicator":{"order":"asc"}}],"aggs":{"NAME":{"filter":{"term":{"NAME":"ATLS"}}}}}
hits": {
"total": 106421,
"max_score": null,
"hits": [
{
"_shard": 0,
"_node": "1",
"_index": "allocation_e1",
"_type": "my_type",
"_id": "239958846_826_E1",
"_score": 2.9571724,
"_routing": "E1",
"_source": {
"sample_number": 239958846,
"countryCode": 101,
"state": "E1",
"name": "BANK of AMERICA Plc",
when running the same query the document order gets differs why do the document order changes when running the same query ?
please help on this thanks in advance
Run your queries in say descending order, based on UID, and you will get the same results.
Compare the following examples.
Unsorted:
Sorted ascending:
I am try sort the result of edge iteration by 'beginAt' attribute but it donĀ“t work for me
follows the aql code:
FOR f IN TRAVERSAL(client, careerEdges, "client/100", "outbound", {paths:true})
let sorted = (
FOR e IN f.path.edges
FILTER e.order <= 3
SORT e.beginAt DESC
RETURN e)
RETURN sorted
and the same to 'order' attribute. Always return the same sequence like this:
[
[],
[
{
"_id": "careerEdges/240469605275",
"_rev": "240469605275",
"_key": "240469605275",
"_from": "client/100",
"_to": "careers/iniAlt",
"order": 2,
"$label": "noLonger",
"beginAt": "2014-05-10 13:48:00",
"endAt": "2014-07-20 13:48:00"
}
],
[
{
"_id": "careerEdges/240470064027",
"_rev": "240470064027",
"_key": "240470064027",
"_from": "client/100",
"_to": "careers/lidGru",
"order": 3,
"$label": "noLonger",
"beginAt": "2014-07-20 13:48:00",
"endAt": "2014-08-20 13:48:00"
}
],
[
{
"_id": "careerEdges/240469867419",
"_rev": "240469867419",
"_key": "240469867419",
"_from": "client/100",
"_to": "careers/iniEst",
"endAt": null,
"order": 1,
"$label": "noLonger",
"beginAt": "2014-06-10 13:48:00"
}
]
]
My query is correctly?
Your query is producing a list of lists. The inner lists will be sorted by beginAt, but not the overall result.
If you want a flat list returned and sort it by some criterion, please try this instead:
FOR f IN TRAVERSAL(client, careerEdges, "client/100", "outbound", {paths:true})
FOR e IN f.path.edges
FILTER e.order <= 3
SORT e.beginAt DESC
RETURN e