Python: How to populate nested dict with attributes and values from parsed XML string? - python-3.x

I have a dict containing identifiers as keys, with an XML string as their respective values. I want to parse the attributes and values from the XML and automagically populate a dict with them, under their respective identifier keys.
import xml.etree.ElementTree as etree
employees = {
'employee_0': '<Person><Attribute name="name"><Value>Bill Johnson</Value></Attribute><Attribute name="city"><Value>New York</Value></Attribute><Attribute name="email"><Value>bill.johnson#email.com</Value></Attribute></Person>',
'employee_1': '<Person><Attribute name="name"><Value>Amanda Philips</Value></Attribute><Attribute name="city"><Value>Los Angeles</Value></Attribute><Attribute name="email"><Value>amanda.philips#email.com</Value></Attribute></Person>'
}
for identifier_key in employees:
xml = etree.fromstring(employees[identifier_key])
for key in xml:
key_str = key.attrib["name"]
for value in key:
value_str = value.text
employees[identifier_key][key_str] = value_str
I want the employees dict to result in this:
{
"employee_0": {
"name": "Bill Johnson",
"city": "New York",
"email": "bill.johnson#email.com"
},
"employee_1": {
"name": "Amanda Philips",
"city": "Los Angeles",
"email": "amanda.philips#email.com"
}
}
But in the code above, we get a TypeError: 'str' object does not support item assignment. My questions are:
Why do we get this error? It seems like this should be the proper way to populate the dict. If I instead use employees[identifier_key] = { key_str: value_str } it will overwrite the previous iteration. I have tried .update() too, without luck. How can this operation be accomplished?
How can the operation be accomplished in a nice and clean way, e.g. using dict comprehension? I'm having difficulty putting together the syntax for it.

Another method.
from simplified_scrapy import SimplifiedDoc
employees = {
'employee_0': '<Person><Attribute name="name"><Value>Bill Johnson</Value></Attribute><Attribute name="city"><Value>New York</Value></Attribute><Attribute name="email"><Value>bill.johnson#email.com</Value></Attribute></Person>',
'employee_1': '<Person><Attribute name="name"><Value>Amanda Philips</Value></Attribute><Attribute name="city"><Value>Los Angeles</Value></Attribute><Attribute name="email"><Value>amanda.philips#email.com</Value></Attribute></Person>'
}
for identifier_key in employees:
dic = {}
xml = SimplifiedDoc(employees[identifier_key])
for attr in xml.Attributes:
dic[attr['name']]=attr.text
employees[identifier_key]=dic
print (employees)
Result:
{'employee_0': {'name': 'Bill Johnson', 'city': 'New York', 'email': 'bill.johnson#email.com'}, 'employee_1': {'name': 'Amanda Philips', 'city': 'Los Angeles', 'email': 'amanda.philips#email.com'}}

Related

python regex usage: how to start with , least match , get content in middle [duplicate]

I wrote some code to get data from a web API. I was able to parse the JSON data from the API, but the result I gets looks quite complex. Here is one example:
>>> my_json
{'name': 'ns1:timeSeriesResponseType', 'declaredType': 'org.cuahsi.waterml.TimeSeriesResponseType', 'scope': 'javax.xml.bind.JAXBElement$GlobalScope', 'value': {'queryInfo': {'creationTime': 1349724919000, 'queryURL': 'http://waterservices.usgs.gov/nwis/iv/', 'criteria': {'locationParam': '[ALL:103232434]', 'variableParam': '[00060, 00065]'}, 'note': [{'value': '[ALL:103232434]', 'title': 'filter:sites'}, {'value': '[mode=LATEST, modifiedSince=null]', 'title': 'filter:timeRange'}, {'value': 'sdas01', 'title': 'server'}]}}, 'nil': False, 'globalScope': True, 'typeSubstituted': False}
Looking through this data, I can see the specific data I want: the 1349724919000 value that is labelled as 'creationTime'.
How can I write code that directly gets this value?
I don't need any searching logic to find this value. I can see what I need when I look at the response; I just need to know how to translate that into specific code to extract the specific value, in a hard-coded way. I read some tutorials, so I understand that I need to use [] to access elements of the nested lists and dictionaries; but I can't figure out exactly how it works for a complex case.
More generally, how can I figure out what the "path" is to the data, and write the code for it?
For reference, let's see what the original JSON would look like, with pretty formatting:
>>> print(json.dumps(my_json, indent=4))
{
"name": "ns1:timeSeriesResponseType",
"declaredType": "org.cuahsi.waterml.TimeSeriesResponseType",
"scope": "javax.xml.bind.JAXBElement$GlobalScope",
"value": {
"queryInfo": {
"creationTime": 1349724919000,
"queryURL": "http://waterservices.usgs.gov/nwis/iv/",
"criteria": {
"locationParam": "[ALL:103232434]",
"variableParam": "[00060, 00065]"
},
"note": [
{
"value": "[ALL:103232434]",
"title": "filter:sites"
},
{
"value": "[mode=LATEST, modifiedSince=null]",
"title": "filter:timeRange"
},
{
"value": "sdas01",
"title": "server"
}
]
}
},
"nil": false,
"globalScope": true,
"typeSubstituted": false
}
That lets us see the structure of the data more clearly.
In the specific case, first we want to look at the corresponding value under the 'value' key in our parsed data. That is another dict; we can access the value of its 'queryInfo' key in the same way, and similarly the 'creationTime' from there.
To get the desired value, we simply put those accesses one after another:
my_json['value']['queryInfo']['creationTime'] # 1349724919000
I just need to know how to translate that into specific code to extract the specific value, in a hard-coded way.
If you access the API again, the new data might not match the code's expectation. You may find it useful to add some error handling. For example, use .get() to access dictionaries in the data, rather than indexing:
name = my_json.get('name') # will return None if 'name' doesn't exist
Another way is to test for a key explicitly:
if 'name' in resp_dict:
name = resp_dict['name']
else:
pass
However, these approaches may fail if further accesses are required. A placeholder result of None isn't a dictionary or a list, so attempts to access it that way will fail again (with TypeError). Since "Simple is better than complex" and "it's easier to ask for forgiveness than permission", the straightforward solution is to use exception handling:
try:
creation_time = my_json['value']['queryInfo']['creationTime']
except (TypeError, KeyError):
print("could not read the creation time!")
# or substitute a placeholder, or raise a new exception, etc.
Here is an example of loading a single value from simple JSON data, and converting back and forth to JSON:
import json
# load the data into an element
data={"test1": "1", "test2": "2", "test3": "3"}
# dumps the json object into an element
json_str = json.dumps(data)
# load the json to a string
resp = json.loads(json_str)
# print the resp
print(resp)
# extract an element in the response
print(resp['test1'])
Try this.
Here, I fetch only statecode from the COVID API (a JSON array).
import requests
r = requests.get('https://api.covid19india.org/data.json')
x = r.json()['statewise']
for i in x:
print(i['statecode'])
Try this:
from functools import reduce
import re
def deep_get_imps(data, key: str):
split_keys = re.split("[\\[\\]]", key)
out_data = data
for split_key in split_keys:
if split_key == "":
return out_data
elif isinstance(out_data, dict):
out_data = out_data.get(split_key)
elif isinstance(out_data, list):
try:
sub = int(split_key)
except ValueError:
return None
else:
length = len(out_data)
out_data = out_data[sub] if -length <= sub < length else None
else:
return None
return out_data
def deep_get(dictionary, keys):
return reduce(deep_get_imps, keys.split("."), dictionary)
Then you can use it like below:
res = {
"status": 200,
"info": {
"name": "Test",
"date": "2021-06-12"
},
"result": [{
"name": "test1",
"value": 2.5
}, {
"name": "test2",
"value": 1.9
},{
"name": "test1",
"value": 3.1
}]
}
>>> deep_get(res, "info")
{'name': 'Test', 'date': '2021-06-12'}
>>> deep_get(res, "info.date")
'2021-06-12'
>>> deep_get(res, "result")
[{'name': 'test1', 'value': 2.5}, {'name': 'test2', 'value': 1.9}, {'name': 'test1', 'value': 3.1}]
>>> deep_get(res, "result[2]")
{'name': 'test1', 'value': 3.1}
>>> deep_get(res, "result[-1]")
{'name': 'test1', 'value': 3.1}
>>> deep_get(res, "result[2].name")
'test1'

How to insert another item programmatically into body?

I am trying to build a free/busy body request to Google Calendar API via Python 3.8 . However, when I try to insert a new item into the body request, I am getting a bad request and can't use it.
This code is working:
SUBJECTA = '3131313636#resource.calendar.google.com'
SUBJECTB = '34343334#resource.calendar.google.com'
body = {
"timeMin": now,
"timeMax": nownext,
"timeZone": 'America/New_York',
"items": [{'id': SUBJECTA},{"id": SUBJECTB} ]
}
Good Body result:
{'timeMin': '2019-11-05T11:42:21.354803Z',
'timeMax': '2019-11-05T12:42:21.354823Z',
'timeZone': 'America/New_York',
'items': [{'id': '131313636#resource.calendar.google.com'},
{'id': '343334#resource.calendar.google.com'}]}
However,
While using this code:
items = "{'ID': '1313636#resource.calendar.google.com'},{'ID': '3383137#resource.calendar.google.com'},{'ID': '383733#resource.calendar.google.com'}"
body = {
"timeMin": now,
"timeMax": nownext,
"timeZone": 'America/New_York',
"items": items
}
The Body results contain additional quotes at the start and end position, failing the request:
{'timeMin': '2019-11-05T12:04:41.189784Z',
'timeMax': '2019-11-05T13:04:41.189804Z',
'timeZone': 'America/New_York',
'items': ["{'ID': 13131313636#resource.calendar.google.com},{'ID':
53333383137#resource.calendar.google.com},{'ID':
831383733#resource.calendar.google.com},{'ID':
33339373237#resource.calendar.google.com},{'ID':
393935323035#resource.calendar.google.com}"]}
What is the proper way to handle it and send the item list in an accurate way?
In your situation, the value of items is given by the string of "{'ID': '1313636#resource.calendar.google.com'},{'ID': '3383137#resource.calendar.google.com'},{'ID': '383733#resource.calendar.google.com'}".
You want to use as the object by parsing the string value with python.
The result value you expect is [{'ID': '1313636#resource.calendar.google.com'}, {'ID': '3383137#resource.calendar.google.com'}, {'ID': '383733#resource.calendar.google.com'}].
You have already been able to use Calender API.
If my understanding is correct, how about this answer? Please think of this as just one of several answers.
Sample script:
import json # Added
items = "{'ID': '1313636#resource.calendar.google.com'},{'ID': '3383137#resource.calendar.google.com'},{'ID': '383733#resource.calendar.google.com'}"
items = json.loads(("[" + items + "]").replace("\'", "\"")) # Added
body = {
"timeMin": now,
"timeMax": nownext,
"timeZone": 'America/New_York',
"items": items
}
print(body)
Result:
If now and nownext are the values of "now" and "nownext", respectively, the result is as follows.
{
"timeMin": "now",
"timeMax": "nownext",
"timeZone": "America/New_York",
"items": [
{
"ID": "1313636#resource.calendar.google.com"
},
{
"ID": "3383137#resource.calendar.google.com"
},
{
"ID": "383733#resource.calendar.google.com"
}
]
}
Note:
If you can retrieve the IDs as the string value, I recommend the following method as a sample script.
ids = ['1313636#resource.calendar.google.com', '3383137#resource.calendar.google.com', '383733#resource.calendar.google.com']
items = [{'ID': id} for id in ids]
If I misunderstood your question and this was not the result you want, I apologize.

How to iterate through indexed field to add field from another index

I'm rather new to elasticsearch, so i'm coming here in hope to find advices.
I have two indices in elastic from two different csv files.
The index_1 has this mapping:
{'settings': {
'number_of_shards' : 3
},
'mappings': {
'properties': {
'place': {'type': 'keyword' },
'address': {'type': 'keyword' },
}
}
}
The file is about 400 000 documents long.
The index_2 with a much smaller file(about 50 documents) has this mapping:
{'settings': {
"number_of_shards" : 1
},
'mappings': {
'properties': {
'place': {'type': 'text' },
'address': {'type': 'keyword' },
}
}
}
The field "place" in index_2 is all of the unique values from the field "place" in index_1.
In both indices the "address" fields are postcodes of datatype keyword with a structure: 0000AZ.
Based on the "place" field keyword in index_1 I want to assign the term of field "address" from index_2.
I have tried using the pandas library but the index_1 file is too large. I have also to tried creating modules based off pandas and elasticsearch, quite unsuccessfully. Although I believe this is a promising direction. A good solution would be to stay into the elasticsearch library as much as possible as these indices will be later be used for further analysis.
If i understand correctly it sounds like you want to use updateByQuery.
the request body should look a little like this:
{
'query': {'term': {'place': "placeToMatch"}},
'script': 'ctx._source.address = "updatedZipCode"'
}
This will update the address field of all documents with the matched place.
EDIT:
So what we want to do is use updateByQuery while iterating over all the documents in index2.
First step: get all the documents from index2, will just do this using the basic search feature
{
"index": 'index2',
"size": 100 // get all documents, once size is over 10,000 you'll have to padginate.
"body": {"query": {"match_all": {}}}
}
Now we iterate over all the results and use updateByQuery for each of the results:
// sudo
doc = response[i]
// update by query request.
{
index: 'index1',
body: {
'query': {'term': {'address': doc._source.address}},
'script': 'ctx._source.place = "`${doc._source.place}`"'
}
}

How can I avoid a forest of apostrophes?

Using Python 3.7, I have this confusing-looking, nested dictionary:
dict = \
{
'HBL_Posts':
{'vNames':[ 'id_no', 'display_msg_no', 'thread', 'headline', 'category', 'author',
'auth_addr', 'author_pic_line', 'postbody',
'last_msg_no', 'mf_lnk', 'subject_header' ],
'data_fname':'_Posts_plain.htm', 'tpl_fname':'_Posts_tpl.htm', 'addrs_fname':'_addrs.csv' },
'MOTM':
{'vNames':[ 'work_month', 'zoom', 'zoom_id', 'headline', 'description', 'subject_header' ],
'data_fname':'_Posts_plain.htm', 'tpl_fname':'_Posts_tpl.htm', 'addrs_fname':'_addrs.csv'},
'MOTM recording':
{'vNames':[ 'topic', 'description', 'wDate', 'box', 'chat'],
'data_fname':'_Recording_data.htm', 'tpl_fname':'_Recording_tpl.htm', 'addrs_fname':'_addrs.csv'},
'Enticement':
{'vNames':[ 'enticing_post', 'headline', 'hb_preface', 'postscript'],
'data_fname':'_Entice_data.htm', 'tpl_fname':'_Entice_tpl.htm', 'addrs_fname':'_entice.csv'}
}
If I initially set each variable to its own name, like: HBL_Posts = 'HBL_Posts', I can substitute this, much clearer and less typo-prone, code:
dict = \
{
HBL_Posts:
{vNames:[ id_no, display_msg_no, thread, headline, category, author,
auth_addr, author_pic_line, postbody,
last_msg_no, mf_lnk, subject_header ],
data_fname:_Posts_plain.htm, tpl_fname:_Posts_tpl.htm, addrs_fname:_addrs.csv },
MOTM:
{vNames:[ work_month, zoom, zoom_id, headline, description, subject_header ],
data_fname:_Posts_plain.htm, tpl_fname:_Posts_tpl.htm, addrs_fname:_addrs.csv},
MOTM recording:
{vNames:[ topic, description, wDate, box, chat],
data_fname:_Recording_data.htm, tpl_fname:_Recording_tpl.htm, addrs_fname:_addrs.csv},
Enticement:
{vNames:[ enticing_post, headline, hb_preface, postscript],
data_fname:_Entice_data.htm, tpl_fname:_Entice_tpl.htm, addrs_fname:_entice.csv}
}
In fact I accomplished this by just doing all the required assignments, one at a time. But that is about as complicated as the original dictionary set up, with the apostrophes. What I'd like is a function that would enable me to do this neatly and economically.
def self_name(s):
[?????]
Then I could have a list of all the variables, vars_lst, and loop through it setting each to the literal version of itself:
for item in vars_lst:
item = self_name(item)
To avoid having to use apostrophes in setting up vars_lst, I would accept doing:
HBL_Posts = vNames = id_no = . . . = ''
After many, many hours of struggle, I have been unable to supply the needed code for the self_name function. How can I do that, or how can I find another way of avoiding so many apostrophes?
Indent it like JSON:
{
"HBL_Posts": {
"vNames": [
"id_no",
"display_msg_no",
"thread",
"headline",
"category",
"author",
"auth_addr",
"author_pic_line",
"postbody",
"last_msg_no",
"mf_lnk",
"subject_header"
],
"data_fname": "_Posts_plain.htm",
"tpl_fname": "_Posts_tpl.htm",
"addrs_fname": "_addrs.csv"
},
"MOTM": {
"vNames": [
"work_month",
"zoom",
"zoom_id",
"headline",
"description",
"subject_header"
],
"data_fname": "_Posts_plain.htm",
"tpl_fname": "_Posts_tpl.htm",
"addrs_fname": "_addrs.csv"
},
"MOTM recording": {
"vNames": [
"topic",
"description",
"wDate",
"box",
"chat"
],
"data_fname": "_Recording_data.htm",
"tpl_fname": "_Recording_tpl.htm",
"addrs_fname": "_addrs.csv"
},
"Enticement": {
"vNames": [
"enticing_post",
"headline",
"hb_preface",
"postscript"
],
"data_fname": "_Entice_data.htm",
"tpl_fname": "_Entice_tpl.htm",
"addrs_fname": "_entice.csv"
}
}
or even store that in a .json file and load it via:
import json
with open('my_file.json', 'r') as f:
my_dict = json.load(f)
JSON is easy for most people to read and the indentation is easy to see. Plus it is easy to save and read from a file so you don't have to clutter your code.
FYI:
You can pretty print a dictionary using:
import json
my_dict = ...
print(json.dumps(my_dict, indent=4))
which is how I printed your dictionary.

Access a dictionary value based on the list of keys

I have a nested dictionary with keys and values as shown below.
j = {
"app": {
"id": 0,
"status": "valid",
"Garden": {
"Flowers":
{
"id": "1",
"state": "fresh"
},
"Soil":
{
"id": "2",
"state": "stale"
}
},
"BackYard":
{
"Grass":
{
"id": "3",
"state": "dry"
},
"Soil":
{
"id": "4",
"state": "stale"
}
}
}
}
Currently, I have a python method which returns me the route based on keys to get to a 'value'. For example, if I want to access the "1" value, the python method will return me a list of string with the route of the keys to get to "1". Thus it would return me, ["app","Garden", "Flowers"]
I am designing a service using flask and I want to be able to return a json output such as the following based on the route of the keys. Thus, I would return an output such as below.
{
"id": "1",
"state": "fresh"
}
The Problem:
I am unsure on how to output the result as shown above as I will need to parse the dictionary "j" in order to build it?
I tried something as the following.
def build_dictionary(key_chain):
d_temp = list(d.keys())[0]
...unsure on how to
#Here key_chain contains the ["app","Garden", "Flowers"] sent to from the method which parses the dictionary to store the key route to the value, in this case "1".
Can someone please help me to build the dictionary which I would send to the jsonify method. Any help would be appreciated.
Hope this is what you are asking:
def build_dictionary(key_chain, j):
for k in key_chain:
j = j.get(k)
return j
kchain = ["app","Garden", "Flowers"]
>>> build_dictionary(kchain, j)
{'id': '1', 'state': 'fresh'}

Resources