How to modify the span list while iterating over it - python-3.x

I am trying to create dummy data for NER task by replacing person_name with some dummy names. But it's giving me weird results in case of same entities occuring multiple times as discussed here:
Strange result when removing item from a list while iterating over it
Modifying list while iterating
Input example spans:
{
'text':"Mohan dob is 25th dec 1980. Mohan loves to play cricket.",
'spans':[{'start':0, 'end':5,'label':'person_name','ngram':'Mohan'},
{start':28, 'end':33,'label':'person_name','ngram':'Mohan'},
{start':13, 'end':26,'label':'date','ngram':'25th dec 1980'}
]
}
The entities person_name occurs twice in a sample.
sample_names=['Jon', 'Sam']
I want to replace (0, 5, 'person_name') and (28, 33, 'person_name') with sample_names.
Dummy Examples Output:
{
{'text':"Jon dob is 25th dec 1980. Jon loves to play cricket.",
'spans':[{'start':0, 'end':3,'label':'person_name','ngram':'Jon'},
{start':26, 'end':31,'label':'person_name','ngram':'Jon'},
{start':11, 'end':24,'label':'date','ngram':'25th dec 1980'}
]
},
{'text':"Sam dob is 25th dec 1980. Sam loves to play cricket.",
'spans':[{'start':0, 'end':3,'label':'person_name','ngram':'Sam'},
{start':26, 'end':31,'label':'person_name','ngram':'Sam'},
{start':11, 'end':24,'label':'date','ngram':'25th dec 1980'}
]
}
}
The spans also get's updated in output
target_entity='person_name'
names=sample_names
Code:
def generate(data, target_entity, names):
text = data['text']
spans = data['spans']
new_sents=[]
if spans:
spans = [(d['start'], d['end'], d['label']) for d in spans]
spans.sort()
labellist=[s[2] for s in spans]
# get before_spans and after_spans around target entity
for n in names:
gap = 0
for i, tup in enumerate(spans):
lab = tup[2]
if lab == target_entity:
new_spans={"before": spans[:i], "after": spans[i+1:]}
print("the spans before and after :\n",new_spans)
start=tup[0] #check this
end=tup[1]
ngram = text[start:end]
new_s = text[:start] + n + text[end:]
gap = len(n) - len(ngram)
before = new_spans["before"]
after = [(tup[0]+gap, tup[1]+gap, tup[2]) for tup in new_spans["after"]]
s_sp = before + [(start, start + len(n), target_label)] + after
text=new_s
en={"text": new_s,"spans": [{"start": tup[0], "end": tup[1], "label": tup[2], "ngram": new_s[tup[0]:tup[1]]} for tup in s_sp]}
spans = s_sp
new_sents.append(en)

If all you seek to do is replace the placeholder with a new value, you can do something like this:
## --------------------
## Some enxaple input from you
## --------------------
input_data = [
(162, 171, 'pno'),
(241, 254, 'person_name'),
(373, 384, 'date'),
(459, 477, 'date'),
None,
(772, 785, 'person_name'),
(797, 806, 'pno')
]
## --------------------
## --------------------
## create an iterator out of our name list
## you will need to decide what happens if sample names
## gets exhausted.
## --------------------
sample_names = [
'Jon',
'Sam'
]
sample_names_itter = iter(sample_names)
## --------------------
for row in input_data:
if not row:
continue
start = row[0]
end = row[1]
name = row[2] if row[2] != "person_name" else next(sample_names_itter)
print(f"{name} dob is 25th dec 1980. {name} loves to play cricket.")

Related

ID inserted automatically in a dictionary and write into a file

I want to create ID for each element inserted in an empty dictionary then write it in a file as in the picture below. But it doesn't work. Any help to fix it?
dict ={}
ids = 0
line_count = 0
fhand = input('Enter the file name:')
fname = open(fhand,'a+')
for line in fname:
if line.split() == []:
ids = 1
else:
line_count +=1
ids = line_count +1
n = int(input('How many colors do you want to add?'))
for i in range (0,n):
dict['ID:'] = ids + 1
dict['Color:'] = input('Enter the color:')
for key,value in dict.items():
s = str(key)+' '+str(value)+'\n'
fname.write(s)
fname.close()
print('Done!') ```
Output should be:
ID : 1
Color: red
ID : 2
Color : rose
ID : 3
Color : blue
Not sure if I got what you meant but...
A dictionary is made of <key, value> pairs.
Let`s suppose you have a dictionary:
thisdict = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
If you want to create a ID for each key (for a specific reason) you could use a for loop like:
for key in thisdict.keys():
createIdFunction(key)
And have a createIdFunction which is going to assign a ID based on whatever you want.
Suggestion: Dictionaries can only hold unique keys, so maybe you could use their own keys as ID.
However if your dictionary is empty, there would be no reason to have a ID for that key, right?
You mean your id is not increased ? I think you did not reassign variable "ids" in loop, you may modify code as below:
dict ={}
ids = 0
line_count = 0
fhand = input('Enter the file name:')
fname = open(fhand,'a+')
for line in fname:
if line.split() == []:
ids = 1
else:
line_count +=1
ids = line_count +1
n = int(input('How many colors do you want to add?'))
for i in range (0,n):
ids += 1 # modified
dict['ID:'] = ids # modified
dict['Color:'] = input('Enter the color:')
for key,value in dict.items():
s = str(key)+' '+str(value)+'\n'
fname.write(s)
fname.close()

Saving for loop output at particular range and update csv sheets

i have a data frame and i am running some code and getting some data as per data frame values but data frame index count is 1.5 Million so it is taking time to extract the data and my server getting stop so whole process is stuck and again starting from zero.
I want to save extract the data in new csv file after every iteration or at after defined rows.
def get_dsm_coverage(df):
import math
import mpmath
list_2019 = []
list_2020 = []
list_2021 = []
for z in df.index:
lat,long = (df['LATITUDE'][z],df['LONGITUDE'][z])
print(z)
zoom = 21
lat_rad = math.radians(lat)
lon_rad = math.radians(long)
n=2**zoom
xtile = str(int(n*((long+180)/360)))
ytile = str(int(n*(1-(np.log(np.tan(lat_rad) +float(mpmath.sec(lat_rad))) / np.pi))/2))
print(long,lat,xtile,ytile)
for year in [2019,2020,2021]:
url = 'https://api.gic.org/images/GetDSMTile/21/' +str(xtile)+ '/' +str(ytile)+'/?layer=bluesky-ultra&year='+str(year)
r = requests.get(url, params= {'AuthToken':token})
if r.status_code!=200:
print('got inside')
url = 'https://api.gic.org/images/GetDSMTile/21/' +str(xtile)+ '/' +str(ytile)+'/?layer=bluesky-ultra-g&year='+str(year)
r = requests.get(url, params= {'AuthToken':token})
try:
content_type = r.headers['Content-type']
except:
content_type = 'application/json;charset=ISO-8859-1'
if content_type == 'image/tiff':
print(r.status_code)
print(url)
print(content_type)
if year==2019:
list_2019.append(1)
elif year==2020:
list_2020.append(1)
else:
list_2021.append(1)
else:
print(content_type)
if year==2019:
list_2019.append(0)
elif year==2020:
list_2020.append(0)
else:
list_2021.append(0)
return list_2019,list_2020,list_2021
list_2019,list_2020,list_2021 = get_dsm_coverage(df)
df['dsm_2019']=list_2019
df['dsm_2020']=list_2020
df['dsm_2021']=list_2021
The crucial part is that we are going to keep track of calculations and regularly. Note this code does not actually hit the API and when run the first time it will intentionally error out so that you can restart it and see it recover. Note that the existence of the temp file signals that there is prior work and as such it should be cleaned up after a successful run.
import json
import math
import os
import random
import mpmath
import numpy
import pandas
import requests
import urllib3
#-------------------------------
# Don't build your own retry.
# Requests already supports it!
#-------------------------------
request_with_retry = requests.Session()
request_with_retry.mount("https://", requests.adapters.HTTPAdapter(
max_retries= urllib3.util.retry.Retry(
total=5,
backoff_factor=1,
status_forcelist=[408, 409, 429, 500, 502, 503, 504]
)
))
#-------------------------------
#-------------------------------
# We can pull this out of the main loop to simplify it
#-------------------------------
def get_tiles(latitude, longitude, zoom_level):
lat_rad = math.radians(latitude)
#lon_rad = math.radians(longitude)
n = 2 ** zoom_level
xtile = n * (longitude + 180) / 360
ytile = n * (1- (numpy.log(numpy.tan(lat_rad) + float(mpmath.sec(lat_rad))) / numpy.pi)) / 2
return int(xtile), int(ytile)
#-------------------------------
#-------------------------------
# Return a dictionary that contains prior saved work (or is blank)
# See: save_prior_work()
#-------------------------------
def fetch_prior_work(tmp_file_path):
try:
with open(tmp_file_path, "r", encoding="utf-8") as temp_in:
return json.load(temp_in)
except (FileNotFoundError, json.decoder.JSONDecodeError):
return {}
#-------------------------------
#-------------------------------
# Save our work to a temp file.
# See: fetch_prior_work()
#-------------------------------
def save_prior_work(tmp_file_path, work_dictionary):
with open(tmp_file_path, "w", encoding="utf-8", newline="") as temp_out:
json.dump(work_dictionary, temp_out)
#-------------------------------
def get_dsm_coverage(token, df, tmp_file_path):
api_url_params = {'AuthToken': token}
api_url_template = "https://api.gic.org/images/GetDSMTile/21/{xtile}/{ytile}/?layer=bluesky-ultra&year={year}"
api_zoom_level = 21
years = ["2019", "2020", "2021"]
note_every = 3 # How often do we print
save_every = 6 # How often do we save progress (probably every 100 or 1000)
## -------------------
## if our tmp_file_path exists, it represents prior work we can skip
## -------------------
year_lists = fetch_prior_work(tmp_file_path)
## -------------------
## -------------------
## make sure our year_lists results is properly initialized
## -------------------
for year in years:
year_lists.setdefault(year, [])
## -------------------
## -------------------
## Determine if there is any prior work we can skip
## -------------------
rows_already_processed = len(year_lists[years[0]])
if rows_already_processed:
print(f"skipping first {rows_already_processed} rows")
## ----------------------
for z in df.index[rows_already_processed:]:
## ----------------------
## printing is expensive so let's only print every so often
## ----------------------
if not z % note_every:
print(f"Row: {z}")
## ----------------------
## ----------------------
## calculate the tile ids
## ----------------------
xtile, ytile = get_tiles(df["LATITUDE"][z], df["LONGITUDE"][z], api_zoom_level)
## ----------------------
for year in years:
url = api_url_template.format_map({"xtile": xtile, "ytile": ytile, "year": year})
## ---------------------------
## TEST: We don't have a key....
## ---------------------------
#response = request_with_retry.get(url, params=api_url_params)
response = None
## ---------------------------
try:
content_type = response.headers['Content-type']
except:
content_type = "application/json;charset=ISO-8859-1"
## ---------------------------
## TEST: We don't have a key....
## ---------------------------
content_type = random.choice(["image/tiff", content_type])
## ---------------------------
if content_type == 'image/tiff':
year_lists[year].append(1)
else:
year_lists[year].append(0)
## ----------------------
## Every so often, dump the work we have done to a temp file.
## ----------------------
if z and not (z % save_every):
print(f"\tSaving Temp File...")
save_prior_work(tmp_file_path, year_lists)
## ----------------------
## ----------------------
## TEST: Force an error the first run so we can restart
## ----------------------
if z == 10 and not rows_already_processed:
raise Exception("Bummer")
## ----------------------
return year_lists.values()
AUTH_TOKEN = ""
TMP_FILE_PATH = "./temp.json"
df = pandas.DataFrame([
("Ansonia", "CT", "USA", 41.346439, -73.084938),
("Walsenburg", "CO", "USA", 37.630322, -104.790543),
("Sterling", "CO", "USA", 40.626743, -103.217026),
("Steamboat Springs", "CO", "USA", 40.490429, -106.842384),
("Ouray", "CO", "USA", 38.025131, -107.675880),
("Leadville", "CO", "USA", 39.247478, -106.300194),
("Gunnison", "CO", "USA", 38.547871, -106.938622),
("Fort Morgan", "CO", "USA", 40.255306, -103.803062),
("Panama City", "FL", "USA", 30.193626, -85.683029),
("Miami Beach", "FL", "USA", 25.793449, -80.139198),
("Cripple Creek", "CO", "USA", 38.749077, -105.183060),
("Central City", "CO", "USA", 39.803318, -105.516830),
("Cañon City", "CO", "USA", 38.444931, -105.245720),
])
df.set_axis(["Name", "State", "Country", "LATITUDE", "LONGITUDE"], axis=1, inplace=True)
list_2019, list_2020, list_2021 = get_dsm_coverage(AUTH_TOKEN, df, TMP_FILE_PATH)
## ----------------------
## if we get here, TMP_FILE_PATH should/could be deleted...
## ----------------------
try:
os.remove(TMP_FILE_PATH)
except OSError:
pass
## ----------------------
df['dsm_2019'] = list_2019
df['dsm_2020'] = list_2020
df['dsm_2021'] = list_2021
print(df)
One should expect the first execution to give:
Row: 0
Row: 3
Row: 6
Saving Temp File...
Row: 9
Traceback (most recent call last):
File "test.py", line 167, in <module>
list_2019, list_2020, list_2021 = get_dsm_coverage(AUTH_TOKEN, df, TMP_FILE_PATH)
File "test.py", line 142, in get_dsm_coverage
raise Exception("Bummer")
Exception: Bummer
and a following execution to give something like:
skipping first 7 rows
Row: 9
Row: 12
Saving Temp File...
Name State Country LATITUDE LONGITUDE dsm_2019 dsm_2020 dsm_2021
0 Ansonia CT USA 41.346439 -73.084938 0 1 1
1 Walsenburg CO USA 37.630322 -104.790543 1 0 0
2 Sterling CO USA 40.626743 -103.217026 0 1 0
3 Steamboat Springs CO USA 40.490429 -106.842384 0 0 0
4 Ouray CO USA 38.025131 -107.675880 0 0 0
5 Leadville CO USA 39.247478 -106.300194 1 0 1
6 Gunnison CO USA 38.547871 -106.938622 0 1 1
7 Fort Morgan CO USA 40.255306 -103.803062 1 0 0
8 Panama City FL USA 30.193626 -85.683029 0 1 1
9 Miami Beach FL USA 25.793449 -80.139198 0 1 1
10 Cripple Creek CO USA 38.749077 -105.183060 1 0 1
11 Central City CO USA 39.803318 -105.516830 0 0 0
12 Cañon City CO USA 38.444931 -105.245720 1 0 0

Python Question: How to add the values per line

I'm new in Python and trying to get my head around on this code.
We have to import a text file named line-items.txt; excerpt of the txt are as follows including its heading:
product name quantity unit price
product a 1 10.00
product b 5 19.70
product a 3 10.00
product b 7 19.70
We need to write a code that will search for the product name and sum its quantity and unit price then the sales revenue formula would be "total unit price of the product" * "total quantity of the product"; we have to create new text file and the output should be something like this:
product name sales volume sales revenue
product a 4 40.0
product b 12 236.39999999999998
On my code below it has searched the quantity of product b which is 5 and 7 and its unit price (I did print statement to check its output but on the code below I commented the unit price for simplicity) but it's not adding the values that it has searched:
def main():
# opening file to read line-items.txt
with open("line-items.txt", "r") as line_items:
# to get the list of lines and reading the second line of the text
prod_b = 0
newtxt = line_items.readlines()[1:]
for line in newtxt:
text = line.strip().split()
product_name = text[0:2]
quantity = text[2]
unit_price = text[3]
if product_name == ['product', 'b']:
prod_b += int(quantity)
unit_price_b = float(unit_price)
# print(unit_price_b)
print(quantity)
line_items.close()
if name == 'main':
main()
The output of the code above are as follows; it's not adding 5 and 7; what am I doing wrong?
5
7
Thanks,
Rogue
While the answer provided by #JonSG is certainly more elegant. The problem with your code is quite simple and is caused by an indentation error. You need to indent the if statement under the for loop as shown below:
def main():
# opening file to read line-items.txt
with open("line-items.txt", "r") as line_items:
# to get the list of lines and reading the second line of the text
prod_b = 0
newtxt = line_items.readlines()[1:]
for line in newtxt:
text = line.strip().split()
product_name = text[0:2]
quantity = text[2]
unit_price = text[3]
if product_name == ['product', 'b']:
prod_b += int(quantity)
unit_price_b = float(unit_price)
# print(unit_price_b)
print(quantity)
line_items.close()
Using a nested collections.defaultdict makes this problem rather straightforward.
import collections
import json
results = collections.defaultdict(lambda: collections.defaultdict(float))
with open("line-items.txt", "r") as line_items:
next(line_items) ## skip first line
for row in line_items.readlines():
cells = row.split(" ")
product_name = f"{cells[0]} {cells[1]}"
quatity = int(cells[2])
price = float(cells[3])
results[product_name]["quantity"] += quatity
results[product_name]["sales volume"] += quatity * price
print(json.dumps(results, indent=4))
results in:
{
"product a": {
"quantity": 4.0,
"sales volume": 40.0
},
"product b": {
"quantity": 12.0,
"sales volume": 236.4
}
}

Python dictionary based on input file

I'm trying to create a dictionary object like below using the input file data structure as below, During conversion inner object is being replicated. Any advise what fix is needed for desire output
input file data:/home/file1.txt
[student1]
fname : Harry
lname : Hoit
age : 22
[Student2]
fname : Adam
lname : Re
age : 25
expected output :
{'Student1' : {'fname' : 'Harry', 'lname' : 'Hoit', 'Age' : 22},
'Student2' : {'fname' : 'Adam', 'lname' : 'Re', 'Age' : 25}}
def dict_val():
out = {}
inn = {}
path= '/home/file1.txt'
with open(path, 'r') as f:
for row in f:
row = row.strip()
if row.startswith("["):
i = row[1:-1]
# inn.clear() ## tried to clean the inner loop during second but its not correct
else:
if len(row) < 2:
pass
else:
key, value = row.split('=')
inn[key.strip()] = value.strip()
out[i] = inn
return out
print(dict_val())
current output: getting duplicate during second iteration
{'student1': {'fname': 'Adam', 'lname': 'Re', 'age': '25'},
'Student2': {'fname': 'Adam', 'lname': 'Re', 'age': '25'}}
With just a little change, you will get it. You were pretty close.
The modification includes checking for empty line. When the line is empty, write inn data to out and then clear out inn.
def dict_val():
out = {}
inn = {}
path= 'file.txt'
with open(path, 'r') as f:
for row in f:
row = row.strip()
if row.startswith("["):
i = row[1:-1]
continue
# when the line is empty, write it to OUT dictionary
# reset INN dictionary
if len(row.strip()) == 0:
if len(inn) > 0:
out[i] = inn
inn = {}
continue
key, value = row.split(':')
inn[key.strip()] = value.strip()
# if last line of the file is not an empty line and
# the file reading is done, you can check if INN still
# has data. If it does, write it out to OUT
if len(inn) > 0:
out[i] = inn
return out
print(dict_val())
When you do out[i] = inn you copy the reference/pointer to the inn dict. This means that when the inn dict is updated in the later part of the loop, your out[1] and out[2] point to the same thing.
to solve this, you can create deepcopy of the inn object.
ref : Nested dictionaries copy() or deepcopy()?
I would work the nested dictionary all at once since you're not going that deep.
def dict_val(file):
inn = {}
for row in open(file, 'r'):
row = row.strip()
if row.startswith("["):
i = row[1:-1]
elif len(row) > 2:
key, value = row.split(':')
inn[i][key.strip()] = value.strip()
return inn
print(dict_val('/home/file1.txt'))

Assigning multiple values to dictionary keys from a file in Python 3

I'm fairly new to Python but I haven't found the answer to this particular problem.
I am writing a simple recommendation program and I need to have a dictionary where cuisine is a key and name of a restaurant is a value. There are a few instances where I have to split a string of a few cuisine names and make sure all other restaurants (values) which have the same cuisine get assigned to the same cuisine (key). Here's a part of a file:
Georgie Porgie
87%
$$$
Canadian, Pub Food
Queen St. Cafe
82%
$
Malaysian, Thai
Mexican Grill
85%
$$
Mexican
Deep Fried Everything
52%
$
Pub Food
so it's just the first and the last one with the same cuisine but there are more later in the file.
And here is my code:
def new(file):
file = "/.../Restaurants.txt"
d = {}
key = []
with open(file) as file:
lines = file.readlines()
for i in range(len(lines)):
if i % 5 == 0:
if "," not in lines[i + 3]:
d[lines[i + 3].strip()] = [lines[i].strip()]
else:
key += (lines[i + 3].strip().split(', '))
for j in key:
if j not in d:
d[j] = [lines[i].strip()]
else:
d[j].append(lines[i].strip())
return d
It gets all the keys and values printed but it doesn't assign two values to the same key where it should. Also, with this last 'else' statement, the second restaurant is assigned to the wrong key as a second value. This should not happen. I would appreciate any comments or help.
In the case when there is only one category you don't check if the key is in the dictionary. You should do this analogously as in the case of multiple categories and then it works fine.
I don't know why you have file as an argument when you have a file then overwritten.
Additionally you should make 'key' for each result, and not += (adding it to the existing 'key'
when you check if j is in dictionary, clean way is to check if j is in the keys (d.keys())
def new(file):
file = "/.../Restaurants.txt"
d = {}
key = []
with open(file) as file:
lines = file.readlines()
for i in range(len(lines)):
if i % 5 == 0:
if "," not in lines[i + 3]:
if lines[i + 3] not in d.keys():
d[lines[i + 3].strip()] = [lines[i].strip()]
else:
d[lines[i + 3]].append(lines[i].strip())
else:
key = (lines[i + 3].strip().split(', '))
for j in key:
if j not in d.keys():
d[j] = [lines[i].strip()]
else:
d[j].append(lines[i].strip())
return d
Normally, I find that if you use names for the dictionary keys, you may have an easier time handling them later.
In the example below, I return a series of dictionaries, one for each restaurant. I also wrap the functionality of processing the values in a method called add_value(), to keep the code more readable.
In my example, I'm using codecs to decode the value. Although not necessary, depending on the characters you are dealing with it may be useful. I'm also using itertools to read the file lines with an iterator. Again, not necessary depending on the case, but might be useful if you are dealing with really big files.
import copy, itertools, codecs
class RestaurantListParser(object):
file_name = "restaurants.txt"
base_item = {
"_type": "undefined",
"_fields": {
"name": "undefined",
"nationality": "undefined",
"rating": "undefined",
"pricing": "undefined",
}
}
def add_value(self, formatted_item, field_name, field_value):
if isinstance(field_value, basestring):
# handle encoding, strip, process the values as you need.
field_value = codecs.encode(field_value, 'utf-8').strip()
formatted_item["_fields"][field_name] = field_value
else:
print 'Error parsing field "%s", with value: %s' % (field_name, field_value)
def generator(self, file_name):
with open(file_name) as file:
while True:
lines = tuple(itertools.islice(file, 5))
if not lines: break
# Initialize our dictionary for this item
formatted_item = copy.deepcopy(self.base_item)
if "," not in lines[3]:
formatted_item['_type'] = lines[3].strip()
else:
formatted_item['_type'] = lines[3].split(',')[1].strip()
self.add_value(formatted_item, 'nationality', lines[3].split(',')[0])
self.add_value(formatted_item, 'name', lines[0])
self.add_value(formatted_item, 'rating', lines[1])
self.add_value(formatted_item, 'pricing', lines[2])
yield formatted_item
def split_by_type(self):
d = {}
for restaurant in self.generator(self.file_name):
if restaurant['_type'] not in d:
d[restaurant['_type']] = [restaurant['_fields']]
else:
d[restaurant['_type']] += [restaurant['_fields']]
return d
Then, if you run:
p = RestaurantListParser()
print p.split_by_type()
You should get:
{
'Mexican': [{
'name': 'Mexican Grill',
'nationality': 'undefined',
'pricing': '$$',
'rating': '85%'
}],
'Pub Food': [{
'name': 'Georgie Porgie',
'nationality': 'Canadian',
'pricing': '$$$',
'rating': '87%'
}, {
'name': 'Deep Fried Everything',
'nationality': 'undefined',
'pricing': '$',
'rating': '52%'
}],
'Thai': [{
'name': 'Queen St. Cafe',
'nationality': 'Malaysian',
'pricing': '$',
'rating': '82%'
}]
}
Your solution is simple, so it's ok. I'd just like to mention a couple of ideas that come to mind when I think about this kind of problem.
Here's another take, using defaultdict and split to simplify things.
from collections import defaultdict
record_keys = ['name', 'rating', 'price', 'cuisine']
def load(file):
with open(file) as file:
data = file.read()
restaurants = []
# chop up input on each blank line (2 newlines in a row)
for record in data.split("\n\n"):
fields = record.split("\n")
# build a dictionary by zipping together the fixed set
# of field names and the values from this particular record
restaurant = dict(zip(record_keys, fields))
# split chops apart the type cuisine on comma, then _.strip()
# removes any leading/trailing whitespace on each type of cuisine
restaurant['cuisine'] = [_.strip() for _ in restaurant['cuisine'].split(",")]
restaurants.append(restaurant)
return restaurants
def build_index(database, key, value):
index = defaultdict(set)
for record in database:
for v in record.get(key, []):
# defaultdict will create a set if one is not present or add to it if one does
index[v].add(record[value])
return index
restaurant_db = load('/var/tmp/r')
print(restaurant_db)
by_type = build_index(restaurant_db, 'cuisine', 'name')
print(by_type)

Resources