PyMongo: how to query a series and find the closest match - python-3.x

This is a simplified example of how my data is stored in MongoDB of a single athlete:
{ "_id" : ObjectId('5bd6eab25f74b70e5abb3326'),
"Result" : 12,
"Race" : [0.170, 4.234, 9.170]
"Painscore" : 68,
}
Now when this athlete has performed a race I want to search for the race that was MOST similar to the current one, and hence I want to compare both painscores.
IOT get the best 'match' I tried this:
query = [0.165, 4.031, 9.234]
closestBelow = db[athlete].find({'Race' : {"$lte": query}}, {"_id": 1, "Race": 1}).sort("Race", -1).limit(2)
for i in closestBelow:
print(i)
closestAbove = db[athlete].find({'Race' : {"$gte": query}}, {"_id": 1, "Race": 1}).sort("Race", 1).limit(2)
for i in closestAbove:
print(i)
This does not seem to work.
Question1: How can I give the mentioned query IOT find the race in Mongo that matches the best/closes?.. When taken in account that a race is almost never exactly the same.
Question2: How can i see a percentage of match per document so that an athlete knows how 'serious' he must interpreted the pain score?
Thank you.

Thanks to this website I found a solution: http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
Step 1: find your query;
Step 2: make a first selection based on query and append the results into a list (for example average);
Step 3: use a for loop to compare every item in the list with your query. Use Euclidean distance for this;
Step 4: when you have your matching processed, define the best match into a variable.
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
Database = 'Firstclass'
def newSearch(Athlete):
# STEP 1
db = client[Database]
lastDoc = [i for i in db[Athlete].find({},{ '_id': 1, 'Race': 1, 'Avarage': 1}).sort('_id', -1).limit(1)]
query = { '$and': [ { 'Average' : {'$gte': lastDoc[0].get('Average')*0.9} }, { 'Average' : {'$lte': lastDoc[0].get('Average')*1.1} } ] }
funnel = [x for x in db[Athlete].find(query, {'_id': 1, 'Race': 1}).sort('_id', -1).limit(15)]
#STEP 2
compareListID = []
compareListRace = []
for x in funnel:
if lastDoc[0].get('_id') != x.get('_id'):
compareListID.append(x.get('_id'))
compareListRace.append(x.get('Race'))
#STEP 3
for y in compareListRace:
ED = euclidean_distance(lastDoc[0].get('Race'),y)
ESlist.append(ED)
#STEP 4
matchObjID = compareListID[numpy.argmax(ESlist)]
matchRace = compareListRace[numpy.argmax(ESlist)]
newSearch('Jim')

Related

Importation of method EF 3.0 - trouble with results

I wrote a script to import the characterization factors of the LCIA method EF 3.0 (adapated) on Brightway. I think it works fine as I see the right characterization factors on the Activity Browser (ex for the Climate Change method : but when I run calculations with the method, the results are not the same as on Simapro (where I got the CSV Import File from) : And for instance the result is 0 for the Climate Change method. Do you know what can be the issue ?
It seems that the units are different but it is the same for the other methods that are available on Brightway.
Besides, I saw on another question that there would be a method implemented to import the EF 3.0 method, is it available yet ?
Thank you very much for your help.
Code of the importation script :
import brightway2 as bw
import csv
import uuid
from bw2data import mapping
from bw2data.utils import recursive_str_to_unicode
class import_method_EF:
'''Class for importing the EF method from Simapro export CSV file to Brightway. '''
def __init__(
self,
project_name,
name_file,
):
self.project_name = project_name
self.name_file = name_file
self.db_biosphere = bw.Database('biosphere3')
#Definition of the dictionnary for the correspondance between the Simapro and the ecoinvent categories
self.dict_categories = {'high. pop.' : 'urban air close to ground',
'low. pop.' : 'low population density, long-term',
'river' : 'surface water',
'in water' : 'in water',
'(unspecified)' : '',
'ocean' : 'ocean',
'indoor' : 'indoor',
'stratosphere + troposphere' : 'lower stratosphere + upper troposphere',
'low. pop., long-term' : 'low population density, long-term',
'groundwater, long-term' : 'ground-, long-term',
'agricultural' : 'agricultural',
'industrial' : 'industrial',
}
#Definition of the dictionnary of the ecoinvent units abreviations
self.dict_units = {'kg' : 'kilogram',
'kWh' : 'kilowatt hour',
'MJ' : 'megajoule',
'p':'p',
'unit':'unit',
'km':'kilometer',
'my' : 'meter-year',
'tkm' : 'ton kilometer',
'm3' : 'cubic meter',
'm2' :'square meter',
'kBq' : 'kilo Becquerel',
'm2a' : 'm2a', #à modifier
}
def importation(self) :
"""
Makes the importation from the Simapro CSV file to Brightway.
"""
#Set the current project
bw.projects.set_current(self.project_name)
self.data = self.open_CSV(self.name_file, [])
list_methods = []
new_flows = []
for i in range(len(self.data)) :
#print(self.data[i])
if self.data[i] == ['Name'] :
name_method = self.data[i+1][0]
if self.data[i] == ['Impact category'] :
list_flows = []
j = 4
while len(self.data[i+j])>1 :
biosphere_code = self.get_biosphere_code(self.data[i+j][2],self.data[i+j][1],self.data[i+j][0].lower())
if biosphere_code == 0 :
if self.find_if_already_new_flow(i+j, new_flows)[0] :
code = self.find_if_already_new_flow(i+j, new_flows)[1]
list_flows.append((('biosphere3', code),float(self.data[i+j][4].replace(',','.'))))
else :
code = str(uuid.uuid4())
while (self.db_biosphere.name, code) in mapping:
code = str(uuid.uuid4())
new_flows.append({'amount' : float(self.data[i+j][4].replace(',','.')),
'CAS number' : self.data[i+j][3],
'categories' : (self.data[i+j][0].lower(), self.dict_categories[self.data[i+j][1]]),
'name' : self.data[i+j][2],
'unit' : self.dict_units[self.data[i+j][5]],
'type' : 'biosphere',
'code' : code})
list_flows.append((('biosphere3', code),float(self.data[i+j][4].replace(',','.'))))
else :
list_flows.append((('biosphere3', biosphere_code),float(self.data[i+j][4].replace(',','.'))))
j+=1
list_methods.append({'name' : self.data[i+1][0],
'unit' : self.data[i+1][1],
'flows' : list_flows})
new_flows = recursive_str_to_unicode(dict([self._format_flow(flow) for flow in new_flows]))
if new_flows :
print('new flows :',len(new_flows))
self.new_flows = new_flows
biosphere = bw.Database(self.db_biosphere.name)
biosphere_data = biosphere.load()
biosphere_data.update(new_flows)
biosphere.write(biosphere_data)
print('biosphere_data :',len(biosphere_data))
for i in range(len(list_methods)) :
method = bw.Method((name_method,list_methods[i]['name']))
method.register(**{'unit':list_methods[i]['unit'],
'description':''})
method.write(list_methods[i]['flows'])
print(method.metadata)
method.load()
def open_CSV(self, CSV_file_name, list_rows):
'''
Opens a CSV file and gets a list of the rows.
: param : CSV_file_name = str, name of the CSV file (must be in the working directory)
: param : list_rows = list, list to get the rows
: return : list_rows = list, list of the rows
'''
#Open the CSV file and read it
with open(CSV_file_name, 'rt') as csvfile:
data = csv.reader(csvfile, delimiter = ';')
#Write every row in the list
for row in data:
list_rows.append(row)
return list_rows
def get_biosphere_code(self, simapro_name, simapro_cat, type_biosphere):
"""
Gets the Brightway code of a biosphere process given in a Simapro format.
: param : simapro_name = str, name of the biosphere process in a Simapro format.
: param : simapro_cat = str, category of the biosphere process (ex : high. pop., river, etc)
: param : type_biosphere = str, type of the biosphere process (ex : Emissions to water, etc)
: return : 0 if the process is not found in biosphere, the code otherwise
"""
if 'GLO' in simapro_name or 'RER' in simapro_name :
simapro_name = simapro_name[:-5]
if '/m3' in simapro_name :
simapro_name = simapro_name[:-3]
#Search in the biosphere database, depending on the category
if simapro_cat == '' :
act_biosphere = self.db_biosphere.search(simapro_name, filter={'categories' : (type_biosphere,)})
else :
act_biosphere = self.db_biosphere.search(simapro_name, filter={'categories' : (type_biosphere, self.dict_categories[simapro_cat])})
#Pourquoi j'ai fait ça ? ...
for act in act_biosphere :
if simapro_cat == '' :
if act['categories'] == (type_biosphere, ):
return act['code']
else :
if act['categories'] == (type_biosphere, self.dict_categories[simapro_cat]):
return act['code']
return 0
def _format_flow(self, cf):
# TODO
return (self.db_biosphere.name, cf['code']), {
'exchanges': [],
'categories': cf['categories'],
'name': cf['name'],
'type': ("resource" if cf["categories"][0] == "resource"
else "emission"),
'unit': cf['unit'],
}
def find_if_already_new_flow(self, n, new_flows) :
"""
"""
for k in range(len(new_flows)) :
if new_flows[k]['name'] == self.data[n][2] :
return True, new_flows[k]['code']
return False, 0
Edit : I made a modification in the get_biosphere_code method and it works better (it was not finding some biosphere flows) but I still have important differences between the results I get on Brightway and the results I get on Simapro. My investigations led me to the following observations :
there are some differences in ecoinvent activities and especially in the lists of biosphere flows (should be a sink of differences in result), some are missing in Brightway and also in the ecoSpold data that was used for the importation compared to the data in Simapro
it seems that the LCA calculation doesn't work the same way as regards the subcategories : for example, the biosphere flow Carbon dioxide, fossil (air,) is in the list of caracterization factors for the Climate Change method and when looking at the inventory in the Simapro LCA results, it appears that all the Carbon dioxide, fossil flows to air participate in the Climate Change impact, no matter what their subcategory is. But Brightway does not work this way and only takes into account the flows that are exactly the same, so it leads to important differences in the results.
In LCA there's no agreement on elementary flows and archetypical emission scenarios / context (https://doi.org/10.1007/s11367-017-1354-3), and implementations of the impact assessment methods differ (https://www.lifecycleinitiative.org/portfolio_category/lcia/).
It is not unusual that the same activity and same impact assessment method returns different results in different software. There are some attempts to improve the current practices (see e.g , https://github.com/USEPA/LCIAformatter).

Python list add variables in rows

im trying to add variables to a list that i created. Got a result from a session.execute.
i´ve done this:
def machine_id(session, machine_serial):
stmt_raw = '''
SELECT
id
FROM
machine
WHERE
machine.serial = :machine_serial_arg
'''
utc_now = datetime.datetime.utcnow()
utc_now_iso = pytz.utc.localize(utc_now).isoformat()
utc_start = datetime.datetime.utcnow() - datetime.timedelta(days = 30)
utc_start_iso = pytz.utc.localize(utc_start).isoformat()
stmt_args = {
'machine_serial_arg': machine_serial,
}
stmt = text(stmt_raw).columns(
#ts_insert = ISODateTime
)
result = session.execute(stmt, stmt_args)
ts = utc_now_iso
ts_start = utc_start_iso
ID = []
for row in result:
ID.append({
'id': row[0],
'ts': ts,
'ts_start': ts_start,
})
return ID
In trying to get the result over api like this:
def form_response(response, session):
result_machine_id = machine_id(session, machine_serial)
if not result_machine_id:
response['Error'] = 'Seriennummer nicht vorhanden/gefunden'
return
response['id_timerange'] = result_machine_id
Output looks fine.
{
"id_timerange": [
{
"id": 1,
"ts": "2020-08-13T08:32:25.835055+00:00",
"ts_start": "2020-07-14T08:32:25.835089+00:00"
}
]
}
Now i only want the id from it as a parameter for another function. Problem is i think its not a list. I cant select the first element. result_machine_id[0] result is like the posted Output. I think in my first function i only add ts & ts_start to the first row? Is it possible to add emtpy rows and then add 'ts':ts as value?
Help would be nice
If I have understood your question correctly ...
Your output looks like dict. so access its id_timerange key which gives you a list. Access the first element which gives you another dict. On this dict you have an id key:
result_machine_id["id_timerange"][0]["id"]

How do I assign multiple values in a dictionary in python?

I am working with object variables based off ICD-9 codes and am trying to create a dictionary which identifies all ICD-9 codes between E880.xx and E888.xx.
I want to code all values between E880.xx and E888.xx as a 1, and all other values as a 0.
I attempted this:
fallinjury_Dictionary = {1 : 'E880', 1 : 'E881', 1 : 'E882' ...}
but the key gets overwritten every time and I only end up with one value in the dictionary (1 = E888)
I've also tried this:
fallinjury_Dictionary = {1 : {'E880' or 'E881' or 'E882' .... or 'E888'}
which just doesn't work.
Dictionaries can only have one value for each key. If you are looking to map the value 1 to E880 (instead of the other way around), you can do that, but otherwise you can't (uniqueness of keys is in an inherent part of dictionaries).
fallinjury_Dictionary = dict.fromkeys(['E880', 'E881', 'E882'], 1)
fallinjury_Dictionary.update(dict.fromkeys(['E880', 'E883'], 0)) # to update values
If i got you right it must be "0: 'E901'" for E901
fallinjury_Dictionary = {1 : 'E880', 1 : 'E881', 1 : 'E882', ... 0: 'E901'}
# Instead i would switch the key and the value for this case
fallinjury_Dictionary = {'E880': 1, 'E881': 1, 'E882':1, ... 'E901': 0}
# You could use boolean
fallinjury_Dictionary = {'E880': True, 'E881': True, 'E882': True, ... 'E901': False}
You could also think about creating classes.

Arango query, collection with Edge count

Really new to Arango and I'm experimenting with it, so please bear with me. I have a feed collection and every feed can be liked by a user
[user]---likes_feed--->[feed].
I'm trying to create a query that will return a feed by its author, and add the number of likes to the result. This is what I have so far and it seems to work, but it only returns feed that have at least 1 like (An edge exists between the feed and a user)
below is my query
FOR f IN feed
SORT f.creationDate
FILTER f.author == #user_key
LIMIT #start_index, #end_index
FOR x IN INBOUND CONCAT('feed','/',f._key) likes_feed
OPTIONS {bfs: true, uniqueVertices: 'global'}
COLLECT feed = f WITH COUNT INTO counter
RETURN {
'feed':feed,
likes: counter
}
this is an example of result
[
"feed":{
"_key":"string",
"_id":"users_feed/1680835",
"_rev":"_W8zRPqe--_",
"author":"author_a",
"creationDate":"98879845467787979",
"title":"some title",
"caption":"some caption'
},
"likes":1
]
If a feed has no likes, no edge inbound to that feed, how do I return the likes count as 0
Something like this?
[
"feed":{
"_key":"string",
"_id":"users_feed/1680835",
"_rev":"_W8zRPqe--_",
"author":"author_a",
"creationDate":"98879845467787979",
"title":"some title",
"caption":"some caption'
},
"likes":0
]
So finally I found the solution. I had to create a graph and traverse it. Final result below
FOR f IN users_feed
SORT f.creationDate
FILTER f.author == #author_id
LIMIT #start_index, #end_index
COLLECT feed = f
LET inboundEdges = LENGTH(FOR v IN 1..1 INBOUND feed GRAPH 'likes_graph' RETURN 1)
RETURN {
feed :feed
,likes: inboundEdges
}

Updating multiple line plots dynamically in callback in bokeh

I have a use case where I have multiple line plots (with legends), and I need to update the line plots based on a column condition. Below is an example of two data set, based on the country, the column data source changes. But the issue I am facing is, the number of columns is not fixed for the data source, and even the types can vary. So, when I update the data source based on a callback when there is a new country selected, I get this error:
Error: attempted to retrieve property array for nonexistent field 'pay_conv_7d.content'.
I am guessing because in the new data source, the pay_conv_7d.content column doesn't exist, but in my plot those lines were already there. I have been trying to fix this issue by various means (making common columns for all country selection - adding the missing column in the data source in callback, but still get issues.
Is there any clean way to have multiple line plots updating using callback, and not do a lot of hackish way? Any insights or help would be really appreciated. Thanks much in advance! :)
def setup_multiline_plots(x_axis, y_axis, title_text, data_source, plot):
num_categories = len(data_source.data['categories'])
legends_list = list(data_source.data['categories'])
colors_list = Spectral11[0:num_categories]
# xs = [data_source.data['%s.'%x_axis].values] * num_categories
# ys = [data_source.data[('%s.%s')%(y_axis,column)] for column in data_source.data['categories']]
# data_source.data['x_series'] = xs
# data_source.data['y_series'] = ys
# plot.multi_line('x_series', 'y_series', line_color=colors_list,legend='categories', line_width=3, source=data_source)
plot_list = []
for (colr, leg, column) in zip(colors_list, legends_list, data_source.data['categories']):
xs, ys = '%s.'%x_axis, ('%s.%s')%(y_axis,column)
plot.line(xs,ys, source=data_source, color=colr, legend=leg, line_width=3, name=ys)
plot_list.append(ys)
data_source.data['plot_names'] = data_source.data.get('plot_names',[]) + plot_list
plot.title.text = title_text
def update_plot(country, timeseries_df, timeseries_source,
aggregate_df, aggregate_source, category,
plot_pay_7d, plot_r_pay_90d):
aggregate_metrics = aggregate_df.loc[aggregate_df.country == country]
aggregate_metrics = aggregate_metrics.nlargest(10, 'cost')
category_types = list(aggregate_metrics[category].unique())
timeseries_df = timeseries_df[timeseries_df[category].isin(category_types)]
timeseries_multi_line_metrics = get_multiline_column_datasource(timeseries_df, category, country)
# len_series = len(timeseries_multi_line_metrics.data['time.'])
# previous_legends = timeseries_source.data['plot_names']
# current_legends = timeseries_multi_line_metrics.data.keys()
# common_legends = list(set(previous_legends) & set(current_legends))
# additional_legends_list = list(set(previous_legends) - set(current_legends))
# for legend in additional_legends_list:
# zeros = pd.Series(np.array([0] * len_series), name=legend)
# timeseries_multi_line_metrics.add(zeros, legend)
# timeseries_multi_line_metrics.data['plot_names'] = previous_legends
timeseries_source.data = timeseries_multi_line_metrics.data
aggregate_source.data = aggregate_source.from_df(aggregate_metrics)
def get_multiline_column_datasource(df, category, country):
df_country = df[df.country == country]
df_pivoted = pd.DataFrame(df_country.pivot_table(index='time', columns=category, aggfunc=np.sum).reset_index())
df_pivoted.columns = df_pivoted.columns.to_series().str.join('.')
categories = list(set([column.split('.')[1] for column in list(df_pivoted.columns)]))[1:]
data_source = ColumnDataSource(df_pivoted)
data_source.data['categories'] = categories
Recently I had to update data on a Multiline glyph. Check my question if you want to take a look at my algorithm.
I think you can update a ColumnDataSource in three ways at least:
You can create a dataframe to instantiate a new CDS
cds = ColumnDataSource(df_pivoted)
data_source.data = cds.data
You can create a dictionary and assign it to the data attribute directly
d = {
'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
'xs1': [[17.0, 166.0], [17.0, 116.0], [17.0, 126.0]],
'ys1': [[179.0, 169.0], [179.0, 1169.0], [1729.0, 169.0]],
'xs2': [[27.0, 276.0], [27.0, 216.0], [27.0, 226.0]],
'ys2': [[279.0, 269.0], [279.0, 2619.0], [2579.0, 2569.0]]
}
data_source.data = d
Here if you need different sizes of columns or empty columns you can fill the gaps with NaN values in order to keep column sizes. And I think this is the solution to your question:
import numpy as np
d = {
'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
'xs1': [[17.0, 166.0], [np.nan], [np.nan]],
'ys1': [[179.0, 169.0], [np.nan], [np.nan]],
'xs2': [[np.nan], [np.nan], [np.nan]],
'ys2': [[np.nan], [np.nan], [np.nan]]
}
data_source.data = d
Or if you only need to modify a few values then you can use the method patch. Check the documentation here.
The following example shows how to patch entire column elements. In this case,
source = ColumnDataSource(data=dict(foo=[10, 20, 30], bar=[100, 200, 300]))
patches = {
'foo' : [ (slice(2), [11, 12]) ],
'bar' : [ (0, 101), (2, 301) ],
}
source.patch(patches)
After this operation, the value of the source.data will be:
dict(foo=[11, 22, 30], bar=[101, 200, 301])
NOTE: It is important to make the update in one go to avoid performance issues

Resources