How do I assign multiple values in a dictionary in python? - python-3.x

I am working with object variables based off ICD-9 codes and am trying to create a dictionary which identifies all ICD-9 codes between E880.xx and E888.xx.
I want to code all values between E880.xx and E888.xx as a 1, and all other values as a 0.
I attempted this:
fallinjury_Dictionary = {1 : 'E880', 1 : 'E881', 1 : 'E882' ...}
but the key gets overwritten every time and I only end up with one value in the dictionary (1 = E888)
I've also tried this:
fallinjury_Dictionary = {1 : {'E880' or 'E881' or 'E882' .... or 'E888'}
which just doesn't work.

Dictionaries can only have one value for each key. If you are looking to map the value 1 to E880 (instead of the other way around), you can do that, but otherwise you can't (uniqueness of keys is in an inherent part of dictionaries).
fallinjury_Dictionary = dict.fromkeys(['E880', 'E881', 'E882'], 1)
fallinjury_Dictionary.update(dict.fromkeys(['E880', 'E883'], 0)) # to update values

If i got you right it must be "0: 'E901'" for E901
fallinjury_Dictionary = {1 : 'E880', 1 : 'E881', 1 : 'E882', ... 0: 'E901'}
# Instead i would switch the key and the value for this case
fallinjury_Dictionary = {'E880': 1, 'E881': 1, 'E882':1, ... 'E901': 0}
# You could use boolean
fallinjury_Dictionary = {'E880': True, 'E881': True, 'E882': True, ... 'E901': False}
You could also think about creating classes.

Related

PyMongo: how to query a series and find the closest match

This is a simplified example of how my data is stored in MongoDB of a single athlete:
{ "_id" : ObjectId('5bd6eab25f74b70e5abb3326'),
"Result" : 12,
"Race" : [0.170, 4.234, 9.170]
"Painscore" : 68,
}
Now when this athlete has performed a race I want to search for the race that was MOST similar to the current one, and hence I want to compare both painscores.
IOT get the best 'match' I tried this:
query = [0.165, 4.031, 9.234]
closestBelow = db[athlete].find({'Race' : {"$lte": query}}, {"_id": 1, "Race": 1}).sort("Race", -1).limit(2)
for i in closestBelow:
print(i)
closestAbove = db[athlete].find({'Race' : {"$gte": query}}, {"_id": 1, "Race": 1}).sort("Race", 1).limit(2)
for i in closestAbove:
print(i)
This does not seem to work.
Question1: How can I give the mentioned query IOT find the race in Mongo that matches the best/closes?.. When taken in account that a race is almost never exactly the same.
Question2: How can i see a percentage of match per document so that an athlete knows how 'serious' he must interpreted the pain score?
Thank you.
Thanks to this website I found a solution: http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
Step 1: find your query;
Step 2: make a first selection based on query and append the results into a list (for example average);
Step 3: use a for loop to compare every item in the list with your query. Use Euclidean distance for this;
Step 4: when you have your matching processed, define the best match into a variable.
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
Database = 'Firstclass'
def newSearch(Athlete):
# STEP 1
db = client[Database]
lastDoc = [i for i in db[Athlete].find({},{ '_id': 1, 'Race': 1, 'Avarage': 1}).sort('_id', -1).limit(1)]
query = { '$and': [ { 'Average' : {'$gte': lastDoc[0].get('Average')*0.9} }, { 'Average' : {'$lte': lastDoc[0].get('Average')*1.1} } ] }
funnel = [x for x in db[Athlete].find(query, {'_id': 1, 'Race': 1}).sort('_id', -1).limit(15)]
#STEP 2
compareListID = []
compareListRace = []
for x in funnel:
if lastDoc[0].get('_id') != x.get('_id'):
compareListID.append(x.get('_id'))
compareListRace.append(x.get('Race'))
#STEP 3
for y in compareListRace:
ED = euclidean_distance(lastDoc[0].get('Race'),y)
ESlist.append(ED)
#STEP 4
matchObjID = compareListID[numpy.argmax(ESlist)]
matchRace = compareListRace[numpy.argmax(ESlist)]
newSearch('Jim')

Updating multiple line plots dynamically in callback in bokeh

I have a use case where I have multiple line plots (with legends), and I need to update the line plots based on a column condition. Below is an example of two data set, based on the country, the column data source changes. But the issue I am facing is, the number of columns is not fixed for the data source, and even the types can vary. So, when I update the data source based on a callback when there is a new country selected, I get this error:
Error: attempted to retrieve property array for nonexistent field 'pay_conv_7d.content'.
I am guessing because in the new data source, the pay_conv_7d.content column doesn't exist, but in my plot those lines were already there. I have been trying to fix this issue by various means (making common columns for all country selection - adding the missing column in the data source in callback, but still get issues.
Is there any clean way to have multiple line plots updating using callback, and not do a lot of hackish way? Any insights or help would be really appreciated. Thanks much in advance! :)
def setup_multiline_plots(x_axis, y_axis, title_text, data_source, plot):
num_categories = len(data_source.data['categories'])
legends_list = list(data_source.data['categories'])
colors_list = Spectral11[0:num_categories]
# xs = [data_source.data['%s.'%x_axis].values] * num_categories
# ys = [data_source.data[('%s.%s')%(y_axis,column)] for column in data_source.data['categories']]
# data_source.data['x_series'] = xs
# data_source.data['y_series'] = ys
# plot.multi_line('x_series', 'y_series', line_color=colors_list,legend='categories', line_width=3, source=data_source)
plot_list = []
for (colr, leg, column) in zip(colors_list, legends_list, data_source.data['categories']):
xs, ys = '%s.'%x_axis, ('%s.%s')%(y_axis,column)
plot.line(xs,ys, source=data_source, color=colr, legend=leg, line_width=3, name=ys)
plot_list.append(ys)
data_source.data['plot_names'] = data_source.data.get('plot_names',[]) + plot_list
plot.title.text = title_text
def update_plot(country, timeseries_df, timeseries_source,
aggregate_df, aggregate_source, category,
plot_pay_7d, plot_r_pay_90d):
aggregate_metrics = aggregate_df.loc[aggregate_df.country == country]
aggregate_metrics = aggregate_metrics.nlargest(10, 'cost')
category_types = list(aggregate_metrics[category].unique())
timeseries_df = timeseries_df[timeseries_df[category].isin(category_types)]
timeseries_multi_line_metrics = get_multiline_column_datasource(timeseries_df, category, country)
# len_series = len(timeseries_multi_line_metrics.data['time.'])
# previous_legends = timeseries_source.data['plot_names']
# current_legends = timeseries_multi_line_metrics.data.keys()
# common_legends = list(set(previous_legends) & set(current_legends))
# additional_legends_list = list(set(previous_legends) - set(current_legends))
# for legend in additional_legends_list:
# zeros = pd.Series(np.array([0] * len_series), name=legend)
# timeseries_multi_line_metrics.add(zeros, legend)
# timeseries_multi_line_metrics.data['plot_names'] = previous_legends
timeseries_source.data = timeseries_multi_line_metrics.data
aggregate_source.data = aggregate_source.from_df(aggregate_metrics)
def get_multiline_column_datasource(df, category, country):
df_country = df[df.country == country]
df_pivoted = pd.DataFrame(df_country.pivot_table(index='time', columns=category, aggfunc=np.sum).reset_index())
df_pivoted.columns = df_pivoted.columns.to_series().str.join('.')
categories = list(set([column.split('.')[1] for column in list(df_pivoted.columns)]))[1:]
data_source = ColumnDataSource(df_pivoted)
data_source.data['categories'] = categories
Recently I had to update data on a Multiline glyph. Check my question if you want to take a look at my algorithm.
I think you can update a ColumnDataSource in three ways at least:
You can create a dataframe to instantiate a new CDS
cds = ColumnDataSource(df_pivoted)
data_source.data = cds.data
You can create a dictionary and assign it to the data attribute directly
d = {
'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
'xs1': [[17.0, 166.0], [17.0, 116.0], [17.0, 126.0]],
'ys1': [[179.0, 169.0], [179.0, 1169.0], [1729.0, 169.0]],
'xs2': [[27.0, 276.0], [27.0, 216.0], [27.0, 226.0]],
'ys2': [[279.0, 269.0], [279.0, 2619.0], [2579.0, 2569.0]]
}
data_source.data = d
Here if you need different sizes of columns or empty columns you can fill the gaps with NaN values in order to keep column sizes. And I think this is the solution to your question:
import numpy as np
d = {
'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
'xs1': [[17.0, 166.0], [np.nan], [np.nan]],
'ys1': [[179.0, 169.0], [np.nan], [np.nan]],
'xs2': [[np.nan], [np.nan], [np.nan]],
'ys2': [[np.nan], [np.nan], [np.nan]]
}
data_source.data = d
Or if you only need to modify a few values then you can use the method patch. Check the documentation here.
The following example shows how to patch entire column elements. In this case,
source = ColumnDataSource(data=dict(foo=[10, 20, 30], bar=[100, 200, 300]))
patches = {
'foo' : [ (slice(2), [11, 12]) ],
'bar' : [ (0, 101), (2, 301) ],
}
source.patch(patches)
After this operation, the value of the source.data will be:
dict(foo=[11, 22, 30], bar=[101, 200, 301])
NOTE: It is important to make the update in one go to avoid performance issues

using as.ppp on data frame to create marked process

I am using a data frame to create a marked point process using as.ppp function. I get an error Error: is.numeric(x) is not TRUE. The data I am using is as follows:
dput(head(pointDataUTM[,1:2]))
structure(list(POINT_X = c(439845.0069, 450018.3603, 451873.2925,
446836.5498, 445040.8974, 442060.0477), POINT_Y = c(4624464.56,
4629024.646, 4624579.758, 4636291.222, 4614853.993, 4651264.579
)), .Names = c("POINT_X", "POINT_Y"), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I can see that the first two columns are numeric, so I do not know why it is a problem.
> str(pointDataUTM)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5028 obs. of 31 variables:
$ POINT_X : num 439845 450018 451873 446837 445041 ...
$ POINT_Y : num 4624465 4629025 4624580 4636291 4614854 ...
Then I also checked for NA, which shows no NA
> sum(is.na(pointDataUTM$POINT_X))
[1] 0
> sum(is.na(pointDataUTM$POINT_Y))
[1] 0
When I tried even only the first two columns of the data.frame, the error I get on using as.ppp is this:
Error: is.numeric(x) is not TRUE
5.stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), ch), call. = FALSE, domain = NA)
4.stopifnot(is.numeric(x))
3.ppp(X[, 1], X[, 2], window = win, marks = marx, check = check)
2.as.ppp.data.frame(pointDataUTM[, 1:2], W = studyWindow)
1.as.ppp(pointDataUTM[, 1:2], W = studyWindow)
Could someone tell me what is the mistake here and why I get the not numeric error?
Thank you.
The critical check is whether PointDataUTM[,1] is numeric, rather than PointDataUTM$POINT_X.
Since PointDataUTM is a tbl object, and tbl is a function from the dplyr package, what is probably happening is that the subset operator for the tbl class is returning a data frame, and not a numeric vector, when a single column is extracted. Whereas the $ operator returns a numeric vector.
I suggest you convert your data to data.frame using as.data.frame() before calling as.ppp.
In the next version of spatstat we will make our code more robust against this kind of problem.
I'm on the phone, so can't check but I think it is happens because you have a tibble and not a data.frame. Please try to convert to a data.frame using as.data.frame first.

identify smallest date in a dictionary within dictionary

My data is arranged in dictionaries within dictionaries, like so:
dict = {subdict1:{}, subdict2:{},...}
where
subdict1 = { subdict_a: {"date":A, "smallest_date":False}, subdict_b : {"date":B, "smallest_date": False},...}
I'd like to loop through the subdictionaries a,b,c... and identify which of the dates A, B, C... is the smallest in each subdictionary, and change the value of 'smallest_date' to True.
How to approach this problem? I tried something like this, but couldn't quite finish it:
for subdict_number, values1 in dict.items():
smallest_date = None
for subdict_alphabet, values2 in values1.items():
if smallest_date == None or smallest_date > values2["date"]
smallest_date = values2["date"]
smallest_subdict = subdict_alphabet
And then some magic where as the loop within subdict closes sets
dict[subdict][smallest_subdict]["date"] = smallest_date
and then continues to the next subdict to do the same thing.
I can't finish this. Can you help me out? A completely different approach can be used, but as a beginner I couldn't think of one.
I've tried to keep the naming explanatory.
Given the input dictionary:
main_dict = { 'subdict1' : {'subdict_1a': {"date":1, "smallest_date":False},
'subdict_1b' : {"date":2, "smallest_date": False}},
'subdict2': {'subdict_2a': {"date":3, "smallest_date":False},
'subdict_2b' : {"date":4, "smallest_date": False}}}
Iterate through the subdicts and declare variables:
for subdict in main_dict:
min_date = 10000000
min_date_subsubdict_name = None
Iterate through the subsubdicts and determine the minimum
for subsubdict in main_dict[subdict]:
if main_dict[subdict][subsubdict]['date'] < min_date:
min_date = main_dict[subdict][subsubdict]['date']
min_date_subsubdict_name = subsubdict
Inside the first loop, but outside the second loop:
main_dict[subdict][min_date_subsubdict_name]['smallest_date'] = True
This should return the output maindict:
{'subdict2': {'subdict_2a': {'date': 3, 'smallest_date': True}, 'subdict_2b': {'date': 4, 'smallest_date': False}}, 'subdict1': {'subdict_1a': {'date': 1, 'smallest_date': True}, 'subdict_1b': {'date': 2, 'smallest_date': False}}}

Mapping list to Map not working

I have a map
["name1":["field1":value1, "field2":value2, "field3":value3],
"name2":["field1":value4, "field2":value5, "field3":value6],
"name3":["field1":value7, "field2":value8, "field3":value9]]
and a list
[name1, name3]
I wanted a result as
["name1":["field1":value1, "field2":value2, "field3":value3],
"name3":["field1":value7, "field2":value8, "field3":value9]]
The code used
result = recomendationOffers.inject( [:] ) { m, v ->
if( !m[ v ] ) {
m[ v ] = []
}
m[ v ] << tariffRecMap[ v.toString() ]
m
}
Now the datatype of the name1 changed from Varchar2(35) to number(10).
I expected the same logic to work but it is not working and I am getting values
["name1":[null], "name3":[null]]
also the value such as 1000000959 is displayed as 1.000000959E9, is this making any difference ?
posting the original values
When I was handling with string, it looked as below
["FBUN-WEB-VIRGIN-10-24-08":["FIXEDLN_ALLOWANCE":0.0,
"OFFER_VERSION_ID":1.000013082E9, "OFFER_TYPE_DESC":"Contract",
"OFFER_NAME":"PM_V 10 50+250 CA", "SMS_ALLOWANCE":250.0,
"VM_TARIFF_FLAG":"N", "IPHONE_IND":"N", "OFFER_MRC":10.5,
"ALLOWANCE08":0.0, "DATA_ALLOWANCE":524288.0, "BB_IND":"N",
"CONTRACT_TERM":24.0, "OFFER_CODE":"FBUN-WEB-VIRGIN-10-24-08",
"ONNET_TEXT_ALLOWANCE":0.0, "VOICE_ALLOWANCE":50.0,
"MMS_ALLOWANCE":0.0, "ONNET_ALLOWANCE":0.0],
Now after the database datatype changed to number from varchar it looks as below where the value in DB is 1000010315
[1.000010315E9:["FIXEDLN_ALLOWANCE":0.0,
"OFFER_VERSION_ID":1.000010315E9, "OFFER_TYPE_DESC":"Sup Voice",
"OFFER_NAME":"VIP - 35 c", "SMS_ALLOWANCE":60000.0,
"VM_TARIFF_FLAG":"N", "IPHONE_IND":"N", "OFFER_MRC":35.0,
"ALLOWANCE08":45000.0, "DATA_ALLOWANCE":2.147483648E9, "BB_IND":"N",
"CONTRACT_TERM":24.0, "OFFER_CODE":"FBUN-MVP-WEB-VIRGIN-35-24-20",
"ONNET_TEXT_ALLOWANCE":0.0, "VOICE_ALLOWANCE":45000.0,
"MMS_ALLOWANCE":0.0, "ONNET_ALLOWANCE":0.0]
Now the datatype of the name1 changed from Varchar2(35) to number(10) ... also the value such as 1000000959 is displayed as 1.000000959E9, is this making any difference ?
Yes, all the difference in the world. That means you're converting a Double (most likely) to a String, and as the String "1000000959" is not equal to "1.000000959E9", you don't get a match.
Not sure from the question which bits are doubles and which bits are Strings... Maybe you could expand with an actual example?
Also, your inject method can be replaced with:
def result = tariffRecMap.subMap( recomendationOffers )

Resources