I am building a relational DB using python. So far I have two tables, as follows:
>>> df_Patient.columns
[1] Index(['NgrNr', 'FamilieNr', 'DosNr', 'Geslacht', 'FamilieNaam', 'VoorNaam',
'GeboorteDatum', 'PreBirth'],
dtype='object')
>>> df_LaboRequest.columns
[2] Index(['RequestId', 'IsComplete', 'NgrNr', 'Type', 'RequestDate', 'IntakeDate',
'ReqMgtUnit'],
dtype='object')
The two tables are quite big:
>>> df_Patient.shape
[3] (386249, 8)
>>> df_LaboRequest.shape
[4] (342225, 7)
column NgrNr on df_LaboRequest if foreign key (FK) and references the homonymous column on df_Patient. In order to avoid any integrity error, I need to make sure that all the values under df_LaboRequest[NgrNr] are in df_Patient[NgrNr].
With list comprehension I tried the following (to pick up the values that would throw an error):
[x for x in list(set(df_LaboRequest['NgrNr'])) if x not in list(set(df_Patient['NgrNr']))]
Though this is taking ages to complete. Would anyone recommend a faster method (method as a general word, as synonym for for procedure, nothing to do with the pythonic meaning of method) for such a comparison?
One-liners aren't always better.
Don't check for membership in lists. Why on earth would you create a set (which is the recommended data structure for O(1) membership checks) and then cast it to a list which has O(N) membership checks?
Make the set of df_Patient once outside the list comprehension and use that instead of making the set in every iteration
patients = set(df_Patient['NgrNr'])
lab_requests = set(df_LaboRequest['NgrNr'])
result = [x for x in lab_requests if x not in patients]
Or, if you like to use set operations, simply find the difference of both sets:
result = lab_requests - patients
Alternatively, use pandas isin() function.
patients = patients.drop_duplicates()
lab_requests = lab_requests.drop_duplicates()
result = lab_requests[~lab_requests.isin(patients)]
Let's test how much faster these changes make the code:
import pandas as pd
import random
import timeit
# Make dummy dataframes of patients and lab_requests
randoms = [random.randint(1, 1000) for _ in range(10000)]
patients = pd.DataFrame("patient{0}".format(x) for x in randoms[:5000])[0]
lab_requests = pd.DataFrame("patient{0}".format(x) for x in randoms[2000:8000])[0]
# Do it your way
def fun1(pat, lr):
return [x for x in list(set(lr)) if x not in list(set(pat))]
# Do it my way: Set operations
def fun2(pat, lr):
pat_s = set(pat)
lr_s = set(lr)
return lr_s - pat_s
# Or explicitly iterate over the set
def fun3(pat, lr):
pat_s = set(pat)
lr_s = set(lr)
return [x for x in lr_s if x not in pat_s]
# Or using pandas
def fun4(pat, lr):
pat = pat.drop_duplicates()
lr = lr.drop_duplicates()
return lr[~lr.isin(pat)]
# Make sure all 3 functions return the same thing
assert set(fun1(patients, lab_requests)) == set(fun2(patients, lab_requests)) == set(fun3(patients, lab_requests)) == set(fun4(patients, lab_requests))
# Time it
timeit.timeit('fun1(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun1', number=100)
# Output: 48.36615000000165
timeit.timeit('fun2(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun2', number=100)
# Output: 0.10799920000044949
timeit.timeit('fun3(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun3', number=100)
# Output: 0.11038020000069082
timeit.timeit('fun4(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun4', number=100)
# Output: 0.32021789999998873
Looks like we have a ~150x speedup with pandas and a ~500x speedup with set operations!
I don't have a pandas installed right now to try this. But you could try removing the list(..) cast. I don't think it provides anything meaningful to the program and sets are much faster for lookup, e.g. x in set(...), than lists.
Also you could try doing this with the pandas API rather than lists and sets, sometimes this faster. Try searching for unique. Then you could compare the size of the two columns and if it is the same, sort them and do an equality check.
Related
# To generate possible combinations
from itertools import combinations
main_list=('a1','a2','a3','a4')
abc=combinations(main_list,3)
for i in list(abc):
print(i)
# Creating number of empty lists
n=6
obj={}
for i in range(n):
obj['set'+str(i)]=()
# I want to combine these, take list1 generated by combinations and store them down in set1.
/* To generate all possible combinations of items in a list and STORE them in different lists. Eg: main_list=('a1','a2','a3'), now i want to combination lists like set1=('a1'), set2=('a2'), set3=('a3'), set4=('a1','a2'), set5=('a1','a3'), set6=('a2','a3'), set7=('a1','a2','a3'). How to access lists set1, set2,... */
If I understand correctly, you want to generate the power set - the set of all subsets of the list. Python's itertools package provides a nice example function to generate this:
from itertools import chain, combinations
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
While Nathan answers helps with first part of the question, my code will help you making a dictionary, so you can access the sets like sets['set1'] as asked.
from itertools import chain, combinations
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
def make_dict(comb):
sets = {}
for i, set in enumerate(comb):
sets['set'+str(i)] = set
return sets
if __name__ == '__main__':
sets = make_dict(powerset(['a1','a2','a3','a4']))
print(sets['set1'])
Output
('a1',)
I have defined my function (the partition function partfunc_E, a function of ionisation energy chiI and a temperature T) and have to create a nested for loop to first, loop over the values of ionisation energies (chiI, a vector) and an inner loop to evaluate the sum over all energy states for the ion (here I am using 5). The output should be a vector of partition function values associated with each ionisation energy in the chiI array.
(From our instructions): For an energy of 10,000K, using the ionisation energies of calcium and a temperature of 10,000K should give something like:
[ 1.45605581 1.45648718 1.45648849 1.45648849 1.45648849]
Note: g=1 for all states.
I don't know what I have done wrong but I am not getting what they expect at all, I am getting an array of length 100 and all the values in the array are the same. The following code is my attempt.
import matplotlib.pylab as plt
import numpy as np
from astropy import units as u
from astropy import constants as const
%matplotlib inline
k = const.k_B.to('eV*K**-1')
#print(k)
def partfunc_E(chiI,T):
return g*np.exp(-(chiI/(k*T)))
chiI = [6.1131554, 11.871719, 50.91316, 67.2732, 84.34]*u.eV
T=(10000)*u.K
g= 1
partition_Ca = []
for i in chiI:
for j in range(0,10,1):
function = partfunc_E(chiI,T)
partition_Ca.append(sum(partfunc_E(chiI,T)))
You are doing something like this:
In [336]: x = np.arange(3)
...: alist = []
...: for i in x:
...: for j in range(2):
...: alist.append(np.exp(x).sum())
...:
In [337]: alist
Out[337]:
[11.107337927389695,
11.107337927389695,
11.107337927389695,
11.107337927389695,
11.107337927389695,
11.107337927389695]
this puts the same value into alist each time. The looping does nothing new:
In [338]: np.exp(x).sum()
Out[338]: 11.107337927389695
You could instead use i, the iteration variable in the loop:
In [339]: x = np.arange(3)
...: alist = []
...: for i in x:
...: alist.append(np.exp(i))
...:
In [340]: alist
Out[340]: [1.0, 2.718281828459045, 7.38905609893065]
But since np.exp works with an array, there's no need to iterate.
In [341]: np.exp(x)
Out[341]: array([1. , 2.71828183, 7.3890561 ])
As I demonstrate, you need to experiment with some basic Python loops, and check the results in detail. Don't write a big script and then scratch your head when the results aren't right. Test pieces, and make sure you understand each step.
My co-worker and I have been setting up, configuring, and testing Dask for a week or so now, and everything is working great (can't speak highly enough about how easy, straightforward, and powerful it is), but now we are trying to leverage it for more than just testing and are running into an issue. We believe it's a fairly simple one related to syntax and an understanding gap. Any help to get it running is greatly appreciated. Any support in evolving our understanding of more optimal paths is also greatly appreciated.
We got fairly close with these two posts:
Dask: How would I parallelize my code with dask delayed?
Unpacking result of delayed function
High level flow:
Open data in pandas & clean it (we plan on moving this to a pipeline)
From there, convert the cleaned data set for regression into a dask data frame
Set the x & y variables and create all unique x combination sets
Create all unique formulas (y ~ x1 + x2 +0)
Run each individual formula set with the data through a linear lasso lars model to get the AIC for each formula for ranking
Current Issue:
Run each individual formula set (~1700 formulas) with the data (1 single data set which doesn’t vary with each run) on the dask cluster and get the results back
Optimize the calculation & return the final data
Code:
# In[]
# Imports:
import logging as log
import datetime as dat
from itertools import combinations
import numpy as np
import pandas as pd
from patsy import dmatrices
import sklearn as sk
from sklearn.linear_model import LogisticRegression, SGDClassifier, LinearRegression
import dask as dask
import dask.dataframe as dk
from dask.distributed import Client
# In[]
# logging, set the dask client, open & clean the data, pass into a dask dataframe
log.basicConfig(level=log.INFO,
format='%(asctime)s %(message)s',
datefmt="%m-%d %H:%M:%S"
)
c = Client('ip:port')
ST = dat.datetime.now()
data_pd = pd.read_csv('some.txt', sep="\t")
#fill some na/clean up the data a bit
data_pd['V9'] = data_pd.V9.fillna("Declined")
data_pd['y'] = data_pd.y.fillna(0)
data_pd['x1'] = data_pd.x1.fillna(0)
#output the clean data and re-import into dask, we could alse use from_pandas to get to dask dataframes
data_pd.to_csv('clean_rr_cp.csv')
data = dk.read_csv(r'C:\path\*.csv', sep=",")
# set x & y variables - the below is truncated
y_var = "y"
x_var = ['x1',
'x2',
'x3',
'x4',......
#list of all variables
all_var = list(y_var) + x_var
#all unique combinations
x_var_combos = [combos for combos in combinations(x_var,2)]
#add single variables for testing as well
for i in x_var:
x_var_combos.append((i,""))
# create formulas from our y, x variables
def formula(y_var, combo):
combo_len = len(combo)
if combo_len == 2:
formula = y_var +"~"+combo[0] +"+"+ combo[1]+"+0"
else:
formula = y_var +"~"+combo[0]+"+0"
return formula
#dask.delayed
def model_aic(dt, formula):
k = 2
y_df, x_df = dmatrices(formula, dt, return_type = 'dataframe')
y_df = np.ravel(y_df)
log.info('dmatrices successful')
LL_model = sk.linear_model.LassoLarsIC(max_iter = 100)
AIC_Value = min(LL_model.fit(x_df, y_df).criterion_) + ( (2*(k**2)+2*(k)) / (len(x_df)-k-1) )
log.info('AIC_Value: %s', AIC_Value)
oup = [formula ,AIC_Value, len(dt)-AIC_Value]
return oup
# ----------------- here's where we're stuck ---------------------
# ----------------- we think this is correct ----------------------
# ----------------- create a list of all formula to execute -------
# In[]
out = []
for i in x_var_combos:
var = model_aic(data, formula(y_var, i))
out.append(var)
# ----------------- but we're stuck figuring out how to -----------
# ------------------make it compute & return the result -----------
ans = c.compute(*out)
ans2 = c.compute(out[1])
print (ans2)
I have a set of unique ngrams (list called ngramlist) and ngram tokenized text (list called ngrams). I want to construct a new vector, freqlist, where each element of freqlist is the fraction of ngrams that is equal to that element of ngramlist. I wrote the following code that gives the correct output, but I wonder if there is a way to optimize it:
freqlist = [
sum(int(ngram == ngram_condidate)
for ngram_condidate in ngrams) / len(ngrams)
for ngram in ngramlist
]
I imagine there is a function in nltk or elsewhere that does this faster but I am not sure which one.
Thanks!
Edit: for what it's worth the ngrams are producted as joined output of nltk.util.ngrams and ngramlist is just a list made from set of all found ngrams.
Edit2:
Here is reproducible code to test the freqlist line (the rest of the code is not really what I care about)
from nltk.util import ngrams
import wikipedia
import nltk
import pandas as pd
articles = ['New York City','Moscow','Beijing']
tokenizer = nltk.tokenize.TreebankWordTokenizer()
data={'article':[],'treebank_tokenizer':[]}
for article in articles:
data['article' ].append(wikipedia.page(article).content)
data['treebank_tokenizer'].append(tokenizer.tokenize(data['article'][-1]))
df=pd.DataFrame(data)
df['ngrams-3']=df['treebank_tokenizer'].map(
lambda x: [' '.join(t) for t in ngrams(x,3)])
ngramlist = list(set([trigram for sublist in df['ngrams-3'].tolist() for trigram in sublist]))
df['freqlist']=df['ngrams-3'].map(lambda ngrams_: [sum(int(ngram==ngram_condidate) for ngram_condidate in ngrams_)/len(ngrams_) for ngram in ngramlist])
You can probably optimize this a bit by pre-computing some quantities and using a Counter. This will be especially useful if most of the elements in ngramlist are contained in ngrams.
freqlist = [
sum(int(ngram == ngram_candidate)
for ngram_candidate in ngrams) / len(ngrams)
for ngram in ngramlist
]
You certainly don't need to iterate over ngrams every single time you check an ngram. One pass over ngrams will make this algorighm O(n) instead of the O(n2) one you have now. Remember, shorter code is not necessarily better or more efficient code:
from collections import Counter
...
counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
To use this function properly, you would have to write a def function instead of a lambda:
def count_ngrams(ngrams):
counter = Counter(ngrams)
size = len(ngrams)
freqlist = [counter.get(ngram, 0) / size for ngram in ngramlist]
return freqlist
df['freqlist'] = df['ngrams-3'].map(count_ngrams)
Firstly, don't pollute your imported functions by overriding them and using them as variables, keep the ngrams name as the function, and use something else as variable.
import time
from functools import partial
from itertools import chain
from collections import Counter
import wikipedia
import pandas as pd
from nltk import word_tokenize
from nltk.util import ngrams
Next the steps before the line you're asking in the original question might be a little inefficient, you can clean them up, make them easier to read and measure them as such:
# Downloading the articles.
titles = ['New York City','Moscow','Beijing']
start = time.time()
df = pd.DataFrame({'article':[wikipedia.page(title).content for title in titles]})
end = time.time()
print('Downloading wikipedia articles took', end-start, 'seconds')
And then:
# Tokenizing the articles
start = time.time()
df['tokens'] = df['article'].apply(word_tokenize)
end = time.time()
print('Tokenizing articles took', end-start, 'seconds')
Then:
# Extracting trigrams.
trigrams = partial(ngrams, n=3)
start = time.time()
# There's no need to flatten them to strings, you could just use list()
df['trigrams'] = df['tokens'].apply(lambda x: list(trigrams(x)))
end = time.time()
print('Extracting trigrams took', end-start, 'seconds')
Finally, to the last line
# Instead of a set, we use a Counter here because
# we can use an intersection between Counter objects later.
# see https://stackoverflow.com/questions/44012479/intersection-of-two-counters
all_trigrams = Counter(chain(*df['trigrams']))
# More often than not, you don't need to keep all the
# zeros in the vectors (aka dense vector),
# you could actually get the non-zero sparse vectors
# as a dict as such
df['trigrams_count'] = df['trigrams'].apply(lambda x: Counter(x) & all_trigrams)
# Now to normalize the count, simply do:
def featurize(list_of_ngrams):
nonzero_features = Counter(list_of_ngrams) & all_trigrams
total = len(list_of_ngrams)
return {ng:count/total for ng, count in nonzero_features.items()}
df['trigrams_count_normalize'] = df['trigrams'].apply(featurize)
I have 2 pandas data Series that I know are the same length. Each Series contains sets() in each element. I want to figure out a computationally efficient way to get the element wise union of these two Series' sets. I've created a simplified version of the code with fake and short Series to play with below. This implementation is a VERY inefficient way of doing this. There has GOT to be a faster way to do this. My real Series are much longer and I have to do this operation hundreds of thousands of times.
import pandas as pd
set_series_1 = pd.Series([{1,2,3}, {'a','b'}, {2.3, 5.4}])
set_series_2 = pd.Series([{2,4,7}, {'a','f','g'}, {0.0, 15.6}])
n = set_series_1.shape[0]
for i in range(0,n):
set_series_1[i] = set_series_1[i].union(set_series_2[i])
print set_series_1
>>> set_series_1
0 set([1, 2, 3, 4, 7])
1 set([a, b, g, f])
2 set([0.0, 2.3, 15.6, 5.4])
dtype: object
I've tried combining the Series into a data frame and using the apply function, but I get an error saying that sets are not supported as dataframe elements.
pir4
After testing several options, I finally came up with a good one... pir4 below.
Testing
def jed1(s1, s2):
s = s1.copy()
n = s1.shape[0]
for i in range(n):
s[i] = s2[i].union(s1[i])
return s
def pir1(s1, s2):
return pd.Series([item.union(s2[i]) for i, item in enumerate(s1.values)], s1.index)
def pir2(s1, s2):
return pd.Series([item.union(s2[i]) for i, item in s1.iteritems()], s1.index)
def pir3(s1, s2):
return s1.apply(list).add(s2.apply(list)).apply(set)
def pir4(s1, s2):
return pd.Series([set.union(*z) for z in zip(s1, s2)])