Python/Pandas element wise union of 2 Series containing sets in each element - python-3.x

I have 2 pandas data Series that I know are the same length. Each Series contains sets() in each element. I want to figure out a computationally efficient way to get the element wise union of these two Series' sets. I've created a simplified version of the code with fake and short Series to play with below. This implementation is a VERY inefficient way of doing this. There has GOT to be a faster way to do this. My real Series are much longer and I have to do this operation hundreds of thousands of times.
import pandas as pd
set_series_1 = pd.Series([{1,2,3}, {'a','b'}, {2.3, 5.4}])
set_series_2 = pd.Series([{2,4,7}, {'a','f','g'}, {0.0, 15.6}])
n = set_series_1.shape[0]
for i in range(0,n):
set_series_1[i] = set_series_1[i].union(set_series_2[i])
print set_series_1
>>> set_series_1
0 set([1, 2, 3, 4, 7])
1 set([a, b, g, f])
2 set([0.0, 2.3, 15.6, 5.4])
dtype: object
I've tried combining the Series into a data frame and using the apply function, but I get an error saying that sets are not supported as dataframe elements.

pir4
After testing several options, I finally came up with a good one... pir4 below.
Testing
def jed1(s1, s2):
s = s1.copy()
n = s1.shape[0]
for i in range(n):
s[i] = s2[i].union(s1[i])
return s
def pir1(s1, s2):
return pd.Series([item.union(s2[i]) for i, item in enumerate(s1.values)], s1.index)
def pir2(s1, s2):
return pd.Series([item.union(s2[i]) for i, item in s1.iteritems()], s1.index)
def pir3(s1, s2):
return s1.apply(list).add(s2.apply(list)).apply(set)
def pir4(s1, s2):
return pd.Series([set.union(*z) for z in zip(s1, s2)])

Related

Find an index in a list of lists using an index inside one of the lists in pyton

I'm trying to determine if there is a way to access an index essentially by making a list of lists, where each inner list has a tuple that provides essentially grid coordinates, i.e:
example = [
['a', (0,0)], ['b',(0,1)], ['c', (0,2)],
['d', (1,0)], ['e',(1,1)], ['d', (1,2)],
.....
]
and so on.
So, If I have coordinates (0,1), I want to be able to return example[1][0], or at the very least example[1] since these coordinates correlate with example[1].
I tried using index(), but this doesn't go deep enough. I also looked into itertools, but I cannot find a tool that finds it and doesn't return a boolean.
Using a number pad as an example:
from itertools import chain
def pinpad_test():
pad=[
['1',(0,0)],['2',(0,1)],['3',(0,2)],
['4',(1,0)],['5',(1,1)],['6',(1,2)],
['7',(2,0)],['8',(2,1)],['9',(2,2)],
['0',(3,1)]
]
tester = '1234'
print(tester)
for dig in tester:
print(dig)
if dig in chain(*pad):
print(f'Digit {dig} in pad')
else:
print('Failed')
print('end of tester')
new_test = pad.index((0,1)in chain(*pad))
print(new_test)
if __name__ == '__main__':
pinpad_test()
I get an value error at the initiation of new_test.
You can just yield from simple generator expression:
coords = (0, 1)
idx = next((sub_l[0] for sub_l in pad if sub_l[1] == coords), None)
print(idx)
2
You can create a function that will give you want
def on_coordinates(coordinates:tuple, list_coordinates:list):
return next(x for x in list_coordinatesif x[1] == coordinates)

How do I minimize the code needed to perform matrix row operations in a python jupyter notebook? (using SymPy)

Here is my code so far (edited screenshots into code cells)
from sympy import *
import copy
init_printing()
A = Matrix([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
A
#Functions for row operations
def rowSwap(A, i, j):
B = A.elementary_row_op(op='n<->m',row1=i-1, row2=j-1)
return B
def rowMultiply(A, i , c):
B = A.elementary_row_op(op='n->kn',row=i-1, k=c)
return B
def rowAddSubtract(A, i, c, j):
B = A.elementary_row_op(op='n->n+km',row=i-1, k=c, row2=j-1) #use negative symbol to subtract
return B
I am creating a matrix row operation calculator using sympy in a jupyter notebook, for linear algebra students. The catch is, the students have to enter the row operations themselves. I've created functions that they can use to perform row operations, however, I need assistance in editing them so students can write less code. Specifically, I would like them to be able to just type:
rowSwap(A, 1, 3)
and have the resulting matrix be displayed and saved, as opposed to entering:
A = rowSwap(A, 1, 3)
A
*The above two line way of doing it is the only way I've been able to get the result matrix to actually save so more steps can be done (i.e. performing operations to get to rref).
My attempt so far at a solution looks like this:
def rowSwap(A, i, j):
A = A.elementary_row_op(op='n<->m',row1=i-1, row2=j-1)
return A
def rowMultiply(A, i, c):
A = A.elementary_row_op(op='n->kn',row=i-1, k=c)
return A
def rowAddSubtract(A, i, c, j):
A = A.elementary_row_op(op='n->n+km',row=i-1, k=c, row2=j-1)
return A
When calling a function it would look like this:
rowSwap(A, 1, 2)
rowMultiply(A, 1, 10)
Yes I was able to minimize the code needed to perform a single operation, however it is functionally useless since it does not actually save and update the matrix A. As when I call the rowMultiply function, it performs on the original matrix defined at the top, not the result shown of the cell just before it.
I come from a c++ background so dealing with objects in python is a bit foreign at the moment.
Any assistance would be much appreciated.

faster method for comparing two lists element-wise

I am building a relational DB using python. So far I have two tables, as follows:
>>> df_Patient.columns
[1] Index(['NgrNr', 'FamilieNr', 'DosNr', 'Geslacht', 'FamilieNaam', 'VoorNaam',
'GeboorteDatum', 'PreBirth'],
dtype='object')
>>> df_LaboRequest.columns
[2] Index(['RequestId', 'IsComplete', 'NgrNr', 'Type', 'RequestDate', 'IntakeDate',
'ReqMgtUnit'],
dtype='object')
The two tables are quite big:
>>> df_Patient.shape
[3] (386249, 8)
>>> df_LaboRequest.shape
[4] (342225, 7)
column NgrNr on df_LaboRequest if foreign key (FK) and references the homonymous column on df_Patient. In order to avoid any integrity error, I need to make sure that all the values under df_LaboRequest[NgrNr] are in df_Patient[NgrNr].
With list comprehension I tried the following (to pick up the values that would throw an error):
[x for x in list(set(df_LaboRequest['NgrNr'])) if x not in list(set(df_Patient['NgrNr']))]
Though this is taking ages to complete. Would anyone recommend a faster method (method as a general word, as synonym for for procedure, nothing to do with the pythonic meaning of method) for such a comparison?
One-liners aren't always better.
Don't check for membership in lists. Why on earth would you create a set (which is the recommended data structure for O(1) membership checks) and then cast it to a list which has O(N) membership checks?
Make the set of df_Patient once outside the list comprehension and use that instead of making the set in every iteration
patients = set(df_Patient['NgrNr'])
lab_requests = set(df_LaboRequest['NgrNr'])
result = [x for x in lab_requests if x not in patients]
Or, if you like to use set operations, simply find the difference of both sets:
result = lab_requests - patients
Alternatively, use pandas isin() function.
patients = patients.drop_duplicates()
lab_requests = lab_requests.drop_duplicates()
result = lab_requests[~lab_requests.isin(patients)]
Let's test how much faster these changes make the code:
import pandas as pd
import random
import timeit
# Make dummy dataframes of patients and lab_requests
randoms = [random.randint(1, 1000) for _ in range(10000)]
patients = pd.DataFrame("patient{0}".format(x) for x in randoms[:5000])[0]
lab_requests = pd.DataFrame("patient{0}".format(x) for x in randoms[2000:8000])[0]
# Do it your way
def fun1(pat, lr):
return [x for x in list(set(lr)) if x not in list(set(pat))]
# Do it my way: Set operations
def fun2(pat, lr):
pat_s = set(pat)
lr_s = set(lr)
return lr_s - pat_s
# Or explicitly iterate over the set
def fun3(pat, lr):
pat_s = set(pat)
lr_s = set(lr)
return [x for x in lr_s if x not in pat_s]
# Or using pandas
def fun4(pat, lr):
pat = pat.drop_duplicates()
lr = lr.drop_duplicates()
return lr[~lr.isin(pat)]
# Make sure all 3 functions return the same thing
assert set(fun1(patients, lab_requests)) == set(fun2(patients, lab_requests)) == set(fun3(patients, lab_requests)) == set(fun4(patients, lab_requests))
# Time it
timeit.timeit('fun1(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun1', number=100)
# Output: 48.36615000000165
timeit.timeit('fun2(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun2', number=100)
# Output: 0.10799920000044949
timeit.timeit('fun3(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun3', number=100)
# Output: 0.11038020000069082
timeit.timeit('fun4(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun4', number=100)
# Output: 0.32021789999998873
Looks like we have a ~150x speedup with pandas and a ~500x speedup with set operations!
I don't have a pandas installed right now to try this. But you could try removing the list(..) cast. I don't think it provides anything meaningful to the program and sets are much faster for lookup, e.g. x in set(...), than lists.
Also you could try doing this with the pandas API rather than lists and sets, sometimes this faster. Try searching for unique. Then you could compare the size of the two columns and if it is the same, sort them and do an equality check.

How to create a dataframe of a particular size containing both continuous and categorical values with a uniform random distribution

So, I'm trying to generate some fake random data of a given dimension size. Essentially, I want a dataframe in which the data has a uniform random distribution. The data consist of both continuous and categorical values. I've written the following code, but it doesn't work the way I want it to be.
import random
import pandas as pd
import time
from datetime import datetime
# declare global variables
adv_name = ['soft toys', 'kitchenware', 'electronics',
'mobile phones', 'laptops']
adv_loc = ['location_1', 'location_2', 'location_3',
'location_4', 'location_5']
adv_prod = ['baby product', 'kitchenware', 'electronics',
'mobile phones', 'laptops']
adv_size = [1, 2, 3, 4, 10]
adv_layout = ['static', 'dynamic'] # advertisment layout type on website
# adv_date, start_time, end_time = []
num = 10 # the given dimension
# define function to generate random advert locations
def rand_shuf_loc(str_lst, num):
lst = adv_loc
# using list comprehension
rand_shuf_str = [item for item in lst for i in range(num)]
return(rand_shuf_str)
# define function to generate random advert names
def rand_shuf_prod(loc_list, num):
rand_shuf_str = [item for item in loc_list for i in range(num)]
random.shuffle(rand_shuf_str)
return(rand_shuf_str)
# define function to generate random impression and click data
def rand_clic_impr(num):
rand_impr_lst = []
click_lst = []
for i in range(num):
rand_impr_lst.append(random.randint(0, 100))
click_lst.append(random.randint(0, 100))
return {'rand_impr_lst': rand_impr_lst, 'rand_click_lst': click_lst}
# define function to generate random product price and discount
def rand_prod_price_discount(num):
prod_price_lst = [] # advertised product price
prod_discnt_lst = [] # advertised product discount
for i in range(num):
prod_price_lst.append(random.randint(10, 100))
prod_discnt_lst.append(random.randint(10, 100))
return {'prod_price_lst': prod_price_lst, 'prod_discnt_lst': prod_discnt_lst}
def rand_prod_click_timestamp(stime, etime, num):
prod_clik_tmstmp = []
frmt = '%d-%m-%Y %H:%M:%S'
for i in range(num):
rtime = int(random.random()*86400)
hours = int(rtime/3600)
minutes = int((rtime - hours*3600)/60)
seconds = rtime - hours*3600 - minutes*60
time_string = '%02d:%02d:%02d' % (hours, minutes, seconds)
prod_clik_tmstmp.append(time_string)
time_stmp = [item for item in prod_clik_tmstmp for i in range(num)]
return {'prod_clik_tmstmp_lst':time_stmp}
def main():
print('generating data...')
# print('generating random geographic coordinates...')
# get the impressions and click data
impression = rand_clic_impr(num)
clicks = rand_clic_impr(num)
product_price = rand_prod_price_discount(num)
product_discount = rand_prod_price_discount(num)
prod_clik_tmstmp = rand_prod_click_timestamp("20-01-2018 13:30:00",
"23-01-2018 04:50:34",num)
lst_dict = {"ad_loc": rand_shuf_loc(adv_loc, num),
"prod": rand_shuf_prod(adv_prod, num),
"imprsn": impression['rand_impr_lst'],
"cliks": clicks['rand_click_lst'],
"prod_price": product_price['prod_price_lst'],
"prod_discnt": product_discount['prod_discnt_lst'],
"prod_clik_stmp": prod_clik_tmstmp['prod_clik_tmstmp_lst']}
fake_data = pd.DataFrame.from_dict(lst_dict, orient="index")
res = fake_data.apply(lambda x: x.fillna(0)
if x.dtype.kind in 'biufc'
# where 'biufc' means boolean, integer,
# unicode, float & complex data types
else x.fillna(random.randint(0, 100)
)
)
print(res.transpose())
res.to_csv("fake_data.csv", sep=",")
# invoke the main function
if __name__ == "__main__":
main()
Problem 1
when I execute the above code snippet, it prints fine but when written to csv format, its horizontally positioned; i.e., it looks like this... How do I position it vertically when writing to csv file? What I want is 7 columns (see lst_dict variable above) with n number of rows?
Problem 2
I dont understand why the random date is generated for the first 50 columns and remaining columns are filled with numerical values?
To answer your first question, replace
print(res.transpose())
with
res.transpose() print(res)
To answer your second question look at the length of the output of the method
rand_shuf_loc()
it as well as the other helper functions only produce a list of 50 items.
The creation of res using the method
fake_data.apply
replaces all nan with a random numeric, so it also applies a numeric to the columns without any predefined values.

some python3 behavior i am unable to understood

I have used following codes.
from collections import defaultdict
from random import randint, randrange,choice, shuffle
def random_array(low, high, step, size):
lst = []
while len(lst)<size:
nexts = randrange(low, high, step)
if nexts in lst:continue
lst.append(nexts)
return lst
def find_pair_from_two_list(a, b, val):
b_dict = defaultdict(int)
for i,v in enumerate(b): b_dict[v] = i
for v in a:
if (val - v) in b_dict:
return v, val-v
return -1, -1
arr1 = random_array(1, 100, 1, 99)
arr2 = random_array(1, 100, 1, 99)
val1 = choice(arr1)
val2 = choice(arr2)
val = val1 + val2
print(find_pair_from_two_list(arr1,arr2, val))
However if i change size value in
arr1 = random_array(1, 100, 1, 99)
arr2 = random_array(1, 100, 1, 99)
upto 99 it works instantly but if i change any of the size value to 100 or more it just seems to hang in there.
I am curious to know why this is happening.I mean it works well till 99 but what causes it to hang for even 100.
Why is yours slow:
Using arr1 = random_array(1, 100, 1, 100) your method can take lots of time to draw the last missing numbers because you draw new random values over and over and discard them when they are already inside your resultlist:
while len(lst)<size:
nexts = randrange(low, high, step)
if nexts in lst:continue # discards already inside numbers
lst.append(nexts)
return lst
With inputs like this you essentially draw "all" possible numbers until done and the more your result contains the longer it takes to draw another "fitting" one.
You can even produce endless loops if your range(low,high,steps) has less total values then your size demands.
(1,100,5,100) # => only 20 in this range with this stepper -> endless loop
Possible simplification (not optimal)
You could simplyfy and speedup the code by:
import random
def random_array(low, high, step, size):
poss = list(range(low,high,step)) # this does not contain duplicates
random.shuffle(poss) # shuffle it
return poss[:size] # return size (or all) elements from it
print(random_array(1,100,1,10))
This code will return if you specify "wrong" combinations to it, but the resulting list is then shorter as whatever you specified as size.
Even better
jonsharpes suggestion to use
random.sample(range(low,high,step),size)
like so:
def ra(low,high,step,size):
return random.sample(range(low,high,step),size)
Performance test
Performancewise they the random.sample outperforms mine for big lists easily:
import random
def random_array(low, high, step, size):
poss = list(range(low,high,step))
random.shuffle(poss)
return poss[:size]
def ra(low,high,step,size):
return random.sample(range(low,high,step),size)
import timeit
if __name__ == '__main__':
import timeit
# create 100 times 495 randoms of range (1,1000000,22)
print(timeit.timeit("ra(1,1000000,22,495)", setup="from __main__ import ra",number = 10000))
print(timeit.timeit("random_array(1,1000000,22,495)", setup="from __main__ import random_array",number = 10000))
Output:
1.1825043768664596 # random.sample(...) of range(...)
92.12594874871951 # mine
Reason probably being I create actual lists from ranges, random.sample uses ranges with iterators smartly...
Doku:
https://docs.python.org/3.1/library/random.html
https://docs.python.org/3/library/timeit.html

Resources