Issue with pd.DataFrame.apply with arguments - python-3.x

I want to create augmented data in a new dataframe for every row of an original dataframe.
So, I've defined augment method which I want to use in apply as following:
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int):
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
When I call this as following, everything works:
tmp_df = pd.DataFrame(None, columns=testDF.columns)
augment(testDF.iloc[0], column_name='binMap', target_df=tmp_df, num_samples=4)
augment(testDF.iloc[1], column_name='binMap', target_df=tmp_df, num_samples=4)
However, when I try it using 'apply' method, I get the prints or the display working fine but the resultant dataframe shows error
tmp_df = pd.DataFrame(None, columns=testDF.columns)
testDF.apply(augment, args=('binMap', tmp_df, 4, ), axis=1)
This is how the o/p data looks like after the apply call -
,data
<Error>, <Error>
<Error>, <Error>
What am I doing wrong?

Your test is very nice, thank you for the clear exposition.
I am happy to be your rubber duck.
In test A, you (successfully) mess with
testDF.iloc[0] and [1],
using kind of a Fortran-style API
for augment(), leaving a side effect result in tmp_df.
Test B is carefully constructed to
be "the same" except for the .apply() call.
So let's see, what's different?
Hard to say.
Let's go examine the docs.
Oh, right!
We're using the .apply() API,
so we'd better follow it.
Down at the end it explains:
Returns: Series or DataFrame
Result of applying func along the given axis of the DataFrame.
But you're offering return None instead.
Now, I'm not here to pass judgement on
whether it's best to have side effects
on a target df -- that's up to you.
But .apply() will be bent out of shape
until you give it something nice
to store as its own result.
Happy hunting!
Tiny little style nit.
You wrote
args=('binMap', tmp_df, 4, )
to offer a 3-tuple. Better to write
args=('binMap', tmp_df, 4)
As written it tends to suggest 1-tuple notation.
When is trailing comma helpful?
in a 1-tuple it is essential: x = (7,)
in multiline dict / list expressions it minimizes git diffs, when inevitably another entry ('cherry'?) will later be added
fruits = [
'apple',
'banana',
]

This change worked for me -
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int) -> pd.Series:
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
return row
And updated call to apply as following:
testDF = testDF.apply(augment, args=('binMap', tmp_df, 4, ), result_type='broadcast', axis=1)
Thank you #J_H.
If there are better to way to achieve what I'm doing, please feel free to suggest the improvements.

Related

Using Ipywidgets with Dataframes

I'm trying to learn how to create interactions with a dataframe. I already read the Ipywidget doc, trying to get the most I can, plus some howto blogs, but when it comes to the real world somehow I can't reproduce the same results. I know this is common, though.
Anyway, so far I've tried doing this:
week = [value for value in canali_weekly.index.get_level_values(level=1).unique()]
values = [value for value in canali_weekly.columns]
#widgets.interact(
x = widgets.IntSlider(min=0, max=canali_weekly["Self"].max(), step=10, description='N. Self minime vendute:'),
s = widgets.IntRangeSlider(
value = [max(week)-1, max(week)],
min=min(week),
max=max(week),
step=1,
description='Confronto settimane:'
),
m = widgets.SelectMultiple(
options=values,
#rows=10,
description='Metriche:',
disabled=False
)
)
def filtra(x,s,m):
return canali_weekly.loc[ (canali_weekly["Self"] > x,s), m]
Which is pretty good because I get the main goal, that is allowing interactions. However, this code doesn't allow me changing the layout (for example I would prefer displaying widgets horizontally).
So I started with this new code :
week = [value for value in canali_weekly.index.get_level_values(level=1).unique()]
s = widgets.IntRangeSlider(
value = [max(week)-1, max(week)],
min=min(week),
max=max(week),
step=1,
description='Confronto settimane:'
)
def filtra(s):
display(canali_weekly.loc[ (slice(None),s), ["Spending"]])
s.observe(filtra, 'value')
display(s)
The problem is that in this case the dataframe is not shown.
ADDENDUM:
Here a repr of how is the dataframe:
channel = ["A", "B", "C", "D"]
week = [1,2,3,4,5,6,7,8,9]
columns = ["col1","col2","col3"]
index = pd.MultiIndex.from_product([channel,week], names=('channel', 'week'))
df = pd.DataFrame(np.random.randn(36, 3), columns=columns, index=index)
UPDATE:
I tried with this new code, very similar to second one I pasted here:
def crea_opzioni(array):
unique = array.unique().tolist()
unique.sort()
unique.insert(0, "ALL")
return unique
dropdown_week = widgets.Dropdown(options = crea_opzioni(canali_weekly.index.get_level_values(level=1)))
def filtra_dati(change):
if (change.new == "ALL"):
display(canali_weekly)
else:
display(canali_weekly.loc[(slice(None), change.new), ["Spending"]])
dropdown_week.observe(filtra_dati, names='value')
display(dropdown_week)
Even though the dataframe is not rendered in the noteboook, I can see it in the log tab
Here it is my solution:
output = w`enter code here`idgets.Output()
def crea_opzioni(array):
unique = array.unique().tolist()
unique.sort()
unique.insert(0, "ALL")
return unique
week = [value for value in canali_weekly.index.get_level_values(level=1).unique()]
intRange_week = widgets.IntRangeSlider(value = [max(week)-1, max(week)], min=min(week), max=max(week), step=1, description='Confronto settimane:')
#widgets.Dropdown(options = crea_opzioni(canali_weekly.index.get_level_values(level=1)))
#dropdown_canali = widgets.Dropdown(options = crea_opzioni(canali_weekly.index.get_level_values(level=0)))
def filtra_dati(selezione):
output.clear_output()
df = canali_weekly.loc[(slice(None), slice(selezione[0],selezione[1])), ["Spending"]]
with output:
display(df)
def dropdown_week_eventhandler(change):
filtra_dati(change.new)
intRange_week.observe(dropdown_week_eventhandler, names='value')
display(intRange_week`enter code here`

How to make Array inside For Loop globally accessible in Python

I have the following code-
data = [row for row in csv.reader(f)]
for i in range(1,145):
for j in range(0,35):
if j%2 == 0:
x_data = data[i][j]
I want access the x_data in a different function def compareCell for comparison. How can I access the array in the function. Anyone help will be highly appreciated.
Updated-
Actually the following is the case-
Diagram
Case1, Case2, .... are generated in real time and I need to compare it with the CSV file data as shown in the diagram above.
Thanks!
You can define x_data at the beginning of your code:
x_data = [] # Considering your data is an array
# ... some code (optional)
data = [row for row in csv.reader(f)]
for i in range(1,145):
for j in range(0,35):
if j%2 == 0:
x_data = data[i][j]
def compareCell(my_data):
# some code
compareCell(x_data) #for example
In Python, you can use global, like global x_data = data[i][j], however I would suggest you call your function from inside the loop like mentioned in the comments.

Split list into randomised ordered sub lists

I would like to improve the below code to split a list of values into two sub lists, which have been randomised and sorted. The below code works, but I'm sure there is a better/cleaner way to do it.
import random
data = list(range(1, 61))
random.shuffle(data)
Intervention = data[:30]
Control = data[30:]
Intervention.sort()
Control.sort()
f = open('Randomised_Groups.txt', 'w')
f.write('Intervention Group = ' + str(Intervention) + '\n' + 'Control Group = ' + str(Control))
f.close()
The expected output is:
Intervention = [1,3,7,9]
Control = [2,4,5,6,8,10]
I think your code is short and clean already. Some changes you can make:
Call sorted() when you slice it.
Intervention = sorted(data[:30])
You can also define both Intervention and Control on one line:
Intervention, Control = data[:30], data[30:]
I would replace the 30 with a variable:
half = len(data)//2
It is safer to open a file with with. That closes the file automatically when indentation stops.
with open('Randomised_Groups.txt', 'w') as f:
...
With the use of f-strings you can make the write statement shorter:
f.write(f'Intervention Group = {Intervention} \nControl Group = {Control}')
All combined:
import random
data = list(range(1, 61))
random.shuffle(data)
half = len(data)//2
Intervention, Control = sorted(data[:half]), sorted(data[half:])
with open('Randomised_Groups.txt', 'w') as f:
f.write(f'Intervention Group = {Intervention}\nControl Group = {Control}')
Something like this might be what you want:
import random
my_rng = [random.randint(0,1) for i in range(60)]
Control = [i for i in range(60) if my_rng[i] == 0]
Intervention = [i for i in range(60) if my_rng[i] == 1]
print(Control)
The idea is to create 60 random 1s or 0s to use as indicators for which list to put each number in. This will only work if you do not need the two lists to be the same length. To get the same length would require changing how my_rng is created in this example.
I have tinkered a bit further and got the lists of the same length:
import random
my_rng = [0 for i in range(30)]
my_rng.extend([1 for i in range(30)])
random.shuffle(my_rng)
Control = [i for i in range(60) if my_rng[i] == 0]
Intervention = [i for i in range(60) if my_rng[i] == 1]
Here, instead of adding randomly 1 or 0 to my_rng I get a list of 30 0s and 30 1s to shuffle, then continue like before.
Here is another solution that is more dynamic using built in random functionality that only creates the lists needed (no extra memory) and would work with lists that contain any type of object (provided that object can be sorted):
import random
def convert_to_random_list(data, num_list):
"""
Takes in the data as one large list and converts it into
[num_list] random sorted lists.
"""
result_lists = [list() for _ in range(num_list)] # two lists
for x in data:
# Using randint we pick which list to insert into
result_lists[random.randint(0, num_list - 1)].append(x)
# You could use list comprehension here with `sorted(...)` but it would take a little extra memory.
for _list in result_lists:
_list.sort()
return result_lists
Can be tested with:
data = list(range(1, 61))
random.shuffle(data)
temp = convert_to_random_list(data, 3)
print(temp)

Retrieving dict value via hardcoded key, works. Retrieving via computed key doesn't. Why?

I'm generating a common list of IDs by comparing two sets of IDs (the ID sets are from a dictionary, {ID: XML "RECORD" element}). Once I have the common list, I want to iterate over it and retrieve the value corresponding to the ID from a dictionary (which I'll write to disc).
When I compute the common ID list using my diff_comm_checker function, I'm unable to retrieve the dict value the ID corresponds to. It doesn't however fail with a KeyError. I can also print the ID out.
When I hard code the ID in as the common_id value, I can retrieve the dict value.
I.e.
common_ids = diff_comm_checker( list_1, list_2, "text")
# does nothing - no failures
common_ids = ['0603599998140032MB']
#gives me:
0603599998140032MB {'R': '0603599998140032MB'} <Element 'RECORD' at 0x04ACE788>
0603599998140032MB {'R': '0603599998140032MB'} <Element 'RECORD' at 0x04ACE3E0>
So I suspected there was some difference between the strings. I checked both the function output and compared it against the hard-coded values using:
print [(_id, type(_id), repr(_id)) for _id in common_ids][0]
I get exactly the same for both:
>>> ('0603599998140032MB', <type 'str'>, "'0603599998140032MB'")
I have also followed the advice of another question and used difflib.ndiff:
common_ids1 = diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "text")
common_ids = ['0603599998140032MB']
print "\n".join(difflib.ndiff(common_ids1, common_ids))
>>> 0603599998140032MB
So again, doesn't appear that there's any difference between the two.
Here's a full, working example of the code:
from StringIO import StringIO
import xml.etree.cElementTree as ET
from itertools import chain, islice
def diff_comm_checker(list_1, list_2, text):
"""Checks 2 lists. If no difference, pass. Else return common set between two lists"""
symm_diff = set(list_1).symmetric_difference(list_2)
if not symm_diff:
pass
else:
mismatches_in1_not2 = set(list_1).difference( set(list_2) )
mismatches_in2_not1 = set(list_2).difference( set(list_1) )
if mismatches_in1_not2:
mismatch_logger(
mismatches_in1_not2,"{}\n1: {}\n2: {}".format(text, list_1, list_2), 1, 2)
if mismatches_in2_not1:
mismatch_logger(
mismatches_in2_not1,"{}\n2: {}\n1: {}".format(text, list_1, list_2), 2, 1)
set_common = set(list_1).intersection( set(list_2) )
if set_common:
return sorted(set_common)
else:
return "no common set: {}\n".format(text)
def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield chain([first], islice(iterator, size - 1))
def get_elements_iteratively(file):
"""Create unique ID out of image number and case number, return it along with corresponding xml element"""
tag = "RECORD"
tree = ET.iterparse(StringIO(file), events=("start","end"))
context = iter(tree)
_, root = next(context)
for event, record in context:
if event == 'end' and record.tag == tag:
xml_element_2 = ''
xml_element_1 = ''
for child in record.getchildren():
if child.tag == "IMAGE_NUMBER":
xml_element_1 = child.text
if child.tag == "CASE_NUM":
xml_element_2 = child.text
r_id = "{}{}".format(xml_element_1, xml_element_2)
record.set("R", r_id)
yield (r_id, record)
root.clear()
def get_chunks(file, chunk_size):
"""Breaks XML into chunks, yields dict containing unique IDs and corresponding xml elements"""
iterable = get_elements_iteratively(file)
for chunk in chunks(iterable, chunk_size):
ids_records = {}
for k in chunk:
ids_records[k[0]]=k[1]
yield ids_records
def create_new_xml(xml_list):
chunk = 5000
chunk_rec_ids_1 = get_chunks(xml_list[0], chunk)
chunk_rec_ids_2 = get_chunks(xml_list[1], chunk)
to_write = [chunk_rec_ids_1, chunk_rec_ids_2]
######################################################################################
### WHAT'S GOING HERE ??? WHAT'S THE DIFFERENCE BETWEEN THE OUTPUTS OF THESE TWO ? ###
common_ids = diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "create_new_xml - large - common_ids")
#common_ids = ['0603599998140032MB']
######################################################################################
for _id in common_ids:
print _id
for gen_obj in to_write:
for kv_pair in gen_obj:
if kv_pair[_id]:
print _id, kv_pair[_id].attrib, kv_pair[_id]
if __name__ == '__main__':
xml_1 = """<?xml version="1.0"?><RECORDSET><RECORD><CASE_NUM>140032MB</CASE_NUM><IMAGE_NUMBER>0603599998</IMAGE_NUMBER></RECORD></RECORDSET>"""
xml_2 = """<?xml version="1.0"?><RECORDSET><RECORD><CASE_NUM>140032MB</CASE_NUM><IMAGE_NUMBER>0603599998</IMAGE_NUMBER></RECORD></RECORDSET>"""
create_new_xml([xml_1, xml_2])
The problem is not in the type or value of common_ids returned from diff_comm_checker. The problem is that the function diff_comm_checker or in constructing the arguments to the function that destroys the values of to_write
If you try this you will see what I mean
common_ids = ['0603599998140032MB']
diff_comm_checker( [x.keys() for x in to_write[0]][0], [x.keys() for x in to_write[1]][0], "create_new_xml - large - common_ids")
This will give the erroneous behavior without using the return value from diff_comm_checker()
This is because to_write is a generator and the call to diff_comm_checker exhausts that generator. The generator is then finished/empty when used in the if-statement in the loop. You can create a list from a generator by using list:
chunk_rec_ids_1 = list(get_chunks(xml_list[0], chunk))
chunk_rec_ids_2 = list(get_chunks(xml_list[1], chunk))
But this may have other implications (memory usage...)
Also, what is the intention of this construct in diff_comm_checker?
if not symm_diff:
pass
In my opinion nothing will happen regardless if symm_diff is None or not.

Data filtering code in Pandas taking lot of time to run

I am executing the below code in Python. Its taking some time run. Is there something i am doing wrong.
Is there a better a way to do the same.
y= list(word)
words = y
similar = [[item[0] for item in model.wv.most_similar(word) if item[1] > 0.7] for word in words]
similarity_matrix = pd.DataFrame({'Root_Word': words, 'Similar_Words': similar})
similarity_matrix = similarity_matrix[['Root_Word', 'Similar_Words']]
similarity_matrix['Unlist_Root']=similarity_matrix['Root_Word'].apply(lambda x: ', '.join(x))
similarity_matrix['Unlist_Similar']=similarity_matrix['Similar_Words'].apply(lambda x: ', '.join(x))
similarity_matrix=similarity_matrix.drop(['Root_Word','Similar_Words'],1)
similarity_matrix.columns=['Root_Word','Similar_Words']
It is not possible to determine what is going on in the following line as there is not enough data provided (I do not know what model is):
similar = [[item[0] for item in model.wv.most_similar(word) if item[1] > 0.7] for word in words]
The second line below does not seem necessary as you create a DataFrame similarity_matrix with only two columns:
similarity_matrix = pd.DataFrame({'Root_Word': words, 'Similar_Words': similar})
# This below does not do anything
similarity_matrix = similarity_matrix[['Root_Word', 'Similar_Words']]
The apply method is not very fast. Try using vectorized methods already implemented in pandas as shown below. Here is a useful link about this topic.
similarity_matrix['Unlist_Root'] = similarity_matrix['Root_Word'].apply(lambda x: ', '.join(x))
# will be faster like this:
similarity_matrix['Unlist_Root'] = similarity_matrix['Root_Word'].str.join(', ')
Similarly:
similarity_matrix['Unlist_Similar'] = similarity_matrix['Similar_Words'].apply(lambda x: ', '.join(x))
# will be faster like this:
similarity_matrix['Unlist_Similar'] = similarity_matrix['Similar_Words'].str.join(', ')
The rest of the code could not run much faster.
If you provided more data/info we could help you more than that...

Resources