How to make Array inside For Loop globally accessible in Python - python-3.x

I have the following code-
data = [row for row in csv.reader(f)]
for i in range(1,145):
for j in range(0,35):
if j%2 == 0:
x_data = data[i][j]
I want access the x_data in a different function def compareCell for comparison. How can I access the array in the function. Anyone help will be highly appreciated.
Updated-
Actually the following is the case-
Diagram
Case1, Case2, .... are generated in real time and I need to compare it with the CSV file data as shown in the diagram above.
Thanks!

You can define x_data at the beginning of your code:
x_data = [] # Considering your data is an array
# ... some code (optional)
data = [row for row in csv.reader(f)]
for i in range(1,145):
for j in range(0,35):
if j%2 == 0:
x_data = data[i][j]
def compareCell(my_data):
# some code
compareCell(x_data) #for example

In Python, you can use global, like global x_data = data[i][j], however I would suggest you call your function from inside the loop like mentioned in the comments.

Related

Issue with pd.DataFrame.apply with arguments

I want to create augmented data in a new dataframe for every row of an original dataframe.
So, I've defined augment method which I want to use in apply as following:
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int):
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
When I call this as following, everything works:
tmp_df = pd.DataFrame(None, columns=testDF.columns)
augment(testDF.iloc[0], column_name='binMap', target_df=tmp_df, num_samples=4)
augment(testDF.iloc[1], column_name='binMap', target_df=tmp_df, num_samples=4)
However, when I try it using 'apply' method, I get the prints or the display working fine but the resultant dataframe shows error
tmp_df = pd.DataFrame(None, columns=testDF.columns)
testDF.apply(augment, args=('binMap', tmp_df, 4, ), axis=1)
This is how the o/p data looks like after the apply call -
,data
<Error>, <Error>
<Error>, <Error>
What am I doing wrong?
Your test is very nice, thank you for the clear exposition.
I am happy to be your rubber duck.
In test A, you (successfully) mess with
testDF.iloc[0] and [1],
using kind of a Fortran-style API
for augment(), leaving a side effect result in tmp_df.
Test B is carefully constructed to
be "the same" except for the .apply() call.
So let's see, what's different?
Hard to say.
Let's go examine the docs.
Oh, right!
We're using the .apply() API,
so we'd better follow it.
Down at the end it explains:
Returns: Series or DataFrame
Result of applying func along the given axis of the DataFrame.
But you're offering return None instead.
Now, I'm not here to pass judgement on
whether it's best to have side effects
on a target df -- that's up to you.
But .apply() will be bent out of shape
until you give it something nice
to store as its own result.
Happy hunting!
Tiny little style nit.
You wrote
args=('binMap', tmp_df, 4, )
to offer a 3-tuple. Better to write
args=('binMap', tmp_df, 4)
As written it tends to suggest 1-tuple notation.
When is trailing comma helpful?
in a 1-tuple it is essential: x = (7,)
in multiline dict / list expressions it minimizes git diffs, when inevitably another entry ('cherry'?) will later be added
fruits = [
'apple',
'banana',
]
This change worked for me -
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int) -> pd.Series:
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
return row
And updated call to apply as following:
testDF = testDF.apply(augment, args=('binMap', tmp_df, 4, ), result_type='broadcast', axis=1)
Thank you #J_H.
If there are better to way to achieve what I'm doing, please feel free to suggest the improvements.

How to use a loop to automate a repetitive task?

I have a function to create a final dataframe and need to apply this function to 10 different dataframes. Is there a way to use a loop to automate this in stead of writing the dataframe name one by one?
def table_all(df,sales_date):
df1=df.loc[df['date']==sales_date]
### more code##
df1.fillna(0,inplace=True)
return df1
sales1 = table_all(df,sales_date1)
sales2 = table_all(df,sales_date2)
sales3 = table_all(df,sales_date3)
##### more code ####
sales10 = table_all(df,sales_date10)
I am hoping to use a loop like below but it did not work. Any ideas would be greatly appreciated.
for x in range(1,11):
sales+str(x)=table_all(df, sales_date+str(x))
Prefer use a dict to store results:
sales = {}
for x in range(1, 11):
sales[x] = table_all(df, globals()["sales_date{}".format(x)]
Pretty much the same question here

Dynamically generating an object's name in a panda column using a for loop (fuzzywuzzy)

Low-level python skills here (learned programming with SAS).
I am trying to apply a series of fuzzy string matching (fuzzywuzzy lib) formulas on pairs of strings, stored in a base dataframe. Now I'm conflicted about the way to go about it.
Should I write a loop that creates a specific dataframe for each formula and then append all these sub-dataframes in a single one? The trouble with this approach seems to be that, since I cannot dynamically name the sub-dataframe, the resulting value gets overwritten at each turn of the loop.
Or should I create one dataframe in a single loop, taking my formulas names and expression as a dict? The trouble here gives me the same problem as above.
Here is my formulas dict:
# ratios dict: all ratios names and functions
ratios = {"ratio": fuzz.ratio,
"partial ratio": fuzz.partial_ratio,
"token sort ratio": fuzz.token_sort_ratio,
"partial token sort ratio": fuzz.partial_token_sort_ratio,
"token set ratio": fuzz.token_set_ratio,
"partial token set ratio": fuzz.partial_token_set_ratio
}
And here is the loop I am currently sweating over:
# for loop iterating over ratios
for r, rn in ratios.items():
# fuzzing function definition
def do_the_fuzz(row):
return rn(row[base_column], row[target_column])
# new base df containing ratio data and calculations for current loop turn
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
It gives me the same problem, namely that the 'mesure' column gets overwritten, and I end up with a column full of the last value (here: 'partial token set').
My overall problem is that I cannot understand if and how I can dynamically name dataframes, columns or values in a python loop (or if I'm even supposed to do it).
I've been trying to come up with a solution myself for too long and I just can't figure it out. Any insight would be very much appreciated! Many thanks in advance!
I would create a dataframe that is updated at each loop iteration:
final_df = pd.DataFrame()
for r, rn in ratios.items():
...
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
final_df = pd.concat([final_dfl, df_out1], axis=0)
I hope this can help you.

Merging uneven Corresponding Elements from a method returning value in Dictionaries

How to sort the data that are stored in a global list after inserting them within a method; so that before they are stacked into another list in accordance to their inserted elements? Or is this a bad practice and complicate things in storing data inside of a global list instead of seperated ones within a method; and finally sorting them thereafter ?
Below is the example of the scenario
list= []
dictionary = {}
def MethodA(#returns title):
#searches for corresponding data using beautifulsoup
#adds data into dictionary
# list.append(dictionary)
# returns list
def MethodB(#returns description):
#searches for corresponding data using beautifulsoup
#adds data into dictionary
# list.append(dictionary)
# returns list
Example of Wanted output
MethodA():[title] #scraps(text.title) data from the web
MethodB():[description] #scraps(text.description) from the web
#print(list)
>>>list=[{title,description},{title.description},{title,description},{title.description}]
Actual output
MethodA():[title] #scraps(text.title) data from the web
MethodB():[description] #scraps(text.description) from the web
#print(list)
>>>list =[{title},{title},{description},{description}]
There are a few examples I've seen; such as using Numpy and sorting them in an Array;-
arraylist = np.array(list)
arraylist[:, 0]
#but i get a 'too many indices for array'-
#because I have too much data loading in; including that some of them
#do not have data and are replaced as `None`; so there's an imbalance of indexes.
Im trying to keep it as modulated as possible. I've tried using the norm of iteration;
but it's sort of complicated because I have to indent more loops in it;
I've tried Numpy and Enumerate, but I'm not able to understand how to go about with it. But because it's an unbalanced list; meaning that some value are returned as Nonegives me the return error that; all the input array dimensions except for the concatenation axis must match exactly
Example : ({'Toy Box','Has a toy inside'},{'Phone', None }, {'Crayons','Used for colouring'})
Update; code sample of methodA
def MethodA(tableName, rowName, selectedLink):
try:
for table_tag in selectedLink.find_all(tableName, {'class': rowName}):
topic_title = table_tag.find('a', href=True)
if topic_title:
def_dict1 = {
'Titles': topic_title.text.replace("\n", "")}
global_list.append(def_dict1 )
return def_dict1
except:
def_dict1 = None
Assuming you have something of the form:
x = [{'a'}, {'a1'}, {'b'}, {'b1'}, {'c'}, {None}]
you can do:
dictionary = {list(k)[0]: list(v)[0] for k, v in zip(x[::2], x[1::2])}
or
dictionary = {s.pop(): v.pop() for k, v in zip(x[::2], x[1::2])}
The second method will clear your sets in x

Data filtering code in Pandas taking lot of time to run

I am executing the below code in Python. Its taking some time run. Is there something i am doing wrong.
Is there a better a way to do the same.
y= list(word)
words = y
similar = [[item[0] for item in model.wv.most_similar(word) if item[1] > 0.7] for word in words]
similarity_matrix = pd.DataFrame({'Root_Word': words, 'Similar_Words': similar})
similarity_matrix = similarity_matrix[['Root_Word', 'Similar_Words']]
similarity_matrix['Unlist_Root']=similarity_matrix['Root_Word'].apply(lambda x: ', '.join(x))
similarity_matrix['Unlist_Similar']=similarity_matrix['Similar_Words'].apply(lambda x: ', '.join(x))
similarity_matrix=similarity_matrix.drop(['Root_Word','Similar_Words'],1)
similarity_matrix.columns=['Root_Word','Similar_Words']
It is not possible to determine what is going on in the following line as there is not enough data provided (I do not know what model is):
similar = [[item[0] for item in model.wv.most_similar(word) if item[1] > 0.7] for word in words]
The second line below does not seem necessary as you create a DataFrame similarity_matrix with only two columns:
similarity_matrix = pd.DataFrame({'Root_Word': words, 'Similar_Words': similar})
# This below does not do anything
similarity_matrix = similarity_matrix[['Root_Word', 'Similar_Words']]
The apply method is not very fast. Try using vectorized methods already implemented in pandas as shown below. Here is a useful link about this topic.
similarity_matrix['Unlist_Root'] = similarity_matrix['Root_Word'].apply(lambda x: ', '.join(x))
# will be faster like this:
similarity_matrix['Unlist_Root'] = similarity_matrix['Root_Word'].str.join(', ')
Similarly:
similarity_matrix['Unlist_Similar'] = similarity_matrix['Similar_Words'].apply(lambda x: ', '.join(x))
# will be faster like this:
similarity_matrix['Unlist_Similar'] = similarity_matrix['Similar_Words'].str.join(', ')
The rest of the code could not run much faster.
If you provided more data/info we could help you more than that...

Resources