How to use a loop to automate a repetitive task? - python-3.x

I have a function to create a final dataframe and need to apply this function to 10 different dataframes. Is there a way to use a loop to automate this in stead of writing the dataframe name one by one?
def table_all(df,sales_date):
df1=df.loc[df['date']==sales_date]
### more code##
df1.fillna(0,inplace=True)
return df1
sales1 = table_all(df,sales_date1)
sales2 = table_all(df,sales_date2)
sales3 = table_all(df,sales_date3)
##### more code ####
sales10 = table_all(df,sales_date10)
I am hoping to use a loop like below but it did not work. Any ideas would be greatly appreciated.
for x in range(1,11):
sales+str(x)=table_all(df, sales_date+str(x))

Prefer use a dict to store results:
sales = {}
for x in range(1, 11):
sales[x] = table_all(df, globals()["sales_date{}".format(x)]
Pretty much the same question here

Related

What are faster ways of reading big data set and apply row wise operations other than pandas and dask?

I am working on a code where I need to populate a set of data structure based on each row of a big table. Right now, I am using pandas to read the data and do some elementary data validation preprocess. However, when I get to the rest of process and putting data in the corresponding data structure, it takes considerably long time for the loop to be completed and my data structures gets populated. For example, in the following code I have a table with 15 M records. Table has three columns and I create a foo() object base on each row and add it to a list.
# Profile.csv
# Index | Name | Family| DB
# ---------|------|-------|----------
# 0. | Jane | Doe | 08/23/1977
# ...
# 15000000 | Jhon | Doe | 01/01/2000
class foo():
def __init__(self, name, last, bd):
self.name = name
self.last = last
self.bd = bd
def populate(row, my_list):
my_list.append(foo(*row))
# reading the csv file and formatting the date column
df = pd.read_csv('Profile.csv')
df['DB'] = pd.to_datetime(df['DB'],'%Y-%m-%d')
# using apply to create an foo() object and add it to the list
my_list = []
gf.apply(populate, axis=1, args=(my_list,))
So the after using pandas to convert the string date to the date object, I just need to iterate over the DataFrame to creat my object and add them to the list. This process is very time taking (in my real example it is even taking more time since my data structure is more complex and I have more columns). So, I am wondering what is the best practice in this case to enhance my run time. Should I even use pandas to read my big tables and process through them row by row?
it would be simply faster using a file handle:
input_file = "profile.csv"
sep=";"
my_list = []
with open(input_file) as fh:
cols = {}
for i, col in enumerate(fh.readline().strip().split(sep)):
cols[col] = i
for line in fh:
line = line.strip().split(sep)
date = line[cols["DB"]].split("/")
date = [date[2], date[0], date[1]]
line[cols["DB"]] = "-".join(date)
populate(line, my_list)
There are multiple approaches for this kind of situation, however, the fastest and most effective method is using vectorization if possible. The solution for the example I demonstrated in this post using vectorization could be as follows:
my_list = [foo(*args) for args in zip(df["Name"],df["Family"],df["BD"])]
If the vectorization is not possible, converting the data framce to a dictionary could significantly improve the performance. For the current example if would be something like:
my_list = []
dc = df.to_dict()
for i, j in dc.items():
my_list.append(foo(dc["Name"][i], dc["Family"][i], dc["BD"][i]))
The last solution is particularly very effective if the type of structures and processes are more complex.

Dynamically generating an object's name in a panda column using a for loop (fuzzywuzzy)

Low-level python skills here (learned programming with SAS).
I am trying to apply a series of fuzzy string matching (fuzzywuzzy lib) formulas on pairs of strings, stored in a base dataframe. Now I'm conflicted about the way to go about it.
Should I write a loop that creates a specific dataframe for each formula and then append all these sub-dataframes in a single one? The trouble with this approach seems to be that, since I cannot dynamically name the sub-dataframe, the resulting value gets overwritten at each turn of the loop.
Or should I create one dataframe in a single loop, taking my formulas names and expression as a dict? The trouble here gives me the same problem as above.
Here is my formulas dict:
# ratios dict: all ratios names and functions
ratios = {"ratio": fuzz.ratio,
"partial ratio": fuzz.partial_ratio,
"token sort ratio": fuzz.token_sort_ratio,
"partial token sort ratio": fuzz.partial_token_sort_ratio,
"token set ratio": fuzz.token_set_ratio,
"partial token set ratio": fuzz.partial_token_set_ratio
}
And here is the loop I am currently sweating over:
# for loop iterating over ratios
for r, rn in ratios.items():
# fuzzing function definition
def do_the_fuzz(row):
return rn(row[base_column], row[target_column])
# new base df containing ratio data and calculations for current loop turn
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
It gives me the same problem, namely that the 'mesure' column gets overwritten, and I end up with a column full of the last value (here: 'partial token set').
My overall problem is that I cannot understand if and how I can dynamically name dataframes, columns or values in a python loop (or if I'm even supposed to do it).
I've been trying to come up with a solution myself for too long and I just can't figure it out. Any insight would be very much appreciated! Many thanks in advance!
I would create a dataframe that is updated at each loop iteration:
final_df = pd.DataFrame()
for r, rn in ratios.items():
...
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
final_df = pd.concat([final_dfl, df_out1], axis=0)
I hope this can help you.

Iterate over 4 pandas data frame columns and store them into 4 lists with one for loop instead of 4 for loops

I am currently working on pandas structure in Python. I wrote a function that extracts data from Pandas data frame and stores it in lists. The code is working but I feel like there is a part that I could write in one for loop instead four for loops. I will give you an example below. The idea of this part of the code is to extract four columns from a pandas data frame into four lists. I did it with 4 separate for loops but I want to have one loop that does the thing.
col1,col1,col1,col1 = [],[],[],[]
for j in abc['col1']:
col1.append(j)
for k in abc['col2']:
col2.append(k)
for l in abc['col3']:
col3.append(l)
for n in abc['col4']:
col4.append(n)
And my idea is to write a one for loop that does all the code. I tried to do something like this, but it doesn't work.
col1,col1,col1,col1 = [],[],[],[]
for j,k,l,n in abc[['col1','col2','col3','col4']]
col1.append(j)
col2.append(k)
col3.append(l)
col4.append(n)
Can you help me with this idea to wrap four for loops into the one? I would appreciate your help!
You don't need to use loops at all; you can just convert each column into a list directly.
list_1 = df["col"]to_list()
Have a look at this previous question.
Treating a panda dataframe like a list usually works, but is very bad for performance. I'd consider using the iterrows() function instead.
This would work as in the following example:
col1,col2,col3,col4 = [],[],[],[]
for index, row in df.iterrows():
col1.append(row['col1'])
col2.append(row['col2'])
col3.append(row['col3'])
col4.append(row['col4'])
It's probably easier to use pandas.values and then numpy.ndarray.to_list():
col = ['col1','col2','col3']
data = []*len(col)
for i in range(len(col)):
data[i] = df[col(i)].values.to_list()

How to make Array inside For Loop globally accessible in Python

I have the following code-
data = [row for row in csv.reader(f)]
for i in range(1,145):
for j in range(0,35):
if j%2 == 0:
x_data = data[i][j]
I want access the x_data in a different function def compareCell for comparison. How can I access the array in the function. Anyone help will be highly appreciated.
Updated-
Actually the following is the case-
Diagram
Case1, Case2, .... are generated in real time and I need to compare it with the CSV file data as shown in the diagram above.
Thanks!
You can define x_data at the beginning of your code:
x_data = [] # Considering your data is an array
# ... some code (optional)
data = [row for row in csv.reader(f)]
for i in range(1,145):
for j in range(0,35):
if j%2 == 0:
x_data = data[i][j]
def compareCell(my_data):
# some code
compareCell(x_data) #for example
In Python, you can use global, like global x_data = data[i][j], however I would suggest you call your function from inside the loop like mentioned in the comments.

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

Resources