Improvement of a Python script | Performance - python-3.x

I wrote a code. But it's very slow. The goal is to look for matches. It doesn't have to be one-on-one matches.
I have a data frame which has about 3,600,000 entries --> "SingleDff"
I have a data frame with about 110'000 entries --> "dfnumbers"
Now - The code tries to find out if out of these 110'000 entries you can find entries in the 3'600'000 million.
I added a counter to see how "fast" it is. After 24h I only got 11'000 entries. 10% in total
I'm looking now for ways and/or ideas how I can improve the performance of the Code.
The Code:
import os
import glob
import numpy as np
import pandas as pd
#Preparation
pathfiles = 'C:\\Python\\Data\\Input\\'
df_Files = glob.glob(pathfiles + "*.csv")
df_Files = [pd.read_csv(f, encoding='utf-8', sep=';', low_memory=False) for f in df_Files]
SingleDff = pd.concat(df_Files, ignore_index=True, sort=True)
dfnumbers = pd.read_excel('C:\\Python\\Data\\Input\\UniqueNumbers.xlsx')
#Output
outputDf = pd.DataFrame()
SingleDff['isRelevant'] = np.nan
count = 0
max = len(dfnumbers['Korrigierter Wert'])
arrayVal = dfnumbers['Korrigierter Wert']
for txt in arrayVal:
outputDf = outputDf.append(SingleDff[SingleDff['name'].str.contains(txt)], ignore_index = True)
outputDf['isRelevant'] = np.where(outputDf['isRelevant'].isnull(),txt,outputDf['isRelevant'])
count += 1
outputDf.to_csv('output_match.csv')
Edit:
Example of Data
In the 110'000 Data Frame I have something like this:
ABCD-12345-1245-T1
ACDB-98765-001 AHHX800.0-3
In the huge DF i have entrys like:
AHSG200-B0097小样图.dwg
MUDI-070097-0-05-00.dwg
ABCD-12345-1245.xlsx
ABCD-12345-1245.pdf
ABCD-12345.xlsx
Now i try to find matches - For which number we can find documents
Thank you for your inputs

Related

Perform code on multiple files 1 by 1 pandas

Hi I have code I have written to read a .csv file in a folder and add some required columns.
I now want to perform this code on multiple files within the path folder 1 by 1 and save each as a separate df.
My current code is as follows
import pandas as pd
import glob
import os
path = r'C:\Users\jake.jennings.BRONCO\Desktop\GPS Reports\Games\Inputs\2022-03-27 Vs
Cowboys\Test' # use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
frame = pd.read_csv(filename, index_col=None, skiprows=8)
li.append(frame)
frame = pd.concat(li, axis=0, ignore_index=True)
frame['filename'] = os.path.basename
#Add odometer change and turn all accel values to positive
import numpy as np
frame['OdChange'] = frame['Odometer'].diff()
frame['accelpos'] = frame['Acceleration'].abs()
#Add column with OdChange # >5.5m/s
frame["new1"] = np.where(
(frame.Velocity >=5.5),
frame["OdChange"],
'0')
#Add column with accels/decels >2.5m.s.s for AccelDec/min
frame["new2"] = np.where(
(frame.accelpos >=2.5),
frame["accelpos"],
'0')
#Add column with accels/decels >2.5m.s.s for AccelDec/min
frame["new3"] = np.where(
(frame.Acceleration >=2.5),
'1',
'0')
s = frame['new3'].astype(int)
frame['new4'] = s.diff().fillna(s).eq(1).astype(int)
frame['new4']
#m/min peaks
frame['1minOD'] = frame['OdChange'].rolling(window=600, axis=0).sum()
#HSm/min peaks
frame['1minHS'] = frame['new1'].rolling(window=600, axis=0).sum()
#AccImpulse/min
frame['1minImp'] = frame['accelpos'].rolling(window=600, axis=0).mean() *60
#AccDec Peak Count
frame['1minAccCount'] = frame['new4'].rolling(window=600, axis=0).sum()
print (frame)
I am not sure if this is even the best way to do what I am trying to do. Any help would be appreciated!

finding latest trip information from a large data frame

I have one requirement:
I have a dataframe "df_input" having 20M rows which includes trip details. columns are "vehicle-no", "geolocation","start","end".
For each of the vehicle number there are multiple rows each having different geolocation for different trips.
Now I want to create a new dataframe df_final which will have only the first record for all of the vehicle-no. How can do that in efficient way?
I used something like below which is taking more than 5 hours to complete:
import dfply as dp
from dfply import X
output_df_columns = ["vehicle-no","start", "end", "geolocations"]
df_final = pd.DataFrame(columns = output_df_columns) #create empty dataframe
unique_vehicle_no = list(df_input["vehicle-no"].unique())
df_input.sort_values(["start"],inplace=True)
for each_vehicle in unique_vehicle_no:
df_temp = (df_input >> dp.mask(X.vehicle-no == each_vehicle))
df_final = df_final.append(df_temp.head(1),ignore_index=True, sort=False)
I think this will work out
import pandas as pd
import numpy as np
df_input=pd.DataFrame(np.random.randint(10,size=(1000,3)),columns=['Geolocation','start','end'])
df_input['vehicle_number']=np.random.randint(100,size=(1000))
print(df_input.shape)
print(df_input['vehicle_number'].nunique())
df_final=df_input.groupby('vehicle_number').apply(lambda x : x.head(1)).reset_index(drop=True)
print(df_final['vehicle_number'].nunique())
print(df_final.shape)

Performing a calculation on every item in a DataFrame

Have a large pandas DataFrame of 1m rows. I want to perform a calculation on every item and create a new DataFrame from it.
The way I'm currently doing it is crazily slow. Any thoughts on how I might improve the efficiency?
# Create some random data in a DataFrame
import pandas as pd
import numpy as np
dfData = pd.DataFrame(np.random.randint(0,1000,size=(100, 10)), columns=list('ABCDEFGHIJ'))
# Key values
colTotals = dfData.sum(axis=0)
rowTotals = dfData.sum(axis=1)
total = dfData.values.sum()
dfIdx = pd.DataFrame()
for respId, row in dfData.iterrows():
for scores in row.iteritems():
colId = scores[0]
score = scores[1]
# Do the calculation
idx = (score / colTotals[colId]) * (total / rowTotals[respId]) * 100
dfIdx.loc[respId, colId] = idx
I think this is the logic of your code
dfData.div(colTotals).mul((total / rowTotals) * 100, 0)

How to import files using a for loop with path names in dictionary in Python?

I want to create a dictionary which has all the information needed to import files, parse dates etc. Then I want to use a for loop to import all these files. But after the for loop is finished I'm only left with the last dataset in the dictionary. As if it overwrites them.
I execute the file in the path folder so that's not a problem.
I tried creating a new dictionary where I add each import but that makes it much harder for later when I need to reference them. I want them as separate dataframes in the variable explorer.
Here's the code:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator # for time series visualisation
# Import data
#PATH = r"C:\Users\sherv\OneDrive\Documents\GitHub\Python-Projects\Research Project\Data"
data = {"google":["multiTimeline.csv", "Month"],
"RDPI": ["RealDisposableIncome-2004-1_Present-Mon-US(Grab-30-11-18).csv", "DATE"],
"CPI": ["CPI.csv", "DATE"],
"GDP": ["GDP.csv", "DATE"],
"UE": ["Unemployment_2004_Present_US(Grab-5-12-18).csv", "DATE"],
"SP500": ["S&P500.csv", "Date"],
"IR": ["InterestRate_2004-1-1_Present_US(Grab-5-12-18).csv", "DATE"],
"PPI": ["PPIACO.csv", "DATE"],
"PMI": ["ISM-MAN_PMI.csv", "Date"]}
for dataset in data.keys():
dataset = pd.read_csv("%s" %(data[dataset][0]), index_col="%s" %(data[dataset][1]), parse_dates=["%s" %(data[dataset][1])])
dataset = dataset.loc["2004-01-01":"2018-09-01"]
# Visualise
minor_locator = AutoMinorLocator(12)
# Investigating overall trendSS
def google_v_X(Data_col, yName, title):
fig, ax1 = plt.subplots()
google["Top5"].plot(ax=ax1,color='b').xaxis.set_minor_locator(minor_locator)
ax1.set_xlabel('Date')
ax1.set_ylabel('google (%)', color='b')
ax1.tick_params('y', colors='b')
plt.grid()
ax2 = ax1.twinx()
Data_col.plot(ax=ax2,color='r')
ax2.set_ylabel('%s' %(yName), color='r')
ax2.tick_params('%s' %(yName), colors='r')
plt.title("Google vs %s trends" %(title))
# Google-CPI
google_v_X(CPI["CPI"], "CPI 1982-1985=100 (%)", "CPI")
# Google-RDPI
google_v_X(RDPI["DSPIC96"], "RDPI ($)", "RDPI")
# Google-GDP
google_v_X(GDP["GDP"], "GDP (B$)", "GDP")
# Google-UE
google_v_X(UE["Value"], "Unemployed persons", "Unemployment")
# Google-SP500
google_v_X(SP500["Close"], "SP500", "SP500")
# Google-PPI
google_v_X(PPI["PPI"], "PPI")
# Google-PMI
google_v_X(PMI["PMI"], "PMI", "PMI")
# Google-IR
google_v_X(IR["FEDFUNDS"], "Fed Funds Rate (%)", "Interest Rate")
I also tried creating a function to read and parse and then use that in a loop like:
def importdata(key, path ,parseCol):
key = pd.read_csv("%s" %(path), index_col="%s" %(parseCol), parse_dates=["%s" %(parseCol)])
key = key.loc["2004-01-01":"2018-09-01"]
for dataset in data.keys():
importdata(dataset, data[dataset][0], data[dataset][0])
But I get an error because it doesn't recognise the path as a string and it says its not defined.
How can I get them to not overwrite each other or how can I get python to recognise the input to the function as a string? Any help is appreciated, Thanks
The for loop is referencing the same dataset variable so each time the loop is executed the variable is replaced with the newly imported dataset. You need to store the result somewhere, whether thats as a new variable each time or in a dictionary. Try something like this:
googleObj = None
RDPIObj = None
CPIObj = None
data = {"google":[googleObj, "multiTimeline.csv", "Month"],
"RDPI": [RDPIObj,"RealDisposableIncome-2004-1_Present-Mon-US(Grab-30-11-18).csv", "DATE"],
"CPI": [CPIObj, "CPI.csv", "DATE"]}
for dataset in data.keys():
obj = data[dataset][0]
obj = pd.read_csv("%s" %(data[dataset][1]), index_col="%s" %(data[dataset][2]), parse_dates=["%s" %(data[dataset][2])])
obj = dataset.loc["2004-01-01":"2018-09-01"]
This way you will have a local dataframe object for each of your datasets. The downside is that you have to define each variable.
Another option is making a second dictionary like you mentioned, something like this:
data = {"google":["multiTimeline.csv", "Month"],
"RDPI": ["RealDisposableIncome-2004-1_Present-Mon-US(Grab-30-11-18).csv", "DATE"],
"CPI": ["CPI.csv", "DATE"]}
output_data = {}
for dataset_key in data.keys():
dataset = pd.read_csv("%s" %(data[dataset_key][0]), index_col="%s" %(data[dataset_key][1]), parse_dates=["%s" %(data[dataset_key][1])])
dataset = dataset.loc["2004-01-01":"2018-09-01"]
output_data[dataset_key] = dataset
Reproducible example (however you should be very careful with using "exec"):
# Generating data
import os
import pandas as pd
os.chdir(r'C:\Windows\Temp')
df1 = pd.DataFrame([['a',1],['b',2]], index=[0,1], columns=['col1','col2'])
df2 = pd.DataFrame([['c',3],['d',4]], index=[2,3], columns=['col1','col2'])
# Exporting data
df1.to_csv('df1.csv', index_label='Month')
df2.to_csv('df2.csv', index_label='DATE')
# Definition of Loading metadata
loading_metadata = {
'df1_loaded':['df1.csv','Month'],
'df2_loaded':['df2.csv','DATE'],
}
# Importing with accordance to loading_metadata (caution for indentation)
for dataset in loading_metadata.keys():
print(dataset, loading_metadata[dataset][0], loading_metadata[dataset][1])
exec(
"""
{0} = pd.read_csv('{1}', index_col='{2}').rename_axis('')
""".format(dataset, loading_metadata[dataset][0], loading_metadata[dataset][1])
)
Exported data (df1.csv):
Month,col1,col2
0,a,1
1,b,2
Exported data (df2.csv):
DATE,col1,col2
2,c,3
3,d,4
Loaded data:
df1_loaded
col1 col2
0 a 1
1 b 2
df2_loaded
col1 col2
2 c 3
3 d 4

parallel write to different groups with h5py

I'm trying to use parallel h5py to create an independent group for each process and fill each group with some data.. what happens is that only one group gets created and filled with data. This is the program:
from mpi4py import MPI
import h5py
rank = MPI.COMM_WORLD.Get_rank()
f = h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=MPI.COMM_WORLD)
data = range(1000)
dset = f.create_dataset(str(rank), data=data)
f.close()
Any thoughts on what is going wrong here?
Thanks alot
Ok, so as mentioned in the comments I had to create the datasets for every process then fill them up.. The following code is writing data in parallel as many times as the size of the communicator:
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
data = [random.randint(1, 100) for x in range(4)]
f = h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=comm)
dset = []
for i in range(size):
dset.append(f.create_dataset('test{0}'.format(i), (len(data),), dtype='i'))
dset[rank][:] = data
f.close()

Resources