Dataframe creation from text file and regex: python code optimisation - python-3.x

I want to extract patterns from a textfile and create pandas dataframe.
Each line inside the text file look like this:
2022-07-01,08:00:57.853, +12-34 = 1.11 (0. AA), a=0, b=1 cct= p=0 f=0 r=0 pb=0 pbb=0 prr=2569 du=89
I want to extract the following patterns:
+12-34, 1.11, a=0, b=1 cct= p=0 f=0 r=0 p=0 pbb=0 prr=2569 du=89 where cols={id,res,a,b,p,f,r,pb,pbb,prr,du}.
I have written the following the code to extract patterns and create dataframe. The file is around 500MB containing huge amount of rows.
files = glob.glob(path_torawfolder + "*.txt")
lines = []
for fle in files:
with open(fle) as f:
items = {}
lines += f.readlines()
df = pd.DataFrame()
for l in lines:
feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
feature_dict["id"] = (l.split("+")[-1]).split(" =")[0]
feature_dict["res"] = re.findall(r'(\d\.\d{2})',feature_interest)[0]
dct = {k:[v] for k,v in feature_dict.items()}
series = pd.DataFrame(dct)
#print(series)
df = pd.concat([df,series], ignore_index=True)
Any suggestions to optimize the code and reduce the processing time, please?
Thanks!

A bit of improvement: in the previous code, there were few unnecessary conversions from dict to df.
dicts = []
def create_dataframe():
df = pd.DataFrame()
for l in lines:
feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
feature_dict["id"] = (l.split("+")[-1]).split(" =")[0]
feature_dict["res"] = re.findall(r'(\d \.\d{2})',feature_interest)[0]
dicts.append(feature_dict)
df = pd.DataFrame(dicts)
return df
Line # Hits Time Per Hit % Time Line Contents
8 def create_dataframe():
9 1 551.0 551.0 0.0 df = pd.DataFrame()
10 1697339 727220.0 0.4 1.7 for l in lines:
11 1697338 1706328.0 1.0 4.0 feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
12 1697338 20857891.0 12.3 49.1 feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
13
14 1697338 1987874.0 1.2 4.7 feature_dict["ctry_provider"] = (l.split("+")[-1]).split(" =")[0]
15 1697338 9142820.0 5.4 21.5 feature_dict["acpa_codes"] = re.findall(r'(\d\.\d{2})',feature_interest)[0]
16 1697338 1039880.0 0.6 2.4 dicts.append(feature_dict)
17
18 1 7025303.0 7025303.0 16.5 df = pd.DataFrame(dicts)
19 1 2.0 2.0 0.0 return df
Improvement reduced the computation to few mins. Any more suggestions to optimize by using dask or parallel computing?

Related

Append lines of data to a Pandas Dataframe that is not associated with the existing dataframe

How to go about adding lines of text to an existing Pandas Dataframe?
I have saved a pandas dataframe via this command:
predictionsdf = pd.DataFrame(predictions, columns=['File_name', 'Actual_class', 'pred_class', 'Boom'])
The saved data looks like this:
I wanted to add lines of text like this:
Total # of Boom detection = 1 from 100 files
Percentage of Boom detection from plastic bag pop = 1.0 %
Time: 0.43 mins
At the bottom of the dataframe data.
Can you tell me how to go about appending these lines to the bottom of the dataframe?
Thanks!
Not sure to understand what you are trying to do here, but with the following toy dataframe and lines of text:
import pandas as pd
pd.set_option("max_colwidth", 100)
df = pd.DataFrame(
{
"File_name": [
"15FT_LabCurtain_S9_pt5GAL_TCL.wav",
"15FT_LabCurtain_S9_pt6GAL_TCL.wav",
"15FT_LabCurtain_S9_pt7GAL_TCL.wav",
],
"Actual_class": ["plastic_bag", "plastic_bag", "plastic_bag"],
"pred_class": ["plastic_bag", "plastic_bag", "plastic_bag"],
"Boom": [0, 0, 1],
}
)
lines = (
"Total # of Boom detection = 1 from 100 files",
"Percentage of Boom detection from plastic bag pop = 1.0 %",
"Time: 0.43 mins",
)
You could try this:
for line in lines:
df.loc[df.shape[0] + 1, "File_name"] = line
df = df.fillna("")
print(df)
# Output
File_name Actual_class pred_class Boom
0 15FT_LabCurtain_S9_pt5GAL_TCL.wav plastic_bag plastic_bag 0.0
1 15FT_LabCurtain_S9_pt6GAL_TCL.wav plastic_bag plastic_bag 0.0
2 15FT_LabCurtain_S9_pt7GAL_TCL.wav plastic_bag plastic_bag 1.0
4 Total # of Boom detection = 1 from 100 files
5 Percentage of Boom detection from plastic bag pop = 1.0 %
6 Time: 0.43 mins

Iterate through excel files' sheets and append if sheet names share common part in Python

Let's say we have many excel files with the multiple sheets as follows:
Sheet 1: 2021_q1_bj
a b c d
0 1 2 23 2
1 2 3 45 5
Sheet 2: 2021_q2_bj
a b c d
0 1 2 23 6
1 2 3 45 7
Sheet 3: 2019_q1_sh
a b c
0 1 2 23
1 2 3 45
Sheet 4: 2019_q2_sh
a b c
0 1 2 23
1 2 3 40
I wish to append all the sheets to one if the last part split by _ of sheet names are same across all excel files. ie., sheet 1 will append with sheet 2 since their both have common bj, if another excel file also have sheets with name bj, it will also be append to this one, same logic for sheet 3 and sheet 4.
How could I achieve that in Pandas or other Python packages?
The expected result for current excel file would be:
bj:
a b c d
0 1 2 23 2
1 2 3 45 5
2 1 2 23 6
3 2 3 45 7
sh:
a b c
0 1 2 23
1 2 3 45
2 1 2 23
3 2 3 40
Code for reference:
import os, glob
import pandas as pd
files = glob.glob("*.xlsx")
for each in files:
dfs = pd.read_excel(each, sheet_name=None, index_col=[0])
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'Out_{n}.xlsx', index=False)
Update:
You may download test excel files and final expected result from this link.
Try:
dfs = pd.read_excel('Downloads/WS_1.xlsx', sheet_name=None, index_col=[0])
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'Out_{n}.xlsx')
Update
import os, glob
import pandas as pd
files = glob.glob("Downloads/test_data/*.xlsx")
writer = pd.ExcelWriter('Downloads/test_data/Output_file.xlsx', engine='xlsxwriter')
excel_dict = {}
for each in files:
dfs = pd.read_excel(each, sheet_name=None, index_col=[0])
excel_dict.update(dfs)
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(writer, index=False, sheet_name=f'{n}')
writer.save()
writer.close()
I have achieved the whole process and get the final expected result with the code below.
Thanks to provide alternative and more concise solutions or give me some advices if it's possible:
import os, glob
import pandas as pd
from pandas import ExcelWriter
from datetime import datetime
def save_xls(dict_df, path):
writer = ExcelWriter(path)
for key in dict_df:
dict_df[key].to_excel(writer, key, index=False)
writer.save()
root_dir = './original/'
for root, subFolders, files in os.walk(root_dir):
# print(subFolders)
for file in files:
if '.xlsx' in file:
file_path = os.path.join(root_dir, file)
print(file)
f = pd.ExcelFile(file_path)
dict_dfs = {}
for sheet_name in f.sheet_names:
df_new = f.parse(sheet_name = sheet_name)
print(sheet_name)
## get the year and quarter from the sheet name
year, quarter, city = sheet_name.split("_")
# year, quarter, city = sheet_name.split("_")
df_new["year"] = year
df_new["quarter"] = quarter
df_new["city"] = city
dict_dfs[sheet_name] = df_new
save_xls(dict_df = dict_dfs, path = './add_columns_from_sheet_name/' + "new_" + file)
root_dir = './add_columns_from_sheet_name/'
list1 = []
df = pd.DataFrame()
for root, subFolders, files in os.walk(root_dir):
# print(subFolders)
for file in files:
if '.xlsx' in file:
# print(file)
city = file.split('_')[0]
# print(file)
file_path = os.path.join(root_dir, file)
# print(file_path)
dfs = pd.read_excel(file_path, sheet_name=None)
df_out = pd.concat(dfs.values(), keys=dfs.keys())
for n, g in df_out.groupby(df_out.index.to_series().str[0].str.rsplit('_', n=1).str[-1]):
print(n)
timestr = datetime.utcnow().strftime('%Y%m%d-%H%M%S%f')[:-3]
g.droplevel(level=0).dropna(how='all', axis=1).reset_index(drop=True).to_excel(f'./output/{n}_{timestr}.xlsx', index=False)
file_set = set()
file_dir = './output/'
file_list = os.listdir(file_dir)
for file in file_list:
data_type = file.split('_')[0]
file_set.add(data_type)
print(file_set)
file_dir = './output'
file_list = os.listdir(file_dir)
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df4 = pd.DataFrame()
file_set = set()
for file in file_list:
if '.xlsx' in file:
# print(file)
df_temp = pd.read_excel(os.path.join(file_dir, file))
if 'bj' in file:
df1 = df1.append(df_temp)
elif 'sh' in file:
df2 = df2.append(df_temp)
elif 'gz' in file:
df3 = df3.append(df_temp)
elif 'sz' in file:
df4 = df4.append(df_temp)
# function
def dfs_tabs(df_list, sheet_list, file_name):
writer = pd.ExcelWriter(file_name,engine='xlsxwriter')
for dataframe, sheet in zip(df_list, sheet_list):
dataframe.to_excel(writer, sheet_name=sheet, startrow=0 , startcol=0, index=False)
writer.save()
# list of dataframes and sheet names
dfs = [df1, df2, df3, df4]
sheets = ['bj', 'sh', 'gz', 'sz']
# run function
dfs_tabs(dfs, sheets, './final/final_result.xlsx')

Python Pandas apply function not being applied to every row when using variables from a DataFrame

I have this weird Pandas problem, when I use the apply function using values from a data frame, it only gets applied to the first row:
import pandas as pd
# main data frame - to be edited
headerData = [['dataA', 'dataB']]
valuesData = [[10, 20], [10, 20]]
dfData = pd.DataFrame(valuesData, columns = headerData)
dfData.to_csv('MainData.csv', index=False)
readMainDataCSV = pd.read_csv('MainData.csv')
print(readMainDataCSV)
#variable data frame - pull values from this to edit main data frame
headerVariables = [['varA', 'varB']]
valuesVariables = [[2, 10]]
dfVariables = pd.DataFrame(valuesVariables, columns = headerVariables)
dfVariables.to_csv('Variables.csv', index=False)
readVariablesCSV = pd.read_csv('Variables.csv')
readVarA = readVariablesCSV['varA']
readVarB = readVariablesCSV['varB']
def formula(x):
return (x / readVarA) * readVarB
dfFormulaApplied = readMainDataCSV.apply(lambda x: formula(x))
print('\n', dfFormulaApplied)
Output:
dataA dataB
0 50.0 100.0
1 NaN NaN
But when I just use regular variables (not being called from a data frame), it functions just fine:
import pandas as pd
# main data frame - to be edited
headerData = [['dataA', 'dataB']]
valuesData = [[10, 20], [20, 40]]
dfData = pd.DataFrame(valuesData, columns = headerData)
dfData.to_csv('MainData.csv', index=False)
readMainDataCSV = pd.read_csv('MainData.csv')
print(readMainDataCSV)
# variables
readVarA = 2
readVarB = 10
def formula(x):
return (x / readVarA) * readVarB
dfFormulaApplied = readMainDataCSV.apply(lambda x: formula(x))
print('\n', dfFormulaApplied)
Output:
dataA dataB
0 50.0 100.0
1 100.0 200.0
Help please I'm pulling my hair out.
If you take readVarA and readVarB from the dataframe by selecting the column it is a pandas Series with an index, which gives a problem in the calculation (dividing a series by another series with a different index doesn't work).
You can take the first value from the series to get the value like this:
def formula(x):
return (x / readVarA[0]) * readVarB[0]

How For loop result concat by column in python

I have dataframe , I split data into 3 segments, applied for loop and calculate features of data, I want to join the result of 3 loops in one row and change column name by adding 1,2,3 at the end of the feature column name.
splitt=np.array_split(df,3)
for x in splitt:
x1=np.mean(A)
x1=pd.DataFrame([x1],columns=['mean'])
x1['std']=np.std
desired_result;
mean_A_1 std_B_1 mean_A_2 std_B_2 mean_A_3 std_B_3
You can use:
df = pd.DataFrame({'A':[0,1,5,5,8,9,6],'B':[5,8,0,9,5,6,3]})
splitt=np.array_split(df,3)
L = {k: v for i, x in enumerate(splitt, 1)
for k, v in {f'mean_A_{i}': x['A'].mean(), f'std_B_{i}': np.std(x['B'])}.items()}
df1 = pd.DataFrame([L])
print (df1)
mean_A_1 std_B_1 mean_A_2 std_B_2 mean_A_3 std_B_3
0 2.0 3.299832 6.5 2.0 7.5 1.5
Another solution with loop:
L = []
splitt=np.array_split(df,3)
for i, x in enumerate(splitt, 1):
d = {f'mean_A_{i}': x['A'].mean(),
f'std_B_{i}': np.std(x['B'])}
L.append(pd.Series(d))
df1 = pd.concat(L).to_frame().T
print (df1)
mean_A_1 std_B_1 mean_A_2 std_B_2 mean_A_3 std_B_3
0 2.0 3.299832 6.5 2.0 7.5 1.5

Reading a text file with multiple headers in Spark

I have a text file having multiple headers where "TEMP" column has the average temperature for the day, followed by the number of recordings. How can I read this text file properly to create a DataFrame
STN--- WBAN YEARMODA TEMP
010010 99999 20060101 33.5 23
010010 99999 20060102 35.3 23
010010 99999 20060103 34.4 24
STN--- WBAN YEARMODA TEMP
010010 99999 20060120 35.2 22
010010 99999 20060121 32.2 21
010010 99999 20060122 33.0 22
You can read the text file as a normal text file in an RDD
You have a separator in the text file, let's assume it's a space
Then you can remove the header from it
Remove all lines inequal to the header
Then convert the RDD to a dataframe using .toDF(col_names)
Like this:
rdd = sc.textFile("path/to/file.txt").map(lambda x: x.split(" ")) # step 1 & 2
headers = rdd.first() # Step 3
rdd2 = rdd.filter(lambda x: x != headers)
df = rdd2.toDF(headers) # Step 4
You can try this out. I have tried on console.
val x = sc.textFile("hdfs path of text file")
val header = x.first()
var y = x.filter(x=>(!x.contains("STN"))) //this will remove all the line
var df = y.toDF(header)
Hope this will works for you.

Resources