Convert huge number of lists to pandas dataframe - python-3.x

User defined function=> my_fun(x): returns a list
XYZ = file with LOTS of lines
pandas_frame = pd.DataFrame() # Created empty data frame
for index in range(0,len(XYZ)):
pandas_frame = pandas_frame.append(pd.DataFrame(my_fun(XYZ[i])).transpose(), ignore_index=True)
This code is taking very long time to run like in days. How do I speed up?

I think need apply for each row funcion to new list by list comprehension and then use only once DataFrame constructor:
L = [my_fun(i) for i in range(len(XYZ))]
df = pd.DataFrame(L)

Related

Subtracting values to groups in pandas

I have the following DataFrame:
df = pd.DataFrame()
df['I'] = [-1.922410e-11, -6.415227e-12, 1.347632e-11, 1.728460e-11,3.787953e-11]
df['V'] = [0,0,0,1,1]
off = df.groupby('V')['I'].mean()
I need to subtract the off values to the respective df['I'] values. In code I want something like this:
for i in df['V'].unique():
df['I'][df['V']==i] -= off.loc[i]
I want to know if there is another approach of doing this without using loops.

Convert files to DataFrame and then applying a function for multiple DataFrames

I have three files well1.las, well2.las, well3.las (".las" is similar to a ".txt") and want to transform them into different DataFrames (8x4) as I will apply some functions on them later.
Inside them there is a string "~A" that I want to exclude and stay just with the logs names ('DEPTH', 'GR', 'NPHI', 'RHOB') as column names.
I already managed to separate the las file to a dataframe, but couldn't do this to all. How can I do that?
I think it should be better if I could put the dataframes in a list or dict as I will need to do some calculations with their values.
Each las file looks like this:
~A DEPTH GR NPHI RHOB
2869.6250 143.5306 0.1205 2.4523
2869.7500 143.9227 0.1221 2.4497
2869.8750 144.5697 0.1180 2.4564
2870.0000 145.3994 0.1128 2.4650
2870.1250 146.3611 0.1378 2.4239
2870.2500 147.3796 0.1535 2.3981
2870.3750 148.4387 0.1288 2.4387
2870.5000 149.5223 0.1195 2.4539
'''
dl = os.listdir(r"C:\Users\laguiar\Desktop\LASfiles")
for filename in dl:
if filename.endswith('.las'):
with open(filename) as f:
for l in f:
if l.startswith('~A'):
logs = l.split()[1:]
break
data = pd.read_csv(f, names=logs, sep='~A', engine='python')
df = data['DEPTH'].str.split(expand = True)
df = df.astype(float)
df.columns = logs
'''

Make operations on several pandas data frames in for loop and return one concatenated data frame

I have many similar data frames which have to be modified and then concatenated in one data frame. I was wondering if there is a way to do everything with a for loop instead of importing and making operations on one data frame at the time?
This is how I was thinking
c = '/disc/data/'
files = [c+'frames_A1.csv',c+'frames_A2.csv',c+'frames_A3.csv',c+'frames_B1.csv',c+'frames_B2.csv',c+'frames_B3.csv',
c+'frames_A1_2.csv',c+'frames_A2_2.csv',c+'frames_A3_2.csv',c+'frames_B1_2.csv',c+'frames_B2_2.csv',c+'frames_B3_2.csv',
c+'frames_B_96.csv',c+'frames_C_96.csv',c+'frames_D_96.csv',c+'frames_E_96.csv',c+'frames_F_96.csv',c+'frames_G_96.csv']
data_tot = []
for i in files:
df = pd.read_csv(i, sep=';', encoding='unicode_escape')
df1 = df[['a','b','c','d']]
df2 = df1[df1['a'].str.contains(r'\btake\b')]
data_tot.append(df2)
I believe I should not append to a list but I cannot figure out how to do otherwise.
you could then do
total_df = pd.concat(data_tot, ignore_index = True).reset_index()

Transfer cell values from different columns and sheets from multiple excel files with same structure into a single dataframe

I have a reporting sheet in excel that contains a set of datapoints that I want to compile from multiple files with the same format into a master dataset.
The initial step I undertook was to extract the data points I need from multiple sheet into one pandas dataframe. See the steps below
I initally imported the excel file and parsed it
import pandas as pd
xl = pd.ExcelFile(r"C:\Users\Nicola\Desktop\ISP 2016-20 Ops-Technical Form.xlsm")
df = xl.parse("FSL, WASH, DRM") #name of sheet #1
Then I located the data points needed for synthesis
a=df.iloc[5:20,3:5]
a1=df.iloc[6:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
Then I concatenated and equalised columns positioning to maintain the whole list of values within the same column:
dfcon=pd.concat(([a,b]))
dfcon2=pd.concat(([a1,b1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
And lastly created a dataframe with the string of values I need
master=pd.DataFrame(dfcont2)
finalmaster=master.transpose()
The next two steps I wish to pursue are:
1) Replicate the same code for 50 excel files
2) Compile all string of values from this set of excel files into one single pandas dataframe without running this code over again and compile manually by exporting it into excel.
Any support would be greatly appreciated. Thanks
I believe need loop by file names created by glob and last concat together (all files have same structure):
import glob
dfs = []
for f in glob.glob('*.xlsm'):
df = pd.read_excel(io=f, sheet_name=1)
a=df.iloc[5:20,3:5]
a1=df.iloc[6:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
dfcon=pd.concat(([a,b]))
dfcon2=pd.concat(([a1,b1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
dfs.append(dfcont2.T)
out = pd.concat(dfs, ignore_index=True)
Found the solution that works for me, thank you for the input, jezrael.
To further explain:
1) Imported the files with same structure from my Desktop directory, parsed and selected the Excel sheet from which data can be extracted from different locations (iloc)
import glob
dfs = []
for f in glob.glob('C:/Users/Nicola/Desktop/OPS Form/*.xlsm'):
df = pd.ExcelFile(io=f, sheet_name=1)
df = df.parse("FSL, WASH, DRM")
a=df.iloc[5:20,3:5]
a1=df.iloc[7:9,10:12]
b=df.iloc[31:35,3:5]
b1=df.iloc[31:35,10:12]
c=df.iloc[50:56,3:5]
c1=df.iloc[38:39,10:12]
d=df.iloc[57:61,3:5]
e=df.iloc[63:71,3:5]
2) Concatenated and repositioned columns order to compose the first version of the dataframe (output)
dfcon=pd.concat(([a,b,c,d,e]))
dfcon2=pd.concat(([a1,b1,c1]))
new_cols = {x: y for x, y in zip(dfcon.columns, dfcon2.columns)}
dfcont2=dfcon2.append(dfcon.rename(columns=new_cols))
dfs.append(dfcont2.T)
3) Output presented the same string of values but repeated twice [same label and form-specific entry] from recursive data pull-outs linked to iloc locations.
output = pd.concat(dfs, ignore_index=True)
4) This last snippet simply allowed me to extract the label only once and to select all entries ordered in odd numbers. With the last concatenation, I generated the dataframe I seeked, ready to be processed analytically.
a=output[2:3]
b=output[1::2]
pd.concat([a,b], axis=0, ignore_index=True)

List iterations and regex, what is the better way to remove the text I don' t need?

We handle data from volunteers, that data is entered in to a form using ODK. When the data is downloaded the header (column names) row contains a lot of 'stuff' we don' t need. The pattern is as follows:
'Group1/most_common/G27'
I want to replace the column names (there can be up to 200) or create a copy of the DataFrame with column names that just contain the G-code (Gxxx). I think I got it.
What is the faster or better way to do this?
IS the output reliable in terms of sort order? As of now it appears that the results list is in the same order as the original list.
y = ['Group1/most common/G95', 'Group1/most common/G24', 'Group3/plastics/G132']
import re
r = []
for x in y:
m = re.findall(r'G\d+', x)
r.append(m)
# the comprehension below is to flatten it
# append.m gives me a list of lists (each list has one item)
results = [q for t in r for q in t]
print(results)
['G95', 'G24', 'G132']
The idea would be to iterate through the column names in the DataFrame (or a copy), delete what I don't need and replace (inplace=True).
Thanks for your input.
You can use str.extract:
df = pd.DataFrame(columns=['Group1/most common/G95',
'Group1/most common/G24',
'Group3/plastics/G132'])
print (df)
Empty DataFrame
Columns: [Group1/most common/G95, Group1/most common/G24, Group3/plastics/G132]
Index: []
df.columns = df.columns.str.extract('(G\d+)', expand=False)
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []
Another solution with rsplit and select last values with [-1]:
df.columns = df.columns.str.rsplit('/').str[-1]
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []

Resources