I have a .csv file that I'm reading from. I read only select columns from it and I need to further process this data before I save it into an excel sheet. The idea is to repeat this process for all the files in the folder and save the sheets with the same names as the original .csv.
As of now, I'm able to read the specific columns from .csv and write the whole file into excel. I am yet to figure out how to further process these columns before I save to excel. Further processing involves
Averaging rows 18000-20000 for each column separately.
Calculating (Column value - Average)/Average
Saving these values in separate columns with different column names.
My code is as follows. Need some help with this.
import pandas as pd
import os
from pathlib import Path
for f in os.listdir():
file_name, file_ext = os.path.splitext(f) #splitting into file name and extension
if file_ext == '.atf':
#open the data file and get data only from specific columns.
df = pd.read_csv(f, header = 9, index_col = 0, usecols = [0,55,59,63,67,71,75,79,83,87,91,95,99,103], encoding = "ISO-8859-1", sep = '\t', dtype = {'YFP.13':str,'YFP.14':str,'YFP.15':str,'YFP.16':str,'YFP.17':str,'YFP.18':str,'YFP.19':str,'YFP.20':str,'YFP.21':str,'YFP.22':str,'YFP.23':str,'YFP.24':str,'YFP.25':str,'Signals=':str})
df.to_excel(file_name+'.xlsx',sheet_name=file_name, engine = 'xlsxwriter') #writing into an excel file
Let's say your dataframe has a shape of 4 columns and 100 rows:
data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
Averaging rows 18000-20000 for each column separately.
To perform the averaging on a subset, you define a logical mask based on inequalities over the index and apply the averaging function on the selected dataframe. The results are saved in a new dataframe as you want to use them later:
means = data[(60<data.index) & (80>data.index)].mean()
Calculating (Column value - Average)/Average
Saving these values in separate columns with different column names
For the two last steps, the code below speaks by itself:
cols = data.columns
for col in cols:
data[col +"_calc"] = (data[col]-means[col])/means[col]
In the end, you may export the dataframe "data" to an excel format as you did earlier.
Related
I've been following this answer : How do I extract data from multiple text files to Excel for merging multiple .txt files to one .xlsx spreadsheet, but this answer is giving me the data in multiple worksheets with no distinct column names.
What I'm trying to do is to extract the data of 3 .txt files(each file contains data in one column without a column name) into one spreadsheet and each column should have a distinct column name('col1', 'col2' and 'col3') respectively.
data1.txt:
ACC_NT/mc_pf_qer
ACC_NT/gsd
ACC_NT/hcv_efg_tdc
ACC_NT/ids_gc
ISQ_BX/oic_lkv
ISQ_BX/pfg_8c
data2.txt:
79.2%
53.9%
100.0%
50.0%
44.2%
0.0%
data3.txt:
ACC_NT/ACC_NT_mc_pf_qer.html
ACC_NT/ACC_NT_gsd.html
ACC_NT/ACC_NT_hcv_efg_tdc.html
ACC_NT/ACC_NT_ids_gc.html
ISQ_BX/ISQ_BX_oic_lkv.html
ISQ_BX/ISQ_BX_pfg_8c.html
Any help or guidance is appreciated, thanks!
IIUC:
data = {}
for i, filename in enumerate(['data1.txt', 'data2.txt', 'data3.txt'], 1):
data[f"col{i}"] = pd.read_csv(filename, squeeze=True, header=None)
pd.concat(data, axis=1).to_excel('output.xlsx')
Update
I realized that the col2 data contains the percentage details and in the output.xlsx
data = {}
for i, filename in enumerate(['data1.txt', 'data2.txt', 'data3.txt'], 1):
data[f"col{i}"] = pd.read_csv(filename, squeeze=True, header=None)
df = pd.concat(data, axis=1)
df['col2'] = df['col2'].str.strip('%').astype(float)
df.to_excel('output.xlsx', index=False)
I have about 50 excel sheet with .'xlsb' extension. I'd like to concatanate a specific worksheet into pandas DataFrame (all worksheets names are the same). The problem I have is that the names of columns are not exactly the same in each worksheet. I wrote a code using pandas but the way it works is that it concats all values into the same column in pandas data frame but based on the name of column. So for example: sometimes I have column called: FgsNr and sometimes FgNr - the datatype and the meaning in both columns are exactly the same and I would like to have them in the same column in Data Frame but pandas creates to separate columns in data frame and stack together only those values that are listed in column with the same name.
files = glob(r'C:\Users\Folder\*xlsb')
for file in files:
Datafile = pd.concat(pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0) for file in files)
How could I correct the code so it copied and concatanted all values based on columns from excel at the same time ignoring the names of columns ?
When concatenating multiple dataframes with the same format, you can use the below snippet for speed and efficiency.
The basic logic is that you put them into a list, and then concatenate at the final stage.
files = glob(r'C:\Users\Folder\*xlsb')
dfs = []
for file in files:
df = pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0)
dfs.append(df)
large_df = pd.concat(dfs, ignore_index=True)
Also refer to the below :
Creating an empty Pandas DataFrame, then filling it?
I have 25 sheets in the excel file and I want the list of column names(top row/header) from each of the sheets.
Can you specify how you want the answers collected? Do you want all the column names from each sheet in the same list or dataframe?
Assuming you want the results in one DataFrame: I will assume you want to collect the results into one DataFrame where each row represents one sheet and each column represents one column name. The general idea is to loop through The pd.read_excel() method specifying a different sheet name each time.
import pandas as pd
import numpy as np
n_sheets = 25
int_sheet_names = np.arange(0,n_sheets,1)
df = pd.DataFrame()
for i in int_sheet_names:
sheet_i_col_names = pd.read_excel('file.xlsx', sheet_name = i, header=None, nrows=1)
df = df.append(sheet_i_col_names)
The resulting DataFrame can be further manipulated based on your specific requirements.
Output from my example excel sheet, which only had 4 sheets
Alternatively, you can pass a list to the sheet_names argument. In this case, you are given a dictionary, which I find to be less useful. In this case, int_sheet_names must be a list and not a numpy array.
n_sheets = 25
int_sheet_names = list(range(0,n_sheets))
dict = pd.read_excel('file.xlsx', sheet_name = int_sheet_names, head=None, nrows=1)
Output as a dictionary when passing a list to sheet_name kwarg
I am new to Python and trying to automate some tasks. I have an Excel file with 8 sheets where each sheet has some identifier on top followed below that are tabular data with headers. Each sheet has the identifiers of interest and the tables in the same location.
What I want to do is to extract some data from the top of each sheet and insert them as columns, remove unwanted rows(after I have assigned some of them to columns) and columns and then merge into one CSV file as output.
The code I have written does the job. My code reads in each sheet, performs the operations on the sheet, then I start the same process for the next sheet (8 times) before using .concat to merge them.
import pandas as pd
import numpy as np
inputfile = "input.xlsx"
outputfile = "merged.csv"
##LN X: READ FIRST SHEET AND ASSIGN HEADER INFORMATION TO COLUMNS
df1 = pd.read_excel(inputfile, sheet_name=0, usecols="A:N", index=0)
#Define cell locations of fields in the header area to be assigned to
columns
#THIS CELL LOCATIONS ARE SAME ON ALL SHEETS
A = df1.iloc[3,4]
B = df1.iloc[2,9]
C = df1.iloc[3,9]
D = df1.iloc[5,9]
E = df1.iloc[4,9]
#Insert well header info as columns in data for worksheet1
df1.insert(0,"column_name", A)
df1.insert(1,"column_name", B)
df1.insert(4,"column_name", E)
# Rename the columns in `enter code here`worksheet1 DataFrame to reflect
actual column headers
df1.rename(columns={'Unnamed: 0': 'Header1',
'Unnamed: 1': 'Header2', }, inplace=True)
df_merged = pd.concat([df1, df2, df3, df4, df5, df6, df7,
df8],ignore_index=True, sort=False)
#LN Y: Remove non-numerical entries
df_merged = df_merged.replace(np.nan, 0)
##Write results to CSV file
df_merged.to_csv(outputfile, index=False)
Since this code will be used on other Excel files with varying numbers of sheets, I am looking for any pointers on how to include the repeating operations in each sheet in a loop. Basically repeating the steps between LN X to LN Y for each sheet (8 times!!). I am struggling with how to use a loop function Thanks in advance for your assistance.
df1 = pd.read_excel(inputfile, sheet_name=0, usecols="A:N", index=0)
You should change the argument sheet_name to
sheet_name=None
Then df1 will be a dictionary of DataFrames. Then you can loop over df1 using
for df in df1:
df1[df].insert(0,"column_name", A)
....
Now perform your operations and merge the dfs. You can loop over them again and concatenate them to one final df.
How do i split a dataframe (csv) in the ratio of 4:1 randomly and store them in two different variables
ex- if there are ten rows from 1 to 10 in the dataframe, i want any 8 rows from it in variable 'a' and the remaining 2 rows in variable 'b'.
I've never done this on a random basis but the basic approach would be:
import pandas 2)
read in your csv
drop empty/null columns(avoid issues with these)
create a new dataframe to put the split values into
assign names to your new columns
split values and combine the values (using apply/combine/lambda)
Code sample:
# importing pandas module
import pandas as pd
# read in csv file
data = pd.read_csv("https://mydata.csv")
# drop null values
data.dropna(inplace = True)
# create new data frame
new = data["ColumnName"].str.split(" ", n = 1, expand = True) #this 'split' code applies to splitting one column into two
# assign new name to first column
data["A"]= new[0] #8 concatenated values will go here
# making seperate last name column from new data frame
data["B"]= new[1] #last two [combined] values in go here
## other/different code required for concatenation of column values - look at this linked SO question##
# df display
data
Hope this helps