merging multiple .txt files into one .xlsx having different column names - python-3.x

I've been following this answer : How do I extract data from multiple text files to Excel for merging multiple .txt files to one .xlsx spreadsheet, but this answer is giving me the data in multiple worksheets with no distinct column names.
What I'm trying to do is to extract the data of 3 .txt files(each file contains data in one column without a column name) into one spreadsheet and each column should have a distinct column name('col1', 'col2' and 'col3') respectively.
data1.txt:
ACC_NT/mc_pf_qer
ACC_NT/gsd
ACC_NT/hcv_efg_tdc
ACC_NT/ids_gc
ISQ_BX/oic_lkv
ISQ_BX/pfg_8c
data2.txt:
79.2%
53.9%
100.0%
50.0%
44.2%
0.0%
data3.txt:
ACC_NT/ACC_NT_mc_pf_qer.html
ACC_NT/ACC_NT_gsd.html
ACC_NT/ACC_NT_hcv_efg_tdc.html
ACC_NT/ACC_NT_ids_gc.html
ISQ_BX/ISQ_BX_oic_lkv.html
ISQ_BX/ISQ_BX_pfg_8c.html
Any help or guidance is appreciated, thanks!

IIUC:
data = {}
for i, filename in enumerate(['data1.txt', 'data2.txt', 'data3.txt'], 1):
data[f"col{i}"] = pd.read_csv(filename, squeeze=True, header=None)
pd.concat(data, axis=1).to_excel('output.xlsx')
Update
I realized that the col2 data contains the percentage details and in the output.xlsx
data = {}
for i, filename in enumerate(['data1.txt', 'data2.txt', 'data3.txt'], 1):
data[f"col{i}"] = pd.read_csv(filename, squeeze=True, header=None)
df = pd.concat(data, axis=1)
df['col2'] = df['col2'].str.strip('%').astype(float)
df.to_excel('output.xlsx', index=False)

Related

How to copy data from one excel file to second excel file using conditions?

I have two excel files are one.xlxs and two.xlxs. Column names are id, mail, name, gender, age, name same on two excel files but they are jumbled in two.xlxs. Two rows(id and mail) have data on both files. I want to copy the data from one.xlxs to two.xlxs. But the column lineup should not be disturb on two.xlxs. The data will be copied based on Two rows(id and mail). for example: the data should be copied into the respective columns if id and mail matches on two files. Please find the reference pictures are one.xlxs, two.xlxs and result_two.xlxs(As required result). I have searched on internet but i did not get any idea.
I am able to copy data from one.xlxs to two.xlxs using below code.
But i don not want to disturb two.xlxs column positions. I want to copy data as shown in the image of two_result.xlxs, If id and mail values matches on both excel files that cell values will be placed on two.xlxs their matched columns. How to do.
import pandas as pd
df1 = pd.read_excel('one.xlsx')
df2 = pd.read_excel('two.xlsx')
df = pd.concat([df1])
df.to_excel('two.xlsx', index=False)
You can merge them then rearrange the columns:
cols = df2.columns
result = df2[['id', 'mail']].merge(df1, how='left', on=['id', 'mail']).reindex(columns=cols)
result.to_excel('result.xlsx', index=False)

Pandas - concatanation multiple excel files with different names but the same data type

I have about 50 excel sheet with .'xlsb' extension. I'd like to concatanate a specific worksheet into pandas DataFrame (all worksheets names are the same). The problem I have is that the names of columns are not exactly the same in each worksheet. I wrote a code using pandas but the way it works is that it concats all values into the same column in pandas data frame but based on the name of column. So for example: sometimes I have column called: FgsNr and sometimes FgNr - the datatype and the meaning in both columns are exactly the same and I would like to have them in the same column in Data Frame but pandas creates to separate columns in data frame and stack together only those values that are listed in column with the same name.
files = glob(r'C:\Users\Folder\*xlsb')
for file in files:
Datafile = pd.concat(pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0) for file in files)
How could I correct the code so it copied and concatanted all values based on columns from excel at the same time ignoring the names of columns ?
When concatenating multiple dataframes with the same format, you can use the below snippet for speed and efficiency.
The basic logic is that you put them into a list, and then concatenate at the final stage.
files = glob(r'C:\Users\Folder\*xlsb')
dfs = []
for file in files:
df = pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0)
dfs.append(df)
large_df = pd.concat(dfs, ignore_index=True)
Also refer to the below :
Creating an empty Pandas DataFrame, then filling it?

Python: Loop through Excel sheets, assign header info to columns on each sheet, then merge to one file

I am new to Python and trying to automate some tasks. I have an Excel file with 8 sheets where each sheet has some identifier on top followed below that are tabular data with headers. Each sheet has the identifiers of interest and the tables in the same location.
What I want to do is to extract some data from the top of each sheet and insert them as columns, remove unwanted rows(after I have assigned some of them to columns) and columns and then merge into one CSV file as output.
The code I have written does the job. My code reads in each sheet, performs the operations on the sheet, then I start the same process for the next sheet (8 times) before using .concat to merge them.
import pandas as pd
import numpy as np
inputfile = "input.xlsx"
outputfile = "merged.csv"
##LN X: READ FIRST SHEET AND ASSIGN HEADER INFORMATION TO COLUMNS
df1 = pd.read_excel(inputfile, sheet_name=0, usecols="A:N", index=0)
#Define cell locations of fields in the header area to be assigned to
columns
#THIS CELL LOCATIONS ARE SAME ON ALL SHEETS
A = df1.iloc[3,4]
B = df1.iloc[2,9]
C = df1.iloc[3,9]
D = df1.iloc[5,9]
E = df1.iloc[4,9]
#Insert well header info as columns in data for worksheet1
df1.insert(0,"column_name", A)
df1.insert(1,"column_name", B)
df1.insert(4,"column_name", E)
# Rename the columns in `enter code here`worksheet1 DataFrame to reflect
actual column headers
df1.rename(columns={'Unnamed: 0': 'Header1',
'Unnamed: 1': 'Header2', }, inplace=True)
df_merged = pd.concat([df1, df2, df3, df4, df5, df6, df7,
df8],ignore_index=True, sort=False)
#LN Y: Remove non-numerical entries
df_merged = df_merged.replace(np.nan, 0)
##Write results to CSV file
df_merged.to_csv(outputfile, index=False)
Since this code will be used on other Excel files with varying numbers of sheets, I am looking for any pointers on how to include the repeating operations in each sheet in a loop. Basically repeating the steps between LN X to LN Y for each sheet (8 times!!). I am struggling with how to use a loop function Thanks in advance for your assistance.
df1 = pd.read_excel(inputfile, sheet_name=0, usecols="A:N", index=0)
You should change the argument sheet_name to
sheet_name=None
Then df1 will be a dictionary of DataFrames. Then you can loop over df1 using
for df in df1:
df1[df].insert(0,"column_name", A)
....
Now perform your operations and merge the dfs. You can loop over them again and concatenate them to one final df.

Creating csv file from pandas dataframe as per the sheet names in file

I have a excel file which has multiple sheets(6) in it. I am writing a python script to convert the each individual sheet into a csv file.
My input file looks like this and this is for example sheetname = class5
Name ID
Mark 11
Tom 22
Jane 33
like this I have multiple sheets in the excel
I need to convert them in csv file having just 'Name' and class like this:
Mark,class5
Tom,class5
Jane,class5
This one one sheet like this I have multiple sheets so what I am using is converting every sheet in dataframe like this
xls = pd.Excelfile('path_of_file'.xlsx)
df1= pd.read_excel(xlsx, 'Sheet1')
df2 = pd.read_excel(xlsx, 'Sheet2')
df3 = pd.read_excel(xlsx, 'Sheet3')
How can I make csv file called 'class5'.csv with output as above and same for every sheet such as class6,7,8?
So, assuming from your question what you want is the contents of each sheet to be saved to a different csv, where the csv has the name column, and another column containing the name of the sheet it came from, without a header.
If that's what you're after, you could do:
xls = pd.read_excel('path_of_file',sheet_name = None)
for sheet_name, df in xls.items():
df['sheet'] = sheet_name
df[['Name','sheet']].to_csv(f'{sheet_name}.csv', header=False)
key point is the sheet_name argument of read_excel. as the commentor on your question states, leave this as None and you will get a dictionary you can iterate through

To average a subset of one column using pandas

I have a .csv file that I'm reading from. I read only select columns from it and I need to further process this data before I save it into an excel sheet. The idea is to repeat this process for all the files in the folder and save the sheets with the same names as the original .csv.
As of now, I'm able to read the specific columns from .csv and write the whole file into excel. I am yet to figure out how to further process these columns before I save to excel. Further processing involves
Averaging rows 18000-20000 for each column separately.
Calculating (Column value - Average)/Average
Saving these values in separate columns with different column names.
My code is as follows. Need some help with this.
import pandas as pd
import os
from pathlib import Path
for f in os.listdir():
file_name, file_ext = os.path.splitext(f) #splitting into file name and extension
if file_ext == '.atf':
#open the data file and get data only from specific columns.
df = pd.read_csv(f, header = 9, index_col = 0, usecols = [0,55,59,63,67,71,75,79,83,87,91,95,99,103], encoding = "ISO-8859-1", sep = '\t', dtype = {'YFP.13':str,'YFP.14':str,'YFP.15':str,'YFP.16':str,'YFP.17':str,'YFP.18':str,'YFP.19':str,'YFP.20':str,'YFP.21':str,'YFP.22':str,'YFP.23':str,'YFP.24':str,'YFP.25':str,'Signals=':str})
df.to_excel(file_name+'.xlsx',sheet_name=file_name, engine = 'xlsxwriter') #writing into an excel file
Let's say your dataframe has a shape of 4 columns and 100 rows:
data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
Averaging rows 18000-20000 for each column separately.
To perform the averaging on a subset, you define a logical mask based on inequalities over the index and apply the averaging function on the selected dataframe. The results are saved in a new dataframe as you want to use them later:
means = data[(60<data.index) & (80>data.index)].mean()
Calculating (Column value - Average)/Average
Saving these values in separate columns with different column names
For the two last steps, the code below speaks by itself:
cols = data.columns
for col in cols:
data[col +"_calc"] = (data[col]-means[col])/means[col]
In the end, you may export the dataframe "data" to an excel format as you did earlier.

Resources