Python merging excel files in directory - python-3.x

I have thousands of files inside a directory with this pattern YYYY/MM/DD/HH/MM:
201901010000.xlsx
201901010001.xlsx,
201901010002.xlsx,
201801010000.xlsx,
201801010001.xlsx,
201801010002.xlsx,
I wants to merge file by begin with same YYYY(2018 & 2019 separate file) wise into one excel file.like below
this is first file
201901010000.xlsx,
201901010001.xlsx,
201901010002.xlsx,
this is second file
201801010000.xlsx,
201801010001.xlsx,
201801010002.xlsx,

You will need to parse each file and concatenate by pandas:
import pandas as pd
import glob
my_path = "c:\\temp\\"
for year in ['2008', '2009']:
buf = []
year_files = glob.glob(my_path + year+"*.xlsx")
for file in year_files:
df = pd.read_excel(file)
buf.append(df)
year_df = pd.concat(buf)
year_df.to_excel(year+".xlsx")

Related

How to change data type of a same column in 7000 csv file and replace the files with the updated ones

I have 7000 csv files with 30000 records in each file and all the files have same column headers and same column count i.e 14.
I want to change the data type of 1,3,4 column respectively which I can do by using pandas but my question how can I do it to all the files with out loading them one by one or you can say how can I achieve this using loop as I want to replace the same file with the updated columns?
I tried this code and honestly I have copied it from some other place so I don't know where to give the path of my csv files folder and how will it replace the files.
import pandas as pd
import os
import glob
def main():
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
for f in csv_files:
df = pd.read_csv(f)
df[['LOAD DATE','DATE OF ISSUE','DATE OF DEPARTURE']] = df[['LOAD DATE','DATE OF ISSUE','DATE OF DEPARTURE']].apply(pd.to_datetime, errors='coerce')
df.to_csv(f, index=False)
main()

saving text files to .npy file

I have many text files in a directory with numerical extension(example: signal_data1.9995100000000001,signal_data1.99961 etc)
The content of the files are as given below
signal_data1.9995100000000001
-1.710951390504200198e+00
5.720409824754981720e-01
2.730176313110273423e+00
signal_data1.99961
-6.710951390504200198e+01
2.720409824754981720e-01
6.730176313110273423e+05
I just want to arrange the above files into a single .npy files as
-1.710951390504200198e+00,5.720409824754981720e-01, 2.730176313110273423e+00
-6.710951390504200198e+01,2.720409824754981720e-01, 6.730176313110273423e+05
So, I want to implement the same procedure for many files of a directory.
I tried a loop as follows:
import numpy as np
import glob
for file in glob.glob(./signal_*):
np.savez('data', file)
However, it does not give what I want as depicted above. So here I need help. Thanks in advance.
Here is another way of achieving it:
import os
dirPath = './data/' # folder where you store your data
with os.scandir(dirPath) as entries:
output = ""
for entry in entries: # read each file in your folder
dataFile = open(dirPath + entry.name, "r")
dataLines = dataFile.readlines()
dataFile.close()
for line in dataLines:
output += line.strip() + " " # clear all unnecessary characters & append
output += '\n' # after each file break line
writeFile = open("a.npy", "w") # save it
writeFile.write(output)
writeFile.close()
You can use np.loadtxt() and np.save():
a = np.array([np.loadtxt(f) for f in sorted(glob.glob('./signal_*'))])
np.save('data.npy', a)

Using filenames to create variable - PySpark

I have a folder where files get dropped (daily, weekly) and I need to add the year and week/day, which are in the file name in a consistent format, as variables to my data frame. The prefix can change (e.g., sales_report, cash_flow, etc.) but the last characters are always YYYY_WW.csv.
For instance, for a weekly file I could manually do it for each file as:
from pyspark.sql.functions import lit
df = spark.read.load('my_folder/sales_report_2019_12.csv', format="csv").withColumn("sales_year", lit(2019)).withColumn("sales_week", lit(12))
I would like to do the equivalent of using a substring function counting from the right of the file name to parse the 12 and 2019. Were I able to parse the file name for these variables I could then read in all of the files in the folder using a wildcard such as df = spark.read.load('my_folder/sales_report_*.csv', format="csv") which would greatly simplify my code.
You can easily extract it from the filename using the input_file_name() column and some string functions like regexp_extract and substring_index:
df = spark.read.load('my_folder/*.csv', format="csv")
df = df.withColumn("year_week", regexp_extract(input_file_name(), "\d{4}_\d{1,2}"))\
.withColumn("sales_year", substring_index(col("year_week"), "_", 1))\
.withColumn("sales_week", substring_index(col("year_week"), "_", -1))\
.drop("year_week")
you can try the below :
import glob
listfiles = glob.glob('my_folder/sales_report_*.csv')
for files in listfiles:
weekyear = c.split('_',2)[-1].split('_')
week = weekyear[1]
year = weekyear[0]
df = spark.read.load('files', format="csv").withColumn("sales_year", lit(year)).withColumn("sales_week", lit(week))

How to combine multiple csv files based on file name

I have more than 1000 csv files , i want to combine where csv filename first five digits are same in to one csv file.
input:
100044566.csv
100040457.csv
100041458.csv
100034566.csv
100030457.csv
100031458.csv
100031459.csv
import pandas as pd
import os
import glob
path_1 =''
all_files_final = glob.glob(os.path.join(path_1, "*.csv"))
names_1 = [os.path.basename(x1) for x1 in all_files_final]
final = pd.DataFrame()
for file_1, name_1 in zip(all_files_final, names_1):
file_df_final = pd.read_csv(file_1,index_col=False)
#file_df['file_name'] = name
final = final.append(file_df_final)
final.to_csv('',index=False)
i used the above code but its merging all files in to one csv file , i dont know have to make selection based on the name
so from above input
output 1: combine first three csv files in one csv file because filename first five digits are same.
output 2: combine next 4 files in one csv files because filename first five digits are same.
I would recommend you to approach the problem slightly differently.
Here's my solution:
import os
import pandas as pd
files = os.listdir('.') # returns list of filenames in current folder
files_of_interest = {} # a dictionary that we will be using in future
for filename in files: # iterate over files in a folder
if filename[-4:] == '.csv': # check whether a file is of .csv format
key = filename[:5] # as you've mentioned in you question - first five characters of filename is of interest
files_of_interest.setdefault(key,[]) #if we dont have such key - .setdefault will create such key for us and assign empy list to it
files_of_interest[key].append(filename) # append to a list new filename
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df= buff_df.append(pd.read_csv(filename)) # iterate over every filename for specific key in dictionary and appending it to buff_df
files_of_interest[key]=buff_df # replacing list of files by a data frame
This code will create a dictionary of dataframes. Where keys of the dictionary will be a set of first unique characters of .csv files.
Then you can iterate over keys of the dictionary to save every according dataframe as a .csv file.
Hope my answer helped.

Parse filename information into multiple columns in the concatenated csv file

I have multiple csv files in a folder and each has a unique file name such as W10N1_RTO_T0_1294_TL_IV_Curve.csv. I would like to concatenate all files together and create multiple columns based on the filename information. For example, W10N1 is one column called DieID.
I am a beginner on programming and Python. I couldn't figure how to do it easily.
import os
import glob
import pandas as pd
import csv
os.chdir('filepath')
extension='csv'
all_filenames=[i for i in glob.glob('*.{}'.format(extension))]
combined_csv=pd.concat([pd.read_csv(f) for f in all_filenames])
combined_csv.to_csv('combined_csv.csv',index=False
import os
os.listdir("your_target_direcotry")
will return a list of all files and directories in "your_target_direcotry".
Then it is just string manipulation. e.g
x = ‘blue_red_green’
x.split(“_”)
[‘blue’, ‘red’, ‘green’]
>>>
>>> a,b,c = x.split(“_”)
>>> a
‘blue’
>>> b
‘red’
>>> c
‘green’
Also do separate for "." first to remove .csv
At last, create a CSV which can operate by any separator u want.
f= open("yourfacnyname.csv","w+")
f.write("DieID You_fancy_other_IDs also_if_u_want_variable_use_this_%d\r\n" % (i+1))
f.close()
EZ as A B C

Resources