I have thousands of files inside a directory with this pattern YYYY/MM/DD/HH/MM:
I wants to merge file by begin with same YYYY(2018 & 2019 separate file) wise into one excel below
this is first file
this is second file

You will need to parse each file and concatenate by pandas:
import pandas as pd
import glob
my_path = "c:\\temp\\"
for year in ['2008', '2009']:
buf = []
year_files = glob.glob(my_path + year+"*.xlsx")
for file in year_files:
df = pd.read_excel(file)
year_df = pd.concat(buf)


How to change data type of a same column in 7000 csv file and replace the files with the updated ones

I have 7000 csv files with 30000 records in each file and all the files have same column headers and same column count i.e 14.
I want to change the data type of 1,3,4 column respectively which I can do by using pandas but my question how can I do it to all the files with out loading them one by one or you can say how can I achieve this using loop as I want to replace the same file with the updated columns?
I tried this code and honestly I have copied it from some other place so I don't know where to give the path of my csv files folder and how will it replace the files.
import pandas as pd
import os
import glob
def main():
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
for f in csv_files:
df = pd.read_csv(f)
df[['LOAD DATE','DATE OF ISSUE','DATE OF DEPARTURE']] = df[['LOAD DATE','DATE OF ISSUE','DATE OF DEPARTURE']].apply(pd.to_datetime, errors='coerce')
df.to_csv(f, index=False)

saving text files to .npy file

I have many text files in a directory with numerical extension(example: signal_data1.9995100000000001,signal_data1.99961 etc)
The content of the files are as given below
I just want to arrange the above files into a single .npy files as
-1.710951390504200198e+00,5.720409824754981720e-01, 2.730176313110273423e+00
-6.710951390504200198e+01,2.720409824754981720e-01, 6.730176313110273423e+05
So, I want to implement the same procedure for many files of a directory.
I tried a loop as follows:
import numpy as np
import glob
for file in glob.glob(./signal_*):
np.savez('data', file)
However, it does not give what I want as depicted above. So here I need help. Thanks in advance.
Here is another way of achieving it:
import os
dirPath = './data/' # folder where you store your data
with os.scandir(dirPath) as entries:
output = ""
for entry in entries: # read each file in your folder
dataFile = open(dirPath +, "r")
dataLines = dataFile.readlines()
for line in dataLines:
output += line.strip() + " " # clear all unnecessary characters & append
output += '\n' # after each file break line
writeFile = open("a.npy", "w") # save it
You can use np.loadtxt() and
a = np.array([np.loadtxt(f) for f in sorted(glob.glob('./signal_*'))])'data.npy', a)

Using filenames to create variable - PySpark

I have a folder where files get dropped (daily, weekly) and I need to add the year and week/day, which are in the file name in a consistent format, as variables to my data frame. The prefix can change (e.g., sales_report, cash_flow, etc.) but the last characters are always YYYY_WW.csv.
For instance, for a weekly file I could manually do it for each file as:
from pyspark.sql.functions import lit
df ='my_folder/sales_report_2019_12.csv', format="csv").withColumn("sales_year", lit(2019)).withColumn("sales_week", lit(12))
I would like to do the equivalent of using a substring function counting from the right of the file name to parse the 12 and 2019. Were I able to parse the file name for these variables I could then read in all of the files in the folder using a wildcard such as df ='my_folder/sales_report_*.csv', format="csv") which would greatly simplify my code.
You can easily extract it from the filename using the input_file_name() column and some string functions like regexp_extract and substring_index:
df ='my_folder/*.csv', format="csv")
df = df.withColumn("year_week", regexp_extract(input_file_name(), "\d{4}_\d{1,2}"))\
.withColumn("sales_year", substring_index(col("year_week"), "_", 1))\
.withColumn("sales_week", substring_index(col("year_week"), "_", -1))\
you can try the below :
import glob
listfiles = glob.glob('my_folder/sales_report_*.csv')
for files in listfiles:
weekyear = c.split('_',2)[-1].split('_')
week = weekyear[1]
year = weekyear[0]
df ='files', format="csv").withColumn("sales_year", lit(year)).withColumn("sales_week", lit(week))

How to combine multiple csv files based on file name

I have more than 1000 csv files , i want to combine where csv filename first five digits are same in to one csv file.
import pandas as pd
import os
import glob
path_1 =''
all_files_final = glob.glob(os.path.join(path_1, "*.csv"))
names_1 = [os.path.basename(x1) for x1 in all_files_final]
final = pd.DataFrame()
for file_1, name_1 in zip(all_files_final, names_1):
file_df_final = pd.read_csv(file_1,index_col=False)
#file_df['file_name'] = name
final = final.append(file_df_final)
i used the above code but its merging all files in to one csv file , i dont know have to make selection based on the name
so from above input
output 1: combine first three csv files in one csv file because filename first five digits are same.
output 2: combine next 4 files in one csv files because filename first five digits are same.
I would recommend you to approach the problem slightly differently.
Here's my solution:
import os
import pandas as pd
files = os.listdir('.') # returns list of filenames in current folder
files_of_interest = {} # a dictionary that we will be using in future
for filename in files: # iterate over files in a folder
if filename[-4:] == '.csv': # check whether a file is of .csv format
key = filename[:5] # as you've mentioned in you question - first five characters of filename is of interest
files_of_interest.setdefault(key,[]) #if we dont have such key - .setdefault will create such key for us and assign empy list to it
files_of_interest[key].append(filename) # append to a list new filename
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df= buff_df.append(pd.read_csv(filename)) # iterate over every filename for specific key in dictionary and appending it to buff_df
files_of_interest[key]=buff_df # replacing list of files by a data frame
This code will create a dictionary of dataframes. Where keys of the dictionary will be a set of first unique characters of .csv files.
Then you can iterate over keys of the dictionary to save every according dataframe as a .csv file.
Hope my answer helped.

Parse filename information into multiple columns in the concatenated csv file

I have multiple csv files in a folder and each has a unique file name such as W10N1_RTO_T0_1294_TL_IV_Curve.csv. I would like to concatenate all files together and create multiple columns based on the filename information. For example, W10N1 is one column called DieID.
I am a beginner on programming and Python. I couldn't figure how to do it easily.
import os
import glob
import pandas as pd
import csv
all_filenames=[i for i in glob.glob('*.{}'.format(extension))]
combined_csv=pd.concat([pd.read_csv(f) for f in all_filenames])
import os
will return a list of all files and directories in "your_target_direcotry".
Then it is just string manipulation. e.g
x = ‘blue_red_green’
[‘blue’, ‘red’, ‘green’]
>>> a,b,c = x.split(“_”)
>>> a
>>> b
>>> c
Also do separate for "." first to remove .csv
At last, create a CSV which can operate by any separator u want.
f= open("yourfacnyname.csv","w+")
f.write("DieID You_fancy_other_IDs also_if_u_want_variable_use_this_%d\r\n" % (i+1))
EZ as A B C
