python iterating on multiple files - python-3.x

I have
file_2000.dta, file_2001.dta, file_2002.dta and so on.
I also have
file1_2000.dta, file1_2001.dta, file1_2002.dta and so on.
I want to iterate on the file year.
Let (year) = 2000, 2001, 2002, etc
import file_(year) using pandas.
import file1_(year) using pandas.
file_(year)['name'] = file_(year).index
file1_(year)['name'] = file1_(year).index2
merged = pd.merge(file_(year), file1_(year), on='name')
write/export merged_(year).dta

It seems to me that you need to use the read_stata function, based on your .dta extensions, to read the files in a loop, create a list of the separate dataframes to be able to work with them separately, and then concatenate all dataframes into one.
Something like:
list_of_files = ['file_2000.dta', 'file_2001.dta', 'file_2002.dta'] # full paths here...
frames = []
for f in list_of_files:
df = pd.read_stata(f)
frames.append(df)
consolidated_df = pd.concat(frames, axis=0, ignore_index=True)
These questions might be relevant to your case:
How to Read multiple files in Python for Pandas separate dataframes
Pandas read_stata() with large .dta files

As much as I know there is not 'Let' keyword in Python. To iterate over multiple files in a directory you can simply use for loop with os module like the following:
import os
directory = r'C:\Users\admin'
for filename in os.listdir(directory):
if filename.startswith("file_200") and filename.endswith(".dat"):
# do something
else:
continue
Another approach is to use regex to tell python the files names to match during the iteration. the pattern should be: pattern = r"file_20\d+"

Related

How do I import a list from a file thats named in a variable

I have tried code like this
x = input("Name of person")
from x import p_list
print(p_list['name'])
In this case I have multiple files all formatted the same way but I have no way to pull that data without combining all the lists into one file, which in my case is not wanted.

Converting multiple .pdf files with multiple pages into 1 single .csv file

I am trying to convert .pdf data to a spreadsheet. Based on some research, some guys recommended transforming it into csv first in order to avoid errors.
So, I made the below coding which is giving me:
"TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid"
Error appears at 'pd.concat' command.
'''
import tabula
import pandas as pd
import glob
path = r'C:\Users\REC.AC'
all_files = glob.glob(path + "/*.pdf")
print (all_files)
df = pd.concat(tabula.read_pdf(f1) for f1 in all_files)
df.to_csv("output.csv", index = False)
'''
Since this might be a common issue, I am posting the solution I found.
"""
df = []
for f1 in all_files:
df = pd.concat(tabula.read_pdf(f1))
"""
I believe that breaking the item iteration in two parts would generate the dataframe it needed and therefore would work.

Using filenames to create variable - PySpark

I have a folder where files get dropped (daily, weekly) and I need to add the year and week/day, which are in the file name in a consistent format, as variables to my data frame. The prefix can change (e.g., sales_report, cash_flow, etc.) but the last characters are always YYYY_WW.csv.
For instance, for a weekly file I could manually do it for each file as:
from pyspark.sql.functions import lit
df = spark.read.load('my_folder/sales_report_2019_12.csv', format="csv").withColumn("sales_year", lit(2019)).withColumn("sales_week", lit(12))
I would like to do the equivalent of using a substring function counting from the right of the file name to parse the 12 and 2019. Were I able to parse the file name for these variables I could then read in all of the files in the folder using a wildcard such as df = spark.read.load('my_folder/sales_report_*.csv', format="csv") which would greatly simplify my code.
You can easily extract it from the filename using the input_file_name() column and some string functions like regexp_extract and substring_index:
df = spark.read.load('my_folder/*.csv', format="csv")
df = df.withColumn("year_week", regexp_extract(input_file_name(), "\d{4}_\d{1,2}"))\
.withColumn("sales_year", substring_index(col("year_week"), "_", 1))\
.withColumn("sales_week", substring_index(col("year_week"), "_", -1))\
.drop("year_week")
you can try the below :
import glob
listfiles = glob.glob('my_folder/sales_report_*.csv')
for files in listfiles:
weekyear = c.split('_',2)[-1].split('_')
week = weekyear[1]
year = weekyear[0]
df = spark.read.load('files', format="csv").withColumn("sales_year", lit(year)).withColumn("sales_week", lit(week))

How to combine multiple csv files based on file name

I have more than 1000 csv files , i want to combine where csv filename first five digits are same in to one csv file.
input:
100044566.csv
100040457.csv
100041458.csv
100034566.csv
100030457.csv
100031458.csv
100031459.csv
import pandas as pd
import os
import glob
path_1 =''
all_files_final = glob.glob(os.path.join(path_1, "*.csv"))
names_1 = [os.path.basename(x1) for x1 in all_files_final]
final = pd.DataFrame()
for file_1, name_1 in zip(all_files_final, names_1):
file_df_final = pd.read_csv(file_1,index_col=False)
#file_df['file_name'] = name
final = final.append(file_df_final)
final.to_csv('',index=False)
i used the above code but its merging all files in to one csv file , i dont know have to make selection based on the name
so from above input
output 1: combine first three csv files in one csv file because filename first five digits are same.
output 2: combine next 4 files in one csv files because filename first five digits are same.
I would recommend you to approach the problem slightly differently.
Here's my solution:
import os
import pandas as pd
files = os.listdir('.') # returns list of filenames in current folder
files_of_interest = {} # a dictionary that we will be using in future
for filename in files: # iterate over files in a folder
if filename[-4:] == '.csv': # check whether a file is of .csv format
key = filename[:5] # as you've mentioned in you question - first five characters of filename is of interest
files_of_interest.setdefault(key,[]) #if we dont have such key - .setdefault will create such key for us and assign empy list to it
files_of_interest[key].append(filename) # append to a list new filename
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df= buff_df.append(pd.read_csv(filename)) # iterate over every filename for specific key in dictionary and appending it to buff_df
files_of_interest[key]=buff_df # replacing list of files by a data frame
This code will create a dictionary of dataframes. Where keys of the dictionary will be a set of first unique characters of .csv files.
Then you can iterate over keys of the dictionary to save every according dataframe as a .csv file.
Hope my answer helped.

loop through read_csv files python

First time post, Python newbie, appreciate your help and patience. Looking for help iterating through read_csv() files and assigning each to their own new dataframe. The 'filesnames' get pulled in correctly, just having issues making a unique dataframe for each csv imported.
#makes a variable list of filenames with common patterns via glob
filenames = sorted(glob.glob('617*.txt'))
#gets the number of files from folder
count = len(filenames)
i =0
for i in range(0, count):
file = pd.read_csv('C:/Users/fjehlik/Desktop/python\
data_frame_excel/'+filenames[i], sep = '\t', header=0)
IIUC you want to create a dictionary of DataFrames:
dfs = {f:pd.read_csv(f, sep='\t') for f in filenames}

Resources