Using filenames to create variable - PySpark - apache-spark

I have a folder where files get dropped (daily, weekly) and I need to add the year and week/day, which are in the file name in a consistent format, as variables to my data frame. The prefix can change (e.g., sales_report, cash_flow, etc.) but the last characters are always YYYY_WW.csv.
For instance, for a weekly file I could manually do it for each file as:
from pyspark.sql.functions import lit
df = spark.read.load('my_folder/sales_report_2019_12.csv', format="csv").withColumn("sales_year", lit(2019)).withColumn("sales_week", lit(12))
I would like to do the equivalent of using a substring function counting from the right of the file name to parse the 12 and 2019. Were I able to parse the file name for these variables I could then read in all of the files in the folder using a wildcard such as df = spark.read.load('my_folder/sales_report_*.csv', format="csv") which would greatly simplify my code.

You can easily extract it from the filename using the input_file_name() column and some string functions like regexp_extract and substring_index:
df = spark.read.load('my_folder/*.csv', format="csv")
df = df.withColumn("year_week", regexp_extract(input_file_name(), "\d{4}_\d{1,2}"))\
.withColumn("sales_year", substring_index(col("year_week"), "_", 1))\
.withColumn("sales_week", substring_index(col("year_week"), "_", -1))\
.drop("year_week")

you can try the below :
import glob
listfiles = glob.glob('my_folder/sales_report_*.csv')
for files in listfiles:
weekyear = c.split('_',2)[-1].split('_')
week = weekyear[1]
year = weekyear[0]
df = spark.read.load('files', format="csv").withColumn("sales_year", lit(year)).withColumn("sales_week", lit(week))

Related

I have one person in a dataframe that keeps showing up as \ufeff in my dataframe when I print to console

I have python code that loads a group of exam results. Each exam is saved in it's own csv file.
files = glob.glob('Exam *.csv')
frame = []
files1 = glob.glob('Exam 1*.csv')
for file in files:
frame.append(pd.read_csv(file, index_col=[0], encoding='utf-8-sig'))
for file in files1:
frame.append(pd.read_csv(file, index_col=[0], encoding='utf-8-sig'))
There is one person in the whole dataframe in their name column it shows up as
\ufeffStudents Name
It happens for every single exam. I tried using the encoding argument but that's not fixing the issue. I am out of ideas. Anyone else have anything?
That character is the BOM or "Byte Order Mark."
There are serveral ways to resovle it.
First, I want to suggest to add engine parameter (for example, engine='python' in pd.read_csv() when reading csv files.
pd.read_csv(file, index_col=[0], engine='python', encoding='utf-8-sig')
Secondly, you can simply remove it by replacing with empty string ('').
df['student_name'] = df['student_name'].apply(lambda x: x.replace("\ufeff", ""))

What is the appropriate way to take in files that have a filename with a timestamp in it?

What is the appropriate way to take in files that have a filename with a timestamp in it and read properly?
One way I'm thinking of so far is to take these filenames into one single text file to read all at once.
For example, filenames such as
1573449076_1570501819_file1.txt
1573449076_1570501819_file2.txt
1573449076_1570501819_file3.txt
Go into a file named filenames.txt
Then something like
with open('/Documents/filenames.txt', 'r') as f:
for item in f:
if item.is_file():
file_stat = os.stat(item)
item = item.replace('\n', '')
print("Fetching {}".format(convert_times(file_stat)))
My question is how would I go about this where I can properly read the names in the text file given that they have timestamps in the actual names? Once figuring that out I can convert them.
If you just want to get the timestamps from the file names, assuming that they all use the same naming convention, you can do so like this:
import glob
import os
from datetime import datetime
# Grab all .txt files in the specified directory
files = glob.glob("<path_to_dir>/*.txt")
for file in files:
file = os.path.basename(file)
# Check that it contains an underscore
if not '_' in file:
continue
# Split the file name using the underscore as the delimiter
stamps = file.split('_')
# Convert the epoch to a legible string
start = datetime.fromtimestamp(int(stamps[0])).strftime("%c")
end = datetime.fromtimestamp(int(stamps[1])).strftime("%c")
# Consume the data
print(f"{start} - {end}")
...
You'll want to add some error checking and handling; for instance, if the first or second index in the stamps array isn't a parsable int, this will fail.

Reading excel files in python using dynamic dates in the file and sheet name

I have a situation in which I would want to read an excel file on a daily basis where the file name is written in the following format : file_name 08.20.2018 xyz.xlsx and gets updated daily where the date getting changed on a daily basis.
The same thing I need to do when this file is being read, I need to extract the data from a sheet whose naming convention also changes daily with the date. An example sheet name is sheet1-08.20.2020-data
How should I achieve this? I am using the following code but it does not work:
df = pd.read_Excel(r'file_name 08.20.2018 xyz.xlsx', sheet_name = 'sheet1-08.20.2020-data')
how do I update this line of code so that it picks the data dynamically daily with new dates coming in. And to be clear here, the date will also be incremental with no gaps.
You could use pathlib and the datetime module to automate the process :
from pathlib import Path
from datetime import date
#assuming you have a directory of files:
folder = Path(directory of files)
sheetname = f"sheet1-0{date.today().month}.{date.today().day}.{date.today().year}-data"
date_string = f"filename 0{date.today().month}.{date.today().day}.{date.today().year}.xlsx"
xlsx_file = folder.glob(date_string)
#read in data
df = pd.read_excel(io=next(xlsx_file), sheet_name = sheetname)

python iterating on multiple files

I have
file_2000.dta, file_2001.dta, file_2002.dta and so on.
I also have
file1_2000.dta, file1_2001.dta, file1_2002.dta and so on.
I want to iterate on the file year.
Let (year) = 2000, 2001, 2002, etc
import file_(year) using pandas.
import file1_(year) using pandas.
file_(year)['name'] = file_(year).index
file1_(year)['name'] = file1_(year).index2
merged = pd.merge(file_(year), file1_(year), on='name')
write/export merged_(year).dta
It seems to me that you need to use the read_stata function, based on your .dta extensions, to read the files in a loop, create a list of the separate dataframes to be able to work with them separately, and then concatenate all dataframes into one.
Something like:
list_of_files = ['file_2000.dta', 'file_2001.dta', 'file_2002.dta'] # full paths here...
frames = []
for f in list_of_files:
df = pd.read_stata(f)
frames.append(df)
consolidated_df = pd.concat(frames, axis=0, ignore_index=True)
These questions might be relevant to your case:
How to Read multiple files in Python for Pandas separate dataframes
Pandas read_stata() with large .dta files
As much as I know there is not 'Let' keyword in Python. To iterate over multiple files in a directory you can simply use for loop with os module like the following:
import os
directory = r'C:\Users\admin'
for filename in os.listdir(directory):
if filename.startswith("file_200") and filename.endswith(".dat"):
# do something
else:
continue
Another approach is to use regex to tell python the files names to match during the iteration. the pattern should be: pattern = r"file_20\d+"

How to combine multiple csv files based on file name

I have more than 1000 csv files , i want to combine where csv filename first five digits are same in to one csv file.
input:
100044566.csv
100040457.csv
100041458.csv
100034566.csv
100030457.csv
100031458.csv
100031459.csv
import pandas as pd
import os
import glob
path_1 =''
all_files_final = glob.glob(os.path.join(path_1, "*.csv"))
names_1 = [os.path.basename(x1) for x1 in all_files_final]
final = pd.DataFrame()
for file_1, name_1 in zip(all_files_final, names_1):
file_df_final = pd.read_csv(file_1,index_col=False)
#file_df['file_name'] = name
final = final.append(file_df_final)
final.to_csv('',index=False)
i used the above code but its merging all files in to one csv file , i dont know have to make selection based on the name
so from above input
output 1: combine first three csv files in one csv file because filename first five digits are same.
output 2: combine next 4 files in one csv files because filename first five digits are same.
I would recommend you to approach the problem slightly differently.
Here's my solution:
import os
import pandas as pd
files = os.listdir('.') # returns list of filenames in current folder
files_of_interest = {} # a dictionary that we will be using in future
for filename in files: # iterate over files in a folder
if filename[-4:] == '.csv': # check whether a file is of .csv format
key = filename[:5] # as you've mentioned in you question - first five characters of filename is of interest
files_of_interest.setdefault(key,[]) #if we dont have such key - .setdefault will create such key for us and assign empy list to it
files_of_interest[key].append(filename) # append to a list new filename
for key in files_of_interest:
buff_df = pd.DataFrame()
for filename in files_of_interest[key]:
buff_df= buff_df.append(pd.read_csv(filename)) # iterate over every filename for specific key in dictionary and appending it to buff_df
files_of_interest[key]=buff_df # replacing list of files by a data frame
This code will create a dictionary of dataframes. Where keys of the dictionary will be a set of first unique characters of .csv files.
Then you can iterate over keys of the dictionary to save every according dataframe as a .csv file.
Hope my answer helped.

Resources