PySpark load files between timestamps

PySpark load files between timestamps - apache-spark

I have a list of xml files containing a timestamp in the filename. I need to conditionally load those files based on the timestamp value. For this I am using wildcards.
Here the code I am using that is not working:
spark.read \
.format("com.databricks.spark.xml") \
.load("/path/file_[1533804409548-1533873609934]*")

I think you can't do this using wildcards as you want to load the files that are within time range. As it's possible to load a dataframe from multiple locations, you can just make a array of file paths that are within the time range and load the paths. Here is the sample code i've tried,
target_files = []
st = 123
et = 321
path="<files_base_path>"
for file in os.listdir(path):
try:
ts = int(file[5:8])
if ts >= st and ts <= et:
target_files.append(path+file)
except Exception as ex:
continue
spark.read.parquet(*target_files)
Change the constant values based your input. Hopefully it'll help you..

Related

I have one person in a dataframe that keeps showing up as \ufeff in my dataframe when I print to console

I have python code that loads a group of exam results. Each exam is saved in it's own csv file.
files = glob.glob('Exam *.csv')
frame = []
files1 = glob.glob('Exam 1*.csv')
for file in files:
frame.append(pd.read_csv(file, index_col=[0], encoding='utf-8-sig'))
for file in files1:
frame.append(pd.read_csv(file, index_col=[0], encoding='utf-8-sig'))
There is one person in the whole dataframe in their name column it shows up as
\ufeffStudents Name
It happens for every single exam. I tried using the encoding argument but that's not fixing the issue. I am out of ideas. Anyone else have anything?

That character is the BOM or "Byte Order Mark."
There are serveral ways to resovle it.
First, I want to suggest to add engine parameter (for example, engine='python' in pd.read_csv() when reading csv files.
pd.read_csv(file, index_col=[0], engine='python', encoding='utf-8-sig')
Secondly, you can simply remove it by replacing with empty string ('').
df['student_name'] = df['student_name'].apply(lambda x: x.replace("\ufeff", ""))

How to ignore non-existent paths In Pyspark

I am looking for a way to read a bunch of files from S3, but there is a potential for a path to not exist. I would just like to ignore the fact that the path does not exist, and process all information possible. Example I want to read in files:
files_to_read = []
for id in ids_to_process:
for date in dates_to_process:
files_to_read.append('s3://bucket/date=' + date + '/id=' + id + '/*.parquet')
sqlContext.read.parquet(*files_to_read)
The issue is that some id's may not have started until a certain date, an while I can figure that out, it's not very easy to do it programmatically. What would the easiest way be to either a) ignore a file if the path does not exist. b) check if a path exists.
I have tried sqlContext.sql("spark.sql.files.ignoreMissingFiles=true"), which does not seem to work. Would there be any similar option that I am missing?

Here, missing file really means the deleted file under directory after you construct the DataFrame.
It is recommended to judge whether the target file exists in python in advance instead of handing it over to spark.

You could try something like this, maybe looking to catch the specific exception that is being thrown when a file does not exist (I believe in Scala it's an AnalysisException):
df = None
for path in paths_to_read:
try:
temp_df = sqlContext \
.read \
.parquet(path)
if df is None:
df = temp_df
else:
df = df.union(temp_df)
except:
# Ignoring this path
# print("Path " + path + " cannot be read. Ignoring.")
pass

Haven't seen something in pyspark that can do that. I also faced this and this is what I did:
Have a list of S3 addresses that you want to read.
addrs = ["s3a://abc", "s3a://xyz", ... ]
Test the links beforehand, and remove them if not accessible
for add in addrs:
try:
spark.read.format("parquet").load(add)
except:
print(add)
addrs.remove(add)
Read the updated list using spark
sdf_a = spark\
.read\
.format("parquet")\
.load(addrs)

Add column and values to CSV or Dataframe

Brand new to Python and programming. I have a function that extracts a file creation date from .csv files (the date is included the file naming convention):
def get_filename_dates(self):
"""Extract date from filename and place it into a list"""
for filename in self.file_list:
try:
date = re.search("([0-9]{2}[0-9]{2}[0-9]{2})",
filename).group(0)
self.file_dates.append(date)
self.file_dates.sort()
except AttributeError:
print("The following files have naming issues that prevented "
"date extraction:")
print(f"\t{filename}")
return self.file_dates
The data within these files are brought into a DataFrame:
def create_df(self):
"""Create DataFrame from list of files"""
for i in range(0, len(self.file_dates)):
self.agg_data = pd.read_csv(self.file_list[i])
self.agg_data.insert(9, 'trade_date', self.file_dates[i],
allow_duplicates=False)
return self.agg_data
As each file in file_list is worked with, I need to insert its corresponding date into a new column (trade_date).
As written here, the value of the last index in the list returned by get_filename_dates() is duplicated into every row of the trade_date column. -- presumably because read_csv() opens and closes each file before the next line.
My questions:
Is there an advantage to inserting data into the csv file using with open() vs. trying to match each file and corresponding date while iterating through files to create the DataFrame?
If there is no advantage to with open(), is there a different Pandas method that would allow me to manipulate the data as the DataFrame is created? In addition to the data insertion, there's other clean-up that I need to do. As it stands, I wrote a separate function for the clean-up; it's not complex and would be great to run everything in this one function, if possible.
Hope this makes sense -- thank you

You could grab each csv as an intermediate dataframe, do whatever cleaning you need to do, and use pd.concat() to concatenate them all together as you go. Something like this:
def create_df(self):
self.agg_data = pd.DataFrame()
"""Create DataFrame from list of files"""
for i, date in enumerate(self.file_dates):
df_part = pd.read_csv(self.file_list[i])
df_part['trade_date'] = date
# --- Any other individual file level cleanup here ---
self.agg_data = pd.concat([self.agg_data, df_part], axis=0)
# --- Any aggregate-level cleanup here
return self.agg_data
It makes sense to do as much of the preprocessing/cleanup as possible on the aggregated level as you can.
I also went to the liberty of converting the for-loop to use the more pythonic enumerate

Using filenames to create variable - PySpark

I have a folder where files get dropped (daily, weekly) and I need to add the year and week/day, which are in the file name in a consistent format, as variables to my data frame. The prefix can change (e.g., sales_report, cash_flow, etc.) but the last characters are always YYYY_WW.csv.
For instance, for a weekly file I could manually do it for each file as:
from pyspark.sql.functions import lit
df = spark.read.load('my_folder/sales_report_2019_12.csv', format="csv").withColumn("sales_year", lit(2019)).withColumn("sales_week", lit(12))
I would like to do the equivalent of using a substring function counting from the right of the file name to parse the 12 and 2019. Were I able to parse the file name for these variables I could then read in all of the files in the folder using a wildcard such as df = spark.read.load('my_folder/sales_report_*.csv', format="csv") which would greatly simplify my code.

You can easily extract it from the filename using the input_file_name() column and some string functions like regexp_extract and substring_index:
df = spark.read.load('my_folder/*.csv', format="csv")
df = df.withColumn("year_week", regexp_extract(input_file_name(), "\d{4}_\d{1,2}"))\
.withColumn("sales_year", substring_index(col("year_week"), "_", 1))\
.withColumn("sales_week", substring_index(col("year_week"), "_", -1))\
.drop("year_week")

you can try the below :
import glob
listfiles = glob.glob('my_folder/sales_report_*.csv')
for files in listfiles:
weekyear = c.split('_',2)[-1].split('_')
week = weekyear[1]
year = weekyear[0]
df = spark.read.load('files', format="csv").withColumn("sales_year", lit(year)).withColumn("sales_week", lit(week))

How to count the number of entries in a csv file using pyspark streaming

I have a monitor directory contains number of .csv file. I need to count the number of entries in each in coming .csv file. I want to do this in pyspark streaming context.
This is what I did,
my_DStream = ssc.textFileStream(monitor_Dir)
test = my_DStream.flatMap(process_file) # process_file function simply process my file. e.g line.split(";")
print(len(test.collect()))
This does not give me the result that I want. For e.g file1.csv contains 10 entries, file2.csv contains 18 entries etc. So I need to see the output
10
18
..
..
etc
I have no problem to do the same task if I have a one single static file and to use rdd operation.

If someone interested, this is what I did.
my_DStream = ssc.textFileStream(monitor_Dir)
DStream1 = my_DStream.flatMap(process_file)
DStream2 = DStream1.filter(lambda x: x[0])
lines_num = DStream2.count()
lines_num.pprint()
This gave the desired output as I wanted.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark load files between timestamps - apache-spark

Related

I have one person in a dataframe that keeps showing up as \ufeff in my dataframe when I print to console

How to ignore non-existent paths In Pyspark

Add column and values to CSV or Dataframe

Using filenames to create variable - PySpark

How to count the number of entries in a csv file using pyspark streaming

Categories

Resources