Reading only the unread files in PySpark - apache-spark

I have a root directory where additional directories are being created daily or sometimes hourly, with avro files, example:
root/2021/12/01/file121.avro
root/2022/06/01/file611.avro
root/2022/06/01/file612.avro
root/2022/06/01/file613.avro
root/2022/06/03/file631.avro
root/2022/06/03/file632.avro
root/2022/06/05/file651.avro
root/2022/06/05/file652.avro
root/2022/06/05/file653.avro
Each time my PySpark code runs it needs to read the files that have not been read before in any of the sub-directories of the root. I need to process one file per run of the code. The code will be run about every 5 minutes.
How can this be accomplished, in PySpark?
Any approach/strategy and code ideas would be much appreciated.
Best :)
Michael

Related

Why is Spark much faster at reading a directory compared to a list of filepaths?

I have a directory in S3 containing millions of small files. They are small (<10MB) and GZ, and I know it's inefficient for Spark. I am running a simple batch job to convert these files to parquet format. I've tried two different ways:
spark.read.csv("s3://input_bucket_name/data/")
as well as
spark.read.csv("file1", "file2"..."file8million")
where each file given in the list is located in the same bucket and subfolder.
I notice that when I feed in a whole directory, there isn't as much delay at the beginning for the driver indexing files (looks like around 20 minutes before the batch starts). In the UI for 1 directory, there is 1 task after this 20 minutes which looks like the conversion itself.
However, with individual filenames, this time for indexing increases to 2+ hours, and my job to do the conversion in the UI doesn't show up until this time. For the list of files, there are 2 tasks: (1) First one is listing leafs for 8mil files, and then (2) job that looks like the conversion itself.
I'm trying to understand why this is the case. Is there anything different about the underlying read API that would lead to this behaviour?
spark assumes every path passed in is a directory
so when given a list of paths, it has to do a list call on each
which for s3 means: 8M LIST calls against the s3 servers
which is rate limited to about 3k/second, ignoring details like thread count on client, http connectons etc
and with LIST build at $0.005 per 1000 calls, so 8M requests comes to $50
oh, and as the LIST returns nothing, the client falls back to a HEAD which adds another S3 API call, doubling execution time and adding another $32 to the query cost
in contrast,
listing a dir with 8M entries kicks off a single LIST request for the first 1K entries
and 7999 followups
s3a releases do async prefetch of the next page of results (faster, esp if the incremental list iterators are used). one thread to fetch, one to process and will cost you 4c
The big directory listing is more efficient and cost effective strategy, even ignoring EC2 server costs

How to make subsection of sbatch jobs run at a time

I have a bash script (essentially what it does is align all files in a directory to all the files of another specified directory). The number of jobs gets quite large if there are 100 files being aligned individually to another 100 files (10,000 jobs), which all get submitted to slurm individually.
I have been doing them in batches manually but I think there must be a way to include it in a script so that, for example, only 50 jobs are running at a time.
I tried
$ sbatch --array [1-1000]%50
but it didn't work

Dealing with large number of small json files using pyspark

I have around 376K of JSON files under a directory in S3. These files are 2.5 KB each and contain only a single record/file. When I tried to load the entire directory via the below code via Glue ETL with 20 workers:
spark.read.json("path")
It just didn't run. There was a Timeout after 5 hrs. So, I developed and ran a shell script to merge the records of these files under a single file, and when I tried to load it, it just displays a single record. The merged file size is 980 MB. It worked fine for 4 records when tested locally after merging those 4 records under a single file. It displayed 4 records as expected.
I used the below command to append the JSON records from different files under a single file:
for f in Agent/*.txt; do cat ${f} >> merged.json;done;
It doesn't have any nested JSON. I even tried the multiline option but didn't work. So, what could be done in this case? As per me, when merged it is not treating records separately hence causing the issue. I even tried head -n 10 to display the top 10 lines but it goes to an infinite loop.
The problem was with my shell script that was being used to merge multiple small files. Post merge, records weren't aligned properly due to which they weren't treated as separate records.
Since I was dealing with a JSON dataset, I used the jq utility to process it. Below is the shell script that would merge a large number of records in a faster way into one file:
find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt
Later on, I was able to load the JSON records as expected with the below code:
spark.read.option("multiline","true").json("path")
I have run into trouble in the past working with thousands of small files. In my case they where csv files not json. One of hte thing I did to try and debug was to create a for loop and load smaller batches then combine all the data frame together. During each iteration I would call an action to force the execution. I would log the progress to get an idea of it was making progress . And monitor how it was slowing down as the job progressed

Does the file get changed in squeue if I modify after being sent into queue? [duplicate]

Say I want to run a job on the cluster: job1.m
Slurm handles the batch jobs and I'm loading Mathematica to save the output file job1.csv
I submit job1.m and it is sitting in the queue. Now, I edit job1.m to have different variables and parameters, and tell it to save data to job1_edited.csv. Then I re-submit job1.m.
Now I have two batch jobs in the queue.
What will happen to my output files? Will job1.csv be data from the original job1.m file? And will job1_edited.csv be data from the edited file? Or will job1.csv and job1_edited.csv be the same output?
:(
Thanks in advance!
I am assuming job1.m is a Mathematica job, run from inside a Bash submission script. In that case, job1.m is read when the job starts so if it is modified after submission but before job start, the modified version will run. If it is modified after the job starts, the original version will run.
If job1.m is the submission script itself (so you run sbatch job1.m), that script is copied in a spool directory specific to the job so if it is modified after the job is submitted, it still will run the original version.
In any case, it is better, for reproducibility and traceability, to make use of a workflow manager such as Fireworks, or Bosco

Storing run time logs in a folder

I am running a shell script in Linux environment to create some logs (dynamic log files) as text files.
I want to store all the log files created into a single folder after some particular time.
So how can I do that? Can anyone suggest some commands?
Thanks in advance.
In the script you can define that directory as a variable and you can use that one across the script.
#!/bin/bash
LOG_DIR=/tmp/logs
LOG_FILE=$LOG_DIR/log_file.$$ ## $$ will create the different log file for each and every run
## You can also do it by using some time stamp using date command.
<Your commands> >> $LOG_FILE
It really depends on your situations:
[Suggested if your log files are small in size]
You may want to backup your logs by just add a cron job, and zip/tar it to another folder, as a snapshot. Since the log files are small, even zip/tar everything may need to take you many many years to fill-up your hard drive.
[Suggested if your log files are large]
In your script that generate logs, you may want to rotate through a few indexed files, say, log.0 to log.6, each for one week day, from Sunday to Saturday. And you can have another script to backup yesterday's log (so that it won't have race conditions between the log producer and the log consumer, i.e. the log mover/copier). You can have strategies for how many days of backup will be still existing, and for how long of those should be discarded.
The yesterdays' log mover/copier can be easily done by a cron job.

Resources