Apache PySpark - Reading a directory without scanning the files - apache-spark

We have a growing data lake of logs we keep on google storage. The data is partitioned by dates (and other stuff such as env=production/staging). Imagine the path gs://bucket/data/env=*/date=*
We begin an app or an analysis by creating dataframes that can be queried later on for processing. The problem is that creating the DFs takes a long time even before we do actions on them. In other words the following command takes a long time because Spark seems to be scanning all the files inside (and as I mentioned, the amount of data keeps growing).
df = spark.read.load("gs://bucket/data/", schema=data_schema, format="json")
Note that we provide the schema here. Also, after the data is loaded the partitioning works well, that is, if we filter by day we do get the speed-up that we expect. We don't want to read a specific partition from the get go, we would like to have everything in one DF and read only what we need later on.

Related

Best option for storage in spark

A third party is producing a complete daily snapshot of their database table (Authors) and is storing it as a Parquet file in S3. Currently the number of records are around 55 million+. This will increase daily. There are 12 columns.
Initially I want to take this whole dataset and do some processing on the records, normalise them and then block them into groups of authors based on some specific criterias. I will then need to repeat this process daily, and filter it to only include authors that have been added or updated since the previous day.
I am using AWS EMR on EKS (Kubernetes) as my Spark cluster. My current thoughts are that I can save my blocks of authors on HDFS.
The main use for the blocks of data will be a separate Spark Streaming job that will then be deployed unto the same EMR cluster, and will read events from a Kafka topic and do a quick search to see which blocks of data are related to that event, and then it will do some matching (pairwise) against each item of that block.
I have two main questions:
Is using HDFS a performant and viable option for this use case?
The third party database table dump is going to be an initial goal. Later on, there will be quite possibly 10s or even 100s of other sources that I would need to do matching against. Which means trillions of data that are blocked and those blocks need to be stored somewhere. Would this option still be viable at that stage?

How can I extract information from parquet files with Spark/PySpark?

I have to read in N parquet files, sort all the data by a particular column, and then write out the sorted data in N parquet files. While I'm processing this data, I also have to produce an index that will later be used to optimize the access to the data in these files. The index will also be written as a parquet file.
For the sake of example, let's say that the data represents grocery store transactions and we want to create an index by product to transaction so that we can quickly know which transactions have cottage cheese, for example, without having to scan all N parquet files.
I'm pretty sure I know how to do the first part, but I'm struggling with how to extract and tally the data for the index while reading in the N parquet files.
For the moment, I'm using PySpark locally on my box, but this solution will eventually run on AWS, probably in AWS Glue.
Any suggestions on how to create the index would be greatly appreciated.
This is already built into spark SQL. In SQL use "distribute by" or pyspark: paritionBy before writing and it will group the data as you wish on your behalf. Even if you don't use a partitioning strategy Parquet has predicate pushdown that does lower level filtering. (Actually if you are using AWS, you likely don't want to use partitioning and should stick with large files that use predicate pushdown. Specifically because s3 scanning of directories is slow and should be avoided.)
Basically, great idea, but this is already in place.

Question about using parquet for time-series data

I'm exploring ways to store a high volume of data from sensors (time series data), in a way that's scalable and cost-effective.
Currently, I'm writing a CSV file for each sensor, partitioned by date, so my filesystem hierarchy looks like this:
client_id/sensor_id/year/month/day.csv
My goal is to be able to perform SQL queries on this data, (typically fetching time ranges for a specific client/sensor, performing aggregations, etc) I've tried loading it to Postgres and timescaledb, but the volume is just too large and the queries are unreasonably slow.
I am now experimenting with using Spark and Parquet files to perform these queries, but I have some questions I haven't been able to answer from my research on this topic, namely:
I am converting this data to parquet files, so I now have something like this:
client_id/sensor_id/year/month/day.parquet
But my concern is that when Spark loads the top folder containing the many Parquet files, the metadata for the rowgroup information is not as optimized as if I used one single parquet file containing all the data, partitioned by client/sensor/year/month/day. Is this true? Or is it the same to have many parquet files or a single partitioned Parquet file? I know that internally the parquet file is stored in a folder hierarchy like the one I am using, but I'm not clear on how that affects the metadata for the file.
The reason I am not able to do this is that I am continuously receiving new data, and from my understanding, I cannot append to a parquet file due to the nature that the footer metadata works. Is this correct? Right now, I simply convert the previous day's data to parquet and create a new file for each sensor of each client.
Thank you.
You can use Structured Streaming with kafka(as you are already using it) for real time processing of your data and store data in parquet format. And, yes you can append data to parquet files. Use SaveMode.Append for that such as
df.write.mode('append').parquet(path)
You can even partition your data on hourly basis.
client/sensor/year/month/day/hour which will further provide you performance improvement while querying.
You can create hour partition based on system time or timestamp column based on type of query you want to run on your data.
You can use watermaking for handling late records if you choose to partition based on timestamp column.
Hope this helps!
I could share my experience and technology stack that being used at AppsFlyer.
We have a lot of data, about 70 billion events per day.
Our time-series data for near-real-time analytics are stored in Druid and Clickhouse. Clickhouse is used to hold real-time data for the last two days; Druid (0.9) wasn't able to manage it. Druid holds the rest of our data, which populated daily via Hadoop.
Druid is a right candidate in case you don't need a row data but pre-aggregated one, on a daily or hourly basis.
I would suggest you let a chance to the Clickhouse, it lacks documentation and examples but works robust and fast.
Also, you might take a look at Apache Hudi.

How to get the number of partitions written by a DataFrameWriter

Let's assume we have the following code in Spark:
dataset.write.partitionBy("c1", "c2", "c3").parquet("myDir")
I have seen a couple of threads on SO explaining how to get the number of files or records written after the parquet method completes. However, what I would like to access is the name of the partitioning directories created, i.e. the number of directories myDir/c1=XX/c2=YY/c3=ZZ where XX, YY and ZZ are domain-related values.
One reason I need these directory names is to perform data integrity checks after an ETL process, and need to know which directories have been created during the ETL (say 3-4 directories for my use case) among thousands of them.
Does anyone know if there is a way to retrieve this information (at the Spark API level)?

Efficiently creating two RDDs in Spark

I'm investigating using Spark for a project with hundreds of GB of data being generated per hour. I'm struggling with getting the first step optimal, though it feels like it should be simple!
Suppose there is a daily process that splits the data into two parts, neither of which are small enough to cache in memory. Is there a way to do this in a single spark job without the source data having to be loaded from HDFS and parsed twice?
Say I want to write all the "Dog" events into a new HDFS location, and all the "Cat" events into another. As soon as I specify the first "write to file" action (for the dogs), Spark will set off loading each file into RAM in turn and filtering out all the dogs. Then it'll have to re-parse every file to find the cats, even though it could have done both at once. It could have loaded each partition of the raw data and written out the cats and dogs for that partition in one go.
If the data would fit in memory I could filter down to both types of events, then cache, then do the two writes. But it doesn't, and I can't see how to make spark do this on a per-partition basis as it goes along. I've studied the API docs thinking I must be missing something...
Any advice would be greatly appreciated, thanks!

Resources