How does spark reads text format files - apache-spark

I have a dataset in S3 in text format(.gz) and I am using spark.read.csv to read the file into spark.
This is about 100GB of data but it contains 150 columns. I am using only 5 columns (so I reduce the breadth of the data) and I have selecting only 5 columns.
For this kind of scenario, does spark scans the complete 100GB of data or it smartly filters only these 5 columns without scanning all the columns(like in columnar formats)?
Any help on this would be appreciated.
imp_feed = spark.read.csv('s3://mys3-loc/input/', schema=impressionFeedSchema, sep='\t').where(col('dayserial_numeric').between(start_date_imp,max_date_imp)).select("col1","col2","col3","col4")

make step 1 of your workflow the process of reading in the CSV file and saving it as snappy compressed ORC or parquet files.
then go to whoever creates those files and tell them to stop it. At the very least, they should switch to Avro + Snappy, as that's easier to split up the initial parse and migration to a columnar format.

Related

Pyspark dataframe parquet vs delta : different number of rows

I have data written in Delta on HDFS. From what I understand, Delta is storing the data as parquet, just has an additional layer over it with advanced features.
But when reading data with Pyspark, I get a different result if dataframe is read with spark.read.parquet() or spark.read.format('delta').load()
df = spark.read.format('delta').load("my_data")
df.count()
> 184511389
df = spark.read.parquet("my_data")
df.count()
> 369022778
As you can see the difference is quite big.
Is there something I misunderstood about delta vs parquet?
Pyspark version is 2.4.
The most probable explanation is that you wrote into the Delta two times using the overwrite option. But Delta is versioned data format - when you use overwrite, it doesn't delete previous data, it just writes new files, and don't delete files immediately - they are just marked as deleted in the manifest file that Delta uses. And when you read from Delta, it knows which files are deleted, or not, and read only actual data. Actual deletion of the data files happens when you're performing VACUUM on Delta lake.
But when you read with Parquet, it doesn't have information about deleted files, so it reads everything that you have in directory, so you get twice as many rows.

is it possible in spark to read large s3 csv files in parallel?

Typically spark files are saved in multiple parts, allowing each worker to read different files.
is there a similar solution when working on a single files?
s3 provides the select API that should allow this kind of behaviour.
spark appears to support this API (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html), but this appears to relate only for optimising queries, not for parallelising reading
S3 Select is unrelated to your use case.
S3 Select: have SQL select and project done in the S3 store, so that the client gets the prefiltered data. Result is returned as CSV with the header stripped, or JSON. You cannot then have >1 worker target this. (you could try, but each worker would have to read in and discard all the data in the runup to its offset, and predicting the ranges each worker can process is essentially impossible)
You: have > 1 worker process different parts of a file which has been partitioned
Partitioning large files into smaller parts for parallel processing is exactly what Spark (and mapreduce, hive etc) do for any format where it makes sense.
CSV files are easily partitioned provided they are compressed with a splittable compression format (none, snappy -but not gzip)
All that's needed is to tell spark what the split threshold is. For S3a, set the value fs.s3a.block.size to a value which it can then split up on, then your queries against CSV, Avro, ORC, Parquet and similar will all be split up amongst workers.
Unless your workers are doing a lot of computation per row, there's a minimum block size before it's even worth doing this. Experiment.
Edit: this is now out of date and depends on the type of CSV. Some CSV's allow new lines within columns. These are un splitable. CSVs that do not an guarantee that a newlines only represent a new row can be split
FYI csv's are inherently single threaded. There is no extra information in a csv file that tells the reader where any row starts without reading the whole file from the start.
If you want multiple readers on the same file use a format like Parquet which has row groups with an explicitly defined start position defined in the footer that can be read by independent readers. When spark goes to read the parquet file it will split out row groups into separate tasks. Ultimately having appropriately sized files is very important for spark performance.

Would S3 Select speed up Spark analyses on Parquet files?

You can use S3 Select with Spark on Amazon EMR and with Databricks, but only for CSV and JSON files. I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much.
Let's say we have a data lake of people with first_name, last_name and country columns.
If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count(), then S3 will transfer all the data for all the columns to the ec2 cluster to run the computation. This is really inefficient because we don't need all the last_name and country data to run this query.
If the data is stored as CSV files and you run the query with S3 select, then S3 will only transfer the data in the first_name column to run the query.
spark
.read
.format("s3select")
.schema(...)
.options(...)
.load("s3://bucket/filename")
.select("first_name")
.distinct()
.count()
If the data is stored in a Parquet data lake and peopleDF.select("first_name").distinct().count() is run, then S3 will only transfer the data in the first_name column to the ec2 cluster. Parquet is a columnar file format and this is one of the main advantages.
So based on my understanding, S3 Select wouldn't help speed up an analysis on a Parquet data lake because columnar file formats offer the S3 Select optimization out of the box.
I am not sure because a coworker is certain I am wrong and because S3 Select supports the Parquet file format. Can you please confirm that columnar file formats provide the main optimization offered by S3 Select?
This is an interesting question. I don't have any real numbers, though I have done the S3 select binding code in the hadoop-aws module. Amazon EMR have some values, as do databricks.
For CSV IO Yes, S3 Select will speedup given aggressive filtering of source data, e.g many GB of data but not much back. Why? although the read is slower, you save on the limited bandwidth to your VM.
For Parquet though, the workers split up a large file into parts and schedule the work across them (Assuming a splittable compression format like snappy is used), so > 1 worker can work on the same file. And they only read a fraction of the data (==bandwidth benefits less), But they do seek around in that file (==need to optimise seek policy else cost of aborting and reopening HTTP connections)
I'm not convinced that Parquet reads in the S3 cluster can beat a spark cluster if there's enough capacity in the cluster and you've tuned your s3 client settings (for s3a this means: seek policy, thread pool size, http pool size) for performance too.
Like I said though: I'm not sure. Numbers are welcome.
Came across this spark package for s3 select on parquet [1]
[1] https://github.com/minio/spark-select

Partitioning strategy in Parquet and Spark

I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3
It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.

Saving in parquet format from multiple spark workers

I have a job that needs to save the result in parquet/avro format from all the worker nodes. Can I do a separate parquet file for each of the individual partition and read all the resulting files as a single table? Or is there a better way of going about this?
Input is divided into 96 partitions and result needs to be saved on HDFS. When I tried to save it as a file it created over a million small files.
You can do a repartition (or coalesce if you always want fewer partitions) to the desired number of partitions just before you call write. Your data will then be written into the same number of files. When you want to read in the data, you simply point to the folder with the files rather than to a specific file. Like this:
sqlContext.read.parquet("s3://my-bucket/path/to/files/")

Resources