I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?
Details : File is csv with tab delimited.
Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):
val df = spark.read.option("sep", "\t").csv("file.csv.gz")
PySpark:
df = spark.read.csv("file.csv.gz", sep='\t')
The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.
Related
For most of my files, when I read in delimited files and write them out to snappy parquet, spark is executing as I expected and creating multiple partitioned snappy parquet files.
That said, I have some large .out files that are pipe-separated (25GB+), and when I read them in:
inputFile = spark.read.load(s3PathIn, format='csv', sep=fileSeparator, quote=fileQuote, escape=fileEscape, inferSchema='true', header='true', multiline='true')
Then output the results to S3:
inputFile.write.parquet(pathOut, mode="overwrite")
I am getting large single snappy parquet files (20GB+). Is there a reason for this? All my other spark pipelines generate nicely split files that make query in Athena more performant, but in these specific cases I am only getting single-large files. I am NOT executing any repartition or coallesce commands.
check how much partitions you have on inputFile dataframe. Seems like it has single partitioned.
Seems like you are just reading a CSV file and then writing it as parquet file. check the size of your CSV file, seems like it really large.
inputFile.rdd.getNumPartitions
if it's one. Try repartition dataframe.
inputFile.repartition(10) \\or
inputFile.repartition("col_name")
I am trying to read multiple csv files using spark. I need to skip more than one line of header from each csv file.
I am able to achieve this by below code.
rdd = df.rdd
schema = df.schema
rdd_without_header = rdd.zipWithIndex().filter(lambda (row, index): index > skip_header).keys()
df = spark_session.createDataFrame(rdd_without_header, schema=schema)
This code is working fine, but if I am having multiple compressed files of format gz this operation is taking very very long time to complete.
Difference is of magnitude 10x when using compressed files as against non compressed files.
Since I want to skip multiple lines of header from all the files, I am not able to leverage the skip header option of spark
option("header", "true")
What should be the best and optimized way to handle this use case.
I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy(). I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps:
Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files.
Read those parquet files to convert them into TFrecord files with the tensorflow-connector plugin.
It is the second step that I'm unable to do efficiently. My idea was to read in the individual parquet files on the executors and immediately write them into TFrecord files. But this needs access to the SQLContext which can only be done in the Driver (discussed here) so not in parallel. I would like to do something like this:
# List all parquet files to be converted
import glob, os
files = glob.glob('/path/*.parquet'))
sc = SparkSession.builder.getOrCreate()
sc.parallelize(files, 2).foreach(lambda parquetFile: convert_parquet_to_tfrecord(parquetFile))
Could I construct the function convert_parquet_to_tfrecord that would be able to do this on the executors?
I've also tried just using the wildcard when reading all the parquet files:
SQLContext(sc).read.parquet('/path/*.parquet')
This indeed reads all parquet files, but unfortunately not into individual partitions. It appears that the original structure gets lost, so it doesn't help me if I want the exact contents of the individual parquet files converted into TFrecord files.
Any other suggestions?
Try spark-tfrecord.
Spark-TFRecord is a tool similar to spark-tensorflow-connector but it does partitionBy. The following example shows how to partition a dataset.
import org.apache.spark.sql.SaveMode
// create a dataframe
val df = Seq((8, "bat"),(8, "abc"), (1, "xyz"), (2, "aaa")).toDF("number", "word")
val tf_output_dir = "/tmp/tfrecord-test"
// dump the tfrecords to files.
df.repartition(3, col("number")).write.mode(SaveMode.Overwrite).partitionBy("number").format("tfrecord").option("recordType", "Example").save(tf_output_dir)
More information can be found at
Github repo:
https://github.com/linkedin/spark-tfrecord
If I understood your question correctly, you want to write the partitions locally on the workers' disk.
If that is the case then I would recommend looking at spark-tensorflow-connector's instructions on how to do so.
This is the code that you are looking for (as stated in the documentation linked above):
myDataFrame.write.format("tfrecords").option("writeLocality", "local").save("/path")
On a side note, if you are worried about efficiency why are you using pyspark? It would be better to use scala instead.
I have a sequence of very large daily gzipped files. I'm trying to use PySpark to re-save all the files in S3 in Parquet format for later use.
If for a single file (in example, 2012-06-01) I do:
dataframe = spark.read.csv('s3://mybucket/input/20120601.gz', schema=my_schema, header=True)
dataframe.write.parquet('s3://mybucket/output/20120601')
it works, but since gzip isn't splittable it runs on a single host and I get no benefit of using the cluster.
I tried reading in a chunk of files at once, and using partitionBy to write the output to daily files like this (in example, reading in a month):
dataframe = spark.read.csv('s3://mybucket/input/201206*.gz', schema=my_schema, header=True)
dataframe.write.partitionBy('dayColumn').parquet('s3://mybucket/output/')
This time, individual files are read in different executors like I want, but the executors later die and the process fails. I believe since the files are so large, and the partitionBy is somehow using unnecessary resources (a shuffle?) it's crashing the tasks.
I don't actually need to re-partition my dataframe since this is just a 1:1 mapping. Is there anyway to make each individual task write to a separate, explicitly named parquet output file?
I was thinking something like
def write_file(date):
# get input/output locations from date
dataframe = spark.read.csv(input_location, schema=my_schema, header=True)
dataframe.write.parquet(output_location)
spark.sparkContext.parallelize(my_dates).for_each(write_file)
except this doesn't work since you can't broadcast the spark session to the cluster. Any suggestions?
Writing input files to separate output files without repartitioning
TL;DR This is what your code is already doing.
partitionBy is causing a unnecessary shuffle
No. DataFrameWriter.partitionBy doesn't shuffle at all.
it works, but since gzip isn't splittable
You can:
Drop compression completely - Parquet uses internal compression.
Use splittable compression like bzip2.
Unpack the files to a temporary storage before submitting the job.
If you are concerned about resources used by partitionBy (it might open larger number of files for each executor thread) you can actually shuffle to improve performance - DataFrame partitionBy to a single Parquet file (per partition). Single file is probably to much but
dataframe \
.repartition(n, 'dayColumn', 'someOtherColumn') \
.write.partitionBy('dayColumn') \
.save(...)
where someOtherColumn can be chosen to get reasonable cardinality, should improve things.
I have a dataset in S3 in text format(.gz) and I am using spark.read.csv to read the file into spark.
This is about 100GB of data but it contains 150 columns. I am using only 5 columns (so I reduce the breadth of the data) and I have selecting only 5 columns.
For this kind of scenario, does spark scans the complete 100GB of data or it smartly filters only these 5 columns without scanning all the columns(like in columnar formats)?
Any help on this would be appreciated.
imp_feed = spark.read.csv('s3://mys3-loc/input/', schema=impressionFeedSchema, sep='\t').where(col('dayserial_numeric').between(start_date_imp,max_date_imp)).select("col1","col2","col3","col4")
make step 1 of your workflow the process of reading in the CSV file and saving it as snappy compressed ORC or parquet files.
then go to whoever creates those files and tell them to stop it. At the very least, they should switch to Avro + Snappy, as that's easier to split up the initial parse and migration to a columnar format.