How can I write and read a .parquet file with Spark?

How can I write and read a .parquet file with Spark? - apache-spark

I am new to Spark and need some help please. I have created a parquet file in spark like this
recordDF.write.parquet("record.parquet")
I try to read the created parquet file like this but get the error below
record_parquet_df=Spark.read.parquet("record.parquet")
Execution failed for task ':CustomerRecord.main()'.
Process 'command 'C:/Program Files/Java/jdk1.8.0_341/bin/java.exe'' finished with non-zero exit value 1
How can I go about resolving this?

Related

Spark : java.lang.ClassCastException: org.apache.Hadoop.io.Text cannot be cast to org.apache.orc.storage.serde2.io.DateWritable

Received this error (java.lang.ClassCastException: org.apache.Hadoop.io.Text cannot be cast to org.apache.orc.storage.serde2.io.DateWritable) while executing a pyspark py file which is reading the data from orc files in a partitioned folder.
This input folder has data, which should be read, transformed and needs to be written to a folder which has existing external table built on top(MSCK repair will be run post writing data to this target folder)
Code sample(process)
Step 1:
Df = spark.read.orc(“input_path”)
Step 2:
—> apply transformations(No cast function used)
Step 3:
Transformed_Df.write\
.partitionBy(“columns”)\
.mode(“overwrite”)\
.orc(“output_path”)
When I checked the logs, I see this error occurs multiple times right after reading partitions. I believe this is happening before applying transformations and writing the data to target.
Attached a picture of the log, please check.
enter image description here

Spark: generate an error when reading from folder without _SUCCESS file

I cannot seem to find any documentation, but I want to understand how I can do the following:
We have Spark pipelines that write data to S3 in the standard format where they write several part-... files and the _SUCCESS file to the folder.
We then have further Spark pipelines that read data from those S3 buckets.
We would like to have the pipelines automatically throw an exception (fail) if they try to read from a folder that does not have the _SUCCESS file.
We can create some sort of user-created function to manage this test, but it seems so common that I figured there must be an easy Spark-native way to generate this exception if the file is not found.
Is there such a native Spark way to trigger that exception?

The only way I can think of is using ,
boolean isExists=getFileSystem(spark.sparkContext().hadoopConfiguration())).exists(new Path("location of _SUCCESS file"));
if this returns false throw an exception.

S3 Query Exception file has an invalid version number and InvalidRange The requested range is not satisfiable

I am running a Python script that read a lot of parquet files from a S3 bucket and insert the dataframe into redshift. However, the errors "S3 Query Exception file has an invalid version number" and "InvalidRange The requested range is not satisfiable" frequently happen, I say "frequently" because it is not always that this occur but mostly of the executions.
For each insertion I commit the changes, until the read proccess end and then I close the cursor and the connection. This started when I updated my script, the old version read all files and insert to redshift. Now it read one file and insert the data. What may be causing this problem?

Spark empty _metadata file in parquet output

I am using an Oozie workflow to generate a parquet file. Occasionally, when I try to read the file using spark, I get the following exception
java.io.IOException: Could not read footer:
java.lang.RuntimeException:
hdfs://ip-10-1-2-243.ec2.internal:8020/path/to/file/_metadata is not a
Parquet file (too small)
After deleting the metadata file, I can read in the rest of the files normally. I would like to know what causes Spark to output an empty _metadata file, and how I can avoid it in the future.

Spark in docker parquet error No predefined schema found

I have a https://github.com/gettyimages/docker-spark based local spark test cluster including R. In particular, this image is used: https://hub.docker.com/r/possibly/spark/
Trying to read a parquet file with sparkR this exception occurs. Reading a parquet file works without any problems on a local spark installation.
myData.parquet <- read.parquet(sqlContext, "/mappedFolder/myFile.parquet")
16/03/29 20:36:02 ERROR RBackendHandler: parquet on 4 failed
Fehler in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/mappedFolder/myFile.parquet.
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$MetadataCache$$readSchema(ParquetRelation.scala:512)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCac
Strangely the same error is the same - even for not existing files.
However in the terminal I can see that the files are there:
/mappedFolder/myFile.parquet
root#worker:/mappedFolder/myFile.parquet# ls
_common_metadata part-r-00097-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet part-r-00196-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet
....

My initial parquet file seems to have been corrupted during my test runs of the dockerized spark.
To solve: re-create parquet files from original sources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How can I write and read a .parquet file with Spark? - apache-spark

Related

Spark : java.lang.ClassCastException: org.apache.Hadoop.io.Text cannot be cast to org.apache.orc.storage.serde2.io.DateWritable

Spark: generate an error when reading from folder without _SUCCESS file

S3 Query Exception file has an invalid version number and InvalidRange The requested range is not satisfiable

Spark empty _metadata file in parquet output

Spark in docker parquet error No predefined schema found

Categories

Resources