Saving dataframe to local file system results in empty results - apache-spark

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3

That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).

This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

Related

Apache Beam AvroIO read large file OOM

Problem:
I am writing an Apache Beam pipeline to convert Avro file to Parquet file (with Spark runner). Everything works well until I start to convert large size Avro file (15G).
The code used to read Avro file to create PColletion:
PCollection<GenericRecord> records =
p.apply(FileIO.match().filepattern(s3BucketUrl + inputFilePattern))
.apply(FileIO.readMatches())
.apply(AvroIO.readFilesGenericRecords(inputSchema));
The error message from my entrypoint shell script is:
b'/app/entrypoint.sh: line 42: 8 Killed java -XX:MaxRAM=${MAX_RAM} -XX:MaxRAMFraction=1 -cp /usr/share/tink-analytics-avro-to-parquet/avro-to-parquet-deploy-task.jar
Hypothesis
After some investigation, I suspect that the AvroIO code above try to load the whole Avro file as one partition, which causes OOM issue.
One hypothesis I have is: if I can specify number of partitions when reading Avro file, let's see 100 partitions for example, then each partition will contain only 150M data, which should avoid the OOM issue.
My questions are:
Does this hypothesis lead me in the right direction?
If so, How can I specify number of partitions while reading the Avro file?
Instead of setting number of partitions, Spark session has a property called spark.sql.files.maxPartitionBytes, which is set to 128Mb by default, see reference here.
Spark uses this number to partition input file(s) while reading them into memory.
I tested with a 50Gb avro file and Spark partitioned it to 403 partitions. This Avro to Parquet conversion worked on a Spark cluster with 16Gb Mem and 4 Cores.

Spark creates extra partitions inside partition

I have a Spark program that reads the data from text file as RDD and convert into Parquet file using spark-sql and partitioned using a single partitionkey. Once in a while instead of creating one partition it creates two partitions that is a partition inside a partition.
My data is partitioned on date and the output folder is in s3://datalake/intk/parquetdata.
After the Spark job is run I am seeing output as :
s3://datalake/intk/parquetdata/datekey=102018/a.parquet
s3://datalake/intk/parquetdata/datekey=102118/a.parquet
s3://datalake/intk/parquetdata/datekey=102218/datekey=102218/a.parquet
Code snippet:
val savepath = "s3://datalake/intk/parquetdata/"
val writeDF = InputDF.write
.mode(savemode)
.partitionBy(partitionKey)
.parquet(savePath)
I am running my Spark job in EMR cluster version 5.16, Spark version 2.2, Scala version 2.11 and the output location is s3. I am not sure why this behavior takes place and I don't see this issue following any pattern and this partition occurs only once in a while.

Accessing HDFS thru spark-scala program

I have written a simple program to join the orders and order_items files which are in HDFS.
My Code to read the data:
val orders = sc.textFile ("hdfs://quickstart.cloudera:8022/user/root/retail_db/orders/part-00000")
val orderItems = sc.textFile ("hdfs://quickstart.cloudera:8022/user/root/retail_db/order_items/part-00000")
I got the below exception:
**Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://quickstart.cloudera:8020/user/root/retail_db, expected: file:///**
Can you please let me know the issue here? Thanks!!
You are currently using the Cloudera Quickstart VM, which most likely means you are running Spark 1.6 as those are the parcels that can be installed directly from Cloudera Manager and the default version for CDH 5.x
If that is the case, Spark on Yarn points by default to HDFS so you don't need to specify hdfs.
Simply do this:
val orderItems = sc.textFile ("/user/cloudera/retail_db/order_items/part-00000")
Note I changed also to /user/cloudera. Make sure your current user has permissions.
The hdfs:// is only if you are using Spark standalone

Spark Standalone cluster cannot read the files in local filesystem

I have a Spark standalone cluster having 2 worker nodes and 1 master node.
Using spark-shell, I was able to read data from a file on local filesystem, then did some transformations and saved the final RDD in /home/output(let's say)
The RDD got saved successfully but only on one worker node and on master node only _SUCCESS file was there.
Now, if I want to read this output data from /home/output, I am not getting any data as it is getting 0 data on master and then I am assuming that it is not checking the other worker nodes for that.
It would be great if someone can throw some light on why Spark is not reading from all the worker nodes or what is the mechanism which Spark uses to read the data from worker nodes.
scala> sc.textFile("/home/output/")
res7: org.apache.spark.rdd.RDD[(String, String)] = /home/output/ MapPartitionsRDD[5] at wholeTextFiles at <console>:25
scala> res7.count
res8: Long = 0
SparkContext i.e. sc by default points to HADOOP_CONF_DIR.This is generally set to hdfs:// , which means when you say sc.textFile("/home/output/") it searches for the file/dir as hdfs:///home/output , which in your case is not present on HDFS. file:// points to local filesystem
Try sc.textFile("file:///home/output") ,thus explicitly telling Spark to read from the local filesystem.
You should put the file to all worker machine with the same path and name.

Why does reading from Hive fail with "java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found"?

I use Spark v1.6.1 and Hive v1.2.x with Python v2.7
For Hive, I have some tables (ORC files) stored in HDFS and some stored in S3. If we are trying to join 2 tables, where one is in HDFS and the other is in S3, a java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found is thrown.
For example this works when querying against a HIVE table in HDFS.
df1 = sqlContext.sql('select * from hdfs_db.tbl1')
This works when querying against a HIVE table in S3.
df2 = sqlContext.sql('select * from s3_db.tbl2')
This code below throws the above RuntimeException.
sql = """
select *
from hdfs_db.tbl1 a
join s3_db.tbl2 b on a.id = b.id
"""
df3 = sqlContext.sql(sql)
We are migrating from HDFS to S3, and so that's why there is a difference of storage backing HIVE tables (basically, ORC files in HDFS and S3). One interesting thing is that if we use DBeaver or the beeline clients to connect to Hive and issue the joined query, it works. I can also use sqlalchemy to issue the joined query and get result. This problem only shows on Spark's sqlContext.
More information on execution and environment: This code is executed in Jupyter notebook on an edge node (that already has spark, hadoop, hive, tez, etc... setup/configured). The Python environment is managed by conda for Python v2.7. Jupyter is started with pyspark as follows.
IPYTHON_OPTS="notebook --port 7005 --notebook-dir='~/' --ip='*' --no-browser" \
pyspark \
--queue default \
--master yarn-client
When I go to the Spark Application UI under Environment, the following Classpath Entries has the following.
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-api-jdo-3.2.6.jar
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-core-3.2.10.jar
/usr/hdp/2.4.2.0-258/spark/lib/datanucleus-rdbms-3.2.9.jar
/usr/hdp/2.4.2.0-258/spark/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar
/usr/hdp/current/hadoop-client/conf/
/usr/hdp/current/spark-historyserver/conf/
The sun.boot.class.path has the following value: /usr/jdk64/jdk1.8.0_60/jre/lib/resources.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/rt.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/sunrsasign.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jsse.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jce.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/charsets.jar:/usr/jdk64/jdk1.8.0_60/jre/lib/jfr.jar:/usr/jdk64/jdk1.8.0_60/jre/classes.
The spark.executorEnv.PYTHONPATH has the following value: /usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip:/usr/hdp/2.4.2.0-258/spark/python/:<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.9-src.zip.
The Hadoop distribution is via CDH: Hadoop 2.7.1.2.4.2.0-258
Quoting the Steve Loughran (who given his track record in Spark development seems to be the source of truth about the topic of accessing S3 file systems) from SPARK-15965 No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1:
This is being fixed with tests in my work in SPARK-7481; the manual
workaround is
Spark 1.6+
This needs my patch a rebuild of spark assembly. However, once that patch is in, trying to use the assembly without the AWS JARs will stop spark from starting —unless you move up to Hadoop 2.7.3
There are also some other sources where you can find workarounds:
https://issues.apache.org/jira/browse/SPARK-7442
https://community.mapr.com/thread/9635
How to access s3a:// files from Apache Spark?
https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html
https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.html
I've got no environment (and experience) to give the above a shot so after you give the above a try please report back to have a better understanding of the current situation regarding S3 support in Spark. Thanks.

Resources