Spark not using spark.sql.parquet.compression.codec - apache-spark

I'm comparing spark's parquets file vs apache-drill's.
Drill's parquet are way more lightweight then spark's. Spark uses GZIP as compression codec as default, for experimenting I tried to change it to
snappy : same size
uncompressed: same size
lzo : exception
I tried both ways:
sqlContext.sql("SET spark.sql.parquet.compression.codec=uncompressed")
sqlContext.setConf("spark.sql.parquet.compression.codec.", "uncompressed")
But seems like it dosen't change his settings

Worked for me in 2.1.1
df.write.option("compression","snappy").parquet(filename)

For spark 1.3 and spark.sql.parquet.compression.codec parameter did not compress the output. But the below one did work.
sqlContext.sql("SET parquet.compression=SNAPPY")

Try this. Seems to work for me in 1.6.0
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec", "uncompressed")

For Spark 1.6 :
You can use different compression codecs. Try :
sqlContext.setConf("spark.sql.parquet.compression.codec","gzip")
sqlContext.setConf("spark.sql.parquet.compression.codec","lzo")
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
sqlContext.setConf("spark.sql.parquet.compression.codec","uncompressed")

Try:
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
I see you already did this, but I'm unable to delete my answer on mobile. Try setting this before the sqlcontext as suggested in the comment.

When facing issues while storing into Hive via hive context use:
hc.sql("set parquet.compression=snappy")

Related

How to export a Datastax graph based on a specific traversal using DseGraphFrame

I would like to export a DSE graph via a spark job , as per
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/graph/graphAnalytics/dseGraphFrameExport.html
All this works fine within the spark-shell ,
I want to be doing this in Java using DseGraphFrame .
Unfortunately there is not much in the documentation
I am able to pack a jar with the following code and do a
spark-submit
SparkSession spark = SparkSession
.builder()
.appName("Datastax Java example")
.getOrCreate();
DseGraphFrame dseGraphFrame = DseGraphFrameBuilder.dseGraph(args[0], spark);
DataFrameWriter dataFrameWriter = dseGraphFrame.V().df().write();
dataFrameWriter.csv("vertices");
The above works fine ,
what I want to be doing is use a specific traversal to filter what I export.
That is use something like that
dseGraphFrame.V().hasLabel("label").df().write();
The above does not work as dseGraphFrame.V().hasLabel("label") does not have .df()
Is this the correct way of doing things
Any help would be appreciated
A late answer to this question, perhaps still of use:
In Java, you need to cast this to a DseGraphTraversal first. This can then be converted to a DataFrame with the .df() method:
((DseGraphTraversal)dseGraphFrame.V().hasLabel("label")).df().write();

Saving empty DataFrame with known schema (Spark 2.2.1)

Is it possible to save an empty DataFrame with a known schema such that the schema is written to the file, even though it has 0 records?
def example(spark: SparkSession, path: String, schema: StructType) = {
val dataframe = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
val dataframeWriter = dataframe.write.mode(SaveMode.Overwrite).format("parquet")
dataframeWriter.save(path)
spark.read.load(path) // ERROR!! No files to read, so schema unknown
}
This is the answer I received from Databricks Support:
This is actually a known issue in Spark. There is already fix done in
opensource JIRA -> https://issues.apache.org/jira/browse/SPARK-23271.
For more details on how this behavior will change from 2.4 please
check this doc change
https://github.com/apache/spark/pull/20525/files#diff-d8aa7a37d17a1227cba38c99f9f22511R1808
The behavior will be changed from Spark 2.4. Until then you need to go
with any one of the following ways
Save a dataframe with at-least one record to preserve its schema
Save schema in a JSON file and use later
I got a similar problem with Spark 2.1.0. I solved it using repartition before writing.
df.repartition(1).write.parquet("my/path")

How to specify Hadoop Configuration when reading CSV

I am using Spark 2.0.2. How can I specify the Hadoop configuration item textinputformat.record.delimiter for the TextInputFormat class when reading a CSV file into a Dataset?
In Java I can code: spark.read().csv(<path>); However, there doesn't seem to be a way to provide a Hadoop configuration specific to the read.
It is possible to set the item using the spark.sparkContext().hadoopConfiguration() but that is global.
Thanks,
You cannot. Data Source API uses its own configuration which, as of 2.0 is not even compatible with Hadoop configuration.
If you want to use custom input format or other Hadoop configuration use SparkContext.hadoopFile, SparkContext.newAPIHadoopRDD or related classes.
Delimiter can be set using option() in spark2.0
var df = spark.read.option("header", "true").option("delimiter", "\t").csv("/hdfs/file/locaton")

Read multiple files with SparkSession in Spark 2.0

In Spark 1.6 to read multiple files, I have used:
JavaSparkContext ctx;
ctx.textFile(filePaths);
With filePaths is the directory to files. For example we have:
/home/user/folderA/0.log,/home/user/folderB/0.log. Each path separates by comma character.
But, when I upgrade to Spark 2.0. Method
SparkSession sparkSession;
sparkSession.read().textFile(filePaths);
doesn't work. The code throws exception: Path does not exist:
Question: Is there any solution to read multiple files, from multiple paths in Spark 2.0 just like Spark 1.6?
Edit: I try to call the method like Spark 1.6 using:
sparkSession.sparkContext().textFile(filePaths, 1).toJavaRDD();
The problem will solved. But, Is there have another solution?

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Resources