Read avro data using spark dataset in java - apache-spark

I am newbie to spark and am trying to load avro data to spark 'dataset' (spark 1.6) using java. I see some examples in scala but not in java.
Any pointers to examples in java will be helpful. I tried to create a javaRDD and then convert it to 'dataset'. I believe there must be a straight forward way.

first of all you need to set hadoop.home.dir
System.setProperty("hadoop.home.dir", "C:/app/hadoopo273/winutils-master/hadoop-2.7.1");
then create a Spark session specifying where the avro file will be located
SparkSession spark = SparkSession .builder().master("local").appName("ASH").config("spark.cassandra.connection.host", "127.0.0.1").config("spark.sql.warehouse.dir", "file:///C:/cygwin64/home/a622520/dev/AshMiner2/cass-spark-embedded/cassspark/cassspark.all/spark-warehouse/").getOrCreate();
In my code am using an embedded spark environement
// Creates a DataFrame from a specified file
Dataset<Row> df = spark.read().format("com.databricks.spark.avro") .load("./Ash.avro");
df.createOrReplaceTempView("words");
Dataset<Row> wordCountsDataFrame = spark.sql("select count(*) as total from words");
wordCountsDataFrame.show();
hope this helps

Related

Write Spark Dataframe to Hive accessible table in HDP2.6

I know there are already lots of answers on writing to HIVE from Spark, but non of them seem to work for me. So first some background. This is an older cluster, running HDP2.6, that's Hive2 and Spark 2.1.
Here an example program:
case class Record(key: Int, value: String)
val spark = SparkSession.builder()
.appName("Test App")
.config("spark.sql.warehouse.dir", "/app/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
records.write.saveAsTable("records_table")
If I log into the spark-shell and run that code, a new table called records_table shows up in Hive. However, if I deploy that code in a jar, and submit it to the cluster using spark-submit, I will see the table show up in the same HDFS location as all of the other HIVE tables, but it's not accessible to HIVE.
I know that in HDP 3.1 you have to use a HiveWarehouseConnector class, but I can't find any reference to that in HDP 2.6. Some people have mentioned the HiveContext class, while others say to just use the enableHiveSupport call in the SparkSessionBuilder. I have tried both approaches, but neither seems to work. I have tried saveAsTable. I have tried insertInto. I have even tried creating a temp view, then hiveContext.sql("create table if not exists mytable as select * from tmptable"). With each attempt, I get a parquet file in hdfs:/apps/hive/warehouse, but I cannot access that table from HIVE itself.
Based on the information provided, here is what I suggest you do,
Create Spark Session, enableHiveSupport is required,
val spark = SparkSession.builder()
.appName("Test App")
.enableHiveSupport()
.getOrCreate()
Next, execute DDL for table resultant table via spark.sql,
val ddlStr: String =
s"""CREATE EXTERNAL TABLE IF NOT EXISTS records_table(key int, value string)
|ROW FORMAT SERDE
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
|STORED AS INPUTFORMAT
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
|OUTPUTFORMAT
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
|LOCATION '$hdfsLocation'""".stripMargin
spark.sql(ddlStr)
Write data as per your use case,
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.write.format("orc").insertInto("records_table")
Notes:
Working is going to be similar for spark-shell and spark-submit
Partitioning is can be defined in the DDL, so do no use partitionBy while writing the data frame.
Bucketing/ Clustering is not supported.
Hope this helps/ Cheers.

How to insert spark structured streaming DataFrame to Hive external table/location?

One query on spark structured streaming integration with HIVE table.
I have tried to do some examples of spark structured streaming.
here is my example
val spark =SparkSession.builder().appName("StatsAnalyzer")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.streaming.checkpointLocation", "hdfs://pp/apps/hive/warehouse/ab.db")
.getOrCreate()
// Register the dataframe as a Hive table
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark.readStream.option("sep", ",").schema(userSchema).csv("file:///home/su/testdelta")
csvDF.createOrReplaceTempView("updates")
val query= spark.sql("insert into table_abcd select * from updates")
query.writeStream.start()
As you can see in the last step while writing data-frame to hdfs location, , the data is not getting inserted into the exciting directory (my existing directory having some old data partitioned by "age").
I am getting
spark.sql.AnalysisException : queries with streaming source must be executed with writeStream start()
Can you help why i am not able to insert data in to existing directory in hdfs location ? or is there any other way that i can do "insert into " operation on hive table ?
Looking for a solution
Spark Structured Streaming does not support writing the result of a streaming query to a Hive table.
scala> println(spark.version)
2.4.0
val sq = spark.readStream.format("rate").load
scala> :type sq
org.apache.spark.sql.DataFrame
scala> assert(sq.isStreaming)
scala> sq.writeStream.format("hive").start
org.apache.spark.sql.AnalysisException: Hive data source can only be used with tables, you can not write files of Hive data source directly.;
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:246)
... 49 elided
If a target system (aka sink) is not supported you could use use foreach and foreachBatch operations (highlighting mine):
The foreach and foreachBatch operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach allows custom write logic on every row, foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.
I think foreachBatch is your best bet.
import org.apache.spark.sql.DataFrame
sq.writeStream.foreachBatch { case (ds: DataFrame, batchId: Long) =>
// do whatever you want with your input DataFrame
// incl. writing to Hive
// I simply decided to print out the rows to the console
ds.show
}.start
There is also Apache Hive Warehouse Connector that I've never worked with but seems like it may be of some help.
On HDP 3.1 with Spark 2.3.2 and Hive 3.1.0 we have used Hortonwork's spark-llap library to write structured streaming DataFrame from Spark to Hive. On GitHub you will find some documentation on its usage.
The required library hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar is available on Maven and needs to be passed on in the spark-submit command. There are many more recent versions of that library, although I haven't had the chance to test them.
After creating the Hive table manually (e.g. through beeline/Hive shell) you could apply the following code:
import com.hortonworks.hwc.HiveWarehouseSession
val csvDF = spark.readStream.[...].load()
val query = csvDF.writeStream
.format(HiveWarehouseSession.STREAM_TO_STREAM)
.option("database", "database_name")
.option("table", "table_name")
.option("metastoreUri", spark.conf.get("spark.datasource.hive.warehouse.metastoreUri"))
.option("checkpointLocation", "/path/to/checkpoint/dir")
.start()
query.awaitTermination()
Just in case someone actually tried the code from Jacek Laskowski he knows that it does not really compile in Spark 2.4.0 (check my gist tested on AWS EMR 5.20.0 and vanilla Spark). So I guess that was his idea of how it should work in some future Spark version.
The real code is:
scala> import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Dataset
scala> sq.writeStream.foreachBatch((batchDs: Dataset[_], batchId: Long) => batchDs.show).start
res0: org.apache.spark.sql.streaming.StreamingQuery =
org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#5ebc0bf5

Why can't we create an RDD using Spark session

We see that,
Spark context available as 'sc'.
Spark session available as 'spark'.
I read spark session includes spark context, streaming context, hive context ... If so, then why are we not able to create an rdd by using a spark session instead of a spark context.
scala> val a = sc.textFile("Sample.txt")
17/02/17 16:16:14 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
a: org.apache.spark.rdd.RDD[String] = Sample.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val a = spark.textFile("Sample.txt")
<console>:23: error: value textFile is not a member of org.apache.spark.sql.SparkSession
val a = spark.textFile("Sample.txt")
As shown above, sc.textFile succeeds in creating an RDD but not spark.textFile.
In Spark 2+, Spark Context is available via Spark Session, so all you need to do is:
spark.sparkContext().textFile(yourFileOrURL)
see the documentation on this access method here.
Note that in PySpark this would become:
spark.sparkContext.textFile(yourFileOrURL)
see the documentation here.
In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s.
For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext.
But as DataSet and Dataframe API’s are becoming new standard API’s Spark need an entry point build for them. So in Spark 2.0, Spark have a new entry point for DataSet and Dataframe API’s called as Spark Session.
SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext.
All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.
sparkContext still contains the method which it had in previous
version .
methods of sparkSession can be found here
It can be created in the following way-
val a = spark.read.text("wc.txt")
This will create a dataframe,If you want to convert it to RDD then use-
a.rdd Please refer the link below,on dataset API-
http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html

Spark - Reading JSON from Partitioned Folders using Firehose

Kinesis firehose manages the persistence of files, in this case time series JSON, into a folder hierarchy that is partitioned by YYYY/MM/DD/HH (down to the hour in 24 numbering)...great.
How using Spark 2.0 then can I read these nested sub folders and create a static Dataframe from all the leaf json files? Is there an 'option' to the dataframe reader?
My next goal is for this to be a streaming DF, where new files persisted by Firehose into s3 naturally become part of the streaming dataframe using the new structured streaming in Spark 2.0. I know this is all experimental - hoping someone has used S3 as a streaming file source before, where the data is paritioned into folders as described above. Of course would prefer straight of a Kinesis stream but there is no date on this connector for 2.0 so Firehose->S3 is the interim.
ND: I am using databricks, which mounts S3 into DBFS, but could easily be EMR of course or other Spark providers. Be great to see a notebook too if one is shareable that gives an example.
Cheers!
Can I read nested subfolders and create a static DataFrame from all the leaf JSON files? Is there an option to the DataFrame reader?
Yes, as your directory structure is regular(YYYY/MM/DD/HH), you can give the path till leaf node with wildcard chars like below
val spark: SparkSession = SparkSession.builder.master("local").getOrCreate
val jsonDf = spark.read.format("json").json("base/path/*/*/*/*/*.json")
// Here */*/*/*/*.json maps to YYYY/MM/DD/HH/filename.json
Of course, would prefer straight of a Kinesis stream but there is no date on this connector for 2.0 so Firehose->S3 is the interim.
I could see there is a library for Kinesis integration with Spark Streaming. So, you can read the streaming data directly and perform SQL operations on it without reading from S3.
groupId = org.apache.spark
artifactId = spark-streaming-kinesis-asl_2.11
version = 2.0.0
Sample code with Spark Streaming and SQL
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kinesis._
import com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
val kinesisStream = KinesisUtils.createStream(
streamingContext, [Kinesis app name], [Kinesis stream name], [endpoint URL],
[region name], [initial position], [checkpoint interval], StorageLevel.MEMORY_AND_DISK_2)
kinesisStream.foreachRDD { rdd =>
// Get the singleton instance of SparkSession
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Convert RDD[String] to DataFrame
val jsonDf = rdd.toDF() // or rdd.toDF("specify schema/columns here")
// Create a temporary view with DataFrame
jsonDf.createOrReplaceTempView("json_data_tbl")
//As we have DataFrame and SparkSession object we can perform most
//of the Spark SQL stuff here
}
Full disclosure: I work for Databricks but I do not represent them on Stack Overflow.
How using Spark 2.0 then can I read these nested sub folders and create a static Dataframe from all the leaf json files? Is there an 'option' to the dataframe reader?
DataFrameReader supports loading a sequence. See the documentation for def
json(paths: String*): DataFrame. You can specify the sequence, use a globbing pattern or build it programmatically (recommended):
val inputPathSeq = Seq[String]("/mnt/myles/structured-streaming/2016/12/18/02", "/mnt/myles/structured-streaming/2016/12/18/03")
val inputPathGlob = "/mnt/myles/structured-streaming/2016/12/18/*"
val basePath = "/mnt/myles/structured-streaming/2016/12/18/0"
val inputPathList = (2 to 4).toList.map(basePath+_+"/*.json")
I know this is all experimental - hoping someone has used S3 as a streaming file source before, where the data is partitioned into folders as described above. Of course would prefer straight of a Kinesis stream but there is no date on this connector for 2.0 so Firehose->S3 is the interim.
Since you're using DBFS, I'm going to assume the S3 buckets where data are streaming from Firehose are already mounted to DBFS. Check out Databricks documentation if you need help mounting your S3 bucket to DBFS. Once you have your input path described above, you can simply load the files into a static or streaming dataframe:
Static
val staticInputDF =
spark
.read
.schema(jsonSchema)
.json(inputPathSeq : _*)
staticInputDF.isStreaming
res: Boolean = false
Streaming
val streamingInputDF =
spark
.readStream // `readStream` instead of `read` for creating streaming DataFrame
.schema(jsonSchema) // Set the schema of the JSON data
.option("maxFilesPerTrigger", 1) // Treat a sequence of files as a stream by picking one file at a time
.json(inputPathSeq : _*)
streamingCountsDF.isStreaming
res: Boolean = true
Most of this is taken straight from Databricks documentation on Structured Streaming. There is even a notebook example you can import into Databricks directly.

Converting CSV to ORC with Spark

I've seen this blog post by Hortonworks for support for ORC in Spark 1.2 through datasources.
It covers version 1.2 and it addresses the issue or creation of the ORC file from the objects, not conversion from csv to ORC.
I have also seen ways, as intended, to do these conversions in Hive.
Could someone please provide a simple example for how to load plain csv file from Spark 1.6+, save it as ORC and then load it as a data frame in Spark.
I'm going to ommit the CSV reading part because that question has been answered quite lots of time before and plus lots of tutorial are available on the web for that purpose, it will be an overkill to write it again. Check here if you want !
ORC support :
Concerning ORCs, they are supported with the HiveContext.
HiveContext is an instance of the Spark SQL execution engine that integrates with data stored in Hive. SQLContext provides a subset of the Spark SQL support that does not depend on Hive but ORCs, Window function and other feature depends on HiveContext which reads the configuration from hive-site.xml on the classpath.
You can define a HiveContext as following :
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
If you are working with the spark-shell, you can directly use sqlContext for such purpose without creating a hiveContext since by default, sqlContext is created as a HiveContext.
Specifying as orc at the end of the SQL statement below ensures that the Hive table is stored in the ORC format. e.g :
val df : DataFrame = ???
df.registerTempTable("orc_table")
val results = hiveContext.sql("create table orc_table (date STRING, price FLOAT, user INT) stored as orc")
Saving as an ORC file
Let’s persist the DataFrame into the Hive ORC table we created before.
results.write.format("orc").save("data_orc")
To store results in a hive directory rather than user directory, use this path instead /apps/hive/warehouse/data_orc (hive warehouse path from hive-default.xml)

Resources