How to save/insert each DStream into a permanent table - apache-spark

I've been facing a problem with "Spark Streaming" about the insertion of output Dstream into a permanent SQL table. I'd like to insert every output DStream (coming from single batch that spark processes) into a unique table. I've been using Python with a Spark version 1.6.2.
At this part of my code I have a Dstream made of one or more RDD that i'd like to permanently insert/store into a SQL table without losing any result for each processed batch.
rr = feature_and_label.join(result_zipped)\
.map(lambda x: (x[1][0][0], x[1][1]) )
Each Dstream here is represented for instance like this tuple: (4.0, 0).
I can't use SparkSQL because the way Spark treats the 'table', that is, like a temporary table, therefore loosing the result at every batch.
This is an example of output:
Time: 2016-09-23 00:57:00
(0.0, 2)
Time: 2016-09-23 00:57:01
(4.0, 0)
Time: 2016-09-23 00:57:02
(4.0, 0)
...
As shown above, each batch is made by only one Dstream. As I said before, I'd like to permanently store these results into a table saved somewhere, and possibly querying it at later time. So my question is:
is there a way to do it ?
I'd appreciate whether somebody can help me out with it but especially telling me whether it is possible or not.
Thank you.

Vanilla Spark does not provide a way to persist data unless you've downloaded the version packaged with HDFS (although they appear to be playing with the idea in Spark 2.0). One way to store the results to a permanent table and query those results later is to use one of the various databases in the Spark Database Ecosystem. There are pros and cons to each and your use case matters. I'll provide something close to a master list. These are segmented by:
Type of data managment, form data is stored in, connection to Spark
Database, SQL, Integrated
SnappyData
Database, SQL, Connector
MemSQL
Hana
Kudu
FiloDB
DB2
SQLServer (JDBC)
Oracle (JDBC)
MySQL (JDBC)
Database, NoSQL, Connector
Cassandra
HBase
Druid
Ampool
Riak
Aerospike
Cloudant
Database, Document, Connector
MongoDB
Couchbase
Database, Graph, Connector
Neo4j
OrientDB
Search, Document, Connector
Elasticsearch
Solr
Data grid, SQL, Connector
Ignite
Data grid, NoSQL, Connector
Infinispan
Hazelcast
Redis
File System, Files, Integrated
HDFS
File System, Files, Connector
S3
Alluxio
Datawarehouse, SQL, Connector
Redshift
Snowflake
BigQuery
Aster

Instead of using external connectors better go for spark structured streaming .

Related

Can Apache Spark be used in place of Sqoop

I have tried connecting spark with JDBC connections to fetch data from MySQL / Teradata or similar RDBMS and was able analyse the data.
Can spark be used to store the data to HDFS?
Is there any possibility for spark outperforming
the activities of Sqoop.
Looking for you valuable answers and explanations.
There are two main things about Sqoop and Spark. The main difference is Sqoop will read the data from your RDMS doesn't matter what you have and you don't need to worry much about how you table is configured.
With Spark using JDBC connection is a little bit different how you need to load the data. If your database doesn't have any column like numeric ID or timestamp Spark will load ALL the data in one single partition. And then will try to process and save. If you have one column to use as partition than Spark sometimes can be even faster than Sqoop.
I would recommend you to take a look in this doc.enter link description here
The conclusion is, if you are going to do a simple export and that need to be done daily with no transformation I would recommend Sqoop to be simple to use and will not impact your database that much. Using Spark will work well IF your table is ready for that, besides that goes with Sqoop

Spark Connect Hive to HDFS vs Spark connect HDFS directly and Hive on the top of it?

Summary of the problem:
I have a perticular usecase to write >10gb data per day to HDFS via spark streaming. We are currently in the design phase. We want to write the data to HDFS (constraint) using spark streaming. The data is columnar.
We have 2 options(so far):
Naturally, I would like to use hive context to feed data to HDFS. The schema is defined and the data is feeded in batches or row wise.
There is another option. We can directly write data to HDFS thanks to spark streaming API. We are also considering this because we can query data from HDFS through hive then in this usecase. This will leave options open to use other technologies in future for the new usecases that may come.
What is best?
Spark Streaming -> Hive -> HDFS -> Consumed by Hive.
VS
Spark Streaming -> HDFS -> Consumed by Hive , or other technologies.
Thanks.
So far I have not found a discussion on the topic, my research may be short. If there is any article that you can suggest, I would be most happy to read it.
I have a particular use case to write >10gb data per day and data is columnar
that means you are storing day-wise data. if thats the case hive has partition column as date, so that you can query the data for each day easily. you can query the raw data from BI tools like looker or presto or any other BI tool. if you are querying from spark then you can use hive features/properties. Moreover if you store the data in columnar format in parquet impala can query the data using hive metastore.
If your data is columnar consider parquet or orc.
Regarding option2:
if you have hive an option NO need to feed data in to HDFS and create an external table from hive and access it.
Conclusion :
I feel both are same. but hive is preferred considering direct query on raw data using BI tools or spark. From HDFS also we can query data using spark. if its there in the formats like json or parquet or xml there wont be added advantage for option 2.
It depends on your final use cases. Please consider below two scenarios while taking decision:
If you have RT/NRT case and all your data is full refresh then I would suggest to go with second approach Spark Streaming -> HDFS -> Consumed by Hive. It will be faster than your first approach Spark Streaming -> Hive -> HDFS -> Consumed by Hive. Since there is one less layer in it.
If your data is incremental and also have multiple update, delete operations then It will be difficult to use HDFS or Hive over HDFS with spark. Since Spark does not allow to update or delete data from HDFS. In that case, both your approaches will be difficult to implement. Either you can go with Hive managed table and do update/delete using HQL (only supported in Hortonwork Hive version) or you can go with NOSQL database like HBase or Cassandra so that spark can do upsert & delete easily. From program perspective, it will be also easy in compare to both your approaches.
If you dump data in NoSQL then you can use hive over it for normal SQL or reporting purpose.
There are so many tools & approaches are available but go with that which fit in your all cases. :)

What role Spark SQL acts? Memory DB?

Recently i come to Spark SQL.
I read the Data Source Api and still confused at what role Spark SQL acts.
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory? That means spark sql is only a memory db that works on data already loaded. Or it scan locally every time?
Really willing to any answers.
Best Regards.
I read the Data Source Api and still confused at what role Spark SQL acts.
Spark SQL is not a database. It is just an interface that allows you to execute SQL-like queries over the data that you store in Spark specific row-based structures called DataFrame
To run a SQL query via Spark, the first requirement is that the table on which you are trying to run a query should be present in either the Hive Metastore (i.e the table should be present in Hive) or it should be a temporary view that is part of the current SQLContext/HiveContext.
So, if you have a dataframe df and you want to run SQL queries over it, you can either use:
df.createOrReplaceTempView("temp_table") // or registerTempTable
and then you can use the SQLContext/HiveContext or the SparkSession to run queries over it.
spark.sql("SELECT * FROM temp_table")
Here's eliasah's answer that explains how createOrReplaceTempView works internally
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory?
The data will be stored in-memory or on disk depending upon the persistence strategy that you use. If you choose to cache the table, the data will get stored in memory and the operations would be considerable faster when compared to the case where data is fetched from the disk. That part is anyway configurable and up to the user. You can basically tell Spark how you want it to store the data.
Spark-sql will only cache the rows that are pulled by the action, this means that it will cache as many partitions as it has to read during the action. this makes your first call much faster than your second call

Importing blob data from RDBMS (Sybase) to Cassandra

I am trying to import large blob data ( around 10 TB ) from an RDBMS (Sybase ASE) into Cassandra, using DataStax Enterprise(DSE) 5.0 .
Is sqoop still the recommended way to do this in DSE 5.0? As per the release notes(http://docs.datastax.com/en/latest-dse/datastax_enterprise/RNdse.html) :
Hadoop and Sqoop are deprecated. Use Spark instead. (DSP-7848)
So should I use Spark SQL with JDBC data source to load data from Sybase, and then save the data frame to a Cassandra table?
Is there a better way to do this? Any help/suggestions will be appreciated.
Edit: As per DSE documentation (http://docs.datastax.com/en/latest-dse/datastax_enterprise/spark/sparkIntro.html), writing to blob columns from spark is not supported.
The following Spark features and APIs are not supported:
Writing to blob columns from Spark
Reading columns of all types is supported; however, you must convert collections of blobs to byte arrays before serialising.
Spark for the ETL of large data sets is preferred because it performs a distributed injest. Oracle data can be loaded into Spark RDDs or data frames and then just use saveToCassandra(keyspace, tablename). Cassandra Summit 2016 had a presentation Using Spark to Load Oracle Data into Cassandra by Jim Hatcher which discusses this topic in depth and provides examples.
Sqoop is deprecated but should still work in DSE 5.0. If its a one-time load and you're already confortable with Squoop, try that.

How to use Apache spark as Query Engine?

i am using Apache Spark For Big data Processing. The data is loaded to Data frames from a Flat file source or JDBC source. The Job is to search specific records from the data frame using spark sql.
So i have to Run the job again and again for new search terms. every time i have to submit the Jar files using spark submit to get the results. As the size of data is 40.5 GB it becomes tedious to reload the same data every time to data frame to get the results for different queries.
so What i need is,
a way if i can load the data in data frame once and query it multiple time with out submitting the jar multiple times ?
if we could use spark as a search engine/ query engine?
if we can load the data into data frame once and query the data frame remotely using RestAP
> The current configuration of My Spark Deployment is
5 node cluster.
runs on yarn rm.
i have tried to use spark-job server but it also runs the job every time.
You might be interested in HiveThriftServer and Spark integration.
Basically you start a Hive Thrift Server and inject your HiveContext build from SparkContext:
...
val sql = new HiveContext(sc)
sql.setConf("hive.server2.thrift.port", "10001")
...
dataFrame.registerTempTable("myTable")
HiveThriftServer2.startWithContext(sql)
...
There are several client libraries and tools to query the server:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
Including CLI tool - beeline
Reference:
https://medium.com/#anicolaspp/apache-spark-as-a-distributed-sql-engine-4373e254e0f9#.3ntbhdxvr
You can also use spark+kafka streaming integration. Just that you will have to send your queries over kafka for the streaming APIs to pick up. Thats one design pattern picking up quickly in market cos if its simplicity.
Create Datasets over your lookup data.
Start a Spark streaming query over Kafka.
Get the sql from your Kafka topic
Execute the query over the already created Datasets
This should take care of your usecase.
Hope this helps!
For the spark search engine, if you require full text search capabilities and/or document level scoring - and you do not have an elasticsearch infrastructure - you can give a try to Spark Search - it brings Apache Lucene support to spark.
df.rdd.searchRDD().save("/tmp/hdfs-pathname")
val restoredSearchRDD: SearchRDD[Person] = SearchRDD.load[Person](sc, "/tmp/hdfs-pathname")
restoredSearchRDD.searchList("(fistName:Mikey~0.8) OR (lastName:Wiliam~0.4) OR (lastName:jonh~0.2)",
topKByPartition = 10)
.map(doc => s"${doc.source.firstName}=${doc.score}"
.foreach(println)

Resources