How to use Apache spark as Query Engine? - apache-spark

i am using Apache Spark For Big data Processing. The data is loaded to Data frames from a Flat file source or JDBC source. The Job is to search specific records from the data frame using spark sql.
So i have to Run the job again and again for new search terms. every time i have to submit the Jar files using spark submit to get the results. As the size of data is 40.5 GB it becomes tedious to reload the same data every time to data frame to get the results for different queries.
so What i need is,
a way if i can load the data in data frame once and query it multiple time with out submitting the jar multiple times ?
if we could use spark as a search engine/ query engine?
if we can load the data into data frame once and query the data frame remotely using RestAP
> The current configuration of My Spark Deployment is
5 node cluster.
runs on yarn rm.
i have tried to use spark-job server but it also runs the job every time.

You might be interested in HiveThriftServer and Spark integration.
Basically you start a Hive Thrift Server and inject your HiveContext build from SparkContext:
...
val sql = new HiveContext(sc)
sql.setConf("hive.server2.thrift.port", "10001")
...
dataFrame.registerTempTable("myTable")
HiveThriftServer2.startWithContext(sql)
...
There are several client libraries and tools to query the server:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
Including CLI tool - beeline
Reference:
https://medium.com/#anicolaspp/apache-spark-as-a-distributed-sql-engine-4373e254e0f9#.3ntbhdxvr

You can also use spark+kafka streaming integration. Just that you will have to send your queries over kafka for the streaming APIs to pick up. Thats one design pattern picking up quickly in market cos if its simplicity.
Create Datasets over your lookup data.
Start a Spark streaming query over Kafka.
Get the sql from your Kafka topic
Execute the query over the already created Datasets
This should take care of your usecase.
Hope this helps!

For the spark search engine, if you require full text search capabilities and/or document level scoring - and you do not have an elasticsearch infrastructure - you can give a try to Spark Search - it brings Apache Lucene support to spark.
df.rdd.searchRDD().save("/tmp/hdfs-pathname")
val restoredSearchRDD: SearchRDD[Person] = SearchRDD.load[Person](sc, "/tmp/hdfs-pathname")
restoredSearchRDD.searchList("(fistName:Mikey~0.8) OR (lastName:Wiliam~0.4) OR (lastName:jonh~0.2)",
topKByPartition = 10)
.map(doc => s"${doc.source.firstName}=${doc.score}"
.foreach(println)

Related

How to use Spark to read all rows from hbase and post it to elastic search

In my current application, we are loading data to elastic search in the form of jsondocuments, and this data is generated by processing the data taken from HBASE.
Currently we are using mapreducefor doing this, which fetch each row from hbase, process it and post the generated jsonfile to elastic search.
Our data consists of millions of documents and it takes too much time to do this data loading.
Is there a more time efficient way to do the same using Spark.
you can replace your Mapreduce with spark . you have to use the below configuration .
val conf = new SparkConf().setAppName(appName).setMaster(master)
conf.set("es.index.auto.create", "true")
If you want to see the complete document Apache spark with elasticsearch
Scala code with elasticsearch Scala code for elasticsearch
If you want to get the row from your Hbase then you check spark-hbase-connector

Spark Streaming : source HBase

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

Cassandra Loading Options

I have deployed a 9 node DataStax Cluster in Google Cloud. I am new to Cassandra and not sure how generally people push the data to Cassandra.
My requirement is read the data from flatfiles and RDBMs table and load into Cassandra which is deployed in Google Cloud.
These are the options I see.
1. Use Spark and Kafka
2. SStables
3. Copy Command
4. Java Batch
5. Data Flow ( Google product )
Is there any other options and which one is best.
Thanks,
For flat files you have 2 most effective options:
Use Spark - it will load data in parallel, but requires some coding.
Use DSBulk for batch loading of data from command line. It supports loading from CSV and JSON, and very effective. DataStax's Academy blog just started a series of the blog posts on DSBulk, and first post will provide you enough information to start with it. Also, if you have big files, consider to split them into smaller ones, as it will allow DSBulk to perform parallel load using all available threads.
For loading data from RDBMS, it depends on what you want to do - load data once, or need to update data as they change in the DB. For first option you can use Spark with JDBC source (but it has some limitations too), and then saving data into DSE. For 2nd, you may need to use something like Debezium, that supports streaming of change data from some databases into Kafka. And then from Kafka you can use DataStax Kafka Connector for submitting data into DSE.
CQLSH's COPY command isn't as effective/flexible as DSBulk, so I won't recommend to use it.
And never use CQL Batch for data loading, until you know how it works - it's very different from RDBMS world, and if it's used incorrectly it will really make loading less effective than executing separate statements asynchronously. (DSBulk uses batches under the hood, but it's different story).

How to save/insert each DStream into a permanent table

I've been facing a problem with "Spark Streaming" about the insertion of output Dstream into a permanent SQL table. I'd like to insert every output DStream (coming from single batch that spark processes) into a unique table. I've been using Python with a Spark version 1.6.2.
At this part of my code I have a Dstream made of one or more RDD that i'd like to permanently insert/store into a SQL table without losing any result for each processed batch.
rr = feature_and_label.join(result_zipped)\
.map(lambda x: (x[1][0][0], x[1][1]) )
Each Dstream here is represented for instance like this tuple: (4.0, 0).
I can't use SparkSQL because the way Spark treats the 'table', that is, like a temporary table, therefore loosing the result at every batch.
This is an example of output:
Time: 2016-09-23 00:57:00
(0.0, 2)
Time: 2016-09-23 00:57:01
(4.0, 0)
Time: 2016-09-23 00:57:02
(4.0, 0)
...
As shown above, each batch is made by only one Dstream. As I said before, I'd like to permanently store these results into a table saved somewhere, and possibly querying it at later time. So my question is:
is there a way to do it ?
I'd appreciate whether somebody can help me out with it but especially telling me whether it is possible or not.
Thank you.
Vanilla Spark does not provide a way to persist data unless you've downloaded the version packaged with HDFS (although they appear to be playing with the idea in Spark 2.0). One way to store the results to a permanent table and query those results later is to use one of the various databases in the Spark Database Ecosystem. There are pros and cons to each and your use case matters. I'll provide something close to a master list. These are segmented by:
Type of data managment, form data is stored in, connection to Spark
Database, SQL, Integrated
SnappyData
Database, SQL, Connector
MemSQL
Hana
Kudu
FiloDB
DB2
SQLServer (JDBC)
Oracle (JDBC)
MySQL (JDBC)
Database, NoSQL, Connector
Cassandra
HBase
Druid
Ampool
Riak
Aerospike
Cloudant
Database, Document, Connector
MongoDB
Couchbase
Database, Graph, Connector
Neo4j
OrientDB
Search, Document, Connector
Elasticsearch
Solr
Data grid, SQL, Connector
Ignite
Data grid, NoSQL, Connector
Infinispan
Hazelcast
Redis
File System, Files, Integrated
HDFS
File System, Files, Connector
S3
Alluxio
Datawarehouse, SQL, Connector
Redshift
Snowflake
BigQuery
Aster
Instead of using external connectors better go for spark structured streaming .

Baseline for measuring Apache Spark jobs execution times

I am fairly new to Apache Spark. I have been using it for several months, but this is my first project that uses it.
I use Spark to compute dynamic reports from data, stored in a NoSQL database (Cassandra). So far I have created several reports and they are computed correctly. Inside them I use DataFrame .unionAll(), .join(), .count(), .map(), etc.
I am running a 1.4.1 Spark cluster on my local machine with the following setup:
export SPARK_WORKER_INSTANCES=6
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=1g
I have also populated the database with test data which is around 10-12k records per table.
By using the driver's web UI (http://localhost:4040/), I have noticed that the jobs are taking 40s-50s to execute, so lately I have been researching ways to tune Apache Spark and the jobs.
I have configured Spark to use the KryoSerializer, I have set the spark.io.compression.codec to lzf, I have optimized the jobs as much as I can and as much as my knowledge allows me to.
This led to the jobs taking 20s-30s to compute (which I think is a good improvement). The problem is that because this is my first Spark project, I have no baseline to compare the jobs times, so I have no idea if the execution is slow or fast and whether there is some problem in the code or with the Spark config.
What is the best way to proceed? Is there a graph or benchmark that shows how much time an action with N data should take?
You have to use hive . On top of hive you can put spark . After doing this create temp table in hive for Cassandra table you can perform all type of aggregation and filtering . After doing this you can use hive jdbc connection to get result . It will give fast result .

Resources