How to use Spark to read all rows from hbase and post it to elastic search - apache-spark

In my current application, we are loading data to elastic search in the form of jsondocuments, and this data is generated by processing the data taken from HBASE.
Currently we are using mapreducefor doing this, which fetch each row from hbase, process it and post the generated jsonfile to elastic search.
Our data consists of millions of documents and it takes too much time to do this data loading.
Is there a more time efficient way to do the same using Spark.

you can replace your Mapreduce with spark . you have to use the below configuration .
val conf = new SparkConf().setAppName(appName).setMaster(master)
conf.set("es.index.auto.create", "true")
If you want to see the complete document Apache spark with elasticsearch
Scala code with elasticsearch Scala code for elasticsearch
If you want to get the row from your Hbase then you check spark-hbase-connector

Related

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

Performance Issues when querying FROM Hbase using Spark TO Elasticsearch

I have a huge nearly billion of rows in the HBase database. I am writing a Spark job that pulls data from Hbase efficiently based on date range and push that data to elastic search for indexing in batches. I am using hbase-spark connector with JavaHBaseContext spark SQL with dataframe to get the data. Later I to push this data for indexing in batches to elasticsearch.
I am having performance issues first with getting data from Hbase then indexing and pushing data to elasticsearch. Please let me know how should I efficiently perform above operation.
P.S:Hbase is backed by data in S3

Spark and JDBC: Iterating through large table and writing to hdfs

What would be the most memory efficient way to copy the contents of a large relational table using spark and then write to a partitioned Hive table in parquet format (without sqoop). I have a basic spark app and i have done some other tuning with spark's jdbc but data in relational table is still 0.5 TB and 2 Billion records so I although I can lazy load the full table, I'm trying to figure out how to efficiently partition by date and save to hdfs without running into memory issues. since the jdbc load() from spark will load everything into memory I was thinking of looping through the dates in the database query but still not sure how to make sure I don't run out of memory.
If you need to use Spark you can add to your application date parameter for filtering table by date and run your Spark application in loop for each date. You can use bash or other scripting language for this loop.
This can look like:
foreach date in dates
spark-submit your application with date parameter
read DB table with spark.read.jdbc
filter by date using filter method
write result to HDFS with df.write.parquet("hdfs://path")
Another option is to use different technology for example implement Scala application using JDBC and DB cursor to iterate through rows and save result to HDFS. This is more complex, because you need to solve problems related to writing to Parquet format and saving to HDFS using Scala. If you want I can provide Scala code responsible for writing to Parquet format.

Spark Streaming : source HBase

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
sc.newAPIHadoopRDD(..)
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
Thanks!
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

How to use Apache spark as Query Engine?

i am using Apache Spark For Big data Processing. The data is loaded to Data frames from a Flat file source or JDBC source. The Job is to search specific records from the data frame using spark sql.
So i have to Run the job again and again for new search terms. every time i have to submit the Jar files using spark submit to get the results. As the size of data is 40.5 GB it becomes tedious to reload the same data every time to data frame to get the results for different queries.
so What i need is,
a way if i can load the data in data frame once and query it multiple time with out submitting the jar multiple times ?
if we could use spark as a search engine/ query engine?
if we can load the data into data frame once and query the data frame remotely using RestAP
> The current configuration of My Spark Deployment is
5 node cluster.
runs on yarn rm.
i have tried to use spark-job server but it also runs the job every time.
You might be interested in HiveThriftServer and Spark integration.
Basically you start a Hive Thrift Server and inject your HiveContext build from SparkContext:
...
val sql = new HiveContext(sc)
sql.setConf("hive.server2.thrift.port", "10001")
...
dataFrame.registerTempTable("myTable")
HiveThriftServer2.startWithContext(sql)
...
There are several client libraries and tools to query the server:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
Including CLI tool - beeline
Reference:
https://medium.com/#anicolaspp/apache-spark-as-a-distributed-sql-engine-4373e254e0f9#.3ntbhdxvr
You can also use spark+kafka streaming integration. Just that you will have to send your queries over kafka for the streaming APIs to pick up. Thats one design pattern picking up quickly in market cos if its simplicity.
Create Datasets over your lookup data.
Start a Spark streaming query over Kafka.
Get the sql from your Kafka topic
Execute the query over the already created Datasets
This should take care of your usecase.
Hope this helps!
For the spark search engine, if you require full text search capabilities and/or document level scoring - and you do not have an elasticsearch infrastructure - you can give a try to Spark Search - it brings Apache Lucene support to spark.
df.rdd.searchRDD().save("/tmp/hdfs-pathname")
val restoredSearchRDD: SearchRDD[Person] = SearchRDD.load[Person](sc, "/tmp/hdfs-pathname")
restoredSearchRDD.searchList("(fistName:Mikey~0.8) OR (lastName:Wiliam~0.4) OR (lastName:jonh~0.2)",
topKByPartition = 10)
.map(doc => s"${doc.source.firstName}=${doc.score}"
.foreach(println)

Resources