Performance Issues when querying FROM Hbase using Spark TO Elasticsearch - apache-spark

I have a huge nearly billion of rows in the HBase database. I am writing a Spark job that pulls data from Hbase efficiently based on date range and push that data to elastic search for indexing in batches. I am using hbase-spark connector with JavaHBaseContext spark SQL with dataframe to get the data. Later I to push this data for indexing in batches to elasticsearch.
I am having performance issues first with getting data from Hbase then indexing and pushing data to elasticsearch. Please let me know how should I efficiently perform above operation.
P.S:Hbase is backed by data in S3

Related

Is it possible to partition video data using Spark on Databricks?

I am on Databricks, but this question extends to Apache Spark.
When ingesting large structured datasets, it is possible to manually partition them during ingestion.
Is it possible to partition videos and other unstructured forms of data using Spark, with the purpose of leveraging the number of cores to decrease time taken?

How to use Delta Table as sink for statefull spark structured streaming

recently I'm working with spark stateful streaming (mapGroupsWithState and flatMapGroupsWithState). As I'm working on Databricks I'm trying to write results from the stateful streaming do Delta Table, but it is not possible in any of output modes (complete, append, update). Ideally, I would like to store all states in memory and saves periodical snapshots in Delta table. Do you have any idea how to achieve that?

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

How to use Spark to read all rows from hbase and post it to elastic search

In my current application, we are loading data to elastic search in the form of jsondocuments, and this data is generated by processing the data taken from HBASE.
Currently we are using mapreducefor doing this, which fetch each row from hbase, process it and post the generated jsonfile to elastic search.
Our data consists of millions of documents and it takes too much time to do this data loading.
Is there a more time efficient way to do the same using Spark.
you can replace your Mapreduce with spark . you have to use the below configuration .
val conf = new SparkConf().setAppName(appName).setMaster(master)
conf.set("es.index.auto.create", "true")
If you want to see the complete document Apache spark with elasticsearch
Scala code with elasticsearch Scala code for elasticsearch
If you want to get the row from your Hbase then you check spark-hbase-connector

How to save/insert each DStream into a permanent table

I've been facing a problem with "Spark Streaming" about the insertion of output Dstream into a permanent SQL table. I'd like to insert every output DStream (coming from single batch that spark processes) into a unique table. I've been using Python with a Spark version 1.6.2.
At this part of my code I have a Dstream made of one or more RDD that i'd like to permanently insert/store into a SQL table without losing any result for each processed batch.
rr = feature_and_label.join(result_zipped)\
.map(lambda x: (x[1][0][0], x[1][1]) )
Each Dstream here is represented for instance like this tuple: (4.0, 0).
I can't use SparkSQL because the way Spark treats the 'table', that is, like a temporary table, therefore loosing the result at every batch.
This is an example of output:
Time: 2016-09-23 00:57:00
(0.0, 2)
Time: 2016-09-23 00:57:01
(4.0, 0)
Time: 2016-09-23 00:57:02
(4.0, 0)
...
As shown above, each batch is made by only one Dstream. As I said before, I'd like to permanently store these results into a table saved somewhere, and possibly querying it at later time. So my question is:
is there a way to do it ?
I'd appreciate whether somebody can help me out with it but especially telling me whether it is possible or not.
Thank you.
Vanilla Spark does not provide a way to persist data unless you've downloaded the version packaged with HDFS (although they appear to be playing with the idea in Spark 2.0). One way to store the results to a permanent table and query those results later is to use one of the various databases in the Spark Database Ecosystem. There are pros and cons to each and your use case matters. I'll provide something close to a master list. These are segmented by:
Type of data managment, form data is stored in, connection to Spark
Database, SQL, Integrated
SnappyData
Database, SQL, Connector
MemSQL
Hana
Kudu
FiloDB
DB2
SQLServer (JDBC)
Oracle (JDBC)
MySQL (JDBC)
Database, NoSQL, Connector
Cassandra
HBase
Druid
Ampool
Riak
Aerospike
Cloudant
Database, Document, Connector
MongoDB
Couchbase
Database, Graph, Connector
Neo4j
OrientDB
Search, Document, Connector
Elasticsearch
Solr
Data grid, SQL, Connector
Ignite
Data grid, NoSQL, Connector
Infinispan
Hazelcast
Redis
File System, Files, Integrated
HDFS
File System, Files, Connector
S3
Alluxio
Datawarehouse, SQL, Connector
Redshift
Snowflake
BigQuery
Aster
Instead of using external connectors better go for spark structured streaming .

Resources