Is it possible to partition video data using Spark on Databricks? - databricks

I am on Databricks, but this question extends to Apache Spark.
When ingesting large structured datasets, it is possible to manually partition them during ingestion.
Is it possible to partition videos and other unstructured forms of data using Spark, with the purpose of leveraging the number of cores to decrease time taken?

Related

How to use Delta Table as sink for statefull spark structured streaming

recently I'm working with spark stateful streaming (mapGroupsWithState and flatMapGroupsWithState). As I'm working on Databricks I'm trying to write results from the stateful streaming do Delta Table, but it is not possible in any of output modes (complete, append, update). Ideally, I would like to store all states in memory and saves periodical snapshots in Delta table. Do you have any idea how to achieve that?

Can Apache Spark repartition the data received from single Kafka partition

We have a Kafka topic with multiple partitions and the load is uneven/skewed in each partition. We can't change the partitioning strategy for Kafka.
I am looking for some ways to repartition the data so that the load is equally processed by all the nodes present within the Apache Spark cluster.
Is it possible to repartition the data across all the nodes received from a single Kafka partition?
If yes, is there a way we can do it efficiently as we have to maintain the state stores also while aggregating the data

How do spark streaming on hive tables using sql in NON-real time?

We have some data (millions) in hive tables which comes everyday. Next day, once the over-night ingestion is complete different applications query us for data (using sql)
We take this sql and make a call on spark
spark.sqlContext.sql(statement) // hive-metastore integration is enabled
This is causing too much memory usage on spark driver, can we use spark streaming (or structured streaming), to stream the results in a piped fashion rather than collecting everything on driver and then sending to clients ?
We don't want to send out the data as soon it comes ( in typical streaming apps), but want to send a streaming data to clients when they ask (PULL) for data.
IIUC..
Spark Streaming is mainly designed to process streaming data by converting into batches of Milliseconds to Seconds.
You can look over streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) provides you a very good functionality for Spark to write
Streaming processed output Sink in micro-batch manner.
Nevertheless Spark structured streaming don't have a standard JDBC source defined to read from.
Work out for an option to directly store Hive underlying files in compressed and structured manner, transfer them directly rather than selecting through spark.sql if every client needs same/similar data or partition them based on where condition of spark.sql query and transfer needed files further.
Source:
Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.
ForeachBatch:
foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

What is the best way to store incoming streaming data?

What is a better choice for a long-term store (many writes, few reads) of data processed through Spark Streaming: Parquet, HBase or Cassandra? Or something else? What are the trade-offs?
In my experience we have used Hbase as datastore for spark streaming data(we also has same scenario many writes and few reads), since we are using hadoop, hbase has native integration with hadoop and it went well..
Above we have used tostore hight rate of messages coming over from solace.
HBase is well suited for doing Range based scans. Casandra is known for availablity and many other things...
However, I can also observe one general trend in many projects, they are simply storing rawdata in hdfs (parquet + avro) in partitioned structure through spark streaming with spark dataframe(SaveMode.Append) and they are processing rawdata with Spark
Ex of partitioned structure in hdfs :
completion ofbusinessdate/environment/businesssubtype/message type etc....
in this case there is no need for going to Hbase or any other data store.
But one common issue in above approach is when you are getting small and tiny files, through streaming then you would need to repartion(1) or colelese or FileUtils.copymerge to meet block size requirements to single partitioned file. Apart from that above approach also would be fine.
Here is some thing called CAP theorm based on which decision can be taken.
Consistency (all nodes see the same data at the same time).
Availability (every request receives a response about whether it
succeeded or failed).
Partition tolerance (the system continues to
operate despite arbitrary partitioning due to network failures)
Casandra supports AP.
Hbase supports CP.
Look at detailed analysis given here

Resources