Spark RDD :How to share data for a parallel operation - apache-spark

My understanding of spark is that when you run a reduce operation on RDD, its being operated in parallel by different nodes and result is accumulated back by master node. Since these operation run in parallel , result is available as a whole and we can not rely on any update done in between during processing e.g say I am designing a shared cab app and I have a RDD which contains start location of trip and actual location of different cabs . I can easily run a spark sql to get the distance of each cab from trip start point. Once , I have this , I need to pick the shortest distance cab and allocate it. Now here we have a condition that a cab can not take more then 4 trips. Since my analysis is running in parallel, I can not be sure if already cab is full capacity. So what is the best way to validate this. Can we have a shared variable or should we store allocation in database. Performance is the key

There is no functionality as such in spark, you can make use of Apache Ignite
For more details refer link
https://apacheignite.readme.io/docs

Related

Why is execution time of spark sql query different between first time and second time of execution?

I am using spark sql to run some aggregated query on the parquet data source.
My parquet data source includes a table with columns: id int, time timestamp, location int, counter_1 long, counter_2 long, ..., counter_48. The total data size is about 887 MB.
My spark version is 2.4.0. I run one master and one slave on a single machine (4 cores, 16G memory).
Using spark-shell, I ran the spark command:
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
The second time I ran a similar command (only change columns):
spark.time(spark.sql("SELECT location, sum(counter_2)+sum(counter_6)+sum(counter_11)+sum(counter_16)+sum(cou
nter_21)+sum(counter_26)+sum(counter_31)+sum(counter_36 )+sum(counter_41)+sum(counter_46) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 3s.
My first question is: Why are they different? I know it is not data caching because of the parquet format. Is it about reusing something like query planning?
I did another test: The first command is
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
In the second command, I change the aggregate function:
spark.time(spark.sql("SELECT location, avg(counter_1)+avg(counter_5)+avg(counter_10)+avg(counter_15)+avg(cou
nter_20)+avg(counter_25)+avg(counter_30)+avg(counter_35 )+avg(counter_40)+avg(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 5s.
My second question is: Why is the second command is faster than the first command but the execution time difference is slightly smaller than the first scenario?
Finally, I have a problem related to above scenarios: The are about 200 formulas like:
formula1 = sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45)
formula2 = avg(counter_2)+avg(counter_5)+avg(counter_11)+avg(counter_15)+avg(cou
nter_21)+avg(counter_25)+avg(counter_31)+avg(counter_35 )+avg(counter_41)+avg(counter_45)
I have to run the following format frequently:
select formulaX,formulaY, ..., formulaZ from table where time > value1 and time < value2 and location in (value1, value 2...) group by location
My third question is: Is there anyway to optimize the performance (the query used once should be faster if it is used again in the future)? Does spark optimize itself or do I have to write some code, change config?
It's called Exchange Reuse. When Spark runs shuffling (i.e. aggregation, join) it stores a copy of the shuffle data on local worker nodes for potential reuse. This is an internally controlled behavior and cannot be directly influenced by end user. If you find you're keep re-using a particular portion of data (or query outcome), you could consider explicitly CACHING it by using the cache(). However, bear in mind although this allows Spark to reuse cached result for potentially faster query performance (if, and only if the Analyzer Plan of your cached query matches your new query), over using CACHE can cause whole lot of different performance problems.
A bad example is when your dataset is very large, it may cause Disk Spill problem. That is, the dataset doesn't fit into your cluster's available memory and needs to be written to slower hard disks.
Another bad example is when your query only needs to access a subset of the cached data. By caching the entire dataset in memory, Spark is forced to perform full in-memory table scan. Not only that's waste of resource but also results in a slower query performance as oppose to not using cache at all.
The best thing to do is try & error with a few of your own example queries, look at the Spark UI and check if there is sign of Disk Spill or large amount of input data scan.
Every query/data combination is unique hence you'll need to experiment it a bit to find the best performance tuning method for your own workload.
When doing an aggregate spark creates what are called shuffle files. If you run the same query twice, it will reuse the shuffle files which are stored locally on the workers fs. Unfortunately you can't rely on them to always be there because eventually the file handler gets gc'd. If your going to run 10 queries on the same dataset, cache it or use databricks.

How to export large Neo4j datasets for analysis in an automated fashion

I've run into a technical challenge around Neo4j usage that has had me stumped for a while. My organization uses Neo4j to model customer interaction patterns. The graph has grown to a size of around 2 million nodes and 7 million edges. All nodes and edges have between 5 and 10 metadata properties. Every day, we export data on all of our customers from Neo4j to a series of python processes that perform business logic.
Our original method of data export was to use paginated cypher queries to pull the data we needed. For each customer node, the cypher queries had to collect many types of surrounding nodes and edges so that the business logic could be performed with the necessary context. Unfortunately, as the size and density of the data grew, these paginated queries began to take too long to be practical.
Our current approach uses a custom Neo4j procedure to iterate over nodes, collect the necessary surrounding nodes and edges, serialize the data, and place it on a Kafka queue for downstream consumption. This method worked for some time but is now taking long enough so that it is also becoming impractical, especially considering that we expect the graph to grow an order of magnitude in size.
I have tried the cypher-for-apache-spark and neo4j-spark-connector projects, neither of which have been able to provide the query and data transfer speeds that we need.
We currently run on a single Neo4j instance with 32GB memory and 8 cores. Would a cluster help mitigate this issue?
Does anyone have any ideas or tips for how to perform this kind of data export? Any insight into the problem would be greatly appreciated!
As far as I remember Neo4j doesn't support horizontal scaling and all data is stored in a single node. To use Spark you could try to store your graph in 2+ nodes and load the parts of the dataset from these separate nodes to "simulate" the parallelization. I don't know if it's supported in both of connectors you quote.
But as told in the comments of your question, maybe you could try an alternative approach. An idea:
Find a data structure representing everything you need to train your model.
Store such "flattened" graph in some key-value store (Redis, Cassandra, DynamoDB...)
Now if something changes in the graph, push the message to your Kafka topic
Add consumers updating the data in the graph and in your key-value store directly after (= make just an update of the graph branch impacted by the change, no need to export the whole graph or change the key-value store at the same moment but it would very probably lead to duplicate the logic)
Make your model querying directly the key-value store.
It depends also on how often your data changes, how deep and breadth is your graph ?
Neo4j Enterprise supports clustering, you could use the Causal Cluster feature and launch as many read replicas as needed, run the queries in parallel on the read replicas, see this link: https://neo4j.com/docs/operations-manual/current/clustering/setup-new-cluster/#causal-clustering-add-read-replica

Records processed metric for intermediate datasets

I have created a spark job using DATASET API. There is chain of operations performed until the final result which is collected on HDFS.
But I also need to know how many records were read for each intermediate dataset. Lets say I apply 5 operations on dataset (could be map, groupby etc), I need to know how many records were there for each of 5 intermediate dataset. Can anybody suggest how this can be obtained at dataset level. I guess I can find this out at task level (using listeners) but not sure how to get it at dataset level.
Thanks
The nearest from Spark documentation related to metrics is Accumulators. However this is good only for actions and they mentioned that acucmulators will not be updated for transformations.
You can still use count to get the latest counts after each operation. But should keep in mind that its an extra step like any other and you need to see if the ingestion should be done faster with less metrics or slower with all metrics.
Now coming back to listerners, I see that a SparkListener can receive events about when applications, jobs, stages, and tasks start and complete as well as other infrastructure-centric events like drivers being added or removed, when an RDD is unpersisted, or when environment properties change. All the information you can find about the health of Spark applications and the entire infrastructure is in the WebUI.
Your requirement is more of a custom implementation. Not sure if you can achieve this. Some info regarding exporting metrics is here.
All metrics which you can collect are at job start, job end, task start and task end. You can check the docs here
Hope the above info might guide you in finding a better solutions

Cassandra : Batch write optimisation

I get bulk write request for let say some 20 keys from client.
I can either write them to C* in one batch or write them individually in async way and wait on future to get them completed.
Writing in batch does not seem to be a goo option as per documentation as my insertion rate will be high and if keys belong to different partitions co-ordinators will have to do extra work.
Is there a way in datastax java driver with which I can group keys
which could belong to same partition and then club them into small
batches and then do invidual unlogged batch write in async. IN that
way i make less rpc calls to server at the same time coordinator will
have to write locally. I will be using token aware policy.
Your idea is right, but there is no built-in way, you usually do that manually.
Main rule here is to use TokenAwarePolicy, so some coordination would happen on driver side.
Then, you could group your requests by equality of partition key, that would probably be enough, depending on your workload.
What I mean by 'grouping by equality of partition key` is e.g. you have some data that looks like
MyData { partitioningKey, clusteringKey, otherValue, andAnotherOne }
Then when inserting several such objects, you group them by MyData.partitioningKey. It is, for all existsing paritioningKey values, you take all objects with same partitioningKey, and wrap them in BatchStatement. Now you have several BatchStatements, so just execute them.
If you wish to go further and mimic cassandra hashing, then you should look at cluster metadata via getMetadata method in com.datastax.driver.core.Cluster class, there is method getTokenRanges and compare them to result of Murmur3Partitioner.getToken or any other partitioner you configured in cassandra.yaml. I've never tried that myself though.
So, I would recommend to implement first approach, and then benchmark your application. I'm using that approach myself, and on my workload it works far better than without batches, let alone batches without grouping.
Logged batches should be used carefully in Cassandra becase they impose additional overhead. It also depends on the partition keys distribution. If your bulk write targets a single partition then using Unlogged batch results in a single insert operation.
In general, writing them invidually in async manner seems to be a good aproach as pointed here:
https://medium.com/#foundev/cassandra-batch-loading-without-the-batch-the-nuanced-edition-dd78d61e9885
You can find sample code on the above site how to handle multiple async writes:
https://gist.github.com/rssvihla/26271f351bdd679553d55368171407be#file-bulkloader-java
https://gist.github.com/rssvihla/4b62b8e5625a805583c1ce39b1260ff4#file-bulkloader-java
EDIT:
please read this also:
https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/#14
What does a single partition batch cost?
There’s no batch log written for single partition batches. The
coordinator doesn’t have any extra work (as for multi partition
writes) because everything goes into a single partition. Single
partition batches are optimized: they are applied with a single
RowMutation [10].
In a few words: single partition batches don’t put much more load on
the server than normal writes.
What does a multi partition batch cost?
Let me just quote Christopher Batey, because he has summarized this
very well in his post “Cassandra anti-pattern: Logged batches” [3]:
Cassandra [is first] writing all the statements to a batch log. That
batch log is replicated to two other nodes in case the coordinator
fails. If the coordinator fails then another replica for the batch log
will take over. [..] The coordinator has to do a lot more work than
any other node in the cluster.
Again, in bullets what has to be done:
serialize the batch statements
write the serialized batch to the batch log system table
replicate of this serialized batch to 2 nodes
coordinate writes to nodes holding the different partitions
on success remove the serialized batch from the batch log (also on the 2 replicas)
Remember that unlogged batches for multiple partitions are deprecated since Cassandra 2.1.6

Spark: Importing Data

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!
Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

Resources