What's the best way to rate limit a spark application - apache-spark

I have an application does the following:
Reads URLs from a Hive table
Creates HTTP requests from those URLs, hits a server with them and parses the responses
Writes the parsed responses to another Hive table
I would like to rate-limit the URLs sent to the server. Currently, to solve the problem I have added a sleep time after every request is sent to the server. The sleep time is calculated as: (no. of executors) * (no. of cores available for each executor) / (RPS intended)
This for some reason does not do any rate limiting, so I am looking for alternatives. From what I have found from this post, it seems Spark Streaming could be a good alternative if I could use the input Hive table as a streaming source and rate limit the reading.
I have read the documents but can not figure out if a Hive table can be a streaming source. A file can be a streaming source, so I can always read the data from the hive table, store it in a file and then use that as a streaming source but I was wondering if it was possible to avoid this indirect route.

You aren't really using the right tool for the job here. Yes, spark reads from hive but so do a lot of other tools. Spark is made to do batch processing, weather it's steaming or processing. Rate control would require custom code.
You might look at other open source tools, like NIFI that know how to work with hive and also understand hive. Here's a good discussion on how to control rate flow with Nifi.
Or look at Nutch which was made to scrape the internet into hadoop.
If you wanted to abuse spark to do this, you might be able to do something with foreachPartitions and repartitioning the partitions up into smaller chunks, and reducing the number of cores/executors, so that the entire job took longer to process... but really your anti-optimizing at that point... again not really a good look. Possible but not really a good use of Spark.

Related

Why there is no JDBC Spark Streaming receiver?

I suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once.
I was surprised that there is no JDBC Spark Streaming receiver implementation. Implementing Receiver doesn't look difficult.
Could you describe why such receiver doesn't exist (is this approach a bad idea?) or provide links to implementations.
I've found Stratio/datasource-receiver. But it reads all data in a DataFrame before processing by Spark Streaming.
Thanks!
First of all actual streaming source would require a reliable mechanism for monitoring updates, which is simply not a part of JDBC interface nor it is a standardized (if at all) feature of major RDBMs, not to mention other platforms, which can be accessed through JDBC. It means that streaming from a source like this typically requires using log replication or similar facilities and is highly resource dependent.
At the same what you describe
suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once
is really not an use case for streaming. Streaming deals with infinite streams of data, while you ask is simply as scenario for partitioning and such capabilities are already a part of the standard JDBC connector (either by range or by predicate).
Additionally receiver based solutions simply don't scale well and effectively model a sequential process. As a result their applications are fairly limited, and wouldn't be even less appealing if data was bounded (if you're going to read finite data sequentially on a single node, there is no value in adding Spark to the equation).
I don't think it is a bad idea since in some cases you have constraints that are outside your power,e.g. legacy systems to which you cannot apply strategies such as CDC but to which you still have to consume as a source of stream data.
On the other hand, Spark Structure Streaming engine, in micro-batch mode, requires the definition of an offset than can be advanced, as you can see in this class. So, if your table has some column that can be used as an offset, you can definitely stream from it, although RDMDS are not the "streaming-friendly" as far as I know.
I have developed Jdbc2s which is a DataSource V1 streaming source for Spark. It's also deployed to Maven Central, if you need. Coordinates are in the documentation.

Read from Hbase + Convert to DF + Run SQLs

Edit
My use case is a Spark streaming app (spark 2.1.1 + Kafka 0.10.2.1), wherein I read from Kafka and for each message/trigger need to pull data from HBase. post the pull, I need to run some SQL statements on the data (so received from HBase)
Naturally, I intend to push the processing (read from HBase & SQL execution) to the worker nodes to achieve parallelism.
So far, my attempts to convert the data from HBase to a data frame (so that i can launch SQK statements) are failing. Another gent mentioned that it's not "allowed " since that part is running on executors. However, this is my conscious choice to run those pieces on worker nodes.
Is that sound thinking? If not, why not?
What's the recommendation on that? or on the overall idea?
For every streamed rec, reading from hbase and sql seems to be "too much happening in streaming app".
Anyways, you can create connection for every partition to hbase and get records and then compare. Not sure about sql. If its just another reading for every streaming record, again handle at partition level in spark.
But the above approach will be time consuming - just make sure you finish all stuff before the next batch starts.
You also mentioned converting "hbase to dataframe" and "parallel". Both seemed to be in opposite direction. Because you start with dataframe(may be reading from hbase once and then you parallelize. Hope I cleared some of your doubts

Cassandra Loading Options

I have deployed a 9 node DataStax Cluster in Google Cloud. I am new to Cassandra and not sure how generally people push the data to Cassandra.
My requirement is read the data from flatfiles and RDBMs table and load into Cassandra which is deployed in Google Cloud.
These are the options I see.
1. Use Spark and Kafka
2. SStables
3. Copy Command
4. Java Batch
5. Data Flow ( Google product )
Is there any other options and which one is best.
Thanks,
For flat files you have 2 most effective options:
Use Spark - it will load data in parallel, but requires some coding.
Use DSBulk for batch loading of data from command line. It supports loading from CSV and JSON, and very effective. DataStax's Academy blog just started a series of the blog posts on DSBulk, and first post will provide you enough information to start with it. Also, if you have big files, consider to split them into smaller ones, as it will allow DSBulk to perform parallel load using all available threads.
For loading data from RDBMS, it depends on what you want to do - load data once, or need to update data as they change in the DB. For first option you can use Spark with JDBC source (but it has some limitations too), and then saving data into DSE. For 2nd, you may need to use something like Debezium, that supports streaming of change data from some databases into Kafka. And then from Kafka you can use DataStax Kafka Connector for submitting data into DSE.
CQLSH's COPY command isn't as effective/flexible as DSBulk, so I won't recommend to use it.
And never use CQL Batch for data loading, until you know how it works - it's very different from RDBMS world, and if it's used incorrectly it will really make loading less effective than executing separate statements asynchronously. (DSBulk uses batches under the hood, but it's different story).

Spark Storm or Flink - Big Data analysis

Can anyone recommend me which technology can be explored if I am having a large data set in Cassandra table (3 node cluster) and I need to perform a sum operation on records received on daily basis. The count so calculated needs to be updated in a MySQL table.
Steps to perform -
1. Fetch Ids from MY SQL table
2. Run Sum operation from Cassandra table
3. Insert/update the calculated sum value in MYSQL table
Currently I am using plain Java to perform these tasks using SQL and CQL queries but its very slow and in future data will be growing exponentially.
Can anyone suggest technologies that can be explored to get this task accomplish in fastest possible way and lowest development time.
There's not much to recommend, it depends only on the task you have and your own preferences.
Apache Storm is a streaming engine, it would be good if you want to process stream of entries, not a batch of data like in your case.
Both Apache Spark and Apache Flink will allow you to perform batch job once a day or make a streaming application that will calculate results from one day.
I prefer Apache Spark, as it has unified API for batch and streaming jobs (so you can easily change code from batch to streaming) and strong community support. Apache Flink supports real time streaming, however it's not necessary in your case.
However, you should look and these two frameworks on your own and choose this framework, which looks better for you. In my opinion both of them will be ok

Parquet vs Cassandra using Spark and DataFrames

I have come to this dilemma that I cannot choose what solution is going to be better for me. I have a very large table (couple of 100GBs) and couple of smaller (couple of GBs). In order to create my data pipeline in Spark and use spark ML I need to join these tables and do couple of GroupBy (aggregate) operations. Those operations were really slow for me so I chose to do one of these two:
Use Cassandra and use indexing to speed the GoupBy operations.
Use Parquet and Partitioning based on the layout of the data.
I can say that Parquet partitioning works faster and more scalable with less memory overhead that Cassandra uses. So the question is this:
If developer infers and understands the data layout and the way it is going to be used, wouldn't it better for just use Parquet since you will have more control over it? Why should I pay the price for the overhead that Cassandra causes?
Cassandra is also a good solution for analytics use cases, but in another way. Before you model your keyspaces, you have to know how you need to read the data. You can also use where and range queries, but in a hard restricted way. Sometimes you will hate this restriction, but there are reasons for these restrictions. Cassandra is not like Mysql. In MySQL the performance is not a key feature. It's more about flexibility and consistency. Cassandra is a high performance write/read database. Better in write than in read. Cassandra has also a linear scalability.
Okay, a bit about your use case: Parquet is the better option for you. This is why:
You aggregate raw data on really large and not splitted datasets
Your Spark ML Job sounds like a scheduled, not long-running job. (onces a week, day?)
This fits more in the use cases of Parquet. Parquet is a solution for ad-hoc analysis, filter analysis stuff. Parquet is really nice if you need to run a query 1 or 2 times a month. Parquet is also a nice solution if a marketing guy wants to know one thing and the response time is not so important. Simply and short:
Use Cassandra if you know the queries.
Use Cassandra if a query will be used in a daily business
Use Cassandra if Realtime matters (I talk about a maximum of 30 seconds latency, from, customer makes an action and I can see the result in my dashboard)
Use Parquet if Realtime doesn't matter
Use Parquet if the query will not perform 100x a day.
Use Parquet if you want to do batch processing stuff
It depends on your usecase. Cassandra makes it much easier (also outside of Spark) to access your data with (limited) pseudo-SQL. That makes it a perfect fit for building online-applications on top (e.g. to display the data in an UI) of it.
Also Cassandra makes it easier if you have to deal with updates, that is not only the new data going to be ingested in your data pipeline(e.g. logs) but you also have to take care about updates (e.g. system has to handle corrections of data)
When your usecase is to do analytics with Spark (and you don't care about the topics mentioned above), it should be feasible and considerable cheaper to use Parquet/HDFS - as you've stated. With HDFS you also achieve data locality with Spark and you might have the advantage that your analytic Spark applications are even faster if you are reading large blocks of data.

Resources