Benchmarks on Spark + external Metastore (on aws emr)

Benchmarks on Spark + external Metastore (on aws emr) - apache-spark

How can we do benchmarks on spark connected with external metastore?
I tried spark-bench but the bottleneck which i faced over there is I'm not able to use external metastore as input. I can connect to hdfs and s3 and get the benchmarks. or am I doing anything wrong
Any help is highly appreciated

Related

Spark and HDFS on Kuberenetes data locality

I'm trying to run Spark on K8 and struggling a bit with data locality. I'm using the native spark support but just watched https://databricks.com/session/hdfs-on-kubernetes-lessons-learned. I've followed the steps there in setting up my HDFS cluster (namenode on first k8 node, using host networking). I was wondering if anyone knows if the fix to the spark driver presented has been merged into the mainline spark code?
I ask as I still see ANY locality in places I'd expect NODE_LOCAL.

The code has been a part of version v2.2.0-kubernetes-0.4.0

How to connect local spark to Hive in cluster in Scala IDE

Can you please let me know the steps to connect scala ide which I use for developing spark to connect to hive. Currently the output goes to hdfs and then I create an external table on top of it. But as spark streaming creates small files , the performance is getting bad and I want spark to write directly to Hive and I am not sure what configuration in my PC that I should make for that to happen for my development.

Spark need of HDFS

Hi can anyone explain me, does Apache 'Spark Standalone' need HDFS?
If it's required how Spark uses the HDFS block size during the Spark application execution.
I mean am trying to understand what will be the HDFS role during Spark application execution.
Spark documentation says that the processing parallelism is controlled through RDD partitions and the executors/cores.
Can anyone please help me to understand.

Spark can work without any issues without using HDFS and most certainly it is not required for core execution.
Some distributed storage (not necessarily HDFS) is required for checkpoiniting and is useful for saving results.

Can Spark access DynamoDb without EMR

I have a set of AWS Instances where Apache Hadoop distribution along with apache spark is setup
I am trying to access DynamoDb through Spark streaming for reading and writing to the table But
During writing the Spark- DynamoDB code, I got to know emr-ddb-hadoop.jar is required to get DynamoDB Input Format and OutputFormat which is present in EMR Cluster only.
After checking few blogs it seems that it is accessible only with EMR Spark.
Is It correct?
However I use standalone JAVA SDK to access Dynamodb which worked fine

I got the solution of the problem.
I downloaded the emr-ddb-hadoop.jar file from EMR and using it in my environment.
Please note: To run the DynamoDB, we only need above jar only.

Currently we are building a reporting platform as a data store we used Shark. Since the development of Shark is stopped so we are in the phase of evaluating Spark SQL. Based on the use cases we have we had few questions.
1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo). We would like to know how can we get this data into Spark SQL? Does there exist any utility which we can use? Does this utility support continuous refresh of data (sync of new add/update/delete on data store to Spark SQL?
2) Is the a way to create multiple database in Spark SQL?
3) For Reporting UI we use Jasper, we would like to connect from Jasper to Spark SQL. When we did our initial search we got to know currently there is no support for consumer to connect Spark SQL through JDBC, but in future releases you would like the add the same. We would like to know by when Spark SQL would have a stable release which would have JDBC Support? Meanwhile we took the source code from https://github.com/amplab/shark/tree/sparkSql but we had some difficulty in setting it up locally and evaluating it . It would be great if you can help us with setup instructions.(I can share the issue we are facing please let me know where can I post the error logs)
4) We would also require a SQL prompt where we can execute queries, currently Spark Shell provides SCALA prompt where SCALA code can be executed, from SCALA code we can fire SQL queries. Like Shark we would like to have SQL prompt in Spark SQL. When we did our search we found that in future release of Spark this would be added. It would be great if you can tell us which release of Spark would address the same.

as for
3) Spark 1.1 provides better support for SparkSQL ThriftServer interface, which you may want to use for JDBC interfacing. Hive JDBC clients that support v. 0.12.0 are able to connect and interface with such server.
4) Spark 1.1 also provides a SparkSQL CLI interface that can be used for entering queries. In the same fashion that Hive CLI or Impala Shell.
Please, provide more details about what you are trying to achieve for 1 and 2.

I can answer (1):
Apache Sqoop was made specifically to solve this problem for the relational databases. The tool was made for HDFS, HBase, and Hive -- as such it can be used to make data available to Spark, via HDFS and the Hive metastore.
http://sqoop.apache.org/
I believe Cassandra is available to SparkContext via this connector from DataStax: https://github.com/datastax/spark-cassandra-connector -- which I have never used.
I'm not aware of any connector for MongoDB.

1) We have data from various sources( MySQL, Oracle, Cassandra, Mongo)
You have to use different driver for each case. For cassandra there is datastax driver (but i encountered some compatibility problems with SparkSQL). For any SQL system you can use JdbcRDD. The usage is straightforward, look at the scala example:
test("basic functionality") {
sc = new SparkContext("local", "test")
val rdd = new JdbcRDD(
sc,
() => { DriverManager.getConnection("jdbc:derby:target/JdbcRDDSuiteDb") },
"SELECT DATA FROM FOO WHERE ? <= ID AND ID <= ?",
1, 100, 3,
(r: ResultSet) => { r.getInt(1) } ).cache()
assert(rdd.count === 100)
assert(rdd.reduce(_+_) === 10100)
}
But notion that it's just an RDD, so you should work with this data through map-reduce api, not in SQLContext.
Does there exist any utility which we can use?
There is Apache Sqoop project but it's in active development state. The current stable version even doesn't save files in parquet format.

Spark SQL is a capability of the Spark framework. It shouldn't be compared to Shark because Shark is a service. (Recall that with Shark, you run a ThriftServer that you can then connect to from your Thrift app or even ODBC.)
Can you elaborate on what you mean by "get this data into Spark SQL"?

There are a couple of Spark - MongoDB connectors:
- the mongodb connector for hadoop (which doesn't actually need Hadoop at all!) https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
the Stratio mongodb connector https://github.com/Stratio/spark-mongodb

If your data is huge and need to perform a lot of transformations then Spark SQL can be used for ETL purpose, else presto could solve all your problems. Addressing your queries one by one:
As your data is in MySQL, Oracle, Cassandra, Mongo all these can be integrated in Presto as it has connectors https://prestodb.github.io/docs/current/connector.html for all these databases.
Once you install Presto in cluster mode you can query all these databases together in one platform, which also provides to join a table from Cassandra and other tables from Mongo, this flexibility is unparalleled.
Presto can be used to connect to Apache Superset https://superset.incubator.apache.org/ which is open source and provides all sets Dashboarding. Also Presto can be connected to Tableau.
You can install MySQL workbench with presto connecting details which helps in providing a UI for all your databases at one place.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string