How to use datahub to get spark transformation lineage? - apache-spark

I have setup datahub and spark in k8s in different namespace, I can run spark with datahub configurations following this guide: https://datahubproject.io/docs/metadata-integration/java/spark-lineage/, my spark application will get data from Minio, then do some transformations including groupby, pivot, rename and a spark SQL query, then write result to cassandra database.
After spark execution finished, I can see my spark application in datahub, but there is no information for "Tasks" or "Lineage" in datahub, what should I do to get these data? There is very limited information in datahub document. Thanks!

Related

Tables not found in Spark SQL after migrating from EMR to AWS Glue

I have Spark jobs on EMR, and EMR is configured to use the Glue catalog for Hive and Spark metadata.
I create Hive external tables, and they appear in the Glue catalog, and my Spark jobs can reference them in Spark SQL like spark.sql("select * from hive_table ...")
Now, when I try to run the same code in a Glue job, it fails with "table not found" error. It looks like Glue jobs are not using the Glue catalog for Spark SQL the same way that Spark SQL would running in EMR.
I can work around this by using Glue APIs and registering dataframes as temp views:
create_dynamic_frame_from_catalog(...).toDF().createOrReplaceTempView(...)
but is there a way to do this automatically?
This was a much awaited feature request (to use Glue Data Catalog with Glue ETL jobs) which has been release recently.
When you create a new job, you'll find the following option
Use Glue data catalog as the Hive metastore
You may also enable it for an existing job by editing the job and adding --enable-glue-datacatalog in the job parameters providing no value
Instead of using SparkContext.getOrCreate(), you should use SparkSession.builder().enableHiveSupport().getOrCreate(), with enableHiveSupport() being the important part that's missing. I think what's probably happening is that your Spark job is not actually creating your tables in Glue but rather is creating them in Spark's embedded Hive metastore, since you have not enabled Hive support.
Had the same problem. It was working on my Dev endpoint but not the actual ETL job. It is fixed by editing the job from Spark 2.2 to Spark 2.4.
in my case
SparkSession.builder
.config(
"hive.metastore.client.factory.class",
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")

Hive queries in Spark cluster

I need to understand how the hive query will get executed in Spark cluster. It will operate as a Mapreduce job running in memory or it will use the spark architecture for running the hive queries. Pls clarify.
If you run hive queries in hive or beeline it will use Map-reduce, but if you run hive queries in spark REPL or program the queries will simply get converted into dataframes and created the logical and physical plan same as data frame and executes. Hence will use all the power of spark.
Assuming that you have a Hadoop cluster with YARN and Spark configured;
Hive execution engine is controlled by hive.execution.engine property. According to the docs it can be mr (default), tez or spark.

Spark SQL query history

Is there any way to get a list of Spark SQL queries executed by various users in a Hadoop cluster?
For example, is there any log file where a Spark application stores the query in string format ?
There is a Spark History Server (port 18080 by default). If you have spark.eventLog.enabled,spark.eventLog.dir configured and Spark HS is running - you can check what Spark apps have been executed on your cluster. Each job there might contain SQL tab in UI where you can see SQL queries. But there are no single place or log file which stores them all.

Improve reading speed of Cassandra in Spark (Parallel reads implementation)

I am new to Spark and trying to combine Cassandra and Spark to do some analytical tasks.
From the Spark web UI I found that most of the time are consumed in the reading process.
When I dig into this particular task, I found that only single executor is working on it.
Is it possible to improve the performance of this task via some tricks like parallelization?
p.s. I am using the pyspark cassandra connector (https://github.com/TargetHolding/pyspark-cassandra).
UPDATE: I am using a 3-node Spark cluster running Spark 1.6 and a 3-node Cassandra cluster running Cassandra 2.2.4.
And I am selecting data in the form of
"select * from tbl where partitionKey IN [pk_1,pk_2,....,pk_N] where
clusteringKey > ck_1 and clusteringKey < ck_2"
UPDATE2: Ive read an article suggesting to replace the IN clause with parallel reads. (https://ahappyknockoutmouse.wordpress.com/2014/11/12/246/) How can this be achieved in spark?
Will able to answer to point, if you provide more details about cluster, spark and Cassandra versions and related stuff.Though I will try to answer it as per my understanding.
Make sure you are partitioning RDD parallelized-collections
If your spark job is running on only single executor, please verify spark submit command.you can get more details about spark submit commands here as per your cluster manager.
For speeding up Cassandra read operations, make use of proper indexing. I will recommend use of Solr, which will help you in fast data retrieval from Cassandra.

connecting to spark data frames in tableau

We are trying to generate reports in tableau by spark SQL connectivity, But i found out that we are ultimately connecting to hive meta-store.
If this is the case what are the advantages of this new spark SQL connection. Is there a way to connect to spark data frames that are persisted, from tableau using spark SQL.
The problem here is a Tableau problem more than a Spark problem. Spark SQL Connector launches a Spark job each time you connect to a database. Part of that Spark job loads the underlying Hive table into the distributed memory that Spark manages, and each time you make a change or select on a graph, the refresh has to go a level deeper to Hive metastore to get the data, through Spark. That is how Tableau is designed. The only option here is to change Tableau for Spotfire (or some other tool) where by pre-caching the underlying Hive table, the Spark SQL Connector can query it directly from Spark distributed memory, skipping the load step.
Disclosure: I am in no way associated with Spotfire makers

Resources