We are trying to build a scenario where based on some selection parameters in a
reporting tool (lets say tableau) a spark program needs to be executed which performs some market basket analysis on a data set against the selection parameters. The result from the program then needs to be displayed in reporting tool.
We are not able figure out how to trigger the spark program once the user enters
selection parameters in the reporting tool (basically the linkage between reporting tool and spark program). Any pointers in this regard would help a lot.
Thanks!
if you are looking for the steps to connect Spark sql with Tableau . If you wanted to do the pre processing then you have to do it in source side .
Example Take an example of source hive with Tableau. Then you have to create the view or data massaging in Hive side.
If you are using Tableau Server, you can use the Tableau JavaScript API to call a function you write when the user makes a selection. The API also has functions your code can call to refresh or display a viz.
Related
I have an application written for Spark using Scala language. My application code is kind of ready and the job runs for around 10-15 mins.
There is an additional requirement to provide status of the application execution when spark job is executing at run time. I know that spark runs in lazy way and it is not nice to retrieve data back to the driver program during spark execution. Typically, I would be interested in providing status at regular intervals.
Eg. if there 20 functional points configured in the spark application then I would like to provide status of each of these functional points as and when they are executed/ or steps are over during spark execution.
These incoming status of function points will then be taken to some custom User Interface to display the status of the job.
Can some one give me some pointers on how this can be achieved.
There are few things you can do on this front that I can think of.
If your job contains multiple actions, you can write a script to poll for the expected output of those actions. For example, imagine your script have 4 different DataFrame save calls. You could have your status script poll HDFS/S3 to see if the data has showed up in the expected output location yet. Another example, I have used Spark to index to ElasticSearch, and I have written status logging to poll for how many records are in the index to print periodic progress.
Another thing I tried before is use Accumulators to try and keep rough track of progress and how much data has been written. This works ok, but it is a little arbitrary when Spark updates the visible totals with information from the executors so I haven't found it to be too helpfully for this purpose generally.
The other approach you could do is poll Spark's status and metric APIs directly. You will be able to pull all of the information backing the Spark UI into your code and do with it whatever you want. It won't necessarily tell you exactly where you are in your driver code, but if you manually figure out how your driver maps to stages you could figure that out. For reference, here are is the documentation on polling the status API:
https://spark.apache.org/docs/latest/monitoring.html#rest-api
I'm trying to figure out why activity that I know is occurring isn't showing up in the SQL tab of the Spark UI. I am using Spark 1.6.0.
For example, we have a load of activity occurring today between 11:06 & 13:17, and I know for certain that the code being executed is using Spark dataframes API:
Yet if I hop over to the SQL tab I don't see any activity occurring for between those times:
So...I'm trying to figure out what influences whether not activity appears in that SQL tab, because the information presented in that SQL tab is (arguably) the most useful information in the whole UI - and when there's activity occurring that isn't showing up it becomes kinda annoying. The only distinguishing characteristic seems to be that the jobs that are showing up in the SQL tab use actions that don't write any data (e.g. count()), the jobs that do write data don't seem to be showing up. I'm puzzled as to why.
Any pearls of wisdom?
Is there a way that i can backup a single table in Apache Cassandra with java code. i want to run such code once every week using a scheduler. Can some one share links to such resources, if there any?
Have a look at this answer.
Fetch all rows in cassandra
It's just a matter of adding code to export let's say every row to csv or some similar format that would be fine for you.
You will also have to write script to load this data, but those are just simple inserts.
In my application all realtime data storing in a cassandra table, I have plan to analyze it using apache spark and put it into different tables which allows faster data fetch, I want to know which design approach I need to apply for it.
Analyze relatime table in a time-frame , then put in to hourly , then analyze later make it to daily , then weekly etc..., Then it is easy to achieve data in a date range. Is my logic is fine or any other approach with cassandra and spark?
I think your approach is good.It is similar to Lambda Architecture designed by Nathan Marz. For more information, follow this link .Hope this will help you.
I have 2 data source(db1, db2) and 2 dataset. 2 dataset are store procedure from each data source.
Dataset1 must run first to create a table for dataset 2 to update and show (dataset 1 will show result too).
Cause the data of the table must base on some table in DB1, the store procedure will create a table to db2 by using link server.
I have search online and tried "single transaction" in data source, but it show error in data set 1 with no detail.
Is there anyway to do it? cause I want to generate an excel with 2 sheet for this result.
Check out this this post.
The default behavior of SSRS is to run the dataset at the same time. They are run in the order in which they are presented in your rdl (top down when looking at it in the report data area). Changing the behavior of a single data source with multiple datasets is as simple as clicking on a checkbox in data source dialog.
With multiple datsources it is a little bit more tricky!
Here is the explanation from the MSDN Blog posted above:
Serializing dataset executions when using multiple data source:
Note that datasets using different data sources will still be executed in parallel; only datasets of the same data source are serialized when using the single transaction setting. If you need to chain dataset executions across different data sources, there are still other options to consider.
For example, if the source databases of your data sources all reside on the same SQL Server instance, you could use just one data source to connect (with single transaction turned on) and then use the three-part name (catalog.schema.object_name) to execute queries or invoke stored procedures in different databases.
Another option to consider is the linked server feature of SQL Server, and then use the four-part name (linked_server_name.catalog.schema.object_name) to execute queries. However, make sure to carefully read the documentation on linked servers to understand its performance and connection credential implications.
This is an interesting question and while I think there might be another way of doing it, it would take a bit of time and playing around with your datasets and more information on your setup of the datasources.
Hope this helps though.