Tableau performance with 500 concurrent users - apache-spark

We are planning to use Tableau server with 8 core machine and 64 GB RAM for Data visualization as it has many rich features. Our idea is to use Spark SQL with hive metadata as the data input and mostly Live queries on top of it.
Can anyone who used the same architecture provide their thoughts on how tableau will behave with more than 500 concurrent users

Related

Cassandra Cluster - Production (Vwmare)

I intend to create a cassandra cluster with 10 nodes (16v cpu + 32 Gb of RAM each).
However, for the generation of this cluster, I intend to use a high-end storage (SSD only) with 320k IOPS. These machines will be spread over 10 machines with VMWARE 6.7 installed. Any contraindications in this case? Even though it is a very performative architecture for any type of application / database?
It looks server side is quite okay but you need to consider other things like network, OS and data modelling part to opt good performance in Cassandra.
You can take a look datastax recommendation here :-
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html

System Requirement for Spark In Production

May someone please help me with the system requirement for Spark to run on Production Environment.
I am trying to set up Environment for Batch Processing of data coming from Kafka Producer.
The size of data daily process is in TB.
The Data is coming from HDFS,and Persistant layer is also HDFS.
The information i got are:-
4-8 disks per node, configured without RAID (just as separate mount points).
Allocating only at most 75% of the memory for Spark.
The rest for the operating system and buffer cache.
10 Gigabit or higher network is the best way to make these applications faster.
Please share your knowledge if someone used Spark on Prod.
Thanks a ton
at least 8-16 cores per machine.
May someone please help me on this.

POC on Cassandra and PowerBI Report server

1.I have been given task to set up hardware for Cassandra DB( preferably on VM). For now, Cassandra has 100 gb of data and data ingestion is at 500 bytes per every 2 seconds.What kind of hardware/VM should i use?
We need Power-bi Report server to connect to this DB, i plan to use The CData ODBC Driver to establish the connection. Considering the above config will i face any issues w.r.t performance or connection?
Thanks,
Karthik
To your first part:
Your incoming data rate is 250byte/s. For a single year, this is about (raw) 8GB - which is quite small and should even fit into a virtual machine. Keep in mind that your storage used on disk will be higher than this as there is overhead for internal structures as well as for replication (if you need high availability).
But I don't recommend VMs for Cassandra as they often use shared storage for their images which can be a real performance killer due to noisy neighbours and latency. This issue can be less relevant when SSDs or NVMe storage is used.
For the second part: I don't know much more from PowerBI apart from its name. But there is/was an ODBC driver for Cassandra from DataStax:
https://www.datastax.com/dev/blog/using-the-datastax-odbc-driver-for-apache-cassandra
Maybe that helps.

Cassandra cluster planning

I would like to know about hardware limitations in cluster planning (in TBs) specific to my use case. I have read few threads and documents related to it but some content seem to be over 5 years old. Thought of giving it a shot again:
Use case: Building a time-series cassandra cluster where there is from time-to-time bulk loading from data sources which are in Gigabytes. However, the end-user will majorly be focused in reading the data from the cluster. Quite rarely will be some update or delete on the rows
I have an initial hardware configuration with me to setup Cassandra cluster:
2*12 Cores
128 GB RAM
HDD SAS 3.27 TB
This is the initial plan that I come up with:
When I now speculate over the setup, and after reading the post:
should I further divide my nodes with lesser RAM, vCPUs and HDD?
If yes, what would be the good fit wrt my case?

How to prevent Spark SQL + Power BI OOM

Now I'm testing Spark SQL like an query engine for Microsoft Power BI.
What I have:
A huge Cassandra table with data I need to analyze.
An Amazon server with 8 cores and 16Gb of RAM.
A Spark Thrift server on this server. Version of Spark - 1.6.1
A Hive table mapped to a huge Cassandra table.
create table data using org.apache.spark.sql.cassandra options (cluster 'Cluster', keyspace 'myspace', table 'data');
All was ok until I tried to connect Power BI to Spark. The problem is that Power BI is trying to fetch all data from huge Cassandra table. Obviously Spark Thrift Server crashes with OOM Error. In this case I cant just add RAM to Spark Thrift Server because Cassandra table with raw data is really huge. Also I cant rely on custom initial query on BI side, because every time user forget about setting this query server would crash.
The best approach I see is in automatically wrapping all queries from BI in some kind of
SELECT * FROM (... BI select ...) LIMIT 1000000
It will be okay for current use cases.
So, is it possible on the server side? How I can do it?
If not, how I can prevent Spark Thrift Server crashes? Is there a possibility to drop or cancel huge queries before getting OOM?
Thanks.
Ok, I find a magic configuration option that solves my problem:
spark.sql.thriftServer.incrementalCollect=true
When this option is set, Spark splits the data that is fetched by a volume-consuming query to chunks

Resources