Performance improvement for Spark streaming job - apache-spark

I have a spark job which keeps streaming data from a kafka topic which contains IP addresses. This data is huge around 1M/sec
This data needs to be correlated with 5 of the Postgres Db tables which contains the subnets. So, the IP address coming from the input kafka topic, has to be validated against all these 5 tables containing subnets.
A little background on how subnet and ip are related
An Subnet can contain 'n' number of IP's
Ex: 103.22.238.173/30 will have an IP range of 103.22.238.173 - 103.22.238.174
So, We have below inputs in my Spark app.
Data coming from kafka for around 1M/sec
5 Database tables with approx. 5M records in each. These tables are cached in Spark for better performance.
HOW it is implemented currently
Calculate the long value of input ip and add another column to the dataframe i.e IPLong
Ex:
Take all 5 postgres tables(imported as Dataframe), pass the subnet column from the dataframe to an UDF which return the lowerLong and upperLong for the subnet and add these as two new columns to the table Dataframes.
Now, i use the spark SQl, to find the if the incoming ip belongs to any of the postgres tables subnets like below:
SELECT A.device,A.office,A.subnet, B.* FROM DbtableDF A RIGHT JOIN kafkaRawDF B ON A.upperLong <= B.IPLong AND A.lowerLong >= B.IPLong
This entire process works fine in terms of correlation , but as the input data is coming with 1M/sec and each of the postgres tables have around 5M+ records, the services becomes extremely slow and then blocked.
Note: Resources are not an issue, here is my executer config
executor:
cores: 2
coreLimit: "2"
instances: 22
memory: "50g"
I am new to apache spark, Can someone please suggest any better way to do this?

Related

Determining the distribution of data within the cluster in Spark

I want to examine the distribution of my data within the cluster. I know how to find out what data is inside each partition. However, I haven't figured out how to find out the distribution of the data within the cluster.
Does a method exist in Spark to find out which rows or how many rows of a data frame are on a particular node within the cluster?
Or alternatively, is there a method to map from the partition ID to the executor ID?
Kind regards

2 million queries against a dataframe

I need to run 2 million queries against a three columns table t (s,p,o) which size is 10 billions rows. The data type of each column is string.
Only two types of queries:
select s p o from t where s = param
select s p o from t where o = param
If I store the table in a Postgresql database takes 6 hours using a Java ThreadPoolExecutor.
Do you think Spark can speed up the queries processing even more?
What would be the best strategy? These are my ideas:
Load the table into a dataframe and launch the queries against the dataframe.
Load the table into a parquet database and launch the queries against this database.
Use Spark 2.4 to launch queries against the Postgresql database instead of querying directly.
Use Spark 3.0 to launch queries against the database loaded into PG-Strom, an extension module of PostgreSQL with GPU support.
Thanks,
Using Apache Spark on top of the existing MySQL or PostgresSQL server(s) (without the need to export or even stream data to Spark or Hadoop) can increase query performance more than ten times. Using multiple MySQL servers (replication or Percona XtraDB Cluster) gives us an additional performance increase for some queries. You can also use the Spark cache function to cache the whole MySQL query results table.
The idea is simple: Spark can read MySQL or PostgresSQL data via JDBC and can also execute SQL queries, so we can connect it directly to DB's and run the queries. Why is this faster? For long-running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. For example, MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes.
But I recommend you use No-SQL(HBase, Cassandra,...) or New-SQL solutions for your analyses because they have better performance when the scale of your data increase.
Static Data? Spark; Otherwise tune Postgres
If the 10 billion rows are static or rarely updated, your best bet is going to be using Spark with appropriate partitions. The magic happens with parallelization, so the more cores you have, the better. You want to aim for partitions that are about half a gig in size each.
Determine the size of the data by running SELECT pg_size_pretty( pg_total_relation_size('tablename')); Divide the result by the number of cores available to Spark until you get between 1/8 and 3/4 gig.
Save as parquet if you really have static data or if you want to recover from a failure quickly.
If the source data are updated frequently, you're going to want to add indices in Postgres. It could be as straightforward as adding an index on each column. Partitioning in Postgres would also help.
Stick to Postgres. Newer databases are not appropriate for structured data such as yours. There are parallelization options. Aurora, if you're on AWS.
PG-Strom is not going to work for you here. You have simple data with few columns. Getting them into and out of a GPU is going to slow you down too much.

Mysql or Spark Processing of 400gb data

If I use spark in my case, based on block and cores will it be useful ?
I have 400 GB of data in single table i.e. User_events with multiple columns in MySQL. This table stores all user events from application. Indexes are there on required columns. I have an user interface where user can try different permutation and combination of fields under user_events
Currently I am facing the performance issues where query either takes 15/20 seconds or even longer or times out.
I have gone through couple of Spark tutorial but I am not sure if it can help here. Per mine understanding from spark,
First Spark has to bring all the data in memory. Bring 100 M record on netwok will be costly operation and I will be needing big memory for the
same. Isn't it ?
Once data in memory, Spark can distribute the data among partition based on cores and input data size. Then it can filter the data on each partition
in parallel. Here Spark can be beneficial as it can do the parallel operation while MySQL will be sequential. Is that correct ?
Is my understanding correct ?

Is there a way to control the distribution of spark partitions across nodes in a cluster?

I have an 8 node cluster and I load two dataframes from a jdbc source like this:
positionsDf = spark.read.jdbc(
url=connStr,
table=positionsSQL,
column="PositionDate",
lowerBound=41275,
upperBound=42736,
numPartitions=128*3,
properties=props
)
positionsDF.cache()
varDatesDf = spark.read.jdbc(
url=connStr,
table=datesSQL,
column="PositionDate",
lowerBound=41275,
upperBound=42736,
numPartitions=128 * 3,
properties=props
)
varDatesDF.cache()
res = varDatesDf.join(positionsDf, on='PositionDate').count()
I can some from the storage tab of the application UI that the partitions are evenly distributed across the cluster nodes. However, what I can't tell is how they are distributed across the nodes. Ideally, both dataframes would be distributed in such a way that the joins are always local to the node, or even better local to the executors.
In other words, will the positionsDF dataframe partition that contains records with PositionDate="01 Jan 2016", be located in the same executor memory space as the varDatesDf dataframe partition that contains records with PositionDate="01 Jan 2016"? Will they be on the same node? Or is it just random?
Is there any way to see what partitions are on which node?
Does spark distribute the partitions created using a column key like this in a deterministic way across nodes? Will they always be node/executor local?
will the positionsDF dataframe partition that contains records with PositionDate="01 Jan 2016", be located in the same executor memory space as the varDatesDf dataframe partition that contains records with PositionDate="01 Jan 2016"
It won't be in general. Even if data is co-partitioned (it is not here) it doesn't imply co-location.
Is there any way to see what partitions are on which node?
This relation doesn't have to be fixed over time. Task can be for example rescheduled. You can use different RDD tricks (TaskContext) or database log but it is not reliable.
would be distributed in such a way that the joins are always local to the node, or even better local to the executors.
Scheduler has its internal optimizations and low level APIs allow you to set node preferences but this type of things are not controllable in Spark SQL.

Cassandra Fast Read Configuration

I have 4 Cassandra nodes with 1 seed in a single data center. I have about 5M records in which Cassandra takes around 4 mins to read where with MySQL, it takes only 17 seconds. So my guess is that there is something wrong in my configuration. So kindly will anyone let me know what configuration attributes so I have to check in Cassandra.yaml.
You may be doing an apples to oranges comparison if you are reading all 5M records from one client.
With MySQL all the data is local and optimized for reads since data is updated in place.
Cassandra is distributed and optimized for writes. Writes are simple appends, but reads are expensive since all the appends need to be read and merged to get the current value of each column.
Since the data is distributed across multiple nodes, there is a lot of overhead of accessing and retrieving the data over the network.
If you were using Spark with Cassandra and loading the data into Spark workers in parallel without shuffling it across the network to a single client, then it would be a more similar comparison.
Cassandra is generally good at ingesting large amounts of data and then working on small slices of it (i.e. partitions) rather than doing table scan operations such as reading the entire table.

Resources