Possible ways of comparing large records from one table on a database and another table on another database - python-3.x

I am looking into the ways for comparing records from same table but on different databases. I just need to compare and find the missing records.
I tried out a few methods.
loading the records into a pandas data frame, I used read_sql. But it is taking more time and memory to complete the load and if the records are large, I am getting a memory error.
Tried setting up a standalone cluster of spark and run the comparison, it is also throwing java heap space error. tuning the conf is not working as well.
Please let me know if there are other ways to handle this huge record comparison.
--update
Do we have a tool readily available for cross data source comparison

If your data size is huge you can use cloud services to run your spark job and get the results. Here you can use aws glue which is serverless and is charged as you go.
Or if your data is not considerably large and is something one time job then you can use google colab which is free and run your comparision over it .

Related

How to speed up spark sql filter queries if the where clause is already fixed?

In my case, the data resides in spark tables which are created by calling createOrReplaceTempView API on a dataframe. Once the table is created, several queries are going to run on top of the table. Most of the time, the where query is going to be based on a particular column. The concerned columns' name is already known. I would like to know if some sort of optimizations can be done to improve the performance of the filter query.
I tried exploring the approach of indexing but it turns out spark does not support indexing a particular column.
Have you looked at the SPARK UI to see where most of your time is being consumed? Is it really the query where most of the time is spent? Usually reading the data from disk is where most of the time is spent. Learn to read the SPARK UI and find where the real bottleneck is. The SQL tab is a really great way to start figuring things out.
Here's some tricks to run faster in spark that apply to most jobs:
Can you reframe the problem? Was the data you are using in a format that helps you solve the query? Can you change how it's written to change the problem? (Could you start "pre-chewing" the data before you even query it to have it stored in the best format to help you solve the issue you want to solve?) Most performance gains come from changing the parameters of the problem to make them easier/faster to solve.
What format (is the incoming data) you are
storing the data in? Are you using Parquet/Orc? They have a great payoff disk space/compression that are worth using. They also can enable file level filter to speed read. Is their transformation work that you can push upstream to help make the query do less work? Can you be writing the data via a partition schema that would aid lookups?
How many files is your input? Can you consolidate files to maximize read throughput. Reading/listing a lot of small files as input slows down the processing of data.
If the tempView query is of similar size every time you could look at tweaking the partition count so that files are smaller but approximately the size of your HDFS block size. (Assuming you are using hdfs). HDFS you have to read an entire block weather you use all the data or not. Try and fit this to some multiple of your executors so that you are finishing together and not straggling. This is hard to get perfect but you can make decent strides to find a good ratio.
There is no need to optimize filter conditions with spark. spark already is smart enough to optimize its conditions post where query to fetch minimum rows first. The best I guess you can do is by persisting your TempView if querying the same view again and again.

What timeseries database to select for large number of records?

I got into scenario where I have about 100,000 input records per seconds to store. The nature of records is timeseries data.
I need to run both aggregation, other analytics and also some machine learning algorithms over the data continuously. Performance is here the factor as I look for near real-time results.
What would you recommend as database engine?
Take a look at ClickHouse analytical database. It can accept millions of rows per second. It can scan billions of rows per second on a single computer. It scales horizontally to multiple nodes. It fits time series workloads.
If you still need time series database, then try VictoriaMetrics. It is built on ClickHouse ideas, so it is fast and resource-efficient.
I am adding my own solution...
ClickHouse is definitely nice killer. But I am now evaulating for new project open source gpu database OmniSci. Its open source version is limited to single gpu node (up to 16 gpu devices - with oem tesla having 64GB per device you can get 1TB VRAM, of course not that cheap as clickhouse). Its simply SQL database on steroids (JDBC driver exists) with Kafka data source
Omnisci is having also crossdashboarding solution which is licensed already, but you can have real time dashboarding over lets say 20-50 billions of ts records (8-16 gpus) and multidashboard real time analytics without any kind of preaggregation required, etc....
But it will cost money...
If you want going purely open source, my second candidate is NVIDA's RAPIDS framework which implements cuDF (CUDA Dataframe - like Spark data structure), eventually you can use it to keep your data window (append new, delete obsolete), and cuxfilter solution which is similar to OmniSci, but its more framework, but with skilled frontend coder you can achieve something very similar/same as OmniSci.
Of course you can go and implement your own on top of cassandra with an appropriate data model for your usecase. This will maybe get you the best results tailored to your needs.
You could look at KairosDB (https://kairosdb.github.io/) which is a timeseries database on top of apache cassandra and I got 50k writes per second on a medium sized single (but bare metal) node.
It's quite good documented (https://kairosdb.github.io/docs/build/html/CassandraSchema.html) and it has aggregators out of the box (https://kairosdb.github.io/docs/build/html/restapi/QueryMetrics.html).
OpenTSDB was slower in my tests. Influx looks promising but i have no experience myself: https://github.com/influxdata/influxdb

Use Cases for Spark

We have an application which the clients use to track their procurement cycle. We need to build a solution which will help the users to pull any column from any table in a particular subject area and they should be able to see all the rows of the result of this join of the tables from which the columns have been pulled. It needs to be similar to a Salesforce kind of reporting solution. We are looking at HDFS and Spark in Azure HDInsight to support these kind of querying capabilities. We would like to know if this is a valid use case for Spark. The volume of the joins of all tables can easily touch 500 million rows which will be pulled into the Spark driver memory before being displayed to the user.
Please let me know if this is something that can be done using Spark.
As per my understanding, Spark is mostly used for batch processing. If your use case is directly user-facing, then I am doubtful about using Spark because there may be better solutions(or alternate architectures). Becuase joining 500 million rows in realtime sounds crazy!
The volume of the joins of all tables can easily touch 500 million rows which will be pulled into the Spark driver memory before being displayed to the user.
This is another thing that puzzled me. Pulling all the 500 million rows into RAM of a single java process doesn't sound right because of the obvious reasons.
Updated
Just using spark for processing huge data will not be effective for realtime solutions(like your use case). But, Spark will be very effective if you are going to pre-process your data, cache the results using some other system, prepare views using the results can be served to your users. More or less similar to Lambda Architecture.
Spark on Yarn cluster to periodically process the data and generate/update the different views, a distributed storage system (preferably columnar storage systems) to cache the views, a REST API to serve the views to users.
Late reply to the question, but in case someone else is reading this in future. AWS Redshift does exactly this.

What would be the proper way to tune Apache Spark for responsive web applications?

I have previously used Apache Spark for streaming applications where it does a wonderful job for ETL pipelines and predictions using Machine Learning.
However, Spark for EDA may not be as fast as one may want. For example, if you would like to do basic mathematical operations on data coming from Postgres or ElasticSearch using the data frames in Spark, the time it takes to fetch data from the host system and do the analysis is much higher than that taken by the SQL query on Postgres to run.
Even simple aggregations such as sum, average, and count can be done much faster using SQL than doing them on top of Spark-SQL.
From what I understand, this is not because of latency in fetching the data from the host system. If you call the show method on a data frame, you can quickly get the top rows of the data set. However, if you limit the response in SQL, and then call collect the time taken is huge.
This means that the data is there, but the processing being done while calling collect is taking a time.
Regardless of the data source (CSV file, JSON file, ElasticSearch, Parquet, etc.), the behavior remains the same.
What is the reason for this latency on collect and is there any way to reduce it to the point where it can work with responsive applications to make real-time or near real-time queries?

Is it bad to use INDEX in Cassandra if performance is not important?

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

Resources