DF.write.mode("append") at scale - apache-spark

Is the SPARK SQL family of API's for writing to a database like this:
DF.write.mode("append").jdbc(url, table, prop)
able to work at scale?
Or is there a time that sqoop should then be used?

In general writing over JDBC will be typically limited by the capabilities of the destination system. In general JDBC connectors are not designed for batch data migrations, and majority of vendors, have their own, platform specific bulk insert tools.
Specific writing mode like append has little or no impact at all.
And as always - if you have questions about performance implications of a specific choice it's best to benchmark it yourself on the platform you use, data that reflects properties of the real input and using resources comparable to the ones, you have at your disposal in production.

Related

Mapping Dataflow vs SQL Stored Procedure in ADF pipeline

I have a requirement where I need to choose between Mapping Data Flow vs SQL Stored Procedures in an ADF pipeline to implement some business scenarios. The data volume is not too huge now but might get larger at a later stage.
The business logic are at times complex where I will have to join multiple tables, write sub queries, use windows functions, nested case statements, etc.
All of my business requirements could be easily implemented through a SP but there is a slight inclination towards mapping data flow considering that it runs spark underneath and can scale up as required.
Does ADF Mapping data flow has an upper hand over SQL Stored Procedures when used in an ADF pipeline?
Some of the concerns that I have with the mapping data flow are as below.
Time taken to implement complex logic using data flows is much more
than a stored procedure
The execution time for a mapping data flow is
much higher considering the time it takes to spin up the spark
cluster.
Now, if I decide to use SQL SPs in the pipeline, what could be the disadvantages?
Would there be issues with the scalability if the data volume grows rapidly at some point in time?
This is kind of an opinion question which doesn't tend to do well on stackoverflow, but the fact you're comparing Mapping Data Flows with stored procs tells me that you have Azure SQL Database (or similar) and Azure Data Factory (ADF) in your architecture.
If you think about the fact Mapping Data Flows is backed by Spark clusters, and you already have Azure SQL DB, then what you really have is two types of compute. So why have both? There's nothing better than SQL at doing joins, nested queries etc. Azure SQL DB can easily be scaled up and down (eg via its REST API) - that seemed to be one of your points.
Having said that, Mapping Data Flows is powerful and offers a nice low-code experience. So if your requirement is to have low-code with powerful transforms then it could be a good choice. Just bear in mind that if your data is already in a database and you're using Mapping Data Flows, that what you're doing is taking data out of SQL, up into a Spark cluster, processing it, then pushing it back down. This seems like duplication to me, and I reserve Mapping Data Flows (and Databricks notebooks) for things I cannot already do in SQL, eg advanced analytics, hard maths, complex string manipulation might be good candidates. Another use case might be work offloading, where you deliberately want to offload work from your db. Just remember the cost implication of having two types of compute running at the same time.
I also saw an example recently where someone had implemented a slowly changing dimension type 2 (SCD2) using Mapping Data Flows but had used 20+ different MDF components to do it. This is low-code in name only to me, with high complexity, hard to maintain and debug. The same process can be done with a single MERGE statement in SQL.
So my personal view is, use Mapping Data Flows for things that you can't already do with SQL, particularly when you already have SQL databases in your architecture. I personally prefer an ELT pattern, using ADF for orchestration (not MDF) which I regard as easier to maintain.
Some other questions you might ask are:
what skills do your team have? SQL is a fairly common skill. MDF is still low-code but niche.
what skills do your support team have? Are you going to train them on MDF when you hand this over?
how would you rate the complexity and maintainability of the two approaches, given the above?
HTH
One disadvantage with using SP's in your pipeline, is that your SP will run directly against the database server. So if you have any other queries/transactions or jobs running against the DB at the same time that your SP is executing you may experience longer run times for each (depending on query complexity, records read, etc.). This issue could compound as data volume grows.
We have decided to use SP's in our organization instead of Mapping Data Flows. The cluster spin up time was an issue for us as we scaled up. To address the issue I mentioned previously with SP's, we stagger our workload, and schedule jobs to run during off-peak hours.

web real time analytics dashboard: which technologies should use? (node/django, cassandra/mongodb...)

we want to develop a dashboard to analyze geospatial data.
This is a small and close approach to what we want to do: http://adilmoujahid.com/images/data-viz-talkingdata.gif
Our main concerns are about the backend technologies to be used. (front will be D3.js, DC.js, leaflet.js...)
Between Django and node.js, we think that we will use node.js, cause we've read than its faster than Django for this kind of tasks. But we are not sure and we are open to ideas.
But about Mongo or Cassandra, we are so confused. Our data is mostly structured, so store it in tables like Cassandra would make it easy to manage, also Cassandra seems to have better performance. However, we also have IoT devices data, with lots of real-time GPS location...
Which suggestions can you give to us to achieve our goal?
TL;DR Summary;
Dashboard with hundreds of simultaneous users.
Stored data will be mostly structured text/numbers, but will include also images, GPS-arrays, IoT sensors, geographical data (vector-polygons & rasters)
Databases will receive high write load coming from sensors.
Dashboard performance is so important. Its more important to read data in real time, than keeping it uncorrupted/secure.
Most calculus/math will be calculated in the client's browser, the server will try to avoid mathematical operations.
Disclaimer: I'm a DataStax employee so I'll comment on the Cassandra piece.
Cassandra is a good choice for this if your dashboard can be planned around a set of known queries. If those users will be doing ad-hoc queries directly to the database from the dashboard, you'll want something with a little more flexibility like ElasticSearch or (shameless plug) DataStax Search. Especially if you expect the queries/database to handle some of the geospatial logic.
JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as Voronoi polygon, convexhull are also supported. It is open source, distributed and sharded, multiple columns indexes, etc.
Concerning Postgresql and Cassandra, is there much difference in RAM/CPU/DISK usage between them?
Our use case does not require transactions, it will be in a single node and we will have IoT devices writing data up to 500 times per second. However ive read that Geographical data that works better with Potstgis than cassandra...
According to this use case, do you recommend Cassandra or Postgis?

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

Migrating from Oracle to VoltDB

We are migrating from Oracle to VoltDB, should we put all the business logic (the migrated stored procedures) inside the database? Is that the best practice for maximum performance?
I work at VoltDB. There isn't one right answer to your question, it would depend on the particular schema and procedures, but I can explain a little about the stored procedures in VoltDB and differences with Oracle.
First, VoltDB is not a general-purpose database like Oracle, but was specially-designed to provide high performance and scalability for OLTP and "Fast Data" workloads. Typically these workloads involve discrete transactions on small set of records, but which come at rates of thousands to millions per second. The use cases range from providing real-time analytics on fast-moving data sets, to transforming and enriching streaming data, to providing low-latency responses that often involve a data-driven decision to high scale interactive applications.
The procedures in VoltDB are typically focused on applying atomic changes to small sets of records, and they often are used to make event-driven changes in real-time vs. running batch processes on data in bulk as you often see in Oracle. VoltDB automatically generates CRUD-style procedures for each table in the schema now including UPSERT. You can declare single-SQL-statement procedures in the DDL. Procedures that include multiple SQL statements and control flow logic are written as simple java classes that run on the database. VoltDB also supports Ad-Hoc SQL statements (ANSI SQL-92 compatible) sent directly from a client, using either a native language client library or the JDBC or ODBC drivers, or over the embedded HTTP-JSON interface.
If the stored procedures in Oracle are for OLTP operations then they may translate somewhat directly into VoltDB procedures. If they are performing batch operations on data in bulk, then often these processes may be redesigned to be event-driven real-time processes that would produce the same result incrementally. If they still have to be done as a long-running batch, typically they would be broken into separate more discrete procedures driven by a client process.

Is it a good idea to perform data migration with generic language?

There are two kinds of migration. One is to update database schema during the development period. The other is to migrate existing data into a new system (with different schema).
There are a lot of tools available for the former scenario, such as Flyway, Liqubase. However, I am not aware of tools for the latter purpose.
We are currently using PL/SQL to do the migration. However, not all our Java developers have a DBA background. I wonder if anyone has an experience of using generic languages (Java, Scala, C#, etc.) with database access libraries (Hibernate, NHibernate, etc.) to perform the migration.
I'm unsure what the question is, but if I understand you correctly;
Sure you can develop an application in a(ny) language that reads data from a data source and puts it into a data target.
A data migration between data sources does not have to be SQL to SQL only (in case source and target are relational databases)
In fact it often makes sense to have an application between the source/target if there's logic which needs to handle or transform data between various structures or between various data sources.
For example if migrating data from one ERP system into an e-commerce system (just an example).
Another advantage to doing it via an application, is that you often can include more tools/features for reporting and error handling.
Especially if the integration/migration should run often, such error handling/reporting to verify the data movement is beneficial.
Also if the data source and data target are located in different areas/on different servers, it can be easier to do the migration via an application, to avoid opening up needlessly between servers and linking them together.
So basically - such an application (Java, C# ... anything) would read data from a data source, transform the data into the data structure of the target and then store it in the data target.
Making an application to do things, is just another tool in a developers toolbox.
However, if the data migration is basically a 1:1 movement of data from one structure to another duplicate of that structure and no transformation exists; then the situation would be faster/easier to handle directly in SQL or using a data-sync program.
Even if not "DBA background" (not many developers have DBA background, but that shouldn't prevent people from learning SQL, as it's also just another language. You don't need to be a DBA to be able to write SQL effectively)
So - in conclusion. Yes, you can write an application and yes, it can be a good idea. But as almost everything within our field, then it's a case-by-case/situation-by-situation evaluation whether or not it is the "better" way.

Resources