My requirement is I have to commit 4Million records to Oracle DB.
For that, I have developed a Java program which starts 10 threads. It's a producer and consumer design. Producer pushes data(Range. Ex. 1-1000) for every consumer (Java Thread). Consumers​ consumes the data and commits that range from one table to another table from Different DBs.
In consumer whenever thread commits data to DB I am logging the range and total no of records committed.
Everything is running smoothly, but after few minutes on console, it's logged that committed records as 90,000 but when I check in DB count is just 40,000.
Even Java program committed data but in Oracle DB why count is less?
After that I came to know multiple commits slow down the process. I am using proper connection pooling and batch processing.
I cannot create DB Link, or can't use another approach for this task. I have to use any Java technology. Normal JDBC , Hibernate or any other technology.
Please help me in resolving this issue.
Your time will be highly appreciated.
Thanks.
Related
I have an existing application that uses Hazelcast for tracking cluster membership and for distributed task execution. I'm thinking that Jet could be useful for adding analytics on top of the existing application, and I'm trying to figure out how best to layer Jet on top of what we already have.
So my first question is, how should run Jet on top of our existing Hazelcast configuration? Do I have to run Jet separately, or replace our existing Hazelcast configuration with Jet (since Jet does expose the HazelcastInstance.)
My second question is, I see lots of examples using IMap and IList, but I'm not seeing anything that uses topics as a source (I also don't see this as an option from the Sources builder). My initial thought on using Jet was to emit events (io perf data, http request data) from our existing code to a topic and then have Jet process that topic, generate analytics from that data, and then push that to an IMap. Is this the wrong approach? Should I be using some other structure to push these events into Jet? I saw that I can make my own custom Source where I could do this, but I felt that I must be going down the wrong path if I was pursuing this given there wasn't one already provided by the library for this specific purpose.
You can either upgrade your current Hazelcast IMDG cluster to a Jet cluster and run your legacy application alongside Jet jobs. This setup is simpler to deploy and operate. Starting an extra cluster for Jet is also perfectly fine. The advantage of it is isolation (cluster lifecycle, failures etc.). Just be aware that you can't combine IMDG 3.x with Jet 4.x clusters.
Use IMap with Journal to connect two jobs or to ingest data into the cluster. It's simplest fault-tolerant option that works OOTB. Jet's data source must be replayable - if the Job fails, it goes back to last state snapshot rewinding the data source offset respectively.
Topic can be used (via Source Builder) but it won't be fault-tolerant (some messages might get lost). Jet achieves fault tolerance by snapshotting the job regularly. In the case of failure, latest snapshot is restored and the data following the snapshot is replayed. Unlike journal, topic consumer can't replay the data using an offset.
What I have?
I have Spark Streaming Application (on Kafka Streams) on Hadoop Cluster that aggregates each 5 minutes users' clicks and some actions done on a web site and
converts them into metrics.
Also I have a table in GreenPlum (on its own cluster) with users data that may get updated. This table is filled using Logical Log Streaming Replication via Kafka. Table size is 100 mln users.
What I want?
I want to join Spark Streams with static data from GreenPlum every 1 or 5 minutes and then aggregate data already using e.g. user age from static table.
Notes
Definitely, I don't need to read all records from users table. There are rather stable core segment + number of new users registering each minute.
Currently I use PySpark 2.1.0
My solutions
Copy data from GreenPlum cluster to Hadoop cluster and save it as
orc/parquet files. Each 5 minute add new files for new users. Once a
day reload all files.
Create new DB on Hadoop and Setup Log replication via Kafka as it is
done for GreenPlum. Read data from DB and use built in Spark
Streaming joins.
Read data from GreenPlum on Spark in cache. Join stream data with
cache.
For each 5 minute save/append new user data in a file, ignore old
user data. Store extra column e.g. last_action to truncate this
file if a user wasn't active on web site during last 2 weeks. Thus,
join this file with stream.
Questions
What of these solutions are more suitable for MVP? for Production?
Are there any better solutions/best practices for such sorts of
problem. Some literature)
Spark streaming reading data from a cache like Apache geode make this better. used this approach in real-time fraud use case. In a nut shell I have features generated on Greenplum Database using historical data. The feature data and some decision making lookup data is pushed in to geode. Features are periodically refreshed (10 min interval) and then refreshed in geode. Spark scoring streaming job constantly scoring the transactions as the come in w/o reading from Greenplum. Also spark streaming job puts the score in geode, which is synced to Greenplum using different thread. I had spark streaming running on cloud foundry using k8. This is a very high level but should give you an idea.
You might want to check out the GPDB Spark Connector --
http://greenplum-spark-connector.readthedocs.io/en/latest/
https://greenplum-spark.docs.pivotal.io/130/index.html
You can load data directly from the segments into Spark.
Currently, if you want to write back to GPDB, you need to use a standard JDBC to the master.
I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallel to the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in Kafka which could take a long time.
I want to avoid this re-processing of the data in the newly deployed updated application.
One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.
So a couple of questions:
Is the above-stated solution a valid solution?
If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?
Any guidance, example or information source is appreciated.
We have a Spark Cluster running under Memsql, We have different Pipelines running, The ETL setup is as below.
Extract:- Spark read Messages from Kafka Cluster (Using Memsql Kafka-Zookeeper)
Transform:- We have a custom jar deployed for this step
Load:- Data from Transform stage is Loaded in Columnstore
I have below doubts:
What Happens to the Message polled from Kafka, if the Job fails in Transform stage
- Does Memsql takes care of loading that Message again
- Or, the data is Lost
If the data gets Lost, how can I solve this Problem, is there any configuration changes which needs to done for this?
As it stands, at least once semantics are not available in MemSQL Ops. It is on the roadmap and will be present in one of the future releases of Ops.
If you haven't yet, you should check out MemSQL 5.5 Pipelines.
http://blog.memsql.com/pipelines/
This one isn't based on spark, (and transforms are done a bit differently so you might have to rewrite your code), but we have native kafka streams now.
The way we get exactly once with the native version is simple; store the offsets in the database same atomic transaction as the actual data. If something fails and the transaction isn't committed, the offsets won't be committed, so we'll naturally and automatically retry that partition-offset-range.
I have a Hadoop job running on HDInsight and source data from Azure DocumentDB. This job runs once a day and as new data comes in the DocumentDB everyday, my hadoop job filters out old records and only process the new ones (this is done by storing a time stamp somewhere). However, as the Hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not? How does the throttling mechanism in DocumentDB play roles here?
as the hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not?
The answer to this depends on what phase or step the hadoop job is in. Data gets pulled once at the beginning. Documents added while data is getting pulled will be included in the Hadoop job results. Documents added after data is finished getting pulled will not included in the Hadoop job results.
Note: ORDER BY _ts is needed for consistent behavior - as the Hadoop job simple follows the continuation token when paging through query results.
"How the throttling mechanism in DocumentDB play roles here?"
The DocumentDB Hadoop connector will automatically retry when throttled.