What happens when HDInsight source data from Azure DocumentDB - azure-hdinsight

I have a Hadoop job running on HDInsight and source data from Azure DocumentDB. This job runs once a day and as new data comes in the DocumentDB everyday, my hadoop job filters out old records and only process the new ones (this is done by storing a time stamp somewhere). However, as the Hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not? How does the throttling mechanism in DocumentDB play roles here?

as the hadoop job is running and if new records come in, I don't know what happens to them. Are they fed to running job or not?
The answer to this depends on what phase or step the hadoop job is in. Data gets pulled once at the beginning. Documents added while data is getting pulled will be included in the Hadoop job results. Documents added after data is finished getting pulled will not included in the Hadoop job results.
Note: ORDER BY _ts is needed for consistent behavior - as the Hadoop job simple follows the continuation token when paging through query results.
"How the throttling mechanism in DocumentDB play roles here?"
The DocumentDB Hadoop connector will automatically retry when throttled.

Related

Is it possible to know the resources used by a specific Spark job?

I'm drawing on ideas of using a multi tenant Spark cluster. The cluster execute jobs on demand for a specific tenant.
Is it possible to "know" the specific resources used by a specific job (for payment reasons)? E.g. if a job requires that several nodes in kubernetes is automatically allocated is it then possible to track which Spark jobs (and tenant at the end) that initiated these resource allocations? Or, jobs are always evenly spread out on allocated resources?
Tried to find information at the Apache Spark site and else where on the internet without success.
See https://spark.apache.org/docs/latest/monitoring.html
You can save data from Spark History Server as json and then write your own resource calc stuff.
It is Spark App you mean.

Spark Streaming join with GreenPlum/Postgres Data. Approach

What I have?
I have Spark Streaming Application (on Kafka Streams) on Hadoop Cluster that aggregates each 5 minutes users' clicks and some actions done on a web site and
converts them into metrics.
Also I have a table in GreenPlum (on its own cluster) with users data that may get updated. This table is filled using Logical Log Streaming Replication via Kafka. Table size is 100 mln users.
What I want?
I want to join Spark Streams with static data from GreenPlum every 1 or 5 minutes and then aggregate data already using e.g. user age from static table.
Notes
Definitely, I don't need to read all records from users table. There are rather stable core segment + number of new users registering each minute.
Currently I use PySpark 2.1.0
My solutions
Copy data from GreenPlum cluster to Hadoop cluster and save it as
orc/parquet files. Each 5 minute add new files for new users. Once a
day reload all files.
Create new DB on Hadoop and Setup Log replication via Kafka as it is
done for GreenPlum. Read data from DB and use built in Spark
Streaming joins.
Read data from GreenPlum on Spark in cache. Join stream data with
cache.
For each 5 minute save/append new user data in a file, ignore old
user data. Store extra column e.g. last_action to truncate this
file if a user wasn't active on web site during last 2 weeks. Thus,
join this file with stream.
Questions
What of these solutions are more suitable for MVP? for Production?
Are there any better solutions/best practices for such sorts of
problem. Some literature)
Spark streaming reading data from a cache like Apache geode make this better. used this approach in real-time fraud use case. In a nut shell I have features generated on Greenplum Database using historical data. The feature data and some decision making lookup data is pushed in to geode. Features are periodically refreshed (10 min interval) and then refreshed in geode. Spark scoring streaming job constantly scoring the transactions as the come in w/o reading from Greenplum. Also spark streaming job puts the score in geode, which is synced to Greenplum using different thread. I had spark streaming running on cloud foundry using k8. This is a very high level but should give you an idea.
You might want to check out the GPDB Spark Connector --
http://greenplum-spark-connector.readthedocs.io/en/latest/
https://greenplum-spark.docs.pivotal.io/130/index.html
You can load data directly from the segments into Spark.
Currently, if you want to write back to GPDB, you need to use a standard JDBC to the master.

Searching SQL data using Spark

I am creating the spark job that searchs records (SQL rows) relevant to a keyword using tf-idf model. What I am currently doing for testing is to spark-submit the job to get results. However, ideally I want to make this job as a web service so that external users can search records using REST API. This may generate a number of concurrent requests to run the job for multiple users when they search own keywords through the API.
I wonder if I should support this scenario with spark job server so that users can submit jobs via API, or if you have any suggestion for this particular case based on your past experience. Thanks.
This would be an inappropriate use of Spark. Spark is for analytics jobs. Those take time (maybe less time than old-school MapReduce but time nonetheless), and REST clients demand immediate results.
You are on the right track though. As data come in, you can use, for example, Spark Streaming and MLLib to process records according to your TD-IDF and then store the indexed results in your SQL database. Then your REST clients will simply query your data like with all the conventional web-with-SQL-backend applications our ancestors once built.
I suppose you could also look into giving admins the ability to start analytics jobs via a REST client too.

Memsql Spark-Kafka Transform Failure

We have a Spark Cluster running under Memsql, We have different Pipelines running, The ETL setup is as below.
Extract:- Spark read Messages from Kafka Cluster (Using Memsql Kafka-Zookeeper)
Transform:- We have a custom jar deployed for this step
Load:- Data from Transform stage is Loaded in Columnstore
I have below doubts:
What Happens to the Message polled from Kafka, if the Job fails in Transform stage
- Does Memsql takes care of loading that Message again
- Or, the data is Lost
If the data gets Lost, how can I solve this Problem, is there any configuration changes which needs to done for this?
As it stands, at least once semantics are not available in MemSQL Ops. It is on the roadmap and will be present in one of the future releases of Ops.
If you haven't yet, you should check out MemSQL 5.5 Pipelines.
http://blog.memsql.com/pipelines/
This one isn't based on spark, (and transforms are done a bit differently so you might have to rewrite your code), but we have native kafka streams now.
The way we get exactly once with the native version is simple; store the offsets in the database same atomic transaction as the actual data. If something fails and the transaction isn't committed, the offsets won't be committed, so we'll naturally and automatically retry that partition-offset-range.

Big Data Analytics using Redshift vs Spark, Oozie Workflow Scheduler with Redshift Analytics

We want to do Big Data Analytics on our data stored in Amazon Redshift (currently in Terabytes, but will grow with time).
Currently, it seems that all our Analytics can be done through Redshift queries (and hence, no distributed processing might be required at our end) but we are not sure if that will remain to be the case in future.
In order to build a generic system that should be able to cater our future needs as well, we are looking to use Apache Spark for data analytics.
I know that data can be read into Spark RDDs from HDFS, HBase and S3, but does it support data reading from Redshift directly?
If not, we can look to transfer our data to S3 and then read it in Spark RDDs.
My question is if we should carry out our Data Analytics through Redshift's queries directly or should we look to go with the approach above and do analytics through Apache Spark (Problem here is that Data Locality optimization might not be available)?
In case we do analytics through Redshift queries directly, can anyone please suggest a good Workflow Scheduler to write our Analytics jobs with. Our requirement is to be able to execute jobs as a DAG (Job2 should execute only if Job1 succeeds, etc) and be able to schedule our workflows through the proposed Workflow Engine.
Oozie seems like a good fit for our requirements but it turns out that Oozie cannot be used without Hadoop. Does it make sense to set up Hadoop on our machines and then use Oozie Workflow Scheduler to schedule our Data Analysis jobs through Redshift queries?
You cannot access data stored on Redshift nodes directly (each via Spark), only via SQL queries submitted the cluster as a whole.
My suggestion would be to use Redshift as long as possible and only take on the complexity of Spark/Hadoop when you absolutely need it.
If, in the future, you move to Hadoop then Cascading Lingual gives you the option of running your existing Redshift analytics more or less unchanged.
Regarding workflow, Oozie is not a good fit for Redshift. I would suggest you look at Azkaban (true DAG) or Luigi (uses a Python DSL).

Resources