Functional tests of application running on Spark Streaming with Kafka - apache-spark

I'm setting up functional tests for applications running with Spark Streaming and Kafka. The steps to be done are
Start zookeeper server
Start kafka server
Start message producer to feed kafka with necessary data
Start Spark Streaming application
Wait for 5 minutes
Stop message producer
Stop Spark Streaming application
Stop kafka server
Stop zookeeper server
Validate output
What is the professional way to do this other than simple bash script?
I think this is quite general question not related strictly to Spark Streaming and Kafka. Maybe there are some testing frameworks which support setting up the environment, running multiple processes in parallel and data validation/assertions.

Maybe there are some testing frameworks which support setting up the environment, running multiple processes in parallel and data validation/assertions.
Unfortuanetely there is no all-in-one framework out there.
One-line answer would be: use docker-compose with the simplest unit-testing or gherkin-based framework of your choice.
Considering the steps above as:
Start the env
Generate Kafka messages / Validate
Shut down the env
Docker-Compose would be the best choice for the steps #1 and #3.
version: '2'
services:
kafka:
# this container already has zookeeper built in
image: spotify/kafka
ports:
- 2181:2181
- 9092:9092
# its just some mock-spark container, you'll have to replace it with
# docker container that can host your spark-app
spark:
image: epahomov/docker-spark:lightweighted
depends_on:
- kafka
The idea of the compose file is that you can start your env with one command:
docker-compose up
And the environment setup will be pretty much portable across dev machines and build servers.
For the step #2 any test framework will do.
The scenario would look like:
Start the environment / Make sure its started
Start Generating messages
Making assertions / Sleep my sweet thread
Shut down the env
Talking about frameworks:
Scala: Scalatest. There you can have a good spectrum of Async Assertions and parallel processing.
Python: Behave (be careful with multiprocessing there) or unit-testing framework such as pytest
Do not let the naming "unit-testing framework" confuse you.
Only test environment defines if a test becomes unit, modular, system or integration like, not a tool.
If a person uses unit-test framework and writes there
MyZookeeperConnect("192.168.99.100:2181") its not a unit test anymore, even unit test framework can't help it :)
To glue steps #1, #2, #3 together - simple bash would be my choice.

Consider using Citrus (http://citrusframework.org/) test framework which could be the all-in-one test framework for you.
Zookeeper access: check
Docker integration: check
Kafka integration via Apache Camel: check
Waiting for x period of time: check
Validating outcome: check
Also consider to use Fabric8 Docker Maven plugin (https://github.com/fabric8io/docker-maven-plugin) for setting up the Docker test environment before Citrus tests are executed within same build run.
Here is an example for both tools working together for automated integration testing: https://github.com/christophd/citrus-samples/tree/master/sample-docker

Related

Migrating nodejs jobs to Airflow

I am looking at migrating several nodejs jobs to apache airflow.
These jobs log to the standard output. I am new to Airflow, and have set it up running in docker. Ideally, we would updated these jobs to use connections provided by airflow, but i'm not sure that will be possible.
We have succeeded in running the job by installing nodejs into the bash operator:
t1 = BashOperator(
task_id='task_1',
bash_command='/usr/bin/nodejs /usr/local/airflow/dags/test.js',
dag=dag)
Would this be a good approach? Or would writing a nodejs operator be a better approach?
I also thought of putting the node code behind an HTTP service which would be my preferred approach, but then we loose the logs.
Any thoughts on how best to architect this in Airflow?
The bash approach is feasible, but it is going to be very hard to maintain the nodejs dependencies.
I would migrate the code to containers and use docker_operator / KubernetesPodOperator afterwards.

Automated Spark testing environment

My problem: I am developing a Spark extension and I would like to perform tests and performance at scale before making the changes public. Currently such tests are a bit too manual: I compile & package my libraries, copy the jar files to a cluster where I have a private Spark deployment, restart Spark, then fire tests and benchmarks by hand. After each test I manually inspect logs and console output.
Could someone with more experience offer hints on how to make this more automatic? I am particularly interested in:
Ability to integrate with Github & Jenkins. Ideally I would only have to push a commit to the GitHub repo, then Jenkins would automatically pull and build, add the new libraries to a Spark environment, start Spark & trigger the tests and benchmarks, and finally collect & make output files available.
How to run and manage the Spark cluster. I see a number of options:
a) continue with having a single Spark installation: The test framework would update my jar files, restart Spark so the new libraries are picked up and then run the tests/benchmarks. The advantage would be that I only have to set up Spark (and maybe HDFS for sharing data & application binaries, YARN as the resource manager, etc) once.
b) run Spark in containers: My cluster would run a container management system (like Kubernetes). The test framework would create/update the Spark container image, fire up & configure a number of containers to start Spark, submit the test/benchmarks and collect results. The big advantage of this is that multiple developers can run tests in parallel and that I can test various versions of Spark & Hadoop.
Create a Docker container that has your entire solution contained including tests, push it to GitHub and have a DroneCi or Travis CI build it and listen for updates. It works great for me. 😀
There are many Spark docker images on GitHub or Docker hub I use this one:
https://github.com/jupyter/docker-stacks/tree/master/all-spark-notebook

How to use docker to setup kafka and spark-streaming on a Mac?

I just want to learn kafka and sparking streaming on my local machine (macOS Sierra).
Maybe Docker is a good idea?
Seems like what you need is described here
If you’ve always wanted to try Spark Streaming, but never found a time
to give it a shot, this post provides you with easy steps on how to
get development setup with Spark and Kafka using Docker
Example application here

Running two versions of Apache Spark in cluster mode

I want to be able to run Spark 2.0 and Spark 1.6.1 in cluster mode on a single cluster to be able to share resources, what are the best practices to do this? this is because I want to be able to shield a certain set of applications from code changes that rely on 1.6.1 and others on Spark 2.0.
Basically the cluster could rely on dynamic allocation for Spark 2.0 but maybe not for 1.6.1 - this is flexible.
By using Docker this is possible you can run various version of Spark application, since Docker runs the application in Isolation.
Docker is an open platform for developing, shipping, and running applications. . With Docker you can separate your applications from your infrastructure and treat your infrastructure like a managed application.
Industries are adopting Docker since it provide this flexibility to run various version application in a single nut shell and many more
Mesos also allows to Run Docker Containers using Marathon
For more information please refer
https://www.docker.com/
https://mesosphere.github.io/marathon/docs/native-docker.html
Hope this helps!!!....

Apache Spark application deployment best practices

I have a couple of use cases for Apache Spark applications/scripts, generally of the following form:
General ETL use case -
more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.
Streaming use case -
realtime analysis of the events as they arrive in the system.
For (1), I'll need to kick off the Spark application periodically.
For (2), just kick off the long running Spark Streaming process at boot time and let it go.
(Note - I'm using Spark Standalone as the cluster manager, so no yarn or mesos)
I'm trying to figure out the most common / best practice deployment strategies for Spark applications.
So far the options I can see are:
Deploying my program as a jar, and running the various tasks with spark-submit - which seems to be the way recommended in the spark docs. Some thoughts about this strategy:
how do you start/stop tasks - just using simple bash scripts?
how is scheduling managed? - simply use cron?
any resilience? (e.g. Who schedules the jobs to run if the driver server dies?)
Creating a separate webapp as the driver program.
creates a spark context programmatically to talk to the spark cluster
allowing users to kick off tasks through the http interface
using Quartz (for example) to manage scheduling
could use cluster with zookeeper election for resilience
Spark job server (https://github.com/ooyala/spark-jobserver)
I don't think there's much benefit over (2) for me, as I don't (yet) have many teams and projects talking to Spark, and would still need some app to talk to job server anyway
no scheduling built in as far as I can see
I'd like to understand the general consensus w.r.t a simple but robust deployment strategy - I haven't been able to determine one by trawling the web, as of yet.
Thanks very much!
Even though you are not using Mesos for Spark, you could have a look at
-Chronos offering a distributed and fault tolerant cron
-Marathon a Mesos framework for long running applications
Note that this doesn't mean you have to move your spark deployment to mesos, e.g. you could just use chronos to trigger the spark -submit.
I hope I understood your problem correctly and this helps you a bit!

Resources