Big Data integration testing best practice [closed]

Big Data integration testing best practice [closed] - apache-spark

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am looking around for some resources on what best practices are for a AWS based data ingestion pipeline that is using Kafka, storm, spark (streaming and batch) which read from and write to Hbase using various micro services to expose the data layer. For my local env I am thinking of creating either docker or vagrant images that will allow me to interact with the env. My issue becomes as to how to standup something for a functional end to end environment which is closer to prod, the drop dead way would be to have an always on environment but that gets expensive. Along the same lines in terms of a perf environment it seems like I might have to punt and have service accounts that can have the 'run of the world' but other accounts that will be limited via compute resources so they don't overwhelm the cluster.
I am curious how others have handled the same problem and if I am thinking of this backwards.

AWS also provides a Docker Service via EC2 Containers. If your local deployment using Docker images is successful, you can check out AWS EC2 Container Service (https://aws.amazon.com/ecs/).
Also, check out storm-docker (https://github.com/wurstmeister/storm-docker), provides easy to use docker-files for deploying storm clusters.

Try hadoop mini clusters. It has support for most of the tools you are using.
Mini Cluster

Related

For a small production environment is it better to use only masters k8s or some mini k8s solutions? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 days ago.
Improve this question
I have a scenario of a small air-gap production environment with only three Linux servers (CentOS or RHEL).
I want to deploy a small k8s cluster on them.
I have two approaches for now:
Installing a pure k8s cluster with only master nodes and untainting them from NoSchedule to run all pods on them.
Installing a mini cluster solution using k3s, k0s, or microk8s and configuring all nodes as master and workers
If I use the first approach (I know it's a bad practice) is it the correct way to run pods on masters?
If I want to use the second one who is the best and easiest to install in different air-gap environments and maintain them? (I used k8s and okd 3 in production but not them)
Lastly, what do you think is the best approach from those two, or are there better ones for my scenario?
Thanks in advance for the help

Cluster mode in nodejs using PM2 [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
We are planning to use kubernetes to deploy node application in a AWS clustered environment. Just needed some advice about if its good practice to use nodejs clustered module for a distributed deployment in AWS. Or single process for single container is good in AWS.

It's really not about "good" or "bad".
Using PM2 would mean you'd ask Kubernetes for multiple CPUs for your pod.
Not using PM2 would means you'd ask Kubernetes for one (or less) CPU for your pod, which would be easier for Kubernetes to schedule (possibly on multiple nodes).
Having one fat pod on one node is less reliable than having multiple smaller pods distributed across multiple nodes.
Hope this helps!

Cron bigquery jobs [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
Which is the best way to schedule BigQuery jobs?
BigQuery doesn't offer a direct approach, and the best I got from searching is using app engine cron service, but from what I understood I have to create a web application to use this service.
My use case is to do some aggregations over clicks and impressions, daily or weekly and use them in our admin portal.
I used Hive as a data warehouse before and Oozie as our scheduler.
Is there a way to accomplish the same logic with BigQuery?

Unfortunately, there is no built in scheduler within BigQuery, although the engineering team takes requests! link.
However, there are a few interesting alternatives.
As you mentioned, using the cron service from App Engine would absolutely work, and you could write a small, simple web service that would invoke the query you want on a regular cadence. This service will not be web facing, so the charges should remain extremely small.
Apache Airflow is a service that I have been playing around with that is very promising; it allows you to define more complex data manipulation tasks across a variety of cloud services in Python and execute them on whatever cadence you choose. Very handy.
Regular Cron - if you have a server available to you, you could just set up a basic cron job that uses the 'bq' command line tool to execute whatever queries you want and save the results to tables in BQ.
Hope that helps! I'm positive there are other options as well, just wanted to give you a few.

Which cloud to use for RabbitMQ? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
We are looking for a cloud based solution for messaging queue. We have chosen RabbitMQ and we already have few app that are using this. RabbitMQ is hosted locally. For testing purposes it was ok, but right now when business is growing and we are looking into centralised RabbitMQ with HA we are looking into cloud solution.
My question is: which service would you recommend for RabbitMQ,
the options that we've found are:
cloudamqp.com/
https://addons.heroku.com/rabbitmq-bigwig
https://bitnami.com and use Azure
or
host it in Azure and manage by ourself - but we would like to avoid this as much as possible - not enough human resources to look after that.
What would you recommend?

my suggestion is http://cloudamqp.com - i use them for just about of all my RabbitMQ hosting needs, for production web apps.
it's a fully managed RabbitMQ hosting service. you don't have to worry about much, and you can get as large / scalable as you need. From very small and cheap, to enterprise level hosting with clustering, etc.

Multiple deployments of virtual machines/instances [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Which one is better to go with.. Juju or Puppet/Chef? Why?
I want to start multiple deployments at the same time, to avoid making the corresponding the server setup again and again.
Thanks

It depends on what you need, every software has its own strengths and weaknesses:
Juju encapsulates services - a charm defines all the ways the service needs to expose or consume config data to/from other services. How a charm does that is the charm's business. It can use any tool from shell scripts to Chef in solo mode, to do that.
Juju orchestrates provisioning - juju keeps track of the resources it has available to it, and can add or remove them as needed.
Juju makes sharing easy - anyone can contribute a charm to the Juju Charm Store; these charms are vetted and peer reviewed by the Juju community.
My recommendation is to go with none of them.. it is Docker's age, a simple tool that manages all of your resources in an easy, fast and reliable way. It is also supported by all cloud providers, so you can simply go and launch your Docker VM on Azure and play with it the way you want.
http://www.docker.com/
https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-docker-machine/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Big Data integration testing best practice [closed] - apache-spark

Try hadoop mini clusters. It has support for most of the tools you are using. Mini Cluster

Related

For a small production environment is it better to use only masters k8s or some mini k8s solutions? [closed]

Cluster mode in nodejs using PM2 [closed]

Cron bigquery jobs [closed]

Which cloud to use for RabbitMQ? [closed]

Multiple deployments of virtual machines/instances [closed]

Categories

Resources