Cron bigquery jobs [closed] - cron

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
Which is the best way to schedule BigQuery jobs?
BigQuery doesn't offer a direct approach, and the best I got from searching is using app engine cron service, but from what I understood I have to create a web application to use this service.
My use case is to do some aggregations over clicks and impressions, daily or weekly and use them in our admin portal.
I used Hive as a data warehouse before and Oozie as our scheduler.
Is there a way to accomplish the same logic with BigQuery?

Unfortunately, there is no built in scheduler within BigQuery, although the engineering team takes requests! link.
However, there are a few interesting alternatives.
As you mentioned, using the cron service from App Engine would absolutely work, and you could write a small, simple web service that would invoke the query you want on a regular cadence. This service will not be web facing, so the charges should remain extremely small.
Apache Airflow is a service that I have been playing around with that is very promising; it allows you to define more complex data manipulation tasks across a variety of cloud services in Python and execute them on whatever cadence you choose. Very handy.
Regular Cron - if you have a server available to you, you could just set up a basic cron job that uses the 'bq' command line tool to execute whatever queries you want and save the results to tables in BQ.
Hope that helps! I'm positive there are other options as well, just wanted to give you a few.

Related

Which load testing method is better? API testing or full website testing [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have an application that we implemented kind of a microservices type architecture. The application contains 6 services (6 Docker containers). I need to load test this application. As I don't have much experience in the testing field, I'm not sure which method to use.
Right now, I have used the Gatling Load testing application for the load test. Here, I record the testing script by start the recorder and wander around my application to record all routes. I have gone through most of the routes in that single recording in order to mimic a practical user. I thought, normally users use an application like this and I can load test with its 1000 times by editing the number of threads/users.
Later I read about API testing which we will focus on APIs. Loading each APIs with a heavy load. So, I'm confused that which testing method should I use? If we go for API testing, it will provide only how much we can scale for that particular API right? (Not sure)
Is there any issue with my method of load testing?
It depends entirely on what you hope to achieve...
If you're looking to validate that your entire application (code + production infrastructure) can handle a given load, then driving as though going through the full website is the right path.
However, if you're looking to see how a particular api scales or want to help developers explore the ramifications of changes, then you will probably want to just drive that API directly to avoid other limitations your system may have.

reading sql server log files (ldf) with spark [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
this is probably far fetched but... can spark - or any advanced "ETL" technology you know - connect directly to sql server's log file (the .ldf) - and extract its data?
Agenda is to get SQL server's real time operational data without replicating the whole database first (nor selecting directly from it).
Appreciate your thoughts!
Rea
to answer your question, I have never heard of any tech to read an LDF directly, but there are several products on the market that can "link-clone" a database almost instantly by using some internal tricks. Keep in mind that the data is not copied using these tools, but it allows instant access for use cases like yours.
There may be some free ways to do this, especially using cloud functions, or maybe linked-clone functions that Virtual Machines offer, but I only know about paid products at this time like Dell EMC, Redgate's and Windocks.
The easiest to try that are not in the cloud are:
Red Gate SQL Clone with a 14 day free trial:
Red Gate SQL Clone Link
Windocks.com (this is free for some cases, but harder to get started with)

When to use Bots, FaaS, Runbooks and logic App [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
All these Azure technologies (Bots, FaaS, Logic Apps and Runbooks) are used to run schadule jobs.
I don't know when we should use these and which scenario we should use them.
YMMV, but here are some pretty good rule of thumbs:
Are you doing PowerShell based Automation work? If Yes, consider Azure Automation Runbooks.
Are you building a bot? If Yes, consider the Azure Bot Framework service.
Are you build a workflow that executes on a timer, especially one that integrates with other services (etc.)? If Yes, consider Logic Apps.
Are you writing generic application code? If Yes, consider Azure Functions.
If none of those fit, I'd be surprised, but you might try starting with Azure Functions since we're kind of an "Everything as a Service", but there is a reason we have the different products - they specialize to enable better productivity within their specialty (Bots, Automation, and Integration).
Note: I'm one of the PMs on the Azure Functions team here at Microsoft.

Google Dataflow vs Apache Spark Streaming (either on Google Cloud or with Google Dataproc) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am new to Cloud and Big-data however have much of interest in these and I have significant experience in Java programming. I am currently working on my uni project for comparing performance of Apache Spark streaming with Google Cloud Dataflow. I have read a number of articles including the comparison done here.
I understand that the programming model of Spark and Dataflow is different, however because of my limited and new knowledge in this area, I am trying to understand if this performance comparison can still be done?
and what type of use case would be correct for this? And what performance parameters should be considered here for a streaming application?
While reading about Dataflow and Spark, I also came across Dataproc and also thinking if it is better to do comparison between Dataflow vs Spark on Dataproc or Dataflow vs Spark+Google Cloud.
Any advise on this would be appreciated as I am not getting a clear direction in this.
The best way to compare performance is with real end-to-end data processing pipelines. So you first need to answer your own question "what type of use case would be correct for this?" as there are a nearly unlimited variety.
You might find some inspiration in the included examples.

Big Data integration testing best practice [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am looking around for some resources on what best practices are for a AWS based data ingestion pipeline that is using Kafka, storm, spark (streaming and batch) which read from and write to Hbase using various micro services to expose the data layer. For my local env I am thinking of creating either docker or vagrant images that will allow me to interact with the env. My issue becomes as to how to standup something for a functional end to end environment which is closer to prod, the drop dead way would be to have an always on environment but that gets expensive. Along the same lines in terms of a perf environment it seems like I might have to punt and have service accounts that can have the 'run of the world' but other accounts that will be limited via compute resources so they don't overwhelm the cluster.
I am curious how others have handled the same problem and if I am thinking of this backwards.
AWS also provides a Docker Service via EC2 Containers. If your local deployment using Docker images is successful, you can check out AWS EC2 Container Service (https://aws.amazon.com/ecs/).
Also, check out storm-docker (https://github.com/wurstmeister/storm-docker), provides easy to use docker-files for deploying storm clusters.
Try hadoop mini clusters. It has support for most of the tools you are using.
Mini Cluster

Resources