Google Cloud Bigquery Library Error - python-3.x

I am receiving this error
Cannot set destination table in jobs with DDL statements
When I try to resubmit a job from the job.build_resource() function in the google.cloud.bigquery library.
It seems that the destination table is set to something like this after that function call.
'destinationTable': {'projectId': 'xxx', 'datasetId': 'xxx', 'tableId': 'xxx'},
Am I doing something wrong here? Thanks to anyone that can give me any guidance here.
EDIT:
The job is initially being triggered by this
query = bq.query(sql_rendered)
We store the job id and use it later to check the status.
We get the job like this
job = bq.get_job(job_id=job_id)
If it meets a condition, in this case it failed due to rate limiting. We retry the job.
We retry the job like this
di = job._build_resource()
jo = bigquery.Client(project=self.project_client).job_from_resource(di)
jo._begin()
I think that's pretty much all of the code you need, but happy to provide more if needed.

You are seeing this error because you have a DDL statement in your query. What is happening is that the job_config is changing some values after the execution of the first query, particularly the job_config.destination . In order to try to overcome this issue, you could try to reset the value of job_config.destination to None after each job submission or use a different job_config for every query.

Related

Incremental data load from Redshift to S3 using Pyspark and Glue Jobs

I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method:
def readFromRedShift(spark: SparkSession, schema, tablename):
table = str(schema) + str(".") + str(tablename)
(url, Properties, host, port, db) = con.getConnection("REDSHIFT")
df = spark.read.jdbc(url=url, table=table, properties=Properties)
return df
Where getConnection is a different method under a separate class that handles all the redshift-related details. Later on, I used this method and created a data frame, and wrote the results into S3 which worked like a charm.
Now, I want to load the incremental data. Will enabling the Job Bookmarks Glue option help me? Or is there any other way to do it? I followed this official documentation but was of no help to me for my problem statement. So, if I run it for the first time as it will load the complete data, and if I rerun it will it be able to load the newly arrived records?
You are right. It can be achieved via use of job bookmarks, but at the same time it can be a bit tricky.
Please refer to this doc https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/

Aws cloudwatch logs getQueryResults returns empty when tried with boto3

Using boto3 of aws, I am trying to run start query and get the results using query id. but it didnt work as expected in python script. It returns the expected json output for start_query and able to fetch the queryID. But if i try to fetch the query results using queryID, it returns empty json.
<code>
import boto3
client = boto3.client('logs')
executeQuery = client.start_query(
logGroupName='LOGGROUPNAME',
startTime=STARTDATE,
endTime=ENDDATE,
queryString='fields status',
limit=10000
)
getQueryId=executeQuery.get('queryId')
getQueryResults = client.get_query_results(
queryId=getQueryId
)
</code>
it returns the reponse of get_query_results as
{'results': [], 'statistics': {'recordsMatched': 0.0, 'recordsScanned': 0.0, 'bytesScanned': 0.0}, 'status': 'Running',
But if i try using aws cli with the queryID generated from script, it returns json output as expected.
Anyone could able to tell why it didnt work from boto3 python script and worked in cli?
Thank you.
The query status is Running in your example. Its not in Complete status yet.
Running queries is not instantaneous. Have to wait a bit for query to complete, before you can get results.
You can use describe_queries to check if your query has completed or not. You can also check if logs service has dedicated waiters in boto3 for the results. They would save you from pulling describe_queries API in a loop waiting till your queries finish.
When you do this in CLI, probably there is more time before you start the query, and query results using CLI.
The other issue you might be encountering is that the syntax for the queryString in the API is different from a query you would type into the CloudWatch console.
Console query syntax example:
{ $.foo = "bar" && $.baz > 0 }
API syntax for same query:
filter foo = "bar" and baz > 0
Source: careful reading and extrapolation from the official documentation plus some trial-and-error.
My logs are in JSON format. YMMV.
Not sure if this problem is resolved. I was facing the same issue with AWS Java SDK . But when i terminate the thread performing the executeQuery query and perform the get_query_results using a new thread and the old queryId. It seems to be working fine.
Adding sleep will work here. If the query is exceeding the Sleep time then again it will show as Running status. You can write a Loop where you can check the status Completed, if the status is Running you can run Sleep again for some second and retry. You can give some retry count here.
Sample Pseudocode:
function for sleep; (let's say SleepFunc())
Loop till retry count
        check if the status is Completed;
        If yes                break;
        else call               SleepFunc();

Right way to delete and then reindex ES documents

I have a python3 script that attempts to reindex certain documents in an existing ElasticSearch index. I can't update the documents because I'm changing from an autogenerated id to an explicitly assigned id.
I'm currently attempting to do this by deleting existing documents using delete_by_query and then indexing once the delete is complete:
self.elasticsearch.delete_by_query(
index='%s_*' % base_index_name,
doc_type='type_a',
conflicts='proceed',
wait_for_completion=True,
refresh=True,
body={}
)
However, the index is massive, and so the delete can take several hours to finish. I'm currently getting a ReadTimeoutError, which is causing the script to crash:
WARNING:elasticsearch:Connection <Urllib3HttpConnection: X> has failed for 2 times in a row, putting on 120 second timeout.
WARNING:elasticsearch:POST X:9200/base_index_name_*/type_a/_delete_by_query?conflicts=proceed&wait_for_completion=true&refresh=true [status:N/A request:140.117s]
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='X', port=9200): Read timed out. (read timeout=140)
Is my approach correct? If so, how can I make my script wait long enough for the delete_by_query to complete? There are 2 timeout parameters that can be passed to delete_by_query - search_timeout and timeout, but search_timeout defaults to no timeout (which is I think what I want), and timeout doesn't seem to do what I want. Is there some other parameter I can pass to delete_by_query to make it wait as long as it takes for the delete to finish? Or do I need to make my script wait some other way?
Or is there some better way to do this using the ElasticSearch API?
You should set wait_for_completion to False. In this case you'll get task details and will be able to track task progress using corresponding API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html#docs-delete-by-query-task-api
Just to explain more in the form of codebase explained by Random for the newbee in ES/python like me:
ES = Elasticsearch(['http://localhost:9200'])
query = {'query': {'match_all': dict()}}
task_id = ES.delete_by_query(index='index_name', doc_type='sample_doc', wait_for_completion=False, body=query, ignore=[400, 404])
response_task = ES.tasks.get(task_id) # check if the task is completed
isCompleted = response_task["completed"] # if complete key is true it means task is completed
One can write custom definition to check if the task is completed in some interval using while loop.
I have used python 3.x and ElasticSearch 6.x
You can use the 'request_timeout' global param. This will reset the Connections timeout settings, as mentioned here
For example -
es.delete_by_query(index=<index_name>, body=<query>,request_timeout=300)
Or set it at connection level, for example
es = Elasticsearch(**(get_es_connection_parms()),timeout=60)

Azure Stream Analytics: "Stream Analytics job has validation errors: The given key was not present in the dictionary."

I burned a couple of hours on a problem today and thought I would share.
I tried to start up a previously-working Azure Stream Analytics job and was greeted by a quick failure:
Failed to start Streaming Job 'shayward10ProcessLogs'.
I looked at the JSON log and found nothing helpful whatsoever. The only description of the problem was:
Stream Analytics job has validation errors: The given key was not present in the dictionary.
Given the error and some changes to our database, I tried the following to no effect:
Deleting and Recreating all Inputs
Deleting and Recreating all Outputs
Running tests against the data (coming from Event Hub) and the output looked good
My query looked as followed:
SELECT
dateTimeUtc,
context.tenantId AS tenantId,
context.userId AS userId,
context.deviceId AS deviceId,
changeType,
dataType,
changeStatus,
failureReason,
ipAddress,
UDF.JsonToString(details) AS details
INTO
[MyOutput]
FROM
[MyInput]
WHERE
logType = 'MyLogType';
Nothing made sense so I started deconstructing my query. I took it down to a single field and it succeeded. I went field by field, trying to figure out which field (if any) was the cause.
See my answer below.
The answer was simple (yet frustrating). When I got to the final field, that's where the failure was:
UDF.JsonToString(details) AS details
This was the only field that used a user-defined function. After futsing around, I noticed that the Function Editor showed the title of the function as:
udf.JsonToString
It was a casing issue. I had UDF in UPPERCASE and Azure Stream Analytics expected it in lowercase. I changed my final field to:
udf.JsonToString(details) AS details
It worked.
The strange thing is, it was previously working. Microsoft may have made a change to Azure Stream Analytics to make it case-sensitive in a place where it seemingly wasn't before.
It makes sense, though. JavaScript is case-sensitive. Every JavaScript object is basically a dictionary of members. Consider the error:
Stream Analytics job has validation errors: The given key was not present in the dictionary.
The "udf" object had a dictionary member with my function in it. The UDF object would be undefined. Undefined doesn't have my function as a member.
I hope my 2-hour head-banging session helps someone else.

Determine if a cucumber scenario has pending steps

I would like to retrieve the scenario state in the "After" scenario hook. I noticed that the .failed? method does not consider pending steps as failed steps.
So How can I determine that a scenario did not execute completely, because it failed OR because some steps were not implemented/defined.
You can use status method. The default value of status is :skipped, the failed one is :failed and the passed step is :passed. So you can write something like this:
do sth if step.status != :passed
Also, if you use !step.passed? it does the same thing because it only checks for the :passed status.
http://cukes.info/api/cucumber/ruby/yardoc/Cucumber/Ast/Scenario.html#failed%3F-instance_method
On that subject, you can also take a look at this post about demoing your feature specs to your customers: http://multifaceted.io/2013/demo-feature-tests/
LiohAu, you can use the 'status' method on a scenario itself rather than on individual steps. Try this: In hooks, add
After do |scenario|
p scenario.status
end
This will give the statuses as follows:
Any step not implemented / defined, it'll give you :undefined
Scenario fails (when all steps are defined) :failed
Scenario passes :passed
Using the same hook, it'll give you the status for scenario outline, but for each example row (since for each example row, it is an individual scenario). So if at all you want the result of an entire outline, you'll need to capture result for all example rows and compute the final result accordingly.
Hope this helps.

Resources