Dataflow job does not any produce output

Dataflow job does not any produce output - python-3.x

I have an issue where the dataflow job actually runs fine, but it does not produce any output until the job is manually drained.
With the following code I was assuming that it would produce windowed output, effectively triggering after each window.
lines = (
p
| "read" >> source
| "decode" >> beam.Map(decode_message)
| "Parse" >> beam.Map(parse_json)
| beam.WindowInto(
beam.window.FixedWindows(5*60),
trigger=beam.trigger.Repeatedly(beam.trigger.AfterProcessingTime(5*60)),
accumulation_mode=beam.trigger.AccumulationMode.DISCARDING)
| "write" >> sink
)
What I want is that if it has received events in a window, it should produce output after the window in any case. The source is a Cloud PubSub, with approximately 100 events/minute.
This is the parameters i use to start the job:
python main.py \
--region $(REGION) \
--service_account_email $(SERVICE_ACCOUNT_EMAIL_TEST) \
--staging_location gs://$(BUCKET_NAME_TEST)/beam_stage/ \
--project $(TEST_PROJECT_ID) \
--inputTopic $(TOPIC_TEST) \
--outputLocation gs://$(BUCKET_NAME_TEST)/beam_output/ \
--streaming \
--runner DataflowRunner \
--temp_location gs://$(BUCKET_NAME_TEST)/beam_temp/ \
--experiments=allow_non_updatable_job \
--disk_size_gb=200 \
--machine_type=n1-standard-2 \
--job_name $(DATAFLOW_JOB_NAME)
Any ideas on how to fix this? I'm using apache-beam 2.22 SDK, python 3.7

Excuse me if you are referring to 2.22, because "apache-beam 1.22" seems to be old? Especially when you are using Python 3.7, you might want to try newer SDK versions such as 2.22.0.
What I want is that if it has received events in a window, it should produce output after the window in any case. The source is a Cloud PubSub, with approximately 100 events/minute.
If you just need one pane fired per window and fixed windows every 5 mins, you can simply go with
beam.WindowInto(beam.window.FixedWindows(5*60))
If you want to customize triggers, you can take a look at this document streaming-102.
Here is a streaming example with visualization of windowed outputs.
from apache_beam.runners.interactive import interactive_beam as ib
ib.options.capture_duration = timedelta(seconds=30)
ib.evict_captured_data()
pstreaming = beam.Pipeline(InteractiveRunner(), options=options)
words = (pstreaming
| 'Read' >> beam.io.ReadFromPubSub(topic=topic)
| 'Window' >> beam.WindowInto(beam.window.FixedWindows(5)))
ib.show(words, visualize_data=True, include_window_info=True)
If you run these code in a notebook environment such as jupyterlab, you get to debug streaming pipelines with outputs like this. Note the windows are visualized, for a period of 30 seconds, we get 6 windows as the fixed window is set to 5 seconds. You can bin data by windows to see what data came in which window.
You can setup your own notebook runtime following instructions;
Or you can use hosted solutions provided by Google Dataflow Notebooks.

Related

HaplotypeCaller provide variants more than expected

I used HaplotypeCaller for variant calling out of WES picard.sorted.MarkedDup.bam file with GATK 4.2.6.1. HaplotypeCaller standard command line.
Apparently, everything worked well and I received standard .vcf file. But the number of identified variants are too much for WES result. It's close to one million variants for one sample!
Did I perform something wrong?
What solution do you recommend?
Any help would be appreciated.
The command line I used was as follow:
gatk --java-options -Xmx8g HaplotypeCaller \ -R $refFile \ -I ${base}.picard.sorted.markedDup.bam \ --dont-use-soft-clipped-bases -stand-call-conf 20.0 \ --emit-ref-confidence GVCF \ -O ${base}.rrrrealigned.vcf

Get All AWS Lambda Functions and Their Tags and Output to CSV

My bash script runs to retrieve lambda functions and their tags.
It runs ok and does what it needed to do, however I need to get the output written to a .txt or a .csv file, which needs to be in a readable format.
Below is the script I have;
#!/bin/bash
while read -r name; do
aws lambda list-functions | jq -r ".Functions[].FunctionArn" | xargs -I {} aws lambda list-tags --resource {} --query '{"{}":Tags}' --output text
done
Below is what a returned value looks like after the script runs;
ARN:AWS:LAMBDA:EU-WEST-1:1939999:FUNCTION:example-lambda EXX dev example-lambda False release-1.1.9 False True
I need to get all the items returned and lined up neatly in a txt or csv file. Any help would be appreciated.

I would recommend to use the resourcegroupstaggingapi API to solve this problem. This API allows you to get all resources of a specific type and their tags.
To get all your Lambda functions for your default region and their tags you can run the following command:
aws resourcegroupstaggingapi get-resources --resource-type-filters "lambda"
The output of this command can now be parsed with jq. The great thing about jq is that you can manipulate the output to be CSV.
To get CSV output with two columns (ARN, Tags) you can run the following command:
aws \
resourcegroupstaggingapi \
get-resources \
--resource-type-filters "lambda" \
| jq -r '.ResourceTagMappingList[] | [.ResourceARN, ((.Tags | map([.Key, .Value] | join("="))) | join(","))] | #csv'
The advantage of this approach is that you only have a single HTTP call making it relatively fast. The disadvantage is that you only get the ARN and the tags.

As shimo mentioned in a comment to the question, a way to save the output of a command to a file is using the > operator.
> operator replaces the existing content of the file with the output of the command. If you want to save the output of multiple commands to the same file, you should use >> operator.
You can also use a pipe and the command tee. The output will be printed in your screen and in a file, which will be at the end of the file.

I found this tutorial helpful. Based on what you written, you could pipe the output of your command into a csv, or concotanate into an array then write it into a file with a newline character at the end of each line.

Publishing test results to Azure (VS Database Project, tSQLt, Azure Pipelines, Docker)

I am trying to fully automate the build, test, and release of a database project using Azure Pipeline.
I already have a Visual Studio solution which consists of three database projects. The first project is the database, which contains the tables, stored procedures, functions, data, etc.. The second project is the tSQLt framework (v 1.0.5873.27393 if anyone is interested). And finally the third project is the tSQLt tests.
My goal here to check the solution into source control, and the pipeline will automatically build the solution, deploy the dacpacs to a build server (docker in this case), run the tSQLt tests, and publish the results back to the pipeline.
My pipeline works like this.
Building the visual studio solution
Publish the Artifacts
Setup a docker container running Ubuntu & SQL Server
Install SQLPackage
Deploy the dacpacs to the SQL instance
Run the tSQLt tests
Publish the test results
Everything up to publishing the results is working, but on this step I got the following error:
[warning]Failed to read /home/vsts/work/1/Results.xml. Error : Data at the root level is invalid. Line 1, position 1.
I added another step in the pipeline to display the content of the Results.xml file. It appears like this:
XML_F52E2B61-18A1-11d1-B105-00805F49916B
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<testsuites><testsuite id="1" name="MyNewTestClassOne" tests="1" errors="0" failures="0" timestamp="2021-02-01T10:40:31" time="0.000" hostname="f6a05d4a3932" package="tSQLt"><properties/><testcase classname="MyNewTestClassOne" name="TestNumberOne" time="0.
I'm not sure if the column name and dashes should be in the file, but I'm guessing not. I added another step in to remove them, just leaving me with the XML. But this then gave me a different error to deal with:
##[warning]Failed to read /home/vsts/work/1/Results.xml. Error : There is an unclosed literal string. Line 2, position 1.
This one is a little obvious to spot, because as you'll see above, the XML is incomplete.
Here is the part of my pipeline which runs the tSQLt tests and outs the results to Results.xml
- script: |
sqlcmd -S 127.0.0.1,1433 -U SA -P Password.1! -d StagingDB -Q 'EXEC tSQLt.RunAll;'
displayName: 'tSQLt - Run All Tests'
- script: |
cd $(Pipeline.Workspace)
sqlcmd -S 127.0.0.1,1433 -U SA -P Password.1! -d StagingDB -Q 'SET NOCOUNT ON; EXEC tSQLt.XmlResultFormatter;' -o 'tSQLt_Results.xml'
displayName: 'tSQLt - Output Results'
I've research so many blogs and articles on this, and most people are doing the same. Some people use PowerShell instead of sqlcmd, but given I'm using a Ubuntu machine this isn't an option here.
I am all out of options, so I am looking for a little help on this.

You are dealing with 2 problems here. There is noise in your result set, that is not xml and your xml result is truncated after 256 characters. I can help you with both.
What I am doing is basically this:
/opt/mssql-tools/bin/sqlcmd \
-S "localhost, 31114" -U sa \
-P "password" \
-d dbname \
-y0 \
-Q "BEGIN TRY EXEC tSQLt.RunAll END TRY BEGIN CATCH END CATCH; EXEC tSQLt.XmlResultFormatter" \
| grep -w "<testsuites>" \
| tee "resultfile.xml"
Few things to note:
y0 important. This sets the length of the xml result set to unlimited, up from 256.
grep with a regular expression - make sure you only get the xml and not the noise around it.
If you want to run only a subset of your tests, you need to make amendments to the SQL query being passed in, but other than that, this is a catch it all "oneliner" to run all tests and get the results in xml format, readable by Azure DevOps

Google Colab (Python, jupyter notebook) terminal input

I am running .py script with arguments, but the code asks for input and I can not enter anything as seen in the picture.

Updated: Colab now supports input prompts. Try running things again, and you should see a prompt like so:

If you know beforehand what inputs you want to enter then you can use:
! printf 'y\ny\ny\n' | python run.py --task 1 --gpu -1 --data "data/"
In above case, if terminal prompts for an input three times, it will enter first y then y then y. \n is just for newline.
e.g-
If you need to enter only two input, say q followed by d then it should look like:
! printf 'q\nd\n' | python run.py --task 1 --gpu -1 --data "data/"

Annotating a Corpus (Syntaxnet)

I downloaded and installed SyntaxNet following Syntax official documentation on Github. following the documentation (annotating corpus) I tried to read a .conll file named wj.conll by SyntaxNet and write the results in wj-tagged.conll but I could not. My questions are:
does SyntaxNet always reads .conll files? (not .txt files?). I got a bit confused as I knew SyntaxNet reads .conll file for training and testing process but I am a bit suspicious that it is necessary to convert a .txt file to .conll file in order to have their Part Of Speach and Dependancy Parsing.
How can I make SyntaxNet reads from files (I tired all possible ways explain in GitHub documentation about SyntaxNet and It didn't work for me)

Add these declaration lines to "context.pbtxt" at the end of the file. Here "inp" and "out" are the text files present in the root directory of syntexnet.
input {
name: 'inp_file'
record_format: 'english-text'
Part {
file_pattern: 'inp'
}
}
input {
name: 'out_file'
record_format: 'english-text'
Part {
file_pattern: 'out'
}
}
Add sentences to the "inp" file for which you want tagging to be done and specify them in shell the next time you run syntaxnet using --input and --output tags.
Just to help you a bit more I am pasting an example shell command.
bazel-bin/syntaxnet/parser_eval \
--input inp_file \
--output stdout-conll \
--model syntaxnet/models/parsey_mcparseface/tagger-params \
--task_context syntaxnet/models/parsey_mcparseface/context.pbtxt \
--hidden_layer_sizes 64 \
--arg_prefix brain_tagger \
--graph_builder structured \
--slim_model \
--batch_size 1024 | bazel-bin/syntaxnet/parser_eval \
--input stdout-conll \
--output out_file \
--hidden_layer_sizes 512,512 \
--arg_prefix brain_parser \
--graph_builder structured \
--task_context syntaxnet/models/parsey_mcparseface/context.pbtxt \
--model_path syntaxnet/models/parsey_mcparseface/parser-params \
--slim_model --batch_size 1024
In the above script the output(POS tagging) of the first shell command is used as an input for the second shell command, where the two shell commands are seperated by "|"

just a quick help if you want to save the output of demo in a .txt file:
try echo "open file X with application Y" | ./demo.sh > output.txt
it gives you sentence tree to the current directory.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string