Monitoring WriteToBigQuery - python-3.x

In my pipeline I use WriteToBigQuery something like this:
| beam.io.WriteToBigQuery(
'thijs:thijsset.thijstable',
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
This returns a Dict as described in the documentation as follows:
The beam.io.WriteToBigQuery PTransform returns a dictionary whose
BigQueryWriteFn.FAILED_ROWS entry contains a PCollection of all the
rows that failed to be written.
How do I print this dict and turn it into a pcollection or how do I just print the FAILED_ROWS?
If I do: | "print" >> beam.Map(print)
Then I get: AttributeError: 'dict' object has no attribute 'pipeline'
I must have read a hundred pipelines but never have I seen anything after the WriteToBigQuery.
[edit]
When I finish the pipeline and store the results in a variable I have the following:
{'FailedRows': <PCollection[WriteToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn).FailedRows] at 0x7f0e0cdcfed0>}
But I do not know how to use this result in the pipeline like this:
| beam.io.WriteToBigQuery(
'thijs:thijsset.thijstable',
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
| ['FailedRows'] from previous step
| "print" >> beam.Map(print)

Dead letters to handle invalid inputs are a common Beam/Dataflow usage and work with both Java and Python SDKs but there are not many examples for the latter.
Imagine that we have some dummy input data with 10 good lines and a bad row that does not conform to the table schema:
schema = "index:INTEGER,event:STRING"
data = ['{0},good_line_{1}'.format(i + 1, i + 1) for i in range(10)]
data.append('this is a bad row')
Then, what I do is name the write result (events in this case):
events = (p
| "Create data" >> beam.Create(data)
| "CSV to dict" >> beam.ParDo(CsvToDictFn())
| "Write to BigQuery" >> beam.io.gcp.bigquery.WriteToBigQuery(
"{0}:dataflow_test.good_lines".format(PROJECT),
schema=schema,
)
)
and then access the FAILED_ROWS side output:
(events[beam.io.gcp.bigquery.BigQueryWriteFn.FAILED_ROWS]
| "Bad lines" >> beam.io.textio.WriteToText("error_log.txt"))
This works well with the DirectRunner and writes the good lines to BigQuery:
and the bad one to a local file:
$ cat error_log.txt-00000-of-00001
('PROJECT_ID:dataflow_test.good_lines', {'index': 'this is a bad row'})
If you run it with the DataflowRunner you'll need some additional flags. If you face the TypeError: 'PDone' object has no attribute '__getitem__' error you'll need to add --experiments=use_beam_bq_sink to use the new BigQuery sink.
If you get a KeyError: 'FailedRows' it's because the new sink will default to load BigQuery jobs for batch pipelines:
STREAMING_INSERTS, FILE_LOADS, or DEFAULT. An introduction on loading
data to BigQuery: https://cloud.google.com/bigquery/docs/loading-data.
DEFAULT will use STREAMING_INSERTS on Streaming pipelines and
FILE_LOADS on Batch pipelines.
You can override the behavior by specifying method='STREAMING_INSERTS' in WriteToBigQuery:
Full code for both DirectRunner and DataflowRunner here.

Related

Apache-beam hanging on groupbykey after windowing - not triggering

TLDR;
How to correct trigger count windows with python SDK?
Problem
I'm trying to make a pipeline for transforming and indexing a Wikipedia dump.
The objective is:
Read from a compressed file - just one process and in a streaming fashion as the file doesn't fit in RAM
Process each element in parallel (ParDo)
Group these elements in a count window (GroupBy in just one key to do streaming -> batch ) in just one process to save them in a DB.
Development
For that, I created a simple source class that returns a tuple in the form (index,data, counting):
class CountingSource(beam.io.filebasedsource.FileBasedSource):
def read_records(self, file_name, offset_range_tracker):
# timestamp = datetime.now()
k = 0
with gzip.open(file_name, "rt", encoding="utf-8", errors="strict") as f:
line = f.readline()
while line:
# Structure: index, page, index, page,...
line = f.readline()
yield line, f.readline(), k
k += 1
And I made the pipeline:
_beam_pipeline_args = [
"--runner=DirectRunner",
"--streaming",
# "--direct_num_workers=5",
# "--direct_running_mode=multi_processing",
]
with beam.Pipeline(options=PipelineOptions(_beam_pipeline_args)) as pipeline:
pipeline = (
pipeline
| "Read dump" >> beam.io.Read(CountingSource(dump_path))
| "With timestamps" >> beam.Map(lambda data: beam.window.TimestampedValue(data, data[-1]))
| "Drop timestamp" >> beam.Map(lambda data: (data[0], data[1]))
| "Process element" >> beam.ParDo(ProcessPage())
| "Filter nones" >> beam.Filter(lambda data: data != [])
# * not working, keep stuck at group - not triggering the window
| "window"
>> beam.WindowInto(
beam.window.GlobalWindows(),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(10)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING,
)
| "Map to tuple" >> beam.Map(lambda data: (None, data))
# | "Print" >> beam.Map(lambda data: print(data))
| "Group all per window" >> beam.GroupByKey()
| "Discard key" >> beam.Values()
| "Index data" >> beam.Map(index_data)
)
If I remove the window and pass directly from "Filter nones" to "Index data" the pipeline works but indexing individually the elements. Also, If uncomment the print step I can see I still have data after the "Map to tuple" step, but it hangs on "Group all per window" without any logg. I tried with timed triggering too, changing the window to
>> beam.WindowInto(
beam.window.FixedWindows(10))
but this changed nothing (which was supposed to do the same as I create a "count time stamp" on data extraction).
I'm understanding something wrong with the windowing? The objective was to just index the data in batches.
Alternative
I can "hack" this last step using a custom do.Fn like:
class BatchIndexing(beam.DoFn):
def __init__(self, connection_string, batch_size=50000):
self._connection_string = connection_string
self._batch_size = batch_size
self._total = 0
def setup(self):
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from scripts.wikipedia.wikipedia_articles.beam_module.documents import Base
engine = create_engine(self._connection_string, echo=False)
self.session = sessionmaker(bind=engine)(autocommit=False, autoflush=False)
Base.metadata.create_all(engine)
def start_bundle(self):
# buffer for string of lines
self._lines = []
def process(self, element):
# Input element is the processed pair
self._lines.append(element)
if len(self._lines) >= self._batch_size:
self._total += len(self._lines)
self._flush_batch()
def finish_bundle(self):
# takes care of the unflushed buffer before finishing
if self._lines:
self._flush_batch()
def _flush_batch(self):
self.index_data(self._lines)
# Clear the buffer.
self._lines = []
def index_data(self, entries_to_index):
"""
Index batch of data.
"""
print(f"Indexed {self._total} entries")
self.session.add_all(entries_to_index)
self.session.commit()
and change the pipeline to:
with beam.Pipeline(options=PipelineOptions(_beam_pipeline_args)) as pipeline:
pipeline = (
pipeline
| "Read dump" >> beam.io.Read(CountingSource(dump_path))
| "Drop timestamp" >> beam.Map(lambda data: (data[0], data[1]))
| "Process element" >> beam.ParDo(ProcessPage())
| "Filter nones" >> beam.Filter(lambda data: data != [])
| "Unroll" >> beam.FlatMap(lambda data: data)
| "Index data" >> beam.ParDo(BatchIndexing(connection_string, batch_size=10000))
)
Which "works" but do the last step in parallel (thus, overwhelming de database or generating locked database problems with sqlite) and I would like to have just one Sink to communicate with the database.
Triggering in Beam is not a hard requirement. My guess would be that the trigger does not manage to trigger before the input ends. The early trigger of 10 elements means the runner is allowed to trigger after 10 elements, but does not have to (relates to how Beam splits inputs into bundles).
The FixedWindows(10) is fixed on 10 second interval and your data will all have the same timestamp, so that is not going to help either.
If your goal is to group data to batches there is a very handy transform for that: GroupIntoBatches, which should work for the use case and has additional features like limiting the time a record can wait in the batch before being processed.

How to call a forward the value of a variable created in the script in Nextflow to a value output channel?

i have process that generates a value. I want to forward this value into an value output channel. but i can not seem to get it working in one "go" - i'll always have to generate a file to the output and then define a new channel from the first:
process calculate{
input:
file div from json_ch.collect()
path "metadata.csv" from meta_ch
output:
file "dir/file.txt" into inter_ch
script:
"""
echo ${div} > alljsons.txt
mkdir dir
python3 $baseDir/scripts/calculate.py alljsons.txt metadata.csv dir/
"""
}
ch = inter_ch.map{file(it).text}
ch.view()
how do I fix this?
thanks!
best, t.
If your script performs a non-trivial calculation, writing the result to a file like you've done is absolutely fine - there's nothing really wrong with this approach. However, since the 'inter_ch' channel already emits files (or paths), you could simple use:
ch = inter_ch.map { it.text }
It's not entirely clear what the objective is here. If the desire is to reduce the number of channels created, consider instead switching to the new DSL 2. This won't let you avoid writing your calculated result to a file, but it might mean you can avoid an intermediary channel, potentially.
On the other hand, if your Python script actually does something rather trivial and can be refactored away, it might be possible to assign a (global) variable (below the script: keyword) such that it can be referenced in your output declaration, like the line x = ... in the example below:
Valid output
values
are value literals, input value identifiers, variables accessible in
the process scope and value expressions. For example:
process foo {
input:
file fasta from 'dummy'
output:
val x into var_channel
val 'BB11' into str_channel
val "${fasta.baseName}.out" into exp_channel
script:
x = fasta.name
"""
cat $x > file
"""
}
Other than that, your options are limited. You might have considered using the env output qualifier, but this just adds some syntactic-sugar to your shell script at runtime, such that an output file is still created:
Contents of test.nf:
process test {
output:
env myval into out_ch
script:
'''
myval=$(calc.py)
'''
}
out_ch.view()
Contents of bin/calc.py (chmod +x):
#!/usr/bin/env python
print('foobarbaz')
Run with:
$ nextflow run test.nf
N E X T F L O W ~ version 21.04.3
Launching `test.nf` [magical_bassi] - revision: ba61633d9d
executor > local (1)
[bf/48815a] process > test [100%] 1 of 1 ✔
foobarbaz
$ cat work/bf/48815aeefecdac110ef464928f0471/.command.sh
#!/bin/bash -ue
myval=$(calc.py)
# capture process environment
set +u
echo myval=$myval > .command.env

Issues creating a virtual HANA table

I am trying to create a virtual table in HANA based on a remote system table view.
If I run it at the command line using hdbsql
hdbsql H00=> create virtual table HanaIndexTable at "SYSRDL#CG_SOURCE"."<NULL>"."dbo"."sysiqvindex"
0 rows affected (overall time 305.661 msec; server time 215.870 msec)
I am able to select from HanaIndexTable and get results and see my index.
When I code it in python, I use the following command:
cursor.execute("""create virtual table HanaIndexTable1 at SYSRDL#CG_source.\<NULL\>.dbo.sysiqvindex""")
I think there is a problem with the NULL. But I see in the output that the escape key is doubled.
self = <hdbcli.dbapi.Cursor object at 0x7f02d61f43d0>
operation = 'create virtual table HanaIndexTable1 at SYSRDL#CG_source.\\<NULL\\>.dbo.sysiqvindex'
parameters = None
def __execute(self, operation, parameters = None):
# parameters is already checked as None or Tuple type.
> ret = self.__cursor.execute(operation, parameters=parameters, scrollable=self._scrollable)
E hdbcli.dbapi.ProgrammingError: (257, 'sql syntax error: incorrect syntax near "\\": line 1 col 58 (at pos 58)')
/usr/local/lib/python3.7/site-packages/hdbcli/dbapi.py:69: ProgrammingError
I have tried to run the command without the <> but get the following error.
hdbcli.dbapi.ProgrammingError: (257, 'sql syntax error: incorrect syntax near "NULL": line 1 col 58 (at pos 58)')
I have tried upper case, lower case and escaping. Is what I am trying to do impossible?
There was an issue with capitalization between HANA and my remote source. I also needed more escaping rather than less.

MafftCommandline and io.StringIO

I've been trying to use the Mafft alignment tool from Bio.Align.Applications. Currently, I've had success writing my sequence information out to temporary text files that are then read by MafftCommandline(). However, I'd like to avoid redundant steps as much as possible, so I've been trying to write to a memory file instead using io.StringIO(). This is where I've been having problems. I can't get MafftCommandline() to read internal files made by io.StringIO(). I've confirmed that the internal files are compatible with functions such as AlignIO.read(). The following is my test code:
from Bio.Align.Applications import MafftCommandline
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import io
from Bio import AlignIO
sequences1 = ["AGGGGC",
"AGGGC",
"AGGGGGC",
"AGGAGC",
"AGGGGG"]
longest_length = max(len(s) for s in sequences1)
padded_sequences = [s.ljust(longest_length, '-') for s in sequences1] #padded sequences used to test compatibilty with AlignIO
ioSeq = ''
for items in padded_sequences:
ioSeq += '>unknown\n'
ioSeq += items + '\n'
newC = io.StringIO(ioSeq)
cLoc = str(newC).strip()
cLocEdit = cLoc[:len(cLoc)] #create string to remove < and >
test1Handle = AlignIO.read(newC, "fasta")
#test1HandleString = AlignIO.read(cLocEdit, "fasta") #fails to interpret cLocEdit string
records = (SeqRecord(Seq(s)) for s in padded_sequences)
SeqIO.write(records, "msa_example.fasta", "fasta")
test1Handle1 = AlignIO.read("msa_example.fasta", "fasta") #alignIO same for both #demonstrates working AlignIO
in_file = '.../msa_example.fasta'
mafft_exe = '/usr/local/bin/mafft'
mafft_cline = MafftCommandline(mafft_exe, input=in_file) #have to change file path
mafft_cline1 = MafftCommandline(mafft_exe, input=cLocEdit) #fails to read string (same as AlignIO)
mafft_cline2 = MafftCommandline(mafft_exe, input=newC)
stdout, stderr = mafft_cline()
print(stdout) #corresponds to MafftCommandline with input file
stdout1, stderr1 = mafft_cline1()
print(stdout1) #corresponds to MafftCommandline with internal file
I get the following error messages:
ApplicationError: Non-zero return code 2 from '/usr/local/bin/mafft <_io.StringIO object at 0x10f439798>', message "/bin/sh: -c: line 0: syntax error near unexpected token `newline'"
I believe this results due to the arrows ('<' and '>') present in the file path.
ApplicationError: Non-zero return code 1 from '/usr/local/bin/mafft "_io.StringIO object at 0x10f439af8"', message '/usr/local/bin/mafft: Cannot open _io.StringIO object at 0x10f439af8.'
Attempting to remove the arrows by converting the file path to a string and indexing resulted in the above error.
Ultimately my goal is to reduce computation time. I hope to accomplish this by calling internal memory instead of writing out to a separate text file. Any advice or feedback regarding my goal is much appreciated. Thanks in advance.
I can't get MafftCommandline() to read internal files made by
io.StringIO().
This is not surprising for a couple of reasons:
As you're aware, Biopython doesn't implement Mafft, it simply
provides a convenient interface to setup a call to mafft in
/usr/local/bin. The mafft executable runs as a separate process
that does not have access to your Python program's internal memory,
including your StringIO file.
The mafft program only works with an input file, it doesn't even
allow stdin as a data source. (Though it does allow stdout as a
data sink.) So ultimately, there must be a file in the file system
for mafft to open. Thus the need for your temporary file.
Perhaps tempfile.NamedTemporaryFile() or tempfile.mkstemp() might be a reasonable compromise.

Getting error when passing parameters from Where block in Groovy-Spock code

I have written a code for my application.
def "Test for file type #FileFormat"() {
given:
HttpURLConnection connection = getHandlerURL('endpoint')
connection.setDoOutput(true)
connection.setRequestMethod("POST")
connection.setRequestProperty(HttpHeaders.CONTENT_TYPE, "RdfFormat."+RDFFileFormat+".toMIME()")
rdfStatement = ModelFactory.createDefaultModel().read(new ByteArrayInputStream(readRDFfromfile(Filename).bytes), null, "RdfFormat."+RDFFileFormat.toString()).listStatements().nextStatement()
when:
connection.getOutputStream().write(readRDFfromfile(Filename).bytes)
then:
connection.getResponseCode() == HTTP_CREATED
where:
FileFormat | Filename | RDFFileFormat
'N-TRIPLES' | 'n-triples.nt' | "NTRIPLES"
}
When I run my code I am getting error: SampleTest.Test for file type #FileFormat:37 » Riot in last line of Given clause.
The test is passing if I use RdfFormat.NTRIPLES.toString() instead of using the parameter RDFFileFormat passed from Where clause.
Tried assigning def format1 = "RdfFormat."+RDFFileFormat+".toString()" and using format1, but got same error.
Is there any way I can make it work?
I think you probably want:
connection.setRequestProperty(HttpHeaders.CONTENT_TYPE, RdfFormat."$RDFFileFormat".toMIME())

Resources