Snowflake PUT Command status in Python - python-3.x

I am using snowflake PUT command from Python to move files from local system to snowflake staging.
I have 400 (40 MB each) files so I am using command like -> put file:///Path/file_name*
It is working and loading all the files but its taking around 30 mins.
I want to know the progress so that I can be sure its progressing, is there a way to print logs after each file is loaded ( file 1 is moved to staging, file 2 is moved to staging etc.)

Is there a way to print logs after each file is loaded?
While the statement execution is non-interactive when used from a library, the Snowflake python connector does support logging its execution work.
Here's a shortened snippet that incorporates the example from the link above:
# Assumes a 'con' object pre-exists and is connected to Snowflake already
import logging
for logger_name in ['snowflake.connector', 'botocore', 'boto3']:
logger = logging.getLogger(logger_name)
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
ch.setFormatter(logging.Formatter('%(asctime)s - %(funcName)s() - %(message)s'))
logger.addHandler(ch)
con.cursor().execute("put file:///Path/file_name* #stage_name")
# Optional, custom app log:
# logging.info("put command completed execution, exiting")
con.close()
Watching the output (to stderr) while this program runs will yield the following (filtered for just upload messages):
~> python3 your_logging_script.py 2>&1 | grep -F "upload_one_file()"
[…]
2020-06-24 04:57:06,495 - upload_one_file() - done: status=ResultStatus.UPLOADED, file=/Path/file_name1, (…)
2020-06-24 04:57:07,312 - upload_one_file() - done: status=ResultStatus.UPLOADED, file=/Path/file_name2, (…)
2020-06-24 04:57:09,121 - upload_one_file() - done: status=ResultStatus.UPLOADED, file=/Path/file_name3, (…)
[…]
You can also configure the python logger to use a file, and tail the file instead of relying on the stderr (from logging.StreamHandler) used for simplicity above.
If you need to filter the logging for only specific messages, the logging python module supports attaching your own filters that decide on each record emitted. The following filters for just the upload_one_file() function call messages (use record.message field to filter over the log message instead of on the function name used in example below):
class UploadFilter(logging.Filter):
def filter(self, record):
# Only tests one condition, but you could chain conditions here
return "upload_one_file" in record.funcName.lower()
for logger_name in ['snowflake.connector', 'botocore', 'boto3']:
logger = logging.getLogger(logger_name)
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
ch.setFormatter(logging.Formatter('%(asctime)s - %(funcName)s() - %(message)s'))
ch.addFilter(UploadFilter())
# ch.addFilter(AnyOtherFilterClass())
logger.addHandler(ch)
Note: If you are changing your handlers (stream to file), ensure you add the filter to the actual new handler too, and the handler to the logger. You can read the tutorial on logging in Python to understand its mechanism better.

Related

How can I pass and receive information dinamically within a subprocess?

I'm developing a Python code that can run two applications and exchange information between them during their run time.
The basic scheme is something like:
start a subprocess with the 1st application
start a subprocess with the 2nd application
1st application performs some calculation, writes a file A, and waits for input
2nd application reads file A, performs some calculation, writes a file B, and waits for input
1st application reads file B, performs some calculation, writes a file C, and waits for input
...and so on until some condition is met
I know how to start one Python subprocess, and now I'm learning how to pass/receive information during run time.
I'm testing my Python code using a super-simple application that just reads a file, makes a plot, closes the plot, and returns 0.
I was able to pass an input to a subprocess using subprocess.communicate() and I could tell that the subprocess used that information (plot opens and closes), but here the problems started.
I can only send an input string once. After the first subprocess.communicate() in my code below, the subprocess hangs there. I suspect I might have to use subprocess.stdin.write() instead, since I read subprocess.communicate() will wait for the end of the file and I wish to send multiple times different inputs during the application run instead. But I also read that the use of stdin.write() and stdout.read() is discouraged. I tried this second alteranative (see #alternative in the code below), but in this case the application doesn't seem to receive the inputs, i.e. it doesn't do anything and the code ends.
Debugging is complicated because I haven't found a neat way to output what the subprocess is receiving as input and giving as output. (I tried to implement the solutions described here, but I must have done something wrong: Python: How to read stdout of subprocess in a nonblocking way, A non-blocking read on a subprocess.PIPE in Python)
Here is my working example. Any help is appreciated!
import os
import subprocess
from subprocess import PIPE
# Set application name
app_folder = 'my_folder_path'
full_name_app = os.path.join(app_folder, 'test_subprocess.exe')
# Start process
out_app = subprocess.Popen([full_name_app], stdin=PIPE, stdout=PIPE)
# Pass argument to process
N = 5
for n in range(N):
str_to_communicate = f'{{\'test_{n+1}.mat\', {{\'t\', \'y\'}}}}' # funny looking string - but this how it needs to be passed
bytes_to_communicate = str_to_communicate.encode()
output_communication = out_app.communicate(bytes_to_communicate)
# output_communication = out_app.stdin.write(bytes_to_communicate) # alternative
print(f'Communication command #{n+1} sent')
# Terminate process
out_app.terminate()

Writing unique parquet file per windows with Apache Beam Python

I am trying to stream messages from kafka consumer to google cloud storage with 30 seconds windows using apache beam. Used beam_nuggets.io for reading from a kafka topic. However, I wasn't able to write unique parquet files to GCS per each window.
You can see my code below:`
import apache_beam as beam
from apache_beam.transforms.trigger import AfterAny, AfterCount, AfterProcessingTime, AfterWatermark, Repeatedly
from apache_beam.portability.api.beam_runner_api_pb2 import AccumulationMode
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import kafkaio
import json
from datetime import datetime
import pandas as pd
import config as conf
import apache_beam.transforms.window as window
consumer_config = {"topic": "Uswrite",
"bootstrap_servers": "*.*.*.*:9092",
"group_id": "notification_consumer_group_33"}
folder_name = datetime.now().strftime('%Y-%m-%d')
def format_result(consume_message):
data = json.loads(consume_message[1])
file_name = datetime.now().strftime("%Y_%m_%d-%I_%M_%S")
df = pd.DataFrame(data).T #, orient='index'
df.to_parquet(f'gs://{conf.gcs}/{folder_name}/{file_name}.parquet',
storage_options={"token": "gcp.json"}, engine='fastparquet')
print(consume_message)
with beam.Pipeline(options=PipelineOptions()) as p:
consumer_message = (p | "Reading messages from Kafka" >> kafkaio.KafkaConsume(consumer_config=consumer_config)
| 'Windowing' >> beam.WindowInto(window.FixedWindows(30),
trigger=AfterProcessingTime(30),
allowed_lateness=900,
accumulation_mode=AccumulationMode.ACCUMULATING)
| 'CombineGlobally' >> beam.Map(format_result))
# window.FixedWindows(30),trigger=beam.transforms.trigger.AfterProcessingTime(30),
# accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING
# allowed_lateness=100,CombineGlobally(format_result).without_defaults() allowed_lateness=30,
Using the code above, a new parquet file is generated for each message. What I would like to do is to group messages by 30 seconds windows and generate one parquet file for each window.
I tried different configurations below with no success:
beam.CombineGlobally(format_result).without_defaults()) instead of beam.Map(format_result))
beam.ParDo(format_result))
In addition, I have few more questions:
Even though I set the offset by "auto.offset.reset": "earliest",
kafka producer starts to read from the last message even if I change
the consumer group and can’t figure out why.
Also, I am puzzled by the usage of trigger, allowed_lateness, accumulation_mode.
I am not sure if I need them for the this task.
As you can see in the code block
above, I also tried using these parameters but it didn’t help.
I searched everywhere but couldn’t find a single example that explains this use case.
`
Here are some changes you should make to your pipeline to get this result:
Remove your trigger if you want a single output per window. Triggers are only needed for getting multiple results per window.
Add a GroupByKey or Combine operation to aggregate the elements. Without such an operation, the windowing has no effect.
I recommend using parquetio from the Beam project itself to ensure you get scalable exactly-once behavior. (See the pydoc from 2.33.0 release)
I took a look at the GroupByKey example in the python documentation
Messages I read from KafkaConsumer (I used kafkaio from
beam_nuggets.io) have a type of tuple, and in order to use
GroupByKey, I tried to create a list in the convert_to_list function
by appending the tuples I got from Kafka Consumer. However,
GroupByKey still produces no output.
import apache_beam as beam
from beam_nuggets.io import kafkaio
new_list = []
def convert_to_list(consume_message):
new_list.append(consume_message)
return new_list
with beam.Pipeline() as pipeline:
dofn_params = (
pipeline
| "Reading messages from Kafka" >> kafkaio.KafkaConsume(consumer_config=consumer_config)
| 'Fixed 30sec windows' >> beam.WindowInto(beam.window.FixedWindows(30))
| 'consume message added list' >> beam.ParDo(convert_to_list)
| 'GroupBykey' >> beam.GroupByKey()
| 'print' >> beam.Map(print))
I also tried a similar pipeline but this time, I created a list of
tuples with beam.Create() instead of reading from kafka, and it
works successfully. You can view this pipeline below:
import apache_beam as beam
from beam_nuggets.io import kafkaio
with beam.Pipeline() as pipeline:
dofn_params = (
pipeline
| 'Created Pipeline' >> beam.Create([(None, '{"userId": "921","xx":"123"]),(None, '{"userId": "92111","yy":"123"]))
| 'Fixed 30sec windows' >> beam.WindowInto(beam.window.FixedWindows(30))
| 'GroupBykey' >> beam.GroupByKey()
| 'print' >> beam.Map(print))
I assume the issue in the first approach is related to generating an external list instead of pcollection, but I am not sure. Can you guide me on how to proceed?
Another thing I tried is to use ReadFromKafka function from apache_beam.io.kafka module. But this time I got the following error:
ERROR:apache_beam.utils.subprocess_server:Starting job service with ['java', '-jar', 'user_directory’/.apache_beam/cache/jars\\beam-sdks-java-io-expansion-service-2.33.0.jar', '59627']
Java version 11.0.12 is installed on my computer and the ‘java’ command is available.

Python logging; [1] cannot set log file dirpath; and [2] datetime formatting problems

I am trying to learn how to use the logging module.
I want to log information to both console and to file.
I confess that I have not completed studying both https://docs.python.org/3/library/logging.html#logging.basicConfig and https://docs.python.org/3/howto/logging.html
It's a little daunting for a novice like me to learn all of it, but I am working on it.
I am trying to use a modified version of the “Logging to multiple destinations” program from https://docs.python.org/3/howto/logging-cookbook.html, to which I refer as “Cookbook_Code”.
The Cookbook_Code appears at that URL under the title "Logging to multiple destinations".
But I have two problems:
The Cookbook Code saves to a file named:
"E:\Zmani\Logging\Logging_to_multiple_destinations_python.org_aaa.py.txt",
and I cannot figure out:
A. Why the Cookbook Code does that, nor
B. How to make the logging module save instead to a the following filepath (which I stored in a var, "logfile_fullname"): "e:\zmani\Logging\2020-10-14_14_14_os.walk_script.log"
I cannot figure out how to have the log file use the following datetime format:
"YYYY-MM-DD_HH-MM-SS - INFO: Sample info."
instead of the following datetime format: "10/14/2020 03:00:22 PM - INFO: Sample info."
I would like the console output include the same datetime prefix:
"YYYY-MM-DD_HH-MM-SS -"
Any suggestions would be much appreciated.
Thank you,
Marc
Here’s the code I have been running:
log_file_fullname = "e:\zmani\Logging\2020-10-14_14_14_os.walk_script.log"
# https://docs.python.org/3/howto/logging-cookbook.html#logging-cookbook
import logging
# set up logging to file - see previous section for more details
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
datefmt='%Y-%m-%d_%H-%M-%S',
filename=log_file_fullname,
filemode='w')
console = logging.StreamHandler()
console.setLevel(logging.INFO)
formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s')
console.setFormatter(formatter)
logging.getLogger('').addHandler(console)
logging.info('Sample info.')
logger1 = logging.getLogger('myapp.area1')
logger2 = logging.getLogger('myapp.area2')
logger1.debug('Quick zephyrs blow, vexing daft Jim.')
logger1.info('How quickly daft jumping zebras vex.')
logger2.warning('Jail zesty vixen who grabbed pay from quack.')
logger2.error('The five boxing wizards jump quickly.')
A quick run of your code showed that it already does 1.B and 2 of your problems.
Your provided URLs showed nowhere that Logging_to_multiple_destinations_python.org_aaa.py.txt is being used. It doesn't matter anyway. It just a path to a text file provided that its parent folders exist. So 1.A is just merely a demonstration.
If you add %(asctime)s to the console's formatter, it will give you 3.
formatter = logging.Formatter('%(asctime)s %(name)-12s: %(levelname)-8s %(message)s', datefmt='%Y-%m-%d_%H-%M-%S')
You should only use basicConfig if you don't need to add any logger nor doing a complicated setup.
basicConfig will only have an effect if there aren't any handler in the root logger. If you use filename argument, it creates a FileHandler. If you use stream argument, it creates a StreamHandler. And you cannot use both arguments at once.
So you said that you need to output to file and console, just create a handler for each of them.
import logging
log_file_fullname = "2020-10-14_14_14_os.walk_script.log"
# config file handler
file_handler = logging.FileHandler(log_file_fullname)
file_handler.setLevel(logging.DEBUG)
fmt_1 = logging.Formatter('%(asctime)s %(name)-12s: %(levelname)-8s %(message)s',
datefmt='%Y-%m-%d_%H-%M-%S')
file_handler.setFormatter(fmt_1)
# config console handler
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
fmt_2 = fmt_1 # you could add console only formatter here
console_handler.setFormatter(fmt_2)
# retrieve a root logger and add handlers
root_logger = logging.getLogger()
root_logger.setLevel(logging.NOTSET) # change from default WARNING to NOTSET,
root_logger.addHandler(file_handler) # this allows root_logger to take all propagated
root_logger.addHandler(console_handler) # messages from other child loggers.
# this line use root logger by default
logging.info('Sample info.')
# create module loggers
logger1 = logging.getLogger('myapp.area1')
logger2 = logging.getLogger('myapp.area2')
# logging by module loggers
logger1.debug('Quick zephyrs blow, vexing daft Jim.')
logger1.info('How quickly daft jumping zebras vex.')
logger2.warning('Jail zesty vixen who grabbed pay from quack.')
logger2.error('The five boxing wizards jump quickly.')

Handling logs and writing to a file in python?

I have a module name acms and inside that have number of python files.The main.py has calls to other python files.I have added logs in those files, which are displayed on console but i also want to write these logs in a file called all.log, i tried with setting log levels and logger in a file called log.py but didnt get the expected format,since am new to python am getting difficulty in handling logs
Use the logging module and use logger = logging.getLogger(__name__). Then it will use the correct logger with the options that you have set up.
See the thinkpad-scripts project for its logging. Also the logging cookbook has a section for logging to multiple locations.
We use the following to log to the console and the syslog:
kwargs = {}
dev_log = '/dev/log'
if os.path.exists(dev_log):
kwargs['address'] = dev_log
syslog = logging.handlers.SysLogHandler(**kwargs)
syslog.setLevel(logging.DEBUG)
formatter = logging.Formatter(syslog_format)
syslog.setFormatter(formatter)
logging.getLogger('').addHandler(syslog)

Writing/Reading stream of output from external command in realtime

I was asked to run a python script in a Jenkins job, which calls external commands via subprocess package. In the Jenkins console, the output of the external commands should be printed in realtime or/and be writen to a log file.
I found a SO post about printing in realtime somehow like this:
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, bufsize=1)
while True:
line = p.stdout.readline()
print (line)
if not line and p.poll() is not None:
break
So I created a function for executing external commands:
def execute_cmd(command, splitlines=True, timeout=None, output_stream=None):
pipe = Popen(command, stdout=PIPE, stderr=PIPE)
utf8_output = []
while True:
line = pipe.stdout.readline()
utf8_output.append(line)
# Add to stream here
if output_stream:
output_stream.write(line)
if not line and pipe.poll() is not None:
break
return ''.join(utf8_output)
As you see I didn't use print. That's because a requirement tells me to use a stream object which could be either streaming to a file or to output everything to the jenkins console in realtime.
So if I want to print the external output to the jenkins console, I wanted to do something like this in my job:
from io import TextIOBase
my_output_stream = TextIOBase()
func1(output_stream=my_output_stream)
Where func1 is a function, that calls an external command via the execute_cmd method.
In my understanding this should write the output of the external command to my_output_stream
Now my question is, how can I output the data writen to that stream in realtime? Don't I need any kind of asynchronous execution? Because I can't just add a loop that reads lines of the stream after the call of func1 as this would be executed after the execution of the function has finished, but not in realtime.
Sorry for the kind of weird description of my problem, if something is still unclear, please comment and I will update my question with further explanation.

Resources