Writing to a file on disk using Airflow is not working - python-3.x

I am using Windows machine and have created container for airflow.
I am able to read data on the local filesystem through DAG but I am unable to write data to a file. I have also tried giving full path, also tried on different operators: Python and Bash but still it doesn't work.
The DAG succeeds there isn't any failures to show.
Note: /opt/airflow : is the $AIRFLOW_HOME path
what may be the reason?
A snippet of code:
from airflow import DAG
from datetime import datetime
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
def pre_process():
f = open("/opt/airflow/write.txt", "w")
f.write("world")
f.close()
with DAG(dag_id="test_data", start_date=datetime(2021, 11, 24), schedule_interval='#daily') as dag:
check_file = BashOperator(
task_id="check_file",
bash_command="echo Hi > /opt/airflow/hi.txt "
)
pre_processing = PythonOperator(
task_id="pre_process",
python_callable=pre_process
)
check_file >> pre_processing

It likely is written but in the container that is running airflow.
You need to understand how containers work. They provide isolation, but this also means that unless you do some data sharing, whatever you create in the container, stays in the container and you do not see it outside of it (that's what container isolation is all about).
You can usually enter the container via docker exec command https://docs.docker.com/engine/reference/commandline/exec/ or you can - for example - mount a folder from your host to your container and write your files there (as far as I know, by default in Windows some folders are mounted for you - but you need to check docker documentation for that).

In your pre_process code, add os.chdir('your/path') before write your data to a file.

Related

Airflow on Docker: Can't Write to Volume (Permission Denied)

Goal
I'm trying to run a simple DAG which creates a pandas DataFrame and writes to a file. The DAG is being run in a Docker container with Airflow, and the file is being written to a named volume.
Problem
When I start the container, I get the error:
Broken DAG: [/usr/local/airflow/dags/simple_datatest.py] [Errno 13] Permission denied: '/usr/local/airflow/data/local_data_input.csv'
Question
Why am I getting this error? And how can I fix this so that it writes properly?
Context
I am loosely following a tutorial here, but I've modified the DAG. I'm using the puckel/docker-airflow image from Docker Hub. I've attached a volume pointing to the appropriate DAG, and I've created another volume to contain the data written within the DAG (created by running docker volume create airflow-data).
The run command is:
docker run -d -p 8080:8080 \
-v /path/to/local/airflow/dags:/usr/local/airflow/dags \
-v airflow-data:/usr/local/airflow/data:Z \
puckel/docker-airflow \
webserver
The DAG located at the /usr/local/airflow/dags path on the container is defined as follows:
import airflow
from airflow import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
import pandas as pd
# Following are defaults which can be overridden later on
default_args = {
'owner': 'me',
'depends_on_past': False,
'start_date': datetime(2021, 12, 31),
'email': ['me#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('datafile', default_args=default_args)
def task_make_local_dataset():
print("task_make_local_dataset")
local_data_create=pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
local_data_create.to_csv('/usr/local/airflow/data/local_data_input.csv')
t1 = BashOperator(
task_id='write_local_dataset',
python_callable=task_make_local_dataset(),
bash_command='python3 ~/airflow/dags/datatest.py',
dag=dag)
The error in the DAG appears to be in the line
local_data_create.to_csv('/usr/local/airflow/data/local_data_input.csv')
I don't have permission to write to this location.
Attempts
I've tried changing the location of the data directory on the container, but airflow can't access it. Do I have to change permissions? It seems that this is a really simple thing that most people would want to be able to do: write to a container. I'm guessing I'm just missing something.
Don't use Puckel Docker Image. It's not been maintained for years, Airflow 1.10 has reached End Of Life in June 2021. You should only look at Airflow 2 and Airflow has official reference image that you can use:
Airflow 2 has also Quick-Start guides you can use - based on the image and docker compose: https://airflow.apache.org/docs/apache-airflow/stable/start/index.html
And it also has Helm Chart that can be used to productionize your setup. https://airflow.apache.org/docs/helm-chart/stable/index.html
Don't waste your (and other's) time on Puckel and Airflow 1.10.

How to get logs from inside the container executed using DockerOperator?(Airflow)

I'm facing logging issues with DockerOperator.
I'm running a python script inside the docker container using DockerOperator and I need airflow to spit out the logs from the python script running inside the container. Airlfow is marking the job as success but the script inside the container is failing and I have no clue of what is going as I cannot see the logs properly. Is there way to set up logging for DockerOpertor apart from setting up tty option to True as suggested in docs
It looks like you can have logs pushed to XComs, but it's off by default. First, you need to pass xcom_push=True for it to at least start sending the last line of output to XCom. Then additionally, you can pass xcom_all=True to send all output to XCom, not just the first line.
Perhaps not the most convenient place to put debug information, but it's pretty accessible in the UI at least either in the XCom tab when you click into a task or there's a page you can list and filter XComs (under Browse).
Source: https://github.com/apache/airflow/blob/1.10.10/airflow/operators/docker_operator.py#L112-L117 and https://github.com/apache/airflow/blob/1.10.10/airflow/operators/docker_operator.py#L248-L250
Instead of DockerOperator you can use client.containers.run and then do the following:
with DAG(dag_id='dag_1',
default_args=default_args,
schedule_interval=None,
tags=['my_dags']) as dag:
#task(task_id='task_1')
def start_task(**kwargs):
# get the docker params from the environment
client = docker.from_env()
# run the container
response = client.containers.run(
# The container you wish to call
image='__container__:latest',
# The command to run inside the container
command="python test.py",
version='auto',
auto_remove=True,
stdout = True,
stderr=True,
tty=True,
detach=True,
remove=True,
ipc_mode='host',
network_mode='bridge',
# Passing the GPU access
device_requests=[
docker.types.DeviceRequest(count=-1, capabilities=[['gpu']])
],
# Give the proper system volume mount point
volumes=[
'src:/src',
],
working_dir='/src'
)
output = response.attach(stdout=True, stream=True, logs=True)
for line in output:
print(line.decode())
return str(response)
test = start_task()
Then in your test.py script (in the docker container) you have to do the logging using the standard Python logging module:
import logging
logger = logging.getLogger("airflow.task")
logger.info("Log something.")
Reference: here

User program failed with ValueError: ZIP does not support timestamps before 1980

Running pipeline failed with the following error.
User program failed with ValueError: ZIP does not support timestamps before 1980
I created Azure ML Pipeline that call several child run. See the attached codes.
# start parent Run
run = Run.get_context()
workspace = run.experiment.workspace
from azureml.core import Workspace, Environment
runconfig = ScriptRunConfig(source_directory=".", script="simple-for-bug-check.py")
runconfig.run_config.target = "cpu-cluster"
# Submit the run
for i in range(10):
print("child run ...")
run.submit_child(runconfig)
It seems timestamp of python script (simple-for-bug-check.py) is invalid.
My Python SDK version is 1.0.83.
Any workaround on this ?
Regards,
Keita
One workaround to the issue is setting the source_directory_data_store to a datastore pointing to a file share. Every workspace comes with a datastore pointing to a file share by default, so you can change the parent run submission code to:
# workspacefilestore is the datastore that is created with every workspace that points to a file share
run_config.source_directory_data_store = 'workspacefilestore'
if you are using RunConfiguration or if you are using an estimator, you can do the following:
datastore = Datastore(workspace, 'workspacefilestore')
est = Estimator(..., source_directory_data_store=datastore, ...)
The cause of the issue is the current working directory in a run is a blobfuse mounted directory, and in the current (1.2.4) as well as prior versions of blobfuse, the last modified date of every directory is set to the Unix epoch (1970/01/01). By changing the source_directory_data_store to a file share, this will change the current working directory to a cifs mounted file share, which will have the correct last modified time for directories and thus will not have this issue.

Python read csv error using external script

I am very new to external script and python and was trying with very simple code.
Trying to print the data from a csv file.
execute sp_execute_external_script
#language = N'Python',
#script=N'
import pandas as pd
import csv
data=open("C:/Users/xxxxxx/Desktop/xxxxxx/Python/Pandas/olympics - Copy.csv")
data=csv.reader(data)
print(data)'
But I get below error
"FileNotFoundError: [Errno 2] No such file or directory: "
when I run the same code in jupyter notebook this runs fine.
import pandas as pd
oo=pd.read_csv('C:/Users/xxxxxx/Desktop/xxxxxx/Python/Pandas/olympics - Copy.csv')
oo.head()
what am i missing in SQL ? Can anyone please help me with the syntax?
Also, are there any good resources where I can learn more of using python in SQL 2017?
The SQL Server you are calling when executing sp_execute_external_script (SPEES), where is that installed; on your machine, or?
Don't forget when you execute SPEES it runs from the SQL box, so unless it is on your machine, it won't work. Even if it is on your machine it may not have permissions to the directory your file is in.
If the SQL is installed on your box, I suggest you create a new directory which you five EVERYONE access to and try with that directory.

Not getting real file update in vm

I've been playing with docker for a while. Recently, I encountered a "bug" that I cannot identify the reason / cause.
I'm currently on windows 8.1 and have docker toolbox installed, which includes docker 1.8.2, docker-machine 0.4.1, and virtualbox 5.0.4 (these are the important ones, presumably). I used to be with pure boot2docker.
I'm not really sure about what is going on, so the description could be vague and unhelpful, please ask me for clarification if you need any. Here we go:
When I write to some files that are located in the shared folders, the vm only gets the file length update, but cannot pick up the new content.
Let's use my app.py as an example (I've been playing with flask)
app.py:
from flask import Flask
from flask.ext.sqlalchemy import SQLAlchemy
from werkzeug.contrib.fixers import LighttpdCGIRootFix
import os
app = Flask(__name__)
app.config.from_object(os.getenv('APP_SETTINGS'))
app.wsgi_app = LighttpdCGIRootFix(app.wsgi_app)
db = SQLAlchemy(app)
#app.route('/')
def hello():
return "My bio!"
if __name__ == '__main__':
app.run(host='0.0.0.0')
and when I cat it in the vm:
Now, lets update it to the following, notice the extra exclamation marks:
from flask import Flask
from flask.ext.sqlalchemy import SQLAlchemy
from werkzeug.contrib.fixers import LighttpdCGIRootFix
import os
app = Flask(__name__)
app.config.from_object(os.getenv('APP_SETTINGS'))
app.wsgi_app = LighttpdCGIRootFix(app.wsgi_app)
db = SQLAlchemy(app)
#app.route('/')
def hello():
return "My bio!!!!!!!"
if __name__ == '__main__':
app.run(host='0.0.0.0')
And when I cat it again:
Notice 2 things:
the extra exclamation marks are not there
the EOF sign moved, the number of the spaces, which appeared in front of the EOF sign, is exactly the number of the exclamation marks.
I suspect that the OS somehow picked up the change in file size, but failed to pick the new content. When I delete characters from the file, the EOF sign also moves, and the cat output is chopped off by exactly how many characters I deleted.
It's not only cat that fails to pick up the change, all programs in the vm do. Hence I cannot develop anything when it happens. The changes I make are simply not affecting anything. And I have to kill the vm and spin it up again to get any changes I make, not so efficient.
Any help will be greatly appreciated! Thank you for reading the long question!
Looks like this is a known issue.
https://github.com/gliderlabs/pagebuilder/issues/2
which links to
https://forums.virtualbox.org/viewtopic.php?f=3&t=33201
Thanks to Matt Aitchison for replying to my github issue at gliderlabs/docker-alpine
sync; echo 3 > /proc/sys/vm/drop_caches is the temporary fix.
A permanent fix doesn't seem to be coming any time soon...
I assume that you mounted app.py as a file, using something like
-v /host/path/to/app.py:/container/path/to/app.py
Sadly, the container will not recognize changes to a file mounted that way.
Try putting the file in a folder and mount the folder instead. Then changes to that file will be visable in the container.
Assuming app.py is located in $(pwd)/work, try running the container with
-v $(pwd)/work:/work
and adjust the command being run to your code as /work/app.py.

Resources