Airflow job not Scheduling after manual Trigger - python-3.x

I'm using Airflow version 2.2.3(Latest version). My Airflow Dag is scheduled to run "#daily".
I manually triggered the Dag yesterday(2022-02-08, 09:19:22) and I expect it to be scheduled Today at 2022-02-09, 00:00:00, but the DAG was not scheduled.
From the below image
The last run was on 2022-02-08, 09:19:22 and the next run is scheduled at 2022-02-09, 00:00:00. But the time has already passed and it is 2022-02-09 06:22 UTC
DAG parameters:
{'catchup': False,
'catchup_by_default': False,
'depends_on_past': False,
'provide_context': True,
'retries': 1,
'retry_delay': datetime.timedelta(0, 300),
'start_date': DateTime(2021, 12, 31, 15, 50, 0, tzinfo=Timezone('UTC'))}
Do I need to change any parameters or config values? Please suggest

Related

How to submit an operation not related to data processing to a Spark Cluster so that it will be executed on every nodes

I've got a big Spark DataFrame loaded from list of files, from around 10K files. I want to do something with files (for example, copy them from one place to another). Currently I'm using ThreadPoolExecutor to do some tasks concurrently. I slice the list of files into chunks and process every chunks in parallel as follows:
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = []
for idx, files_chunk in enumerate(files_in_chunks):
futures.append(executor.submit(process_chunk, chunk_index=idx, chunk_of_files=files_chunk))
concurrent.futures.as_completed(futures)
The function process_chunk do the job I need.
But this, as I understand, forces to execute the logic only on one worker node, or even on master node. So that all parallel executions won't use all available worker nodes in the Spark Cluster.
How is it possible to submit process_chunk to Spark cluster, so that it can be executed on every available worker node, and, for example, use files from new RDD created from the list of files?
I'm using PySpark.
RDD has such methods like:
map
mapPartitions
mapPartitionsWithIndex
The function which they except will be executed on the cluster and can do any work you like. For example, copying of files.
It's also possible to use ThreadPoolExecutor inside the functions. It will improve performance as well.
To execute the function, need to call .collect() method after it. For example,
def process_chunks(idx, chunk_items):
return [{"partition_idx":idx, "items": len(list(chunk_items))}]
items = range(1000)
rdd = spark.sparkContext.parallelize(items, numSlices=10)
result = rdd.mapPartitionsWithIndex(process_chunks).collect()
print(result)
It will return:
[{'partition_idx': 0, 'items': 100}, {'partition_idx': 1, 'items': 100}, {'partition_idx': 2, 'items': 100}, {'partition_idx': 3, 'items': 100}, {'partition_idx': 4, 'items': 100}, {'partition_idx': 5, 'items': 100}, {'partition_idx': 6, 'items': 100}, {'partition_idx': 7, 'items': 100}, {'partition_idx': 8, 'items': 100}, {'partition_idx': 9, 'items': 100}]
You can read more here: https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations

python get unix timestamp given a timezone and datetime

is there a simplified way to get unix timestamp given a timezone and datetime using datetime and pytz?
I searched some answers but not exactly this case.
Given the moment datetime.datetime(2020, 3, 15, 22, 0, 0) in tz Asia/Tokyo, goal is to get the unix timestamp of that moment.
You will need to use a 3rd-party library to convert 'Asia/Tokyo' into a proper timezone. The most common one used is the pytz library as follows:
import datetime
import pytz
naive_datetime = datetime.datetime(2020, 3, 15, 22, 0, 0)
new_datetime = naive_datetime.astimezone(pytz.timezone('Asia/Tokyo'))
new_datetime will now yield datetime.datetime(2020, 3, 16, 5, 0, tzinfo=<DstTzInfo 'Asia/Tokyo' JST+9:00:00 STD>)
Then you use the timestamp method on the datetime to get the UTC UNIX timestamp:
>>> new_datetime.timestamp()
1584302400.0
>>> datetime.datetime.utcfromtimestamp(1584302400.0)
datetime.datetime(2020, 3, 15, 20, 0)
You could also look into the pendulum library, which offers an all-around friendly experience dealing with times and dates.

Force AWS Lambda to use UTC when listing S3 objects using boto3

When retrieving a list of objects on AWS Lambda using Python 3.6 and boto3, the objects' LastModified attribute is using 'LastModified': datetime.datetime(2018, 8, 17, 1, 51, 31, tzinfo=tzlocal()).
When I run my program locally, this attribute is using 'LastModified': datetime.datetime(2018, 8, 17, 1, 51, 31, tzinfo=tzutc()), which is what I want.
Why is this happening? Is there a workaround that will allow me to specify UTC as part of the request? Alternatively, is there a simple way to convert these datetimes to UTC after they're returned from S3?
Running this code:
from datetime import datetime
from dateutil import tz
from dateutil.tz import tzlocal
d_local = datetime(2018, 8, 17, 1, 51, 31, tzinfo=tzlocal())
d_utc = d_local.astimezone(tz.tzutc())
The result is that d_utc is:
datetime.datetime(2018, 8, 16, 15, 51, 31, tzinfo=tzutc())

Spark performance in single node

Im trying to execute some of the sample Python - Scikit scripts into Spark in Single node (my desktop - Mac - 8 GB). Here is my configuration
spark-env.sh file.
SPARK_MASTER_HOST='IP'
SPARK_WORKER_INSTANCES=3
SPARK_WORKER_CORES=2
Im starting my slaves
./sbin/start-slave.sh spark://IP
Workers Table in (http://localhost:8080/) shows there are 3 workers running with each 2 cores
My script file which I took it from (https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html)
from sklearn import svm, grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from spark_sklearn.util import createLocalSparkSession
from pyspark import SparkContext
sc=SparkContext.getOrCreate()
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
gs = GridSearchCV(sc,RandomForestClassifier(), param_grid=param_grid)
gs.fit(X, y)
Submitting the script
spark-submit --master spark://IP traingrid.py
However I do not see any significant improvement in the execution time.
Is there any other configurations required to make it more parallel? Or I should add another node to improve it?

How to execute Hive Query in Spark with more speed

EXECUTE HIVE QUERIES IN TEZ AND SPARK
I have hive query running in tez taking 10 min
executed same query in spark using hivecontext.sql taking 13-14min
Hardware Info
Spark-1.5.2
Cluster-Yarn (in client mode)
Nodes- 11, memory-192GB, Cores-24
To Achieve
Like to execute in spark less than 5 min without changing my query
My Findings
executors driverMemory ExecutorMemory Cores start end difference
4, 6, 15, 6, 17:12:15, 17:26:03, 0:13:48
4, 6, 15, 10, 17:37:55, 17:49:24, 0:11:29
4, 6, 7, 10, 17:53:40, 18:07:12, 0:13:32
1, 6, 7, 4, 21:54:10, 22:08:16, 0:14:06
1, 6, 7, 6, 23:44:15, 23:57:25, 0:13:10
3, N/a, 60, 5, 11:12:49, 11:28:58, 0:16:09
However I use it it is **not reducing execution time**.
Can please someone let me know how to tune it
This is because spark 1.5.2 did not use tungsten engine.
Check out the discussion here here

Resources