Pyspark Inferring Timezone by location - apache-spark

I'm trying to infer timezone in PySpark given the longitude and latitude of an event. I came across the timezonefinder library which works locally. I wrapped it in a user defined function in an attempt to use it as the timezone inferrer.
def get_timezone(longitude, latitude):
from timezonefinder import TimezoneFinder
tzf = TimezoneFinder()
return tzf.timezone_at(lng=longitude, lat=latitude)
udf_timezone = F.udf(get_timezone, StringType())
df = sqlContext.read.parquet(INPUT)
df.withColumn("local_timezone", udf_timezone(df.longitude, df.latitude))\
.write.parquet(OUTPUT)
When I run on a single node, this code works. However, when running in parallel, I get the following error:
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 104, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 71, in <lambda>
return lambda *a: f(*a)
File "/tmp/c95422912bfb4079b64b88427991552a/enrich_data.py", line 64, in get_timezone
File "/opt/conda/lib/python2.7/site-packages/timezonefinder/__init__.py", line 3, in <module>
from .timezonefinder import TimezoneFinder
File "/opt/conda/lib/python2.7/site-packages/timezonefinder/timezonefinder.py", line 59, in <module>
from .helpers_numba import coord2int, int2coord, distance_to_polygon_exact, distance_to_polygon, inside_polygon, \
File "/opt/conda/lib/python2.7/site-packages/timezonefinder/helpers_numba.py", line 17, in <module>
#jit(b1(i4, i4, i4[:, :]), nopython=True, cache=True)
File "/opt/conda/lib/python2.7/site-packages/numba/decorators.py", line 191, in wrapper
disp.enable_caching()
File "/opt/conda/lib/python2.7/site-packages/numba/dispatcher.py", line 529, in enable_caching
self._cache = FunctionCache(self.py_func)
File "/opt/conda/lib/python2.7/site-packages/numba/caching.py", line 614, in __init__
self._impl = self._impl_class(py_func)
File "/opt/conda/lib/python2.7/site-packages/numba/caching.py", line 349, in __init__
"for file %r" % (qualname, source_path))
RuntimeError: cannot cache function 'inside_polygon': no locator available for file '/opt/conda/lib/python2.7/site-packages/timezonefinder/helpers_numba.py'
I can import the library locally on the nodes where I got the error.
Any solution along these line would be appreciated:
Is there a native Spark to do the task?
Is there another way to load the library?
Is there a way to avoid caching numba does?

Eventually this was solved by abandoning timezonefinder completely, and instead, using the geo-spatial timezone dataset from timezone-boundary-builder, while querying using magellan, the geo-spatial sql query library for spark.
One caveat I had was the fact that the Point and other objects in the library were not wrapped for Python. I ended up writing my own scala function for timezone matching, and dropped the objects from magellan before returning the dataframe.

Encountered this error when running timezonefinder on spark cluster.
RuntimeError: cannot cache function 'inside_polygon': no locator available for file '/disk-1/hadoop/yarn/local/usercache/timezonefinder1.zip/timezonefinder/helpers_numba.py'
The issue was that numpy versions were different on cluster and timezonefinder package that we shipped to spark.
Cluster had numpy - 1.13.3 where as numpy on timezonefinder.zip was 1.17.2.
To overcome version mismatches, we created a custom conda environment with timezonefinder and numpy 1.17.2 and submitted spark job using custom conda environment.
Creating Custom Conda Environment with timezonefinder package installed:
conda create --name timezone-conda python timezonefinder
source activate timezone-conda
conda install -y conda-pack
conda pack -o timezonecondaevnv.tar.gz -d ./MY_CONDA_ENV
https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands
Submitting spark job with custom conda environment:
!spark-submit --name app_name \
--master yarn \
--deploy-mode cluster \
--driver-memory 1024m \
--executor-memory 1GB \
--executor-cores 5 \
--num-executors 10 \
--queue QUEUE_NAME\
--archives ./timezonecondaevnv.tar.gz#MY_CONDA_ENV \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./MY_CONDA_ENV/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./MY_CONDA_ENV/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=./MY_CONDA_ENV/bin/python \
--conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=./MY_CONDA_ENV/bin/python \
./main.py

Related

Airflow SparkSubmitOperator - How to spark-submit in another server

I am new to Airflow and Spark and I am struggling with the SparkSubmitOperator.
Our airflow scheduler and our hadoop cluster are not set up on the same machine (first question: is it a good practice?).
We have many automatic procedures that need to call pyspark scripts. Those pyspark scripts are stored in the hadoop cluster (10.70.1.35). The airflow dags are stored in the airflow machine (10.70.1.22).
Currently, when we want to spark-submit a pyspark script with airflow, we use a simple BashOperator as follows:
cmd = "ssh hadoop#10.70.1.35 spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 2g \
--executor-cores 2 \
/home/hadoop/pyspark_script/script.py"
t = BashOperator(task_id='Spark_datamodel',bash_command=cmd,dag=dag)
It works perfectly fine. But we would like to start using SparkSubmitOperator to spark submit our pyspark scripts.
I tried this:
from airflow import DAG
from datetime import timedelta, datetime
from airflow.contrib.operators.spark_submit_operator import
SparkSubmitOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
dag = DAG('SPARK_SUBMIT_TEST',start_date=datetime(2018,12,10),
schedule_interval='#daily')
sleep = BashOperator(task_id='sleep', bash_command='sleep 10',dag=dag)
_config ={'application':'hadoop#10.70.1.35:/home/hadoop/pyspark_script/test_spark_submit.py',
'master' : 'yarn',
'deploy-mode' : 'cluster',
'executor_cores': 1,
'EXECUTORS_MEM': '2G'
}
spark_submit_operator = SparkSubmitOperator(
task_id='spark_submit_job',
dag=dag,
**_config)
sleep.set_downstream(spark_submit_operator)
The syntax should be ok as the dag does not show up as broken. But when it runs it gives me the following error:
[2018-12-14 03:26:42,600] {logging_mixin.py:95} INFO - [2018-12-14
03:26:42,600] {base_hook.py:83} INFO - Using connection to: yarn
[2018-12-14 03:26:42,974] {logging_mixin.py:95} INFO - [2018-12-14
03:26:42,973] {spark_submit_hook.py:283} INFO - Spark-Submit cmd:
['spark-submit', '--master', 'yarn', '--executor-cores', '1', '--name',
'airflow-spark', '--queue', 'root.default',
'hadoop#10.70.1.35:/home/hadoop/pyspark_script/test_spark_submit.py']
[2018-12-14 03:26:42,977] {models.py:1760} ERROR - [Errno 2] No such
file or directory: 'spark-submit'
Traceback (most recent call last):
File "/home/dataetl/anaconda3/lib/python3.6/site-
packages/airflow/models.py", line 1659, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/dataetl/anaconda3/lib/python3.6/site-
packages/airflow/contrib/operators/spark_submit_operator.py", line
168,
in execute
self._hook.submit(self._application)
File "/home/dataetl/anaconda3/lib/python3.6/site-
packages/airflow/contrib/hooks/spark_submit_hook.py", line 330, in
submit
**kwargs)
File "/home/dataetl/anaconda3/lib/python3.6/subprocess.py", line
707,
in __init__
restore_signals, start_new_session)
File "/home/dataetl/anaconda3/lib/python3.6/subprocess.py", line
1326, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit'
Here are my questions:
Should I install spark hadoop on my airflow machine? I'm asking because in this topic I read that I need to copy hdfs-site.xml and hive-site.xml. But as you can imagine, I have neither /etc/hadoop/ nor /etc/hive/ directories on my airflow machine.
a) If no, where exactly should I copy hdfs-site.xml and hive-site.xml on my airflow machine?
b) If yes, does it mean that I need to configure my airflow machine as a client? A kind of edge node that does not participate in jobs but can be used to submit actions?
Then, will I be able to spark-submit from my airflow machine? If yes, then I don't need to create a connection on Airflow like I do for a mysql database for example, right?
Oh and the cherry on the cake: will I be able to store my pyspark scripts in my airflow machine and spark-submit them from this same airflow machine. It would be amazing!
Any comment would be very useful, even if you're not able to answer all my questions...
Thanks in advance anyway! :)
To answer your first question, yes it is a good practice.
For how you can use SparkSubmitOperator, please refer to my answer on https://stackoverflow.com/a/53344713/5691525
Yes, you need spark-binaries on airflow machine.
-
Yes
No -> You still need a connection to tell Airflow where have you installed your spark binary files. Similar to https://stackoverflow.com/a/50541640/5691525
Should work

Pyspark in Docker based on Hortonworks 2.6.1 is throwing error with EnableHiveSupport()

I am trying to build an edge-node using docker with HDP2.6.1. Everything is available and running except Spark Support. I was able to install and run pyspark but only when I comment enableHiveSupport(). I have copied over the hive-site.xml to /etc/spark2/conf as well from ambari and all the spark confs are matching with the cluster settings. But still get this error:
17/10/27 02:35:57 WARN conf.HiveConf: HiveConf of name hive.groupby.position.alias does not exist
17/10/27 02:35:57 WARN conf.HiveConf: HiveConf of name hive.mv.files.thread does not exist
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/shell.py", line 43, in <module>
spark = SparkSession.builder\
File "/usr/hdp/current/spark2-client/python/pyspark/sql/session.py", line 187, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
>>> spark.createDataFrame([(1,'a'), (2,'b')], ['id', 'nm'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'spark' is not defined
I have tried to search this error, but all the results that I get are possible windows errors related to permissions and hive-site.xml missing. But i am building it on centos:7.3.1611. And installing the following:
RUN wget http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.1.0/hdp.repo
RUN cp hdp.repo /etc/yum.repos.d
RUN yum -y install hadoop sqoop spark2_2_6_1_0_129-master spark2_2_6_1_0_129-python hive-hcatalog
So the solution to the above problem is that the hive-site.xml needs to only contain the property for hive.metastore.uris and NOTHING ELSE. (Reference: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_spark-component-guide/content/spark-config-hive.html). Once you take the other properties out, it works like a charm!

pyspark 1.6.0 write to parquet gives "path exists" error

I read in from parquet file(s) from different folders, take e.g. February this year (one folder = one day)
indata = sqlContext.read.parquet('/data/myfolder/201602*')
do some very simple grouping and aggregation
outdata = indata.groupby(...).agg()
and want to store again.
outdata.write.parquet(outloc)
Here is how I run the script from bash:
spark-submit
--master yarn-cluster
--num-executors 16
--executor-cores 4
--driver-memory 8g
--executor-memory 16g
--files /etc/hive/conf/hive-site.xml
--driver-java-options
-XX:MaxPermSize=512m
spark_script.py
This generates multiple jobs (is that the right term?). First job runs successfully. Subsequent jobs fail with the following error:
Traceback (most recent call last):
File "spark_generate_maps.py", line 184, in <module>
outdata.write.parquet(outloc)
File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 471, in parquet
File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u'path OBFUSCATED_PATH_THAT_I_CLEANED_BEFORE_SUBMIT already exists.;'
When I give only one folder as input, this works fine.
So it seems the first job creates the folder, all subsequent jobs fail to write into that folder. Why?
just in case this could help anybody:
imports:
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf, collect_list, countDistinct, count
import pyspark.sql.functions as func
from pyspark.sql.functions import lit
import numpy as np
import sys
import math
config:
conf = SparkConf().setAppName('spark-compute-maps').setMaster('yarn-cluster')
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
Your question is "why does Spark iterate on input folders, but applies the default write mode, that does not make sense in that context".
Quoting the Spark V1.6 Python API...
mode(saveMode)
Specifies the behavior when data or table already exists.
Options include:
append Append contents of this DataFrame to existing data.
overwrite Overwrite existing data.
error Throw an exception if data already exists.
ignore Silently ignore this operation if data already exists.
I think outdata.write.mode('append').parquet(outloc) is worth a try.
You should add mode option in your code.
outdata.write.mode('append').parquet(outloc)

Spark on EMR Yarn - EOF Error

We are running some PySpark processes on Yarn, when the datasets increase in size we are getting this error in the yarn log:
Traceback (most recent call last):
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/worker.py", line 136, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 544, in read_int
raise EOFError
java.net.SocketException: Socket is closed
at java.net.Socket.shutdownOutput(Socket.java:1496)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply$mcV$sp(PythonRDD.scala:256)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply(PythonRDD.scala:256)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply(PythonRDD.scala:256)
at org.apache.spark.util.Utils$.tryLog(Utils.scala:1785)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:256)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
We are running on a EMR Setup 3*m3.xlarge - each with 4vCPUs, 15GiB and 2x40 GB
The job is executed with the following sh script:
export SPARK_HOME=/home/hadoop/spark
JARS="/home/hadoop/avro-1.7.7.jar,/home/hadoop/spark-avro-master/target/scala-2.10/spark-avro_2.10-1.0.0.jar”
$SPARK_HOME/bin/spark-submit --master yarn-cluster --py-files deploy.zip --jars $JARS main.py
where deploy.zip contains some utility methods and lambda functions
No other configuration changes were made to the cluster.
By looking at the UI seems that the all the jobs are finishing with a SUCCESS status, nevertheless we would like to get rid of this issue, or at the very least to understand what's causing it.
Would you have any idea on what it might be the origin of the error?
Thanks!

Spark not correctly installed with HDP 2.3 via Ambari?

I am attempting to install HDFS, YARN, Spark, etc. on a local cluster of CentOS 6.6 machines using Ambari 2.1.0 and HDP 2.3. I already managed to botch the upgrade from HDP 2.2 so I erased all the HDP 2.2 packages + Ambari before starting over. I am able to get through most of the Cluster Install Wizard without a problem, but on the "Install, Start and Test" phase, I receive the following error message
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/hook.py", line 38, in <module>
AfterInstallHook().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 218, in execute
method(env)
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/hook.py", line 35, in hook
link_configs(self.stroutfile)
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/shared_initialization.py", line 91, in link_configs
_link_configs(k, json_version, v)
File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/after-INSTALL/scripts/shared_initialization.py", line 156, in _link_configs
conf_select.select("HDP", package, version)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/conf_select.py", line 241, in select
shell.checked_call(get_cmd("set-conf-dir", package, version), logoutput=False, quiet=False, sudo=True)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 70, in inner
result = function(command, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 92, in checked_call
tries=tries, try_sleep=try_sleep)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 140, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 291, in _call
raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of 'conf-select set-conf-dir --package spark --stack-version 2.3.0.0-2557 --conf-version 0' returned 1. spark not installed or incorrect package name
The check script appears to be looking for spark in /usr/hdp/2.3.0.0-2557. This is what I see in that directory
ls /usr/hdp/2.3.0.0-2557/
etc hadoop hadoop-hdfs hadoop-mapreduce hadoop-yarn ranger-hdfs-plugin ranger-yarn-plugin usr zookeeper
One one of the slave machines that complains, it appears that spark has been "installed"
# yum list installed | grep spark
spark_2_3_0_0_2557.noarch
spark_2_3_0_0_2557-master.noarch
spark_2_3_0_0_2557-python.noarch
spark_2_3_0_0_2557-worker.noarch
Any ideas on how to resolve this issue?
Some components are at level 2.3.0.0-2557 and some others are at 2.3.2.0-2621.
In my case I had an issue with Zookeeper, so in /usr/hdp/current I linked
zookeeper-client -> /usr/hdp/2.3.0.0-2557/zookeeper
zookeeper-server -> /usr/hdp/2.3.0.0-2557/zookeeper
This fixed the install issue, but not starting services.

Resources