Read elasticsearch index using Pyspark - apache-spark

I am trying to read elasticsearch index using Pyspark (v1.6.3), but getting following error
I am using following snippet to read/load the index
es_reader = sql_context.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", "x.x.x.x,y.y.y.y,z.z.z.z")
.option("es.port", "9200").option("es.net.ssl","true")
.option("es.net.http.auth.user", "*****")
.option("es.net.http.auth.pass", "*****")
sde_df = es_reader.load("index_name/doc_type")
Error:
22_0452/container_1547624497922_0452_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1547624497922_0452/container_1547624497922_0452_02_000001/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/mnt/yarn/usercache/root/appcache/application_1547624497922_0452/container_1547624497922_0452_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/StreamSinkProvider
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.StreamSinkProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 49 more
Environment:
Spark v1.6.3
Elasticsearch 5.4.3

Related

py4j.protocol.Py4JJavaError: An error occurred while calling o49.csv

I'm new to pyspark. I'm running pyspark in the local machine. I'm trying write to CSV file from pyspark data frame. So I wrote the following code
dataframe.write.mode('append').csv(outputPath)
But I'm getting an error message
Traceback (most recent call last):
File "D:\PycharmProjects\pythonProject\org\spark\weblog\SparkWebLogsAnalysis.py", line 71, in <module>
weblog_sessionIds.write.mode('append').csv(outputPath)
File "C:\spark-3.1.2-bin-hadoop3.2\python\pyspark\sql\readwriter.py", line 1372, in csv
self._jwrite.csv(path)
File "C:\spark-3.1.2-bin-hadoop3.2\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1304, in __call__
File "C:\spark-3.1.2-bin-hadoop3.2\python\pyspark\sql\utils.py", line 111, in deco
return f(*a, **kw)
File "C:\spark-3.1.2-bin-hadoop3.2\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o49.csv.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode(NativeIO.java:560)
at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:587)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:559)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:586)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:559)
at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:705)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:354)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:178)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
Can you suggest me to rectify this error?
Problem got resolve by deleting hadoop.dll file from winutils folder and using lower version of Spark

Tracker issue with XGBoost in PySpark

I am using XGBoost in PySpark using by placing these two jars xgboost4j and xgboost4j-spark in $SPARK_HOME/jars folder.
When I try to fit the XGBoostClassifier model I get an error with the following message
py4j.protocol.Py4JJavaError: An error occurred while calling o413.fit.
: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
I looked for the tracker in trace and noticed that it is not binding to the localhost. This is the tracker info
Tracker started, with env={}
I am using a Mac and so checked for the /etc/hosts file
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost.localdomain localhost
255.255.255.255 broadcasthost
::1 localhost
127.0.0.1 myusername
Everything looks fine in the hosts file.
Any idea why the tracker is failing to initialise properly?
Error trace
Tracker started, with env={}
2019-01-07 12:50:19 ERROR RabitTracker:91 - Uncaught exception thrown by worker:
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4$$anon$1.run(XGBoost.scala:233)
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Users/myusername/Downloads/ml_project/ml_project/features/variable_selection.py", line 130, in fit
self.ttv.fit(target_col, X, test=None, validation=None)
File "/Users/myusername/Downloads/ml_project/ml_project/models/train_test_validator.py", line 636, in fit
upper_bounds, self.curr_model_num_iter, is_integer_variable)
File "/Users/myusername/Downloads/ml_project/ml_project/models/train_test_validator.py", line 253, in model_tuner
num_iter, is_integer_variable, random_state=42)
File "/Users/myusername/Downloads/ml_project/ml_project/models/hyperparam_optimizers.py", line 200, in aml_forest_maximize
is_integer_variable, random_state=random_state)
File "/Users/myusername/Downloads/ml_project/ml_project/models/hyperparam_optimizers.py", line 179, in aml_forest_minimize
return forest_minimize(objective_calculator, space, n_calls=num_iter, random_state=random_state, n_random_starts=n_random_starts, base_estimator="RF", n_jobs=-1)
File "/usr/local/lib/python3.7/site-packages/skopt/optimizer/forest.py", line 161, in forest_minimize
callback=callback, acq_optimizer="sampling")
File "/usr/local/lib/python3.7/site-packages/skopt/optimizer/base.py", line 248, in base_minimize
next_y = func(next_x)
File "/Users/myusername/Downloads/ml_project/ml_project/models/train_test_validator.py", line 487, in objective_calculator
model_fit = init_model.fit(train) # fit model
File "/Users/myusername/Downloads/spark/python/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/Users/myusername/Downloads/spark/python/pyspark/ml/wrapper.py", line 288, in _fit
java_model = self._fit_java(dataset)
File "/Users/myusername/Downloads/spark/python/pyspark/ml/wrapper.py", line 285, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/Users/myusername/Downloads/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/Users/myusername/Downloads/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/myusername/Downloads/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o413.fit.
: ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:283)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:240)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributed$4.apply(XGBoost.scala:222)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributed(XGBoost.scala:221)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:191)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.train(XGBoostClassifier.scala:48)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Try to add xgboost-tracker.properties file in the folder with your jar files with the following content:
host-ip=0.0.0.0
XGBoost github
Another option is to unzip xgboost4j jar file using command:
jar xf xgboost4j-0.72.jar
You can modify tracker.py file manually and add the corrected file back to jar using
jar uf xgboost4j-0.72.jar tracker.py

KinesisUtils.createStream error in Spark streaming + Kinesis

I'm trying to stream data from AWS Kinesis using Spark Streaming + Kinesis Integration
My code looks like:
sc = SparkContext('local[*]', 'app_name')
ssc = StreamingContext(sc, 10)
kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName='kinesis_app_name',
streamName='kinesis_stream_name',
endpointUrl='https://kinesis.ap-southeast-2.amazonaws.com',
regionName='ap-southeast-2',
initialPositionInStream=InitialPositionInStream.TRIM_HORIZON,
checkpointInterval=10)
The command to run the script: spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.2.0 script.py. I'm using Spark 2.2.0 with Pyspark.
The error I got:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 1035, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 883, in send_command
response = connection.send_command(command)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "kinesis_to_s3.py", line 63, in
checkpointInterval=streaming_interval)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/pyspark/streaming/kinesis.py", line 92, in createStream
stsSessionName, stsExternalId)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/java_gateway.py", line 1133, in call
answer, self.gateway_client, self.target_id, self.name)
File "/home/ubuntu/transformer/env/lib/python3.5/site-packages/py4j/protocol.py", line 327, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o27.createStream
Exception in thread "Thread-2" java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/clientlibrary/lib/worker/InitialPositionInStream
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetPublicMethods(Class.java:2902)
at java.lang.Class.getMethods(Class.java:1615)
at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:345)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:305)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more

Reading from hdfs and writting to oracle 12

Hi.
I am trying to read from hdfs and write in oracle using pyspark, but I
have an error. I attach the code that I am using and the error that I
get:
pyspark --driver-class-path "/opt/oracle/app/oracle/product/12.1.0.2/dbhome_1/jdbc/lib/ojdbc7.jar"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
lines = sc.textFile("hdfs://bigdatalite.localdomain:8020/user/oracle/ACTIVITY/part-m-00000")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=p[1]))
schemaPeople = sqlContext.createDataFrame(people)
url = "jdbc:oracle:thin#localhost:1521/orcl"
properties = {
"user": "MOVIEDEMO",
"password": "welcome1",
"driver": "oracle.jdbc.driver.OracleDriver"
}
schemaPeople.write.jdbc(url=url, table="ACTIVITY", mode="append", properties=properties)
..and the error that show is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 530, in jdbc
self._jwrite.mode(mode).jdbc(url, table, jprop)
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o66.jdbc.
: java.sql.SQLException: Invalid Oracle URL specified
at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:453)
at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:61)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:278)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:748)
PD: I using spark 1.6.0
Url should be specified in the "service" format, ie.
jdbc:oracle:thin:#//myhost:1521/orcl

Spark 2.0 with Zeppelin 0.6.1 - SQLContext not available

I am running spark 2.0 and zeppelin-0.6.1-bin-all on a Linux server. The default spark notebook runs just fine, but when I try to create and run a new notebook in pyspark using sqlContext I get the error "py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist"
I tried running a simple code,
%pyspark
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordsDF.show()
print type(wordsDF)
wordsDF.printSchema()
I get the error,
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 266, in
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7635635698598314374.py", line 259, in
exec(code)
File "", line 1, in
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/context.py", line 299, in createDataFrame
return self.sparkSession.createDataFrame(data, schema, samplingRatio)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/spark/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value
format(target_id, ".", name, value))
Py4JError: An error occurred while calling o48.createDataFrame. Trace:
py4j.Py4JException: Method createDataFrame([class java.util.ArrayList, class java.util.ArrayList, null]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
When I try the same code with "sqlContext = SQLContext(sc)" it works just fine.
I have tried setting the interpreter "zeppelin.spark.useHiveContext false" configuration but it did not work.
I must obviously be missing something since this is such a simple operation. Please advice if there is any other configuration to be set or what I am missing.
I tested the same piece of code with Zeppelin 0.6.0 and it is working fine.
SparkSession is the default entry-point for Spark 2.0.0, which is mapped to spark in Zeppelin 0.6.1 (as it is in the Spark shell). Have you tried spark.createDataFrame(...)?

Resources