Py4JJavaError in an Azure Databricks notebook pipeline - apache-spark

I have a curious issue, when launching a databricks notebook from a caller notebook through dbutils.notebook.run (I am working in Azure Databricks).
One interesting thing I noticed is that when manually launching the inner notebook, everything goes smoothly.
I am also positive that at least one run had been successful even when called by the outer notebook in the exact same conditions. It is likely it never worked when called from outside, see explaination of the issue below.
What is weird is that when I get to view the inner notebook run, I have a pandas related exception (KeyError: "None of [Index(['address'], dtype='object')] are in the [columns]"). But I really don't think that it is related to my code as, like mentioned above, the code works when the inner notebook is run directly. For what it helps, the inner notebook has some heavy pandas computation.
The full visible java stack in the outer notebook is:
Py4JJavaError: An error occurred while calling o1141._run.
: com.databricks.WorkflowException: com.databricks.NotebookExecutionException: FAILED
at com.databricks.workflow.WorkflowDriver.run(WorkflowDriver.scala:71)
at com.databricks.dbutils_v1.impl.NotebookUtilsImpl.run(NotebookUtilsImpl.scala:122)
at com.databricks.dbutils_v1.impl.NotebookUtilsImpl._run(NotebookUtilsImpl.scala:89)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.databricks.NotebookExecutionException: FAILED
at com.databricks.workflow.WorkflowDriver.run0(WorkflowDriver.scala:117)
at com.databricks.workflow.WorkflowDriver.run(WorkflowDriver.scala:66)
... 13 more
Any help would be welcome, thanks!

Thanks to #AlexOtt, I identified the origin of my issue.
The main takeaway I would like to share is to double check job parameters passing between the notebooks (and especially the "type cast" that happen with the standard way of passing arguments)
In my specific case, I wanted to pass an integer to the inner notebook but it was converted to string in the process, and was incorrectly taken into account afterwards.
In the outer notebook:
# set up the parameter dict
jobs_params = {
...
'max_accounts': 0, # set to 0 to parse all the accounts sent
}
# call the inner notebook
dbutils.notebook.run("./01_JSON_Processing", 1800, jobs_params)
In the inner notebook:
arg_list = [
...
'max_accounts',
]
v = dict()
for arg in arg_list:
dbutils.widgets.text(arg, "", "")
v[arg] = dbutils.widgets.get(arg)
Checking the type of v['max_accounts'] showed that it had been converted to a string in the process (and further computation resulted in the KeyError exception).
I did not identify the issue as when debugging the inner notebook, I just copy/pasted the job_params values in the inner notebook, but this did not reproduce the casting of max_accounts as a string in the process.

Related

Azure Databricks sentiment analysis doesn't work

I'm using Azure Databricks and I want to do a text sentiment analysis with the following code:
from pyspark.sql.functions import col
import synapse.ml
from synapse.ml.cognitive import *
# Create a dataframe that's tied to it's column names
df_sentences = spark.createDataFrame([
("I am so happy today, its sunny!", "en-US"),
("this is a dog", "en-US"),s
("I am frustrated by this rush hour traffic!", "en-US")
], ["text", "language"])
# Run the Text Analytics service with options
sentiment = (TextSentiment()
.setTextCol("text")
.setLocation("eastasia") # Set the location of your cognitive service
.setSubscriptionKey(cognitive_service_key)
.setOutputCol("sentiment")
.setErrorCol("error")
.setLanguageCol("language"))
# Show the results of your text query in a table format
display(sentiment.transform(df_sentences).select("text", col("sentiment")[0].getItem("sentiment").alias("sentiment")))
This doesn't work.. Here is the full error with the details:
java.lang.NoClassDefFoundError: Could not initialize class com.microsoft.azure.synapse.ml.param.ServiceParamJsonProtocol$
Py4JJavaError: An error occurred while calling None.com.microsoft.azure.synapse.ml.cognitive.TextSentiment.
: java.lang.NoClassDefFoundError: Could not initialize class com.microsoft.azure.synapse.ml.param.ServiceParamJsonProtocol$
at com.microsoft.azure.synapse.ml.param.ServiceParam.<init>(JsonEncodableParam.scala:62)
at com.microsoft.azure.synapse.ml.cognitive.HasSubscriptionKey.$init$(CognitiveServiceBase.scala:130)
at com.microsoft.azure.synapse.ml.cognitive.CognitiveServicesBaseNoHandler.<init>(CognitiveServiceBase.scala:306)
at com.microsoft.azure.synapse.ml.cognitive.TextAnalyticsBase.<init>(TextAnalytics.scala:53)
at com.microsoft.azure.synapse.ml.cognitive.TextSentiment.<init>(TextAnalytics.scala:288)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:250)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
I'm running 10.4 (includes Apache Spark 3.2.1, Scala 2.12) cluster and have synapseml_2.12:0.10.0 installed. Does someone know what goes wrong?
Generally this error is caused by incompatible versions of libraries with spark and Scala.
To get rid of this error you need to Uninstall synapseml_2.12:0.10.0 and install synapseml_2.12:0.10.1
Maven coordinates - com.microsoft.azure:synapseml_2.12:0.10.1

Spark magic output committer settings not recognized

I'm trying to play around with different Spark output committer settings for s3, and wanted to try out the magic committer. So far I didn't manage to get my jobs to use the magic committer, and they always seem to fall back on the file output committer.
The Spark job I'm running is a simple PySpark test job that runs a simple query, repartitions the data and outputs parquet to s3:
df = spark.sql("select * from some_table where some_condition")
df.write \
.partitionBy("some_column") \
.parquet("s3://some-bucket/some-folder", mode="overwrite")
The relevant spark settings are (taken from the Spark UI, job's environment tab):
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.magic.enabled true
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.staging.tmp.path tmp/staging
spark.hadoop.fs.s3a.committer.staging.unique-filenames true
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
mapreduce.output.fileoutputformat.compress false
mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress.DefaultCodec
mapreduce.output.fileoutputformat.compress.type RECORD
mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
mapreduce.fileoutputcommitter.algorithm.version 1
mapreduce.fileoutputcommitter.task.cleanup.enabled false
mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
Hadoop properties:
fs.s3a.committer.magic.enabled true
fs.s3a.committer.name magic
(Let me know if any other settings are relevant)
I'm basing the observation of file committer being used instead of magic committer on a couple of things:
Different log lines produced by the spark job seem to indicate the file output committer being used:
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:601","func":"commitTask","message":"Saved output of task 'attempt_2021...' to s3://some-bucket/some-folder/_temporary/0/
task_2021..."
"class":"org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat","file_line":"ParquetFileFormat.scala:54","message":"U
sing user defined output committer for Parquet: org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:141","func":"<init>","message":"File Outpu
t Committer Algorithm version is 1"
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:156","func":"<init>","message":"FileOutput
Committer skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false"
When setting the file committer's algo to an invalid number, like so:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version -7
an exception is raised from the file committer's constructor saying the value is invalid - implicating that the file committer was initialized instead of the magic committer.
I'm not seeing any logs indicating usage of the magic committer, or any failure to initialize a committer which could explain falling back on the file committer.
Spark version is 3.1.2 using this spark-hadoop-cloud JAR. Let me know if there's any other officially published JAR I can try or if there are any other log indications that may be relevant.
Any thoughts?
===== EDIT:
Below is the stack trace I see when setting the file committer algo to an invalid value. It seems that the call to org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter ends up calling org.apache.hadoop.mapreduce.lib.output.FileOutputCommitterFactory.createOutputCommitter which in turn initializes the incorrect type org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter instead of the configured type org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
Py4JJavaError: An error occurred while calling o259.parquet.
: java.io.IOException: Only 1 or 2 algorithm version is supported
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:143)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:117)
at org.apache.hadoop.mapreduce.lib.output.PathOutputCommitterFactory.createFileOutputCommitter(PathOutputCommitterFactory.java:134)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitterFactory.createOutputCommitter(FileOutputCommitterFactory.java:35)
at org.apache.hadoop.mapreduce.lib.output.PathOutputCommitterFactory.createCommitter(PathOutputCommitterFactory.java:201)
at org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter(PathOutputCommitProtocol.scala:88)
at org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter(PathOutputCommitProtocol.scala:49)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:177)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:874)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Mystery solved - the failure to initialize the magic committer was due to a mismatch between the committer factory scheme setting to the scheme of the actual destination URL. Consider this:
The committer factory configuration was set using the key: spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a - meaning that the setting is made for s3a protocol URLs.
While th URL sent to the write method was: s3://some-bucket/some-folder - using s3 protocol instead of s3a.
The PathOutputCommitterFactory hadoop class searches for a config key with pattern mapreduce.outputcommitter.factory.scheme.%s to recognize which factory to use for the given output URL. In case the pattern set in the config key (in this case s3a) does not match the pattern in the destination URL (in this case s3) - the committer factory setting will not be recognized and the factory type will fall back on FileOutputCommitter.
Solution - make sure the outputcommitter.factory.scheme.<protocol> setting matches the protocol in the destination URL. I've successfully tested using both s3 and s3a in the URL & config key.
this does sound like a binding problem but I cannot see immediately where it is. At a glance you have all the right settings.
The easiest way to check that an S3 a committee is being used is to look at the _SUCCESS file . If it is a piece of JSON then a new committer was used… The text inside will then tell you more about the committer.
a 0 byte file means that the classic file output committer was still used

Can you translate (or alias) s3:// to s3a:// in Spark/ Hadoop?

We have some code that we run on Amazon's servers that loads parquet using the s3:// scheme as advised by Amazon. However, some developers want to run code locally using a spark installation on Windows, but stubbornly spark insists on using the s3a:// scheme.
We can read files just fine using s3a, but we get an java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException.
SparkSession available as 'spark'.
>>> spark.read.parquet('s3a://bucket/key')
DataFrame[********************************************]
>>> spark.read.parquet('s3://bucket/key')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 316, in parquet
return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "C:\spark\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.parquet.
: java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
at org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:99)
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:644)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException: org.jets3t.service.S3ServiceException
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 24 more
Is there a way to get hadoop or spark or pyspark to "translate" the URI scheme from s3 to s3a via some sort of magic configuration? Changing the code is not an option we entertain as it would involve quite a lot of testing.
The local environment is windows 10, pyspark2.4.4 with hadoop2.7 (prebuilt), python3.7.5, and the right aws libs installed.
EDIT: One hack I used - since we're not supposed to use s3:// paths is to just convert them to s3a:// in pyspark.
I've added the following function in readwriter.py and just invoked it wherever there was a call out to the jvm with paths. Works fine, but would be nice if this was a config option.
def massage_paths(paths):
if isinstance(paths, basestring):
return 's3a' + x[2:] if x.startswith('s3:') else x
if isinstance(paths, list):
t = list
else:
t = tuple
return t(['s3a' + x[2:] if x.startswith('s3:') else x for x in paths])
cricket007 is correct.
spark.hadoop.fs.s3.impl org.apache.fs.s3a.S3AFileSystem
There's some code in org.apache.hadoop.FileSystem which looks up from a schema "s3" to an implementation class, loads it and instantiates it with the full URL.
Warning There's no specific code in the core S3A FS which looks for an FS schema being s3a, but you will encounter problems if you use the DynamoDB consistency layer "S3Guard" -that's probably a bit of overkill someone could fix
Ideally, you could refactor the code to detect the runtime environment, or externalize the paths to a config file that could be used in the respective areas.
Otherwise, you would need to edit the hdfs-site.xml to configure the fs.s3a.impl key to rename s3a to s3, and you might be able to keep the value the same. That change would need done for all Spark workers
You probably won't be able to configure Spark to help you "translate".
Instead, this is more like a design issue. The code should be made configurable to choose different protocol for different environment(that was what I did for a similar situation). If you insist on working locally, some code refactoring may not be avoidable...

loading parquet file into vertica database using spark

How to load a parquet file into vertica database using spark???
link (http://www.sparkexpert.com/2015/04/17/save-apache-spark-dataframe-to-database/)
I tried to load data frame(parquet files) using the above link into mysql it worked. But when i tried to load it into vertica database this is the error i am facing.The error below is because vertica db doesn’t support the datatypes(String) which is in the data frames(parquet file). I do not wanted to type cast the columns since its going to be a performance issue. we are looking to load around 280 million rows. Could you please suggest the best way to load the data into vertica db.
Exception in thread “main” java.sql.SQLSyntaxErrorException: [Vertica][VJDBC](5108) ERROR: Type “TEXT” does not exist
at com.vertica.util.ServerErrorData.buildException(Unknown Source)
at com.vertica.io.ProtocolStream.readExpectedMessage(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepareImpl(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepare(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepare(Unknown Source)
at com.vertica.jdbc.common.SPreparedStatement.(Unknown Source)
at com.vertica.jdbc.jdbc4.S4PreparedStatement.(Unknown Source)
at com.vertica.jdbc.VerticaJdbc4PreparedStatementImpl.(Unknown Source)
at com.vertica.jdbc.VJDBCObjectFactory.createPreparedStatement(Unknown Source)
at com.vertica.jdbc.common.SConnection.prepareStatement(Unknown Source)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:275)
at org.apache.spark.sql.DataFrame.createJDBCTable(DataFrame.scala:1611)
at com.sparkread.SparkVertica.JdbctoVertica.main(JdbctoVertica.java:51)
Caused by: com.vertica.support.exceptions.SyntaxErrorException: [Vertica][VJDBC](5108) ERROR: Type “TEXT” does not exist
… 13 more
Since you are getting the error on the createJDBCTable, you could just create the table yourself and use insertIntoJDBC instead.
Another idea would be to try and set spark.sql.dialect to Postgres since I noticed registerDialect(PostgresDialect) in spark. That said, I don't know how to do this other than to use jdbc:postgresql, but if you use that driver you would not get any advantage of a optimal insert that Vertica's JDBC driver would give you. You might need to modify here to allow it to use that dialect for jdbc:vertica. If for some reason that doesn't work you'd need to add in a new dialect.
Personally I think the first option is simpler.
When the Vertica table exists with the same column names as the dataFrame (and the corresponding types, VARCHAR) the following has worked for me (while keeping vertica's jdbc):
myDataFrame.write().mode(SaveMode.Append).jdbc(url, "MY_VERTICA_TABLE", new Properties());

Pig into Cassandra - pass list objects using python UDF and CqlStorage

I am working on a dataflow including some aggregating steps in Pig and storing steps into Cassandra. I have been able to pass relatively simple data types such as integer, long or dates but can't find how to pass some sort of list, set or tuple from Pig to Cassandra using CqlStorage.
I use Pig 0.9.2, so I can't use the FLATTEN methods.
Question
How do I fill a Cassandra table carrying complex data types such as sets or lists from Pig 0.9.2?
Overview of my specific application:
I created the corresponding Cassandra table respecting the description:
CREATE TABLE mycassandracf (
my_id int,
date timestamp,
my_count bigint,
grouped_ids list<bigint>,
PRIMARY KEY (my_id, date));
and a STORE instruction carrying prepared statement:
STORE CassandraAggregate
INTO 'cql://test/mycassandracf?output_query=UPDATE+test.mycassandracf+set+my_count+%3D+%3F%2C+grouped_ids+%3D+%3F'
USING CqlStorage;
From a 'GROUP BY' relation, I 'generate' a relation in a cql-friendly format (e.g. in tuples), that I want to store into Cassandra.
CassandraAggregate = FOREACH GroupedRelation
GENERATE TOTUPLE(TOTUPLE('my_id', $0.my_id),
TOTUPLE('date', ISOToUnix($0.createdAt))),
TOTUPLE(COUNT($1), $1.grouped_id);
DUMP CassandraAggregate;
(((my_id,30021),(date,1357084800000)),(2,{(60128490006325819),(62726281032786005)}))
(((my_id,30165),(date,1357084800000)),(1,{(60128411174143024)}))
(((my_id,30376),(date,1357084800000)),(4,{(60128411146211875),(63645100121476995),(60128411146211875),(63645100121476995)}))
Unsurprisingly, using the STORE instruction on this relation raises the exception:
java.lang.ClassCastException: org.apache.pig.data.DefaultDataBag cannot be cast to org.apache.pig.data.DataByteArray
I thus add a UDF written in python to apply some flattening on the grouped_id bag:
#outputSchema("flat_bag:bag{}")
def flattenBag(bag):
return tuple([long(item) for tup in bag for item in tup])
I use tuple because using python sets as well as python lists ends up in getting casting errors.
Adding it to my pipeline, I have:
CassandraAggregate = FOREACH GroupedRelation
GENERATE TOTUPLE(TOTUPLE('my_id', $0.my_id),
TOTUPLE('date', ISOToUnix($0.createdAt))),
TOTUPLE(COUNT($1), py_f.flattenBag($1.grouped_id));
DUMP CassandraAggregate;
(((my_id,30021),(date,1357084800000)),(2,(60128490006325819,62726281032786005)))
(((my_id,31120),(date,1357084800000)),(1,(60128411174143024)))
(((my_id,31120),(date,1357084800000)),(1,(60128411146211875,63645100121476995,6012841114621187563645100121476995)))
Using the STORE instruction on this last relation raises the exception with error stack:
java.io.IOException: java.io.IOException: org.apache.thrift.transport.TTransportException
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:428)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:408)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:262)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
Caused by: java.io.IOException: org.apache.thrift.transport.TTransportException
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:248)
Caused by: org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1724)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1709)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:232)
I tested the exact same workflow with simple data types and is working perfectly. What I am really looking for is the way to fill a cassandra table with complex types such as sets or lists from Pig.
Many thanks
After further investigation, I found the solution here:
https://issues.apache.org/jira/browse/CASSANDRA-5867
Basically, CqlStorage supports complex types. For that, the type should represented by a tuple in the tuples, carrying as first element the very data type as a string. For list, this is how one does this:
# python
#outputSchema("flat_bag:bag{}")
def flattenBag(bag):
return ('list',) + tuple([long(item) for tup in bag for item in tup])
Thus, in grunt:
# pig
CassandraAggregate = FOREACH GroupedRelation
GENERATE TOTUPLE(TOTUPLE('my_id', $0.my_id),
TOTUPLE('date', ISOToUnix($0.createdAt))),
TOTUPLE(COUNT($1), py_f.flattenBag($1.grouped_id));
DUMP CassandraAggregate;
(((my_id,30021),(date,1357084800000)),(2,(list, 60128490006325819,62726281032786005)))
(((my_id,31120),(date,1357084800000)),(1,(list, 60128411174143024)))
(((my_id,31120),(date,1357084800000)),(1,(list, 60128411146211875,63645100121476995,6012841114621187563645100121476995)))
This is then stored into cassandra using classic encoded prepared statement.
Hope this will of some help.

Resources