Custom delimited text file from dataframe - apache-spark

I'm using spark 1.6 and trying to create a delimited file from the dataframe.
The field delimiter is '|^', so I'm concatenating the columns from the dataframe while selecting from the temp table
Now the below code fails everytime with this error
ERROR scheduler.TaskSetManager: Task 172 in stage 9.0 failed 4 times; aborting job
19/03/01 09:10:15 ERROR datasources.InsertIntoHadoopFsRelation: Aborting job.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 172 in stage 9.0 failed 4 times, most recent failure: Lost task 172.3 in stage 9.0 (TID 1397, tplhc01d104.iuser.iroot.adidom.com, executor 7): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272
The Code I'm using is this..
tempDF.registerTempTable("BNUC_TEMP")
context.sql("select concat('VALID','|^', RECORD_ID,'|^', DATA_COL1,'|^', DATA_COL2,'|^','P','|^', DATA_COL4,'|^', DATA_COL5,'|^', DATA_COL6,'GBP','|^',from_unixtime(unix_timestamp( ACTION_DATE)),'|^',from_unixtime(unix_timestamp( UPDATED_DATE))) from BNUC_TEMP")
.write.mode("overwrite")
.text("/user/USERNAME/landing/staging/BNU/temp/")

Related

Pyspark failed to save df to S3

I want to save pyspark dataframe of ~14 millions rows into 6 differents files
After cleaning data:
clean_data.repartition(6).write.option("sep", "\t").option("header", "true").csv("s3_path", mode="overwrite")
I got this error An error was encountered:
An error occurred while calling o258.csv.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:195)

Spark 2.4 to Spark 3.0 DateTime question of date time

Right I am on a new environment upgraded from Spark 2.4 to Spark 3.0 and I am receiving these errors
ERROR 1
You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyy-MM-dd hh:mm:ss aa' pattern in the DateTimeFormatter
Lines causing this –
from_unixtime(unix_timestamp(powerappcapturetime_local, 'yyyy-MM-dd hh:mm:ss aa')+ (timezoneoffset*60),'yyyy-MM-dd HH:mm:ss') as powerappcapturetime
ERROR 2
DataSource.Error: ODBC: ERROR [42000] [Microsoft][Hardy] (80) Syntax or semantic analysis error thrown in server while executing query. Error message from server: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 94.0 failed 4 times, most recent failure: Lost task 0.3 in stage 94.0 (TID 1203) (10.139.64.43 executor 3): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse ' 01/19/2022' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:150)
at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:141)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:86)
at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
at scala.Option.getOrElse(Option.scala:189)”
Lines causing this –
cast ( to_date ( TT_VALID_FROM_TEXT, 'MM/dd/yyyy') as timestamp) as ttvalidfrom
My code is python with sql in the middle of it.
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
query_view_create = '''
CREATE OR REPLACE VIEW {0}.{1} as
SELECT
customername
,
cast ( to_date ( TT_VALID_FROM_TEXT, 'MM/dd/yyyy') as timestamp)
as ttvalidfrom
, from_unixtime(unix_timestamp(powerappcapturetime_local, 'yyyy-MM-dd hh:mm:ss aa')+ (timezoneoffset*60),'yyyy-MM-dd HH:mm:ss') as powerappcapturetime
from {0}.{2}
'''.format(DATABASE_NAME,VIEW_NAME_10,TABLE_NAME_12,ENVIRONMENT)
print(query_view_create)
Added to fix datetime issues we see when using spark 3.0 with Power BI that don't appear in spark 2.4
spark.sql(query_view_create)
The error still comes from Power BI when I import the table into Power BI . Not sure what I can do to make this work and not display these errors ?
#James Khan, Thanks for finding the source of the problem. Posting your discussion as an Answer to help other community members.
To set the legacy timeParserPolicy the below code may work.
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
OR
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
Still If you are getting the same after this, please check this similar SO thread.
Reference:
https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/parameters/legacy_time_parser_policy

Pyspark mllib + count or collect method throws ArrayIndexOutOfBounds exception

I'm learning pyspark and mllib.
After predicting the test data using A RF model, I'm assigning the result in a variable called 'predictions' which is a RDD.
If I call predictions.count() or prediction.collect(), then it is failing with the following exception.
Can you please share your thoughts? Already spent quite some time, but didn't find what is missing.
predictions = predict(training_data, test_data)
File "/mp5/part_d_poc.py", line 36, in predict
print(predictions.count())
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1055, in count
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1046, in sum
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 917, in fold
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 28, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 7
I constructed the training data in the following way.
raw_training_data.map(lambda row: LabeledPoint(row.split(',')[-1], Vectors.dense(row.split(',')[0:-1])))
It seems like this error is caused when there's a mismatch between the schema and data. Please refer to these -
ArrayIndexOutOfBoundsException with Spark, Spark-Avro and Google Analytics Data
https://github.com/Azure/spark-cdm-connector/issues/46#issuecomment-717543025
https://forums.couchbase.com/t/arrayindexoutofboundsexception/10311/3

PySpark groupBy count fails with show method

I have a problem with my df, running Spark 2.1.0, that has several string columns created as an SQL query from a Hive DB that gives this .summary():
DataFrame[summary: string, visitorid: string, eventtype: string, ..., target: string].
If I only run df.groupBy("eventtype").count(), it works and I get DataFrame[eventtype: string, count: bigint]
When running with show df.groupBy('eventtype').count().show(), I keep getting :
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-9040214714346906648.py", line 267, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-9040214714346906648.py", line 265, in <module>
exec(code)
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 318, in show
print(self._jdf.showString(n, 20))
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o4636.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 633.0 failed 4 times, most recent failure: Lost task 0.3 in stage 633.0 (TID 19944, ip-172-31-28-173.eu-west-1.compute.internal, executor 440): java.lang.NullPointerException
I have no clue what is wrong with the show method (neither of the other columns works either, not event column target which I created). The admin of the cluster could not help me either.
Many thanks for any pointers
There is some problem, currently we know the issue if your DataFrame contain some limit. If yes, you probably went into https://issues.apache.org/jira/browse/SPARK-18528
That means, you must upgrade Spark version to 2.1.1 or you can use repartition as a workaround to avoid this problem
As #AssafMendelson said, the count() only creates new DataFrame, but it doesn't start calculation. Performing show or i.e. head will start the calculation.
If the Jira ticket and upgrade don't help you, please post logs of workers
When you run
df.groupBy("eventtype").count()
You are actually defining a lazy transformation on HOW to calculate the result. This would return a new dataframe almost immediately regardless of the data size. When you call show you are performing an action, this is when the actual calculation begins.
If you look at the bottom of your error log:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 633.0 failed 4 times, most recent failure: Lost task 0.3 in stage 633.0 (TID 19944, ip-172-31-28-173.eu-west-1.compute.internal, executor 440): java.lang.NullPointerException
You can see that one of the task failed due to a null pointer exception. I would go and check the definition of df to see what happened before (maybe even see if simply doing df.count() causes the exception).

Reading multiple avro files into RDD from a nested directory structure

suppose I have a directory which contains a bunch of avro files and I want to read them all in one shot. this code works fine
val path = "hdfs:///path/to/your/avro/folder"
val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)
However, if the folder contains subfolders and the avro files are in subfolders. then I get an error
5/10/30 14:57:47 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 6,
hadoop1): java.io.FileNotFoundException: Path is not a file: /folder/subfolder
Is there anyway I can read all the avros (even in subdirectories) into an RDD?
all avros have same schema and I am on spark 1.3.0
Edit::
Based on the suggestion below I executed this line in my spark shell
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
and this solved the problem.... but now my code is very very slow and I don't understand what does a mapreduce setting have to do with spark.

Resources