I have a Hive external table as TEXTFILE FORMAT
hive> SHOW CRAETE TABLE customers;
CREATE EXTERNAL TABLE `customers`(
`id` int,
`name` string,
`age` int,
`address` string,
`salary` double)
COMMENT 'Customer Details'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim'=',','line.delim'='\n','serialization.format'=',')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs://localhost:9000/user/hive/warehouse/shopping.db/customers'
TBLPROPERTIES ('skip.header.line.count'='1','transient_lastDdlTime'='1605818870')
and a spark session as follows :
val warehouseLocation:String = "hdfs://localhost:9000/user/hive/warehouse/"
val spark = SparkSession.builder()
.appName("Spark Extract")
.enableHiveSupport()
.master("local[*]")
.getOrCreate()
another line that extracts the tables from hive using :
val df: DataFrame = spark.sql("SELECT * FROM database.table")
extracting the table with spark.sql("QUERY") gives the following exception:
java.lang.RuntimeException: hdfs://localhost:9000/user/hive/warehouse/shopping.db/customers/customers.txt is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [46, 48, 48, 10]
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:476)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:445)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:401)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:106)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:404)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:345)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:128)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I tried various methods even:
hive> alter table customers SET FILEFORMAT PARQUET;
Please assist
I tried various methods even:
hive> alter table customers SET FILEFORMAT PARQUET;
You need to remove the customers.txt file from the table if it is not parquet or you need to redefine the table to have a text fileformat
Related
I have a partitioned data structure on S3 as below which store parquet files in it:
date=100000000000
date=111620200621
date=111620202258
The S3 key will look like s3://bucket-name/master/date={a numeric value}
I am reading the data from SPARK code as below:
Dataset<Row> df = spark.read().parquet("s3://bucket-name/master/");
data.createOrReplaceTempView("master"); // This will lead to duplicates as NUM_VALUE can be repetitive in each S3 partition```
Spark DF looks like below with duplicate NUM_value:
NAME date NUM_VALUE
name1 100000000000 1
name2 111620200621 2
name3 111620202258 2
Expected unique output:
NAME date NUM_VALUE
name1 100000000000 1
name3 111620202258 2
I am trying to get the unique latest data as below:
Dataset<Row> final = spark.sql("SELECT NAME,date,NUM_VALUE FROM (SELECT rank() OVER (PARTITION BY NAME ORDER BY date DESC) rank, * FROM master) temp WHERE (rank = 1)");
final.show();
But I am getting the below error when the above query is invoked:
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 40, date), LongType) AS date#472L
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:292)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:594)
at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:594)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of bigint
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_20$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289)```
[1]: https://i.stack.imgur.com/8INxj.png
Read the way you are using
Dataset<Row> df = spark.read().parquet("s3://bucket-name/master/")
To get duplicates , Use group by() along with Count() (Returns the count of rows for each group.) And gt(1) will get you all rows with duplicates.
val dups = df .groupBy("NAME","date","NUM_VALUE").count.filter(col("count").gt(1));
println(dups.count()+" rows in dups.count")
And if you want to remove duplicates , use dropDuplicates
val uniqueRowsDF = df.dropDuplicates("NAME","date","NUM_VALUE")
println(uniqueRowsDF .count()+" rows after .dropDuplicates")
I want to schedule the Databricks notebook which merges small ORC files to one bigger ORC file on a daily basis for a particular hive table. I'm looking to implement this using Spark, but currently stuck with the error as shown below.
My databricks runtime: 6.3 (includes Apache Spark 2.4.4, Scala 2.11)
Any pointers would be great.
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
val spark = SparkSession.builder().appName("Hive small ORC files merge").enableHiveSupport().getOrCreate()
spark.sql("ALTER TABLE [TABLE_NAME] CONCATENATE")
Error:
`Operation not allowed: ALTER TABLE CONCATENATE(line 1, pos 0)
== SQL ==ALTER TABLE CONCATENATE^^^
at org.apache.spark.sql.catalyst.parser.ParserUtils$.operationNotAllowed(ParserUtils.scala:43)
at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitFailNativeCommand$1.apply(SparkSqlParser.scala:1135)
at org.apache.spark.sql.execution.SparkSqlAstBuilder$$anonfun$visitFailNativeCommand$1.apply(SparkSqlParser.scala:1126)
at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:110)
at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitFailNativeCommand(SparkSqlParser.scala:1126)
at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitFailNativeCommand(SparkSqlParser.scala:62)
at org.apache.spark.sql.catalyst.parser.SqlBaseParser$FailNativeCommandContext.accept(SqlBaseParser.java:831)
at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:18)
at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:74)
at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:74)
at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:110)
at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:73)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:70)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:69)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:100)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:55)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:69)
at com.databricks.sql.parser.DatabricksSqlParser$$anonfun$parsePlan$1.apply(DatabricksSqlParser.scala:64)
at com.databricks.sql.parser.DatabricksSqlParser$$anonfun$parsePlan$1.apply(DatabricksSqlParser.scala:61)
at com.databricks.sql.parser.DatabricksSqlParser.parse(DatabricksSqlParser.scala:84)
at com.databricks.sql.parser.DatabricksSqlParser.parsePlan(DatabricksSqlParser.scala:61)
at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:694)
at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:694)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:693)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:716)
at line341589a136f246f788b6b288061c96ae31.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-4297307810790143:4)
at line341589a136f246f788b6b288061c96ae31.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-4297307810790143:50)
at line341589a136f246f788b6b288061c96ae31.$read$$iw$$iw$$iw$$iw.<init>(command-4297307810790143:52)
at line341589a136f246f788b6b288061c96ae31.$read$$iw$$iw$$iw.<init>(command-4297307810790143:54)
at line341589a136f246f788b6b288061c96ae31.$read$$iw$$iw.<init>(command-4297307810790143:56)
at line341589a136f246f788b6b288061c96ae31.$read$$iw.<init>(command-4297307810790143:58)
at line341589a136f246f788b6b288061c96ae31.$read.<init>(command-4297307810790143:60)
at line341589a136f246f788b6b288061c96ae31.$read$.<init>(command-4297307810790143:64)
at line341589a136f246f788b6b288061c96ae31.$read$.<clinit>(command-4297307810790143)
at line341589a136f246f788b6b288061c96ae31.$eval$.$print$lzycompute(<notebook>:7)
at line341589a136f246f788b6b288061c96ae31.$eval$.$print(<notebook>:6)
at line341589a136f246f788b6b288061c96ae31.$eval.$print(<notebook>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572)
at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:215)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:699)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:652)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:385)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:362)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:251)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:246)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:49)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:288)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:49)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:362)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at scala.util.Try$.apply(Try.scala:192)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)`
As per spark source code, ALTER TABLE <> CONCATENATE option is not implemented or not supported as of now. Please check below code for more information.
Spark SQL Parser
Spark Un Supported Hive Native Commands
This command works from Hive only:
ALTER TABLE <table_name> CONCATENATE;
Does not work from Spark, yet.
I created Avro datafiles using spark2 and then defined a hive table pointing to the avro datafiles.
val trades= spark.read.option("compression","gzip").csv("file:///data/nyse_all/nyse_data").select($"_c0".as("stockticker"),$"_c1".as("tradedate").cast(IntegerType),$"_c2".as("openprice").cast(DataTypes.createDecimalType(10,2)),$"_c3".as("highprice").cast(DataTypes.createDecimalType(10,2)),$"_c4".as("lowprice").cast(DataTypes.createDecimalType(10,2)),$"_c5".as("closeprice").cast(DataTypes.createDecimalType(10,2)),$"_c6".as("volume").cast(LongType))
trades.repartition(4,$"tradedate",$"volume").sortWithinPartitions($"tradedate".asc,$"volume".desc).write.format("com.databricks.spark.avro").save("/user/pawinder/spark_practice/problem6/data/nyse_data_avro")
spark.sql("create external table pawinder.nyse_data_avro(stockticker string, tradedate int, openprice decimal(10,2) , highprice decimal(10,2), lowprice decimal(10,2), closeprice decimal(10,2), volume bigint) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' location '/user/pawinder/spark_practice/problem6/data/nyse_data_avro'")
Querying the hive table fails with the following error:
Error: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error
while processing writable
org.apache.hadoop.hive.serde2.avro.AvroGenericRecordWritable#178270b2
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive
Runtime Error while processing writable
org.apache.hadoop.hive.serde2.avro.AvroGenericRecordWritable#178270b2
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:563)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
... 8 more Caused by: org.apache.avro.AvroTypeException: Found string, expecting union
On some debugging I found that the datatypes that were defined as Decimal(10,2) are marked as String in the avro data files:
[pawinder#gw02 ~]$ hdfs dfs -cat /user/pawinder/spark_practice/problem6/data/nyse_data_avro/part-00003-f1ca3b0a-f0b4-4aa8-bc26-ca50a0a16fe3-c000.avro |more
Objavro.schemaâ–’{"type":"record","name":"topLevelRecord","fields":[{"name":"stockticker","type":["string","null"]},{"name":"tradedate","type":["int","null"]},{"name":"o
penprice","type":["string","null"]},{"name":"highprice","type":["string","null"]},{"name":"lowprice","type":["string","null"]},{"name":"closeprice","type":["string","n
ull"]},{"name":"volume","type":["long","null"]}]}
I am able to query the same hive table in spark-shell. Is spark-sql DecimalType not recognised by avro serde? I am using spark 2.3.
I reading a table from MapR DB with Spark. But the timestamp column is inferred as InvalidType. There is no option of setting schema as well when you read data from Mapr db.
root
|-- Name: string (nullable = true)
|-- dt: struct (nullable = true)
| |-- InvalidType: string (nullable = true)
I tried to cast the column to timestamp, but got the below exception.
val df = spark.loadFromMapRDB("path")
df.withColumn("dt1", $"dt" ("InvalidType").cast(TimestampType))
.drop("dt")
df.show(5, false)
com.mapr.db.spark.exceptions.SchemaMappingException: Schema cannot be
inferred for the column {dt}
at com.mapr.db.spark.sql.utils.MapRSqlUtils$.convertField(MapRSqlUtils.scala:250)
at com.mapr.db.spark.sql.utils.MapRSqlUtils$.convertObject(MapRSqlUtils.scala:64)
at com.mapr.db.spark.sql.utils.MapRSqlUtils$.convertRootField(MapRSqlUtils.scala:48)
at com.mapr.db.spark.sql.utils.MapRSqlUtils$$anonfun$documentsToRow$1.apply(MapRSqlUtils.scala:34)
at com.mapr.db.spark.sql.utils.MapRSqlUtils$$anonfun$documentsToRow$1.apply(MapRSqlUtils.scala:33)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Any help will be appreciated.
If you know the schema of the table. You can create your own case class defining the schema of the table and then load the table using this case class.
Go through this link Loading Data from MapR Database as an Apache Spark Dataset
And also check the table in MapRDB if that particular column has valid schema or not
first,create a dataset with the spark-sql command:
spark.sql("select id ,a.userid,regexp_replace(b.tradeno,',','|') as TradeNo
,Amount ,TradeType ,TxTypeId
,regexp_replace(title,',','|') as title
,status ,tradetime ,TradeStatus
,regexp_replace(otherside,',','') as otherside
from
(
select userid
from tableA
where daykey='2018-10-30'
group by userid
) a
left join tableb b
on a.userid=b.userid
where b.userid is not null")
the result is:
dataset: org.apache.spark.sql.DataFrame = [id: bigint, userid: int ... 9 more fields]
then,export the dataset as csv with command:
dataset.coalesce(40).write.option("delimiter", ",").option("charset", "utf-8").csv("/binlog_test/mycsv.excel")
as spark task running,the following error occurs:
Driver stacktrace:
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1430)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1417)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1417)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:797)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:797)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:797)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1645)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1600)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1589)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1930)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1943)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1963)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127)
... 69 more
Caused by: java.lang.IllegalArgumentException: Field "id" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:290)
at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:290)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:289)
at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$6.apply(OrcFileFormat.scala:308)
at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$6.apply(OrcFileFormat.scala:308)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:96)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at org.apache.spark.sql.types.StructType.map(StructType.scala:96)
at org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcFileFormat.scala:308)
at org.apache.spark.sql.hive.orc.OrcFileFormat$$anonfun$buildReader$2.apply(OrcFileFormat.scala:140)
at org.apache.spark.sql.hive.orc.OrcFileFormat$$anonfun$buildReader$2.apply(OrcFileFormat.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:138)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:122)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:168)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
but, when i directly execute the join operate use hive, and create a new table with the join result, finally export the dataset with the spark-sql command ahead, all going well.