Read Excel in Spark Error :InputStream of class ZipArchiveInputStream is not implementing InputStreamStatistics - excel

I am trying to read excel files from COS via spark , like this
def readExcelData(filePath: String, spark: SparkSession): DataFrame =
spark.read
.format("com.crealytics.spark.excel")
.option("path", filePath)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "False")
.option("addColorColumns", "False")
.load()
def readAllFiles: DataFrame = {
val objLst //contains the list the file paths
val schema = StructType(
StructField("col1", StringType, true) ::
StructField("col2", StringType, true) ::
StructField("col3", StringType, true) ::
StructField("col4", StringType, true) :: Nil
)
var initialDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
for (file <- objLst) {
initialDF = initialDF.union(
readExcelData(file, spark).select($"col1", $"col2", $"col3", $"col4"))
}
}
In this code , I am creating an empty dataframe first , then reading all the excel files (by iterating the filepaths ) and merging the data via a union operation.
It is throwing an error like this
java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.
at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63)
The sparkExcel version is 0.10.2

try removing the .show() for your original statement and convert to dataframe first.
def readExcel(file: String): DataFrame = spark.read
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "False")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show()

Related

How to extract urls from HYPERLINKS in excel file when reading into scala spark dataframe

I have an Excel file with Column A containing HYPERLINKS like this:
=HYPERLINK("https://google.com","View Link")
I can load the Excel file in scala spark dataframe using com.crealytics.spark.excel library but only with the 'View Link' text which DOES NOT contain the url
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object Tut {
def main(args: Array[String]): Unit = {
println("started")
val spark = SparkSession
.builder()
.appName("MySpark")
.config("spark.master", "local")
.getOrCreate()
val customSchema = StructType(Array(
StructField("A", StringType, nullable = false),
StructField("B", IntegerType, nullable = false)))
val df = spark.read.format("com.crealytics.spark.excel")
.option("useHeader", "true").schema(customSchema)
.option("dataAddress", "A1")
.load("/MY_PATH/src/main/resources/SampFile.xlsx")
df.printSchema()
df.show()
}
}
My goal is to load the entire content of the HYPERLINK as a string:
=HYPERLINK("https://google.com","View Link")
and then extract the url
https://google.com.
Do you know if there is a way to do this using com.crealytics.spark.excel library or any other spark library? Thanks in advance!
About the other question link you provided in the comments, they're trying to read the column as BinaryType, and cast it out of the box into StringType, well, such thing is not possible (even in scala itself), since you need to know how to read the bytes and represent it as a human readable string, right? for instance the encoding, etc.
Now we know that we need to define a custom approach. I used a sample in-code dataframe, and this approach worked:
scala> import spark.implicits._
import spark.implicits._
scala> val df = Seq(
| ("ddd".getBytes, 1)
| ).toDF("A", "B")
df: org.apache.spark.sql.DataFrame = [A: binary, B: int]
scala> val btos: Array[Byte] => String = bytes => new String(bytes) // short fot bytes to string
btos: Array[Byte] => String = $Lambda$2322/665683021#738f6e44
scala> spark.udf.register("btos", btos)
res0: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2322/665683021#738f6e44,StringType,List(Some(class[value[0]: binary])),Some(btos),true,true)
scala> df.withColumn("C", expr("btos(A)")).show
+----------+---+---+
| A| B| C|
+----------+---+---+
|[64 64 64]| 1|ddd|
+----------+---+---+
Hope this works for you.

spark structured streaming batch data refresh issue (partition by clause)

I came across a problem while joining spark structured streaming data frame with batch data frame , my scenario I have a S3 stream which needs to do left anti join with history data which returns record not present in history (figures out new records) and I write these records to history as a new append (partition by columns disk data partition not memory).
when I refresh my history data frame which is partitioned my history data frame doesn't get updated.
Below are the code two code snippets one which work's the other which doesn't work.
Only difference between working code and non working code is partition_by clause.
Working Code:- (history gets refreshed)
import spark.implicits._
val inputSchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val historySchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val source = spark.readStream
.schema(inputSchema)
.option("header", "false")
.csv("src/main/resources/Input/")
val history = spark.read
.schema(inputSchema)
.option("header", "true")
.csv("src/main/resources/history/")
.withColumnRenamed("spark_id", "spark_id_2")
.withColumnRenamed("account_id", "account_id_2")
.withColumnRenamed("run_dt", "run_dt_2")
.withColumnRenamed("trxn_ref_id", "trxn_ref_id_2")
.withColumnRenamed("trxn_dt", "trxn_dt_2")
.withColumnRenamed("trxn_amt", "trxn_amt_2")
val readFilePersisted = history.persist()
readFilePersisted.createOrReplaceTempView("hist")
val recordsNotPresentInHist = source
.join(
history,
source.col("account_id") === history.col("account_id_2") &&
source.col("run_dt") === history.col("run_dt_2") &&
source.col("trxn_ref_id") === history.col("trxn_ref_id_2") &&
source.col("trxn_dt") === history.col("trxn_dt_2") &&
source.col("trxn_amt") === history.col("trxn_amt_2"),
"leftanti"
)
recordsNotPresentInHist.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write
.mode(SaveMode.Append)
//.partitionBy("spark_id", "account_id", "run_dt")
.csv("src/main/resources/history/")
val lkpChacheFileDf1 = spark.read
.schema(inputSchema)
.parquet("src/main/resources/history")
val lkpChacheFileDf = lkpChacheFileDf1
lkpChacheFileDf.unpersist(true)
val histLkpPersist = lkpChacheFileDf.persist()
histLkpPersist.createOrReplaceTempView("hist")
}
.start()
println("This is the kafka dataset:")
source
.withColumn("Input", lit("Input-source"))
.writeStream
.format("console")
.outputMode("append")
.start()
recordsNotPresentInHist
.withColumn("reject", lit("recordsNotPresentInHist"))
.writeStream
.format("console")
.outputMode("append")
.start()
spark.streams.awaitAnyTermination()
Doesn't Work:- (history is not getting refreshed)
import spark.implicits._
val inputSchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val historySchema = StructType(
Array(
StructField("spark_id", StringType),
StructField("account_id", StringType),
StructField("run_dt", StringType),
StructField("trxn_ref_id", StringType),
StructField("trxn_dt", StringType),
StructField("trxn_amt", StringType)
)
)
val source = spark.readStream
.schema(inputSchema)
.option("header", "false")
.csv("src/main/resources/Input/")
val history = spark.read
.schema(inputSchema)
.option("header", "true")
.csv("src/main/resources/history/")
.withColumnRenamed("spark_id", "spark_id_2")
.withColumnRenamed("account_id", "account_id_2")
.withColumnRenamed("run_dt", "run_dt_2")
.withColumnRenamed("trxn_ref_id", "trxn_ref_id_2")
.withColumnRenamed("trxn_dt", "trxn_dt_2")
.withColumnRenamed("trxn_amt", "trxn_amt_2")
val readFilePersisted = history.persist()
readFilePersisted.createOrReplaceTempView("hist")
val recordsNotPresentInHist = source
.join(
history,
source.col("account_id") === history.col("account_id_2") &&
source.col("run_dt") === history.col("run_dt_2") &&
source.col("trxn_ref_id") === history.col("trxn_ref_id_2") &&
source.col("trxn_dt") === history.col("trxn_dt_2") &&
source.col("trxn_amt") === history.col("trxn_amt_2"),
"leftanti"
)
recordsNotPresentInHist.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.write
.mode(SaveMode.Append)
.partitionBy("spark_id", "account_id","run_dt")
.csv("src/main/resources/history/")
val lkpChacheFileDf1 = spark.read
.schema(inputSchema)
.parquet("src/main/resources/history")
val lkpChacheFileDf = lkpChacheFileDf1
lkpChacheFileDf.unpersist(true)
val histLkpPersist = lkpChacheFileDf.persist()
histLkpPersist.createOrReplaceTempView("hist")
}
.start()
println("This is the kafka dataset:")
source
.withColumn("Input", lit("Input-source"))
.writeStream
.format("console")
.outputMode("append")
.start()
recordsNotPresentInHist
.withColumn("reject", lit("recordsNotPresentInHist"))
.writeStream
.format("console")
.outputMode("append")
.start()
spark.streams.awaitAnyTermination()
Thanks
Sri
I resolved this problem by using union by name function instead of reading refreshed data from disk.
Step 1:-
Read history S3
Step 2:-
Read Kafka and look up history
Step 3:-
Write to processed data to Disk and append to data frame created in step 1 using union by name spark function.
Step 1 Code (Reading History Data Frame):-
val acctHistDF = sparkSession.read
.schema(schema)
.parquet(S3path)
val acctHistDFPersisted = acctHistDF.persist()
acctHistDFPersisted.createOrReplaceTempView("acctHist")
Step 2 (Refreshing History Data Frame with stream data):-
val history = sparkSession.table("acctHist")
history.unionByName(stream)
history.createOrReplaceTempView("acctHist")
Thanks
Sri

how can i overcome the file not foundexception

I am rying to read multiple excel files which under one directory, but i am encountered an error java.io.FileNotFoundException: File path/** does not exist
object example {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Excel to
DataFrame").master("local[2]").getOrCreate()
val path = "C:\\excel\\files"
val df = spark.read.format("com.crealytics.spark.excel")
.option("location", "true")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema","true")
.option("addColorColumns", "true")
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss")
.load("path")
Try this:
def readExcel(file: String): DataFrame = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", file)
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.load()
val data = readExcel("path to your excel file")
data.show(false)
If you want to read a particular sheet:
.option("sheetName", "Sheet2")
EDIT: To read multiple excel files into one dataframe. (provided the columns in the excel file are consistent)
For this I have used spark-excel package. It can be added to build.sbt file as:
libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"
The code is as follows:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File
val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val spark = SparkSession.builder().getOrCreate()
// Function to read xlsx file using spark-excel.
// This code format with "trailing dots" can be sent to Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
format("com.crealytics.spark.excel").
option("location", file).
option("useHeader", "true").
option("treatEmptyValuesAsNulls", "true").
option("inferSchema", "true").
option("addColorColumns", "False").
load()
val dir = new File("path to your excel file")
val excelFiles = dir.listFiles.sorted.map(f => f.toString) // Array[String]
val dfs = excelFiles.map(f => readExcel(f)) // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_)) // DataFrame
ppdf.count()
ppdf.show(5)
Hope this helps. Good luck.

Structured streaming debugging input

Is there a way for me to print out the incoming data? For e.g. I have a readStream on a folder looking for JSON files, however there seems to be an issue as I am seeing 'nulls' in the aggregation output.
val schema = StructType(
StructField("id", LongType, false) ::
StructField("sid", IntegerType, true) ::
StructField("data", ArrayType(IntegerType, false), true) :: Nil)
val lines = spark.
readStream.
schema(schema).
json("in/*.json")
val top1 = lines.groupBy("id").count()
val query = top1.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "false")
.start()
To print the data you can add queryName in the write stream, by using that queryName you can print.
In your Example
val query = top1.writeStream
.outputMode("complete")
.queryName("xyz")
.format("console")
.option("truncate", "false")
.start()
run this and you can display data by using SQL query
%sql select * from xyz
or you can Create Dataframe
val df = spark.sql("select * from xyz")

Join files in Apache Spark

I have a file like this. code_count.csv
code,count,year
AE,2,2008
AE,3,2008
BX,1,2005
CD,4,2004
HU,1,2003
BX,8,2004
Another file like this. details.csv
code,exp_code
AE,Aerogon international
BX,Bloomberg Xtern
CD,Classic Divide
HU,Honololu
I want the total sum for each code but in the final output, I want the exp_code. Like this
Aerogon international,5
Bloomberg Xtern,4
Classic Divide,4
Here is my code
var countData=sc.textFile("C:\path\to\code_count.csv")
var countDataKV=countData.map(x=>x.split(",")).map(x=>(x(0),1))
var sum=countDataKV.foldBykey(0)((acc,ele)=>{(acc+ele)})
sum.take(2)
gives
Array[(String, Int)] = Array((AE,5), (BX,9))
Here sum is RDD[(String, Int)]. I am kind of confused about how to pull the exp_code from the other file. Please guide.
You need to calculate the sum after groupby with code and then join another dataframe. Below is similar example.
import spark.implicits._
val df1 = spark.sparkContext.parallelize(Seq(("AE",2,2008), ("AE",3,2008), ("BX",1,2005), ("CD",4,2004), ("HU",1,2003), ("BX",8,2004)))
.toDF("code","count","year")
val df2 = spark.sparkContext.parallelize(Seq(("AE","Aerogon international"),
("BX","Bloomberg Xtern"), ("CD","Classic Divide"), ("HU","Honololu"))).toDF("code","exp_code")
val sumdf1 = df1.select("code", "count").groupBy("code").agg(sum("count"))
val finalDF = sumdf1.join(df2, "code").drop("code")
finalDF.show()
If you are using spark version > 2.0 you can use following code directly.
com.databricks.spark.csv is available by default as part of spark 2.0
val codeDF = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("hdfs://pathTo/code_count.csv")
val detailsDF = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("hdfs://pathTo/details.csv")
//
//
import org.apache.spark.sql.functions._
val resDF = codeDF.join(detailsDF,codeDF.col("code")===detailsDF.col("code")).groupBy(codeDF.col("code"),detailsDF.col("exp_code")).agg(sum("count").alias("cnt"))
output:
If you are using spark <=1.6 version. you can use following code.
you can follow this link to use com.databricks.spark.csv
https://github.com/databricks/spark-csv
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
import hiveContext.implicits._
val codeDF = hiveContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("delimiter",",")
.load("hdfs://pathTo/code_count.csv")
val detailsDF = hiveContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter",",")
.load("hdfs://pathTo/details.csv")
import org.apache.spark.sql.functions._
val resDF = codeDF.join(detailsDF,codeDF.col("code")===detailsDF.col("code")).groupBy(codeDF.col("code"),detailsDF.col("exp_code")).agg(sum("count").alias("cnt"))

Resources