Read Excel files from S3 using Scala, Spark and org.apache.poi - excel

I'm looking for the way to open and process an Excel file (*.xlsx) in Spark job.
I'm quite new to Scala/Spark stack so trying to complete it in pythonic way :)
Without Spark it's simple:
val f = new File("src/worksheets.xlsx")
val workbook = WorkbookFactory.create(f)
val sheet = workbook.getSheetAt(0)
But Spark needs some streaming input. I've configured Hadoop for S3 (in my case - MinIO)
val hadoopConf = sparkSession.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set(
"fs.s3a.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
)
hadoopConf.set("fs.s3a.path.style.access", "true")
hadoopConf.set("fs.s3a.access.key", params.minioAccessKey.get)
hadoopConf.set("fs.s3a.secret.key", params.minioSecretKey.get)
hadoopConf.set(
"fs.s3a.connection.ssl.enabled",
params.minioSSL.get.toString
)
hadoopConf.set("fs.s3a.endpoint", params.minioUrl.get)
val FilterDF = sparkSession.read
.format("com.crealytics.spark.excel")
.option("recursiveFileLookup", "true")
.option("modifiedBefore", "2020-07-01T05:30:00")
.option("modifiedAfter", "2020-06-01T05:30:00")
.option("header", "true")
.load("s3a://first/");
println(FilterDF)
So the question is: how to configure DataFrame (or, maybe some other solution) to filter and gather files in some time range from S3 bucket and make it suitable to work with Apache POI? Its Workbook can process general file objects as well as InputStream (so this might be the point of conversion)
Thanks in advance

Related

How to do a fast insertion of the data in a Kafka topic inside a Hive Table?

I have a Kafka topic in which I have received around 500k events.
Currently, I need to insert those events into a Hive table.
Since events are time-driven, I decided to use the following strategy:
1) Define a route inside HDFS, which I call users. Inside of this route, there will be several Parquet files, each one corresponding to a certain date. E.g.: 20180412, 20180413, 20180414, etc. (Format YYYYMMDD).
2) Create a Hive table and use the date in the format YYYYMMDD as a partition. The idea is to use each of the files inside the users HDFS directory as a partition of the table, by simply adding the corresponding parquet file through the command:
ALTER TABLE users DROP IF EXISTS PARTITION
(fecha='20180412') ;
ALTER TABLE users ADD PARTITION
(fecha='20180412') LOCATION '/users/20180412';
3) Read the data from the Kafka topic by iterating from the earliest event, get the date value in the event (inside the parameter dateClient), and given that date value, insert the value into the corresponding Parque File.
4) In order to accomplish the point 3, I read each event and saved it inside a temporary HDFS file, from which I used Spark to read the file. After that, I used Spark to convert the temporary file contents into a Data Frame.
5) Using Spark, I managed to insert the DataFrame values into the Parquet File.
The code follows this approach:
val conf = ConfigFactory.parseResources("properties.conf")
val brokersip = conf.getString("enrichment.brokers.value")
val topics_in = conf.getString("enrichment.topics_in.value")
val spark = SparkSession
.builder()
.master("yarn")
.appName("ParaTiUserXY")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val properties = new Properties
properties.put("key.deserializer", classOf[StringDeserializer])
properties.put("value.deserializer", classOf[StringDeserializer])
properties.put("bootstrap.servers", brokersip)
properties.put("auto.offset.reset", "earliest")
properties.put("group.id", "UserXYZ2")
//Schema para transformar los valores del topico de Kafka a JSON
val my_schema = new StructType()
.add("longitudCliente", StringType)
.add("latitudCliente", StringType)
.add("dni", StringType)
.add("alias", StringType)
.add("segmentoCliente", StringType)
.add("timestampCliente", StringType)
.add("dateCliente", StringType)
.add("timeCliente", StringType)
.add("tokenCliente", StringType)
.add("telefonoCliente", StringType)
val consumer = new KafkaConsumer[String, String](properties)
consumer.subscribe( util.Collections.singletonList("geoevents") )
val fs = {
val conf = new Configuration()
FileSystem.get(conf)
}
val temp_path:Path = new Path("hdfs:///tmp/tmpstgtopics")
if( fs.exists(temp_path)){
fs.delete(temp_path, true)
}
while(true)
{
val records=consumer.poll(100)
for (record<-records.asScala){
val data = record.value.toString
val dataos: FSDataOutputStream = fs.create(temp_path)
val bw: BufferedWriter = new BufferedWriter( new OutputStreamWriter(dataos, "UTF-8"))
bw.append(data)
bw.close
val data_schema = spark.read.schema(my_schema).json("hdfs:///tmp/tmpstgtopics")
val fechaCliente = data_schema.select("dateCliente").first.getString(0)
if( fechaCliente < date){
data_schema.select("longitudCliente", "latitudCliente","dni", "alias",
"segmentoCliente", "timestampCliente", "dateCliente", "timeCliente",
"tokenCliente", "telefonoCliente").coalesce(1).write.mode(SaveMode.Append)
.parquet("/desa/landing/parati/xyusers/" + fechaCliente)
}
else{
break
}
}
}
consumer.close()
However, this method is taking around 1 second to process each record in my cluster. So far, it would mean I will take around 6 days to process all the events I have.
Is this the optimal way to insert the whole amount of events inside a Kafka topic into a Hive table?
What other alternatives exist or which upgrades could I do to my code in order to speed it up?
Other than the fact that you're not using Spark Streaming correctly to poll from Kafka (you wrote a vanilla Scala Kafka consumer with a while loop) and coalesce(1) will always be a bottleneck as it forces one executor to collect the records, I'll just say you're really reinventing the wheel here.
What other alternatives exist
That I known of and are all open source
Gobblin (replaces Camus) by LinkedIn
Kafka Connect w/ HDFS Sink Connector (built into Confluent Platform, but also builds from source on Github)
Streamsets
Apache NiFi
Secor by Pinterest
From those listed, it would be beneficial for you to have JSON or Avro encoded Kafka messages, and not a flat string. That way, you can drop the files as is into a Hive serde, and not parse them while consuming them. If you cannot edit the producer code, make a separate Kafka Streams job taking the raw string data, parse it, then write to a new topic of Avro or JSON.
If you choose Avro (which you really should for Hive support), you can use the Confluent Schema Registry. Or if you're running Hortonworks, they offer a similar Registry.
HIve on Avro operates far better than text or JSON. Avro can easily be transformed into Parquet, and I believe each of the above options offers at least Parquet support while the others also can do ORC (Kafka Connect doesn't do ORC at this time).
Each of the above also support some level of automatic Hive partition generation based on the Kafka record time.
You can improve the parallelism by increasing the partitions of the kafka topic and having one or more consumer groups with multiple consumers consuming one-to-one with each partition.
As, cricket_007 mentioned you can use one of the opensource frameworks or you can have more consumer groups consuming the same topic to off-load the data.

Reading excel files in a streaming fashion in spark 2.0.0

I have a set of Excel format files which needs to be read from Spark(2.0.0) as and when an Excel file is loaded into a local directory. Scala version used here is 2.11.8.
I've tried using readstream method of SparkSession, but I'm not able to read in a streaming way. I'm able to read Excel files statically as:
val df = spark.read.format("com.crealytics.spark.excel").option("sheetName", "Data").option("useHeader", "true").load("Sample.xlsx")
Is there any other way of reading excel files in streaming way from a local directory?
Any answers would be helpful.
Thanks
Changes done:
val spark = SparkSession.builder().master("local[*]").config("spark.sql.warehouse.dir","file:///D:/pooja").appName("Spark SQL Example").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val dataFrame = spark.readStream.format("csv").option("inferSchema",true).option("header", true).load("file:///D:/pooja/sample.csv")
dataFrame.writeStream.format("console").start()
dataFrame.show()
Updated code:
val spark = SparkSession.builder().master("local[*]").appName("Spark SQL Example").getOrCreate()
spark.conf.set("spark.sql.streaming.schemaInference", true)
import spark.implicits._
val df = spark.readStream.format("com.crealytics.spark.excel").option("header", true).load("file:///filepath/*.xlsx")
df.writeStream.format("memory").queryName("tab").start().awaitTermination()
val res = spark.sql("select * from tab")
res.show()
Error:
Exception in thread "main" java.lang.UnsupportedOperationException: Data source com.crealytics.spark.excel does not support streamed reading
Can anyone help me resolve this issue.
For a streaming DataFrame you have to provide Schema and currently, DataStreamReader does not support option("inferSchema", true|false). You can set SQLConf setting spark.sql.streaming.schemaInference, which needs to be set at session level.
You can refer here

Spark Session read mulitple files instead of using pattern

I'm trying to read couple of CSV files using SparkSession from a folder on HDFS ( i.e I don't want to read all the files in the folder )
I get the following error while running (code at the end):
Path does not exist:
file:/home/cloudera/works/JavaKafkaSparkStream/input/input_2.csv,
/home/cloudera/works/JavaKafkaSparkStream/input/input_1.csv
I don't want to use the pattern while reading, like /home/temp/*.csv, reason being in future I have logic to pick only one or two files in the folder out of 100 CSV files
Please advise
SparkSession sparkSession = SparkSession
.builder()
.appName(SparkCSVProcessors.class.getName())
.master(master).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
Set<String> fileSet = Files.list(Paths.get("/home/cloudera/works/JavaKafkaSparkStream/input/"))
.filter(name -> name.toString().endsWith(".csv"))
.map(name -> name.toString())
.collect(Collectors.toSet());
SQLContext sqlCtx = sparkSession.sqlContext();
Dataset<Row> rawDataset = sparkSession.read()
.option("inferSchema", "true")
.option("header", "true")
.format("com.databricks.spark.csv")
.option("delimiter", ",")
//.load(String.join(" , ", fileSet));
.load("/home/cloudera/works/JavaKafkaSparkStream/input/input_2.csv, " +
"/home/cloudera/works/JavaKafkaSparkStream/input/input_1.csv");
UPDATE
I can iterate the files and do an union as below. Please recommend if there is a better way ...
Dataset<Row> unifiedDataset = null;
for (String fileName : fileSet) {
Dataset<Row> tempDataset = sparkSession.read()
.option("inferSchema", "true")
.option("header", "true")
.format("csv")
.option("delimiter", ",")
.load(fileName);
if (unifiedDataset != null) {
unifiedDataset= unifiedDataset.unionAll(tempDataset);
} else {
unifiedDataset = tempDataset;
}
}
Your problem is that you are creating a String with the value:
"/home/cloudera/works/JavaKafkaSparkStream/input/input_2.csv,
/home/cloudera/works/JavaKafkaSparkStream/input/input_1.csv"
Instead passing two filenames as parameters, which should be done by:
.load("/home/cloudera/works/JavaKafkaSparkStream/input/input_2.csv",
"/home/cloudera/works/JavaKafkaSparkStream/input/input_1.csv");
The comma has to be outside the strin and you should have two values, instead of one String.
From my understanding you want to read multiple files from HDFS without using regex like "/path/*.csv". what you are missing is each path needs to be separately with quotes and separated by ","
You can read using code as below, ensure that you have added SPARK CSV library :
sqlContext.read.format("csv").load("/home/cloudera/works/JavaKafkaSparkStream/input/input_1.csv","/home/cloudera/works/JavaKafkaSparkStream/input/input_2.csv")
Pattern can be helpful as well.
You want to select two files at time.
If they are sequencial then you could do something like
.load("/home/cloudera/works/JavaKafkaSparkStream/input/input_[1-2].csv")
if more files then just do input_[1-5].csv

Rename written CSV file Spark

I'm running spark 2.1 and I want to write a csv with results into Amazon S3.
After repartitioning the csv file has kind of a long kryptic name and I want to change that into a specific filename.
I'm using the databricks lib for writing into S3.
dataframe
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("folder/dataframe/")
Is there a way to rename the file afterwards or even save it directly with the correct name? I've already looked for solutions and havent found much.
Thanks
You can use below to rename the output file.
dataframe.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("folder/dataframe/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "folder/dataframe/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"file.csv"))
The code as you mentioned here returns a Unit. You would need to confirm when your Spark application has completed its run (assuming this is a batch case) and then rename
dataframe
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("folder/dataframe/")
You can rename the part files with any specific name using the dbutils command, use the below code to rename the part-generated CSV file, this code works fine for pyspark
x = 'dbfs:mnt/source_path' # your source path
y = 'dbfs:mnt/destination_path' # you destination path
Files = dbutils.fs.ls(x)
#moving or renaming the part-000 CSV file into the normal or specific name
i = 0
for file in Files:
print(file.name)
i = i+1
if file.name[-4] ='.csv': #you can use any file extension like parquet, JSON, etc.
dbutils.fs.mv(x+file.name,y+'OutputData-' + str(i) +'.csv') #you can provide any specific name here
dbutils.fs.rm(x, True) # later remove the source path after renaming all the part-generated files if you want

How to read only n rows of large CSV file on HDFS using spark-csv package?

I have a big distributed file on HDFS and each time I use sqlContext with spark-csv package, it first loads the entire file which takes quite some time.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path")
now as I just want to do some quick check at times, all I need is few/ any n rows of the entire file.
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").take(n)
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("file_path").head(n)
but all these run after the file load is done. Can't I just restrict the number of rows while reading the file itself ? I am referring to n_rows equivalent of pandas in spark-csv, like:
pd_df = pandas.read_csv("file_path", nrows=20)
Or it might be the case that spark does not actually load the file, the first step, but in this case, why is my file load step taking too much time then?
I want
df.count()
to give me only n and not all rows, is it possible ?
You can use limit(n).
sqlContext.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true').load("file_path").limit(20)
This will just load 20 rows.
My understanding is that reading just a few lines is not supported by spark-csv module directly, and as a workaround you could just read the file as a text file, take as many lines as you want and save it to some temporary location. With the lines saved, you could use spark-csv to read the lines, including inferSchema option (that you may want to use given you are in exploration mode).
val numberOfLines = ...
spark.
read.
text("myfile.csv").
limit(numberOfLines).
write.
text(s"myfile-$numberOfLines.csv")
val justFewLines = spark.
read.
option("inferSchema", true). // <-- you are in exploration mode, aren't you?
csv(s"myfile-$numberOfLines.csv")
Not inferring schema and using limit(n) worked for me, in all aspects.
f_schema = StructType([
StructField("col1",LongType(),True),
StructField("col2",IntegerType(),True),
StructField("col3",DoubleType(),True)
...
])
df_n = sqlContext.read.format('com.databricks.spark.csv').options(header='true').schema(f_schema).load(data_path).limit(10)
Note: If we use inferschema='true', its again the same time, and maybe hence the same old thing.
But if we dun have idea of the schema, Jacek Laskowski solutions works well too. :)
The solution given by Jacek Laskowski works well. Presenting an in-memory variation below.
I recently ran into this problem. I was using databricks and had a huge csv directory (200 files of 200MB each)
I originally had
val df = spark.read.format("csv")
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.load("dbfs:/huge/csv/files/in/this/directory/")
display(df)
which took a lot of time (10+ minutes), but then I change it to below and it ran instantly (2 seconds)
val lines = spark.read.text("dbfs:/huge/csv/files/in/this/directory/").as[String].take(1000)
val df = spark.read
.option("header", true)
.option("sep", ",")
.option("inferSchema", true)
.csv(spark.createDataset(lines))
display(df)
Inferring schema for text formats is hard and it can be done this way for the csv and json (but not if it's a multi-line json) formats.
Since PySpark 2.3 you can simply load data as text, limit, and apply csv reader on the result:
(spark
.read
.options(inferSchema="true", header="true")
.csv(
spark.read.text("/path/to/file")
.limit(20) # Apply limit
.rdd.flatMap(lambda x: x))) # Convert to RDD[str]
Scala counterpart is available since Spark 2.2:
spark
.read
.options(Map("inferSchema" -> "true", "header" -> "true"))
.csv(spark.read.text("/path/to/file").limit(20).as[String])
In Spark 3.0.0 or later one can also apply limit and use from_csv function, but it requires a schema, so it probably won't fit your requirements.
Since I didn't see that solution in the answers, the pure SQL-approach is working for me:
df = spark.sql("SELECT * FROM csv.`/path/to/file` LIMIT 10000")
If there is no header the columns will be named _c0, _c1, etc. No schema required.
May be this would be helpful who is working in java.
Applying limit will not help to reduce the time. You have to collect the n rows from the file.
DataFrameReader frameReader = spark
.read()
.format("csv")
.option("inferSchema", "true");
//set framereader options, delimiters etc
List<String> dataset = spark.read().textFile(filePath).limit(MAX_FILE_READ_SIZE).collectAsList();
return frameReader.csv(spark.createDataset(dataset, Encoders.STRING()));

Resources