Spark Session read mulitple files instead of using pattern - apache-spark

I'm trying to read couple of CSV files using SparkSession from a folder on HDFS ( i.e I don't want to read all the files in the folder )
I get the following error while running (code at the end):
Path does not exist:
file:/home/cloudera/works/JavaKafkaSparkStream/input/input_2.csv,
/home/cloudera/works/JavaKafkaSparkStream/input/input_1.csv
I don't want to use the pattern while reading, like /home/temp/*.csv, reason being in future I have logic to pick only one or two files in the folder out of 100 CSV files
Please advise
SparkSession sparkSession = SparkSession
.builder()
.appName(SparkCSVProcessors.class.getName())
.master(master).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
Set<String> fileSet = Files.list(Paths.get("/home/cloudera/works/JavaKafkaSparkStream/input/"))
.filter(name -> name.toString().endsWith(".csv"))
.map(name -> name.toString())
.collect(Collectors.toSet());
SQLContext sqlCtx = sparkSession.sqlContext();
Dataset<Row> rawDataset = sparkSession.read()
.option("inferSchema", "true")
.option("header", "true")
.format("com.databricks.spark.csv")
.option("delimiter", ",")
//.load(String.join(" , ", fileSet));
.load("/home/cloudera/works/JavaKafkaSparkStream/input/input_2.csv, " +
"/home/cloudera/works/JavaKafkaSparkStream/input/input_1.csv");
UPDATE
I can iterate the files and do an union as below. Please recommend if there is a better way ...
Dataset<Row> unifiedDataset = null;
for (String fileName : fileSet) {
Dataset<Row> tempDataset = sparkSession.read()
.option("inferSchema", "true")
.option("header", "true")
.format("csv")
.option("delimiter", ",")
.load(fileName);
if (unifiedDataset != null) {
unifiedDataset= unifiedDataset.unionAll(tempDataset);
} else {
unifiedDataset = tempDataset;
}
}

Your problem is that you are creating a String with the value:
"/home/cloudera/works/JavaKafkaSparkStream/input/input_2.csv,
/home/cloudera/works/JavaKafkaSparkStream/input/input_1.csv"
Instead passing two filenames as parameters, which should be done by:
.load("/home/cloudera/works/JavaKafkaSparkStream/input/input_2.csv",
"/home/cloudera/works/JavaKafkaSparkStream/input/input_1.csv");
The comma has to be outside the strin and you should have two values, instead of one String.

From my understanding you want to read multiple files from HDFS without using regex like "/path/*.csv". what you are missing is each path needs to be separately with quotes and separated by ","
You can read using code as below, ensure that you have added SPARK CSV library :
sqlContext.read.format("csv").load("/home/cloudera/works/JavaKafkaSparkStream/input/input_1.csv","/home/cloudera/works/JavaKafkaSparkStream/input/input_2.csv")

Pattern can be helpful as well.
You want to select two files at time.
If they are sequencial then you could do something like
.load("/home/cloudera/works/JavaKafkaSparkStream/input/input_[1-2].csv")
if more files then just do input_[1-5].csv

Related

spark.read.excel - Not reading all Excel rows when using custom schema

I am trying to read a Spark DataFrame from an 'excel' file. I used the crealytics dependency.
Without any predefined schema, all rows are correctly read but as only string type columns.
To prevent that, I am using my own schema (where I mentioned certain columns to be Integer type), but in this case, most of the rows are dropped when the file is being read.
The library dependency used in build.sbt:
"com.crealytics" %% "spark-excel" % "0.11.1",
Scala version - 2.11.8
Spark version - 2.3.2
val inputDF = sparkSession.read.excel(useHeader = true).load(inputLocation(0))
The above reads all the data - around 25000 rows.
But,
val inputWithSchemaDF: DataFrame = sparkSession.read
.format("com.crealytics.spark.excel")
.option("useHeader" , "false")
.option("inferSchema", "false")
.option("addColorColumns", "true")
.option("treatEmptyValuesAsNulls" , "false")
.option("keepUndefinedRows", "true")
.option("maxRowsInMey", 2000)
.schema(templateSchema)
.load(inputLocation)
This gives me only 450 rows.
Is there a way to prevent that? Thanks in advance! (edited)
As of now, I haven't found a fix to this problem, but I tried solving it in a different way by manually type-casting. To make it a bit better in terms of number of lines of code, I took the help of a for loop. My solutions is as follows:
Step 1: Create my own schema of type 'StructType':
val requiredSchema = new StructType()
.add("ID", IntegerType, true)
.add("Vendor", StringType, true)
.add("Brand", StringType, true)
.add("Product Name", StringType, true)
.add("Net Quantity", StringType, true)
Step 2: Type casting the Dataframe AFTER it has been read (WITHOUT the custom schema) from the excel file (instead of using the schema while reading the data):
def convertInputToDesiredSchema(inputDF: DataFrame, requiredSchema: StructType)(implicit sparkSession: SparkSession) : DataFrame =
{
var schemaDf: DataFrame = inputDF
for(i <- inputDF.columns.indices)
{
if(inputDF.schema(i).dataType.typeName != requiredSchema(i).dataType.typeName)
{
schemaDf = schemaDf.withColumn(schemaDf.columns(i), col(schemaDf.columns(i)).cast(requiredSchema.apply(i).dataType))
}
}
schemaDf
}
This might not be an efficient solution, but is better than typing out too many lines of code to typecast multiple columns.
I am still searching for a solution to my original question.
This solution is just in case someone might want to try and are in immediate need of a quick fix.
Here's a workaround, using PySpark, using a schema that consists of "fieldname" and "dataType":
# 1st load the dataframe with StringType for all columns
from pyspark.sql.types import *
input_df = spark.read.format("com.crealytics.spark.excel") \
.option("header", isHeaderOn) \
.option("treatEmptyValuesAsNulls", "true") \
.option("dataAddress", xlsxAddress1) \
.option("setErrorCellsToFallbackValues", "true") \
.option("ignoreLeadingWhiteSpace", "true") \
.option("ignoreTrailingWhiteSpace", "true") \
.load(inputfile)
# 2nd Modify the datatypes within the dataframe using a file containing column names and the expected data type.
dtypes = pd.read_csv("/dbfs/mnt/schema/{}".format(file_schema_location), header=None).to_records(index=False).tolist()
fields = [StructField(dtype[0], globals()[f'{dtype[1]}']()) for dtype in dtypes]
schema = StructType(fields)
for dt in dtypes:
colname =dt[0]
coltype = dt[1].replace("Type","")
input_df = input_df.withColumn(colname, col(colname).cast(coltype))

How to read csv with second line as header in pyspark dataframe

I am trying to load a csv and make the second line as header. How to achieve this. Please let me know. Thanks.
file_location = "/mnt/test/raw/data.csv"
file_type = "csv"
infer_schema = "true"
delimiter = ","
data = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", "false") \
.option("sep", delimiter) \
.load(file_location) \
First Read the data as rdd and then pass this rdd to df.read.csv()
data=sc.TextFile('/mnt/test/raw/data.csv')
firstRow=data.first()
data=data.filter(lambda row:row != firstRow)
df = spark.read.csv(data,header=True)
For reference of dataframe functions use the below link, This would serve as bible for all of the dataframe operations you need, for specific version of spark replace "latest" in url to whatever version you want:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

How to add multidimensional array to an existing Spark DataFrame

If I understand correctly, ArrayType can be added as Spark DataFrame columns. I am trying to add a multidimensional array to an existing Spark DataFrame by using the withColumn method. My idea is to have this array available with each DataFrame row in order to use it to send back information from the map function.
The error I get says that the withColumn function is looking for a Column type but it is getting an array. Are there any other functions that will allow adding an ArrayType?
object TestDataFrameWithMultiDimArray {
val nrRows = 1400
val nrCols = 500
/** Our main function where the action happens */
def main(args: Array[String]) {
// Create a SparkContext using every core of the local machine, named RatingsCounter
val sc = new SparkContext("local[*]", "TestDataFrameWithMultiDimArray")
val sqlContext = new SQLContext(sc)
val PropertiesDF = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", "C:/Users/tjoha/Desktop/Properties.xlsx")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.option("sheetName", "Sheet1")
.load()
PropertiesDF.show()
PropertiesDF.printSchema()
val PropertiesDFPlusMultiDimArray = PropertiesDF.withColumn("ArrayCol", Array.ofDim[Any](nrRows,nrCols))
}
Thanks for your help.
Kind regards,
Johann
There are 2 problems in your code
the 2nd argument to withColumn needs to be a Column. you can wrap constant value with function col
Spark cant take Any as its column type, you need to use a specific supported type.
val PropertiesDFPlusMultiDimArray = PropertiesDF.withColumn("ArrayCol", lit(Array.ofDim[Int](nrRows,nrCols)))
will do the trick

Change the filename of the spark streaming output

The below simple program reads from kafka stream and writes to CSV file every 5 mins, and its spark streaming. It generates file with the naming convention part-00000-f90bbc78-b847-41d4-9938-bdae89adb8eb.csv , is there a way I can change the name to include a "DATETIMESTAMP" + GUID
Please adivse. Thanks.
I was able to find the list of options for DatastreamReader, but nothing for DatastreamWriter
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/DataStreamReader.html#csv-java.lang.String-
public static void main(String[] args) throws Exception {
if (args.length == 0)
throw new Exception("Usage program configFilename");
String configFilename = args[0];
addShutdownHook();
ConfigLoader.loadConfig(configFilename);
sparkSession = SparkSession
.builder()
.appName(TestKafka.class.getName())
.master(ConfigLoader.getValue("master")).getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel(ConfigLoader.getValue("logLevel"));
SQLContext sqlCtx = sparkSession.sqlContext();
System.out.println("Spark context established");
DataStreamReader kafkaDataStreamReader = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", ConfigLoader.getValue("brokers"))
.option("group.id", ConfigLoader.getValue("groupId"))
.option("subscribe", ConfigLoader.getValue("topics"))
.option("failOnDataLoss", false);
Dataset<Row> rawDataSet = kafkaDataStreamReader.load();
rawDataSet.printSchema();
rawDataSet.createOrReplaceTempView("rawEventView1");
rawDataSet = rawDataSet.withColumn("rawEventValue", rawDataSet.col("value").cast("string"));
rawDataSet.printSchema();
rawDataSet.createOrReplaceTempView("eventView1");
sqlCtx.sql("select * from eventView1")
.writeStream()
.format("csv")
.option("header", "true")
.option("delimiter", "~")
.option("checkpointLocation", ConfigLoader.getValue("checkpointPath"))
.option("path", ConfigLoader.getValue("recordsPath"))
.outputMode(OutputMode.Append())
.trigger(ProcessingTime.create(Integer.parseInt(ConfigLoader.getValue("kafkaProcessingTime"))
, TimeUnit.SECONDS))
.start()
.awaitTermination();
}
There isn't a provision for changing the format of part files in structured Streaming which uses ManifestFileCommitProtocol that tracks the list of valid files the job writes to. Target part file's name is combination of split,uuid and extension and this is followed for avoiding collisions.
Source:https://github.com/apache/spark/blob/20adf9aa1f42353432d356117e655e799ea1290b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala#L87
1) There is no direct support in saveAsTextFile method to control file output name. You can try using saveAsHadoopDataset to control output file basename.
e.g.: instead of part-00000, you can get yourCustomName-00000.
Keep in mind that you cannot control the suffix 00000 using this method. It is something spark automatically assigns for each partition while writing so that each partition writes to a unique file.
In order to control that too as mentioned above in the comments, you have to write your own custom OutputFormat.
SparkConf conf=new SparkConf();
conf.setMaster("local").setAppName("yello");
JavaSparkContext sc=new JavaSparkContext(conf);
JobConf jobConf=new JobConf();
jobConf.set("mapreduce.output.basename", "customName");
jobConf.set("mapred.output.dir", "outputPath");
JavaRDD<String> input = sc.textFile("inputDir");
input.saveAsHadoopDataset(jobConf);
2) A workaround would be to write output as it is to your output location and use Hadoop FileUtil.copyMerge function to form merged file.

Spark dataframe to csv file saving as part

Hi i have this code that saves dataframe into cvs locally on system and i keep getting a directory names myfile.csv/part-0000.gz, part-0001.gz .... i just want a cvs file. here my code
String current = LocalDateTime.now().format(DateTimeFormatter.ISO_LOCAL_DATE_TIME);
groupedMessages.write()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.save("/finance_reports/myfile.csv");
}

Resources