How to check path existence in spark? - apache-spark

I have datetime paths as follow:
/data/2020-04-01/data.csv
/data/2020-04-02/data.csv
For some days, the file may not exist and If I call spark.read.load() function it fails. Is there a way to avoid this?

You can check like below,
import os.path
from os import path
if path.exists('/data/2020-04-01/data.csv'):
spark.read.load('/data/2020-04-01/data.csv')

In databricks you can use dbutils:
dbutils.fs.ls(path)
Using this function, you will get all the valid paths that exist. You can also use following hadoop library to get valid paths from hdfs:
org.apache.hadoop.fs

I assume you have a list of data paths and want to load data for the paths which exists on HDFS. You can pass your path to the get method in FileSystem. If the path doesn't exist it will throw IOException, and then you can proceed with the next set of paths.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
// method to get files which exists
public static List<String> listFilesForBasePath(String path, Configuration conf) throws IOException {
FileSystem fs = FileSystem.get(new Path(path).toUri(), conf);
FileStatus[] items = fs.listStatus(new Path(path));
List<String> files = new ArrayList<>();
for (FileStatus item : items) {
files.add(item.getPath().toString());
}
return files;
}
// read files as :
Configuration configuration = sparkSession.sparkContext().hadoopConfiguration();
List<String> inputPaths = listFilesForBasePath(basePath, configuration)
Dataset<String> dataSet = sparkSession.read().textFile(inputPaths.toArray(new String[0]));

To fix issue in Simple way, Define your schema & apply that before loading. It will work without files in specified directory.
Check below code.
scala> import sys.process._
import sys.process._
scala> "ls -ltr D:\\tmp\\data\\customers".! // I don't have files inside directory.
total 0
Defining schema
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val schema = DataType.fromJson("""{"type":"struct","fields":[{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"age","type":"integer","nullable":false,"metadata":{}}]}""").asInstanceOf[StructType] // Definining schema
Reading files from empty directory without schema will fail.
scala> spark.read.load("D:\\tmp\\data\\customers") // Reading file without schema will fail.
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 61 elided
Reading files from empty directory with schema.
scala> spark.read.schema(schema).load("D:\\tmp\\data\\customers") // Reading files from directory with predefined schema.
res80: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> val df = spark.read.schema(schema).load("D:\\tmp\\data\\customers") // Reading files from directory with predefined schema.
df: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> df.count
res81: Long = 0

Related

Is there a way to list all files in all folders and sub-folders in data lake?

I am trying to list all files in all folders and sub folders. I'm trying to get everything into a RDD or a dataframe (I don't think it matters because it's just a list of file names and paths). I found some code online that looks promising, but it doesn't seem to do anything. I'm pretty new to Scala though, so maybe I just missed something simple.
First code sample:
import org.apache.spark.sql.functions.input_file_name
val inputPath: String = "mnt/rawdata/2019/01/01/corp/*.gz"
val df = spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)] // Optionally convert to Dataset
.rdd // or RDD
Second code sample:
import java.io.File
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
val files = getListOfFiles("mnt/rawdata/2019/01/01/corp/")
There's useful Files.walk method for recursive tree traversal in nio package.
import java.nio.file._
import scala.collection.JavaConverters._
val files = Files.walk(FileSystems.getDefault.getPath("mnt/rawdata/2019/01/01/corp")).iterator.asScala.toList
Just note it returns both files and directories, so you need to filter in case you only need files.

Spark Avro write RDD to multiple directories by key

I need to split an RDD by first letters (A-Z) and write the files into directories respectively.
The simple solution is to filter the RDD for each letter, but this requires 26 passes.
There is a response to a similar question for writing to text files here, but I cannot figure out how to do this for Avro files.
Has anyone been able to do this?
You can use multipleoutputformat to do this
It is a two step task :-
First you need the multiple output format for avro. Below is the code for that:
package avro
import org.apache.hadoop.mapred.lib.MultipleOutputFormat
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.util.Progressable
import org.apache.avro.mapred.AvroOutputFormat
import org.apache.avro.mapred.AvroWrapper
import org.apache.hadoop.io.NullWritable
import org.apache.spark.rdd.RDD
import org.apache.hadoop.mapred.RecordWriter
class MultipleAvroFileOutputFormat[K] extends MultipleOutputFormat[AvroWrapper[K], NullWritable] {
val outputFormat = new AvroOutputFormat[K]
override def generateFileNameForKeyValue(key: AvroWrapper[K], value: NullWritable, name: String) = {
val name = key.datum().asInstanceOf[String].substring(0, 1)
name + "/" + name
}
override def getBaseRecordWriter(fs: FileSystem,
job: JobConf,
name: String,
arg3: Progressable) = {
outputFormat.getRecordWriter(fs, job, name, arg3).asInstanceOf[RecordWriter[AvroWrapper[K], NullWritable]]
}
}
In your driver code you have to mention that you want to use the Above given output format. You also need to mention the output schema for avro data. Below is sample driver code which stores a RDD of string in avro format with schema {"type":"string"}
package avro
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.JobConf
import org.apache.avro.mapred.AvroJob
import org.apache.avro.mapred.AvroWrapper
object AvroDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf
conf.setAppName(args(0));
conf.setMaster("local[2]");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[AvroWrapper[String]]))
val sc = new SparkContext(conf);
val input = sc.parallelize(Seq("one", "two", "three", "four"), 1);
val pairRDD = input.map(x => (new AvroWrapper(x), null));
val job = new JobConf(sc.hadoopConfiguration)
val schema = "{\"type\":\"string\"}"
job.set(AvroJob.OUTPUT_SCHEMA, schema) //set schema for avro output
pairRDD.partitionBy(new HashPartitioner(26)).saveAsHadoopFile(args(1), classOf[AvroWrapper[String]], classOf[NullWritable], classOf[MultipleAvroFileOutputFormat[String]], job, None);
sc.stop()
}
}
I hope you get a better answer than mine...
I've been in a similar situation myself, except with "ORC" instead of Avro. I basically threw up my hands and ended up calling the ORC file classes directly to write the files myself.
In your case, my approach would entail partitioning the data via "partitionBy" into 26 partitions, one for each first letter A-Z. Then call "mapPartitionsWithIndex", passing a function that outputs the i-th partition to an Avro file at the appropriate path. Finally, to convince Spark to actually do something, have mapPartitionsWithIndex return, say, a List containing the single boolean value "true"; and then call "count" on the RDD returned by mapPartitionsWithIndex to get Spark to start the show.
I found an example of writing an Avro file here: http://www.myhadoopexamples.com/2015/06/19/merging-small-files-into-avro-file-2/

Xml processing in Spark

Scenario:
My Input will be multiple small XMLs and am Supposed to read these XMLs as RDDs. Perform join with another dataset and form an RDD and send the output as an XML.
Is it possible to read XML using spark, load the data as RDD? If it is possible how will the XML be read.
Sample XML:
<root>
<users>
<user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number>
<\user>
<user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number>
<\user>
<\users>
<\root>
How will this be loaded into the RDD?
Yes it possible but details will differ depending on an approach you take.
If files are small, as you've mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
For larger files you can use Hadoop input formats.
If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
Otherwise Mahout provides XmlInputFormat
Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:
use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
use second mapPartitionsWithIndex to repair broken records
Edit:
There is also relatively new spark-xml package which allows you to extract specific records by tag:
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "foo")
.load("bar.xml")
Here's the way to perform it using HadoopInputFormats to read XML data in spark as explained by #zero323.
Input data:
<root>
<users>
<user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number>
<\user>
<user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number>
<\user>
<\users>
<\root>
Code for reading XML Input:
You will get some jars at this link
Imports:
//---------------spark_import
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
//----------------xml_loader_import
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{ LongWritable, Text }
import com.cloudera.datascience.common.XmlInputFormat
Code:
object Tester_loader {
case class User(account: String, name: String, number: String)
def main(args: Array[String]): Unit = {
val sparkHome = "/usr/big_data_tools/spark-1.5.0-bin-hadoop2.6/"
val sparkMasterUrl = "spark://SYSTEMX:7077"
var jars = new Array[String](3)
jars(0) = "/home/hduser/Offload_Data_Warehouse_Spark.jar"
jars(1) = "/usr/big_data_tools/JARS/Spark_jar/avro/spark-avro_2.10-2.0.1.jar"
val conf = new SparkConf().setAppName("XML Reading")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setMaster("local")
.set("spark.cassandra.connection.host", "127.0.0.1")
.setSparkHome(sparkHome)
.set("spark.executor.memory", "512m")
.set("spark.default.deployCores", "12")
.set("spark.cores.max", "12")
.setJars(jars)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// ---- loading user from XML
// calling function 1.1
val pages = readFile("src/input_data", "<user>", "<\\user>", sc)
val xmlUserDF = pages.map { tuple =>
{
val account = extractField(tuple, "account")
val name = extractField(tuple, "name")
val number = extractField(tuple, "number")
User(account, name, number)
}
}.toDF()
println(xmlUserDF.count())
xmlUserDF.show()
}
Functions:
def readFile(path: String, start_tag: String, end_tag: String,
sc: SparkContext) = {
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, start_tag)
conf.set(XmlInputFormat.END_TAG_KEY, end_tag)
val rawXmls = sc.newAPIHadoopFile(
path, classOf[XmlInputFormat], classOf[LongWritable],
classOf[Text], conf)
rawXmls.map(p => p._2.toString)
}
def extractField(tuple: String, tag: String) = {
var value = tuple.replaceAll("\n", " ").replace("<\\", "</")
if (value.contains("<" + tag + ">") &&
value.contains("</" + tag + ">")) {
value = value.split("<" + tag + ">")(1).split("</" + tag + ">")(0)
}
value
}
}
Output:
+-------+------+------+
|account| name|number|
+-------+------+------+
| 1234|name_1| 34233|
| 58789|name_2| 54697|
+-------+------+------+
The result obtained is in dataframes you can convert them to RDD as per your requirement like this->
val xmlUserRDD = xmlUserDF.toJavaRDD.rdd.map { x =>
(x.get(0).toString(),x.get(1).toString(),x.get(2).toString()) }
Please evaluate it, if it could help you some how.
This will help you.
package packagename;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import com.databricks.spark.xml.XmlReader;
public class XmlreaderSpark {
public static void main(String arr[]){
String localxml="file path";
String booksFileTag = "user";
String warehouseLocation = "file:" + System.getProperty("user.dir") + "spark-warehouse";
System.out.println("warehouseLocation" + warehouseLocation);
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("Java Spark SQL Example")
.config("spark.some.config.option", "some-value").config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport().config("set spark.sql.crossJoin.enabled", "true")
.getOrCreate();
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> df = (new XmlReader()).withRowTag(booksFileTag).xmlFile(sqlContext, localxml);
df.show();
}
}
You need to add this dependency in your POM.xml:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.10</artifactId>
<version>0.4.0</version>
</dependency>
and your input file is not in proper format.
Thanks.
There are two good options for simple cases:
wholeTextFiles. Use map method with your XML parser which could be Scala XML pull parser (quicker to code) or the SAX Pull Parser (better performance).
Hadoop streaming XMLInputFormat which you must define the start and end tag <user> </user> to process it, however, it creates one partition per user tag
spark-xml package is a good option too.
With all options you are limited to only process simple XMLs which can be interpreted as dataset with rows and columns.
However, if we make it a little complex, those options won’t be useful.
For example, if you have one more entity there:
<root>
<users>
<user>...</users>
<companies>
<company>...</companies>
</root>
Now you need to generate 2 RDDs and change your parser to recognise the <company> tag.
This is just a simple case, but the XML could be much more complex and you would need to include more and more changes.
To solve this complexity we’ve built Flexter on top of Apache Spark to take the pain out of processing XML files on Spark. I also recommend to read about converting XML on Spark to Parquet. The latter post also includes some code samples that show how the output can be queried with SparkSQL.
Disclaimer: I work for Sonra

Merge parquet file on standalone spark

Is there a simple way how to save DataFrame into a single parquet file or merge the directory containing metadata and parts of this parquet file produced by sqlContext.saveAsParquetFile() into a single file stored on NFS without using HDFS and hadoop?
To save only one file, rather than many, you can call coalesce(1) / repartition(1) on the RDD/Dataframe before the data is saved.
If you already have a directory with small files, you could create a Compacter process which would read in the exiting files and save them to one new file. E.g.
val rows = parquetFile(...).coalesce(1)
rows.saveAsParquetFile(...)
You can store to a local file system using saveAsParquetFile. e.g.
rows.saveAsParquetFile("/tmp/onefile/")
I was able to use this method to compress parquet files using snappy format with Spark 1.6.1. I used overwrite so that I could repeat the process if needed. Here is the code.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
object CompressApp {
val serverPort = "hdfs://myserver:8020/"
val inputUri = serverPort + "input"
val outputUri = serverPort + "output"
val config = new SparkConf()
.setAppName("compress-app")
.setMaster("local[*]")
val sc = SparkContext.getOrCreate(config)
val sqlContext = SQLContext.getOrCreate(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
import sqlContext.implicits._
def main(args: Array[String]) {
println("Compressing Parquet...")
val df = sqlContext.read.parquet(inputUri).coalesce(1)
df.write.mode(SaveMode.Overwrite).parquet(outputUri)
println("Done.")
}
}
coalesce(N) has saved me so far. If your table is partitioned, then use repartition("partition key") as well.

Spark: Reading files using different delimiter than new line

I'm using Apache Spark 1.0.1. I have many files delimited with UTF8 \u0001 and not with the usual new line \n. How can I read such files in Spark? Meaning, the default delimiter of sc.textfile("hdfs:///myproject/*") is \n, and I want to change it to \u0001.
You can use textinputformat.record.delimiter to set the delimiter for TextInputFormat, E.g.,
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "X")
val input = sc.newAPIHadoopFile("file_path", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val lines = input.map { case (_, text) => text.toString}
println(lines.collect)
For example, my input is a file containing one line aXbXcXd. The above code will output
Array(a, b, c, d)
In Spark shell, I extracted data according to Setting textinputformat.record.delimiter in spark:
$ spark-shell
...
scala> import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.LongWritable
scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text
scala> import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.conf.Configuration
scala> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
scala> val conf = new Configuration
conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml
scala> conf.set("textinputformat.record.delimiter", "\u0001")
scala> val data = sc.newAPIHadoopFile("mydata.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString)
data: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at <console>:19
sc.newAPIHadoopFile("mydata.txt", ...) is a RDD[(LongWritable, Text)], where the first part of the elements is the starting character index, and the second part is the actual text delimited by "\u0001".
In python this could be achieved using:
rdd = sc.newAPIHadoopFile(YOUR_FILE, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
"org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
conf={"textinputformat.record.delimiter": YOUR_DELIMITER}).map(lambda l:l[1])
Here is a ready-to-use version of Chad's and #zsxwing's answers for Scala users, which can be used this way:
sc.textFile("some/path.txt", "\u0001")
The following snippet creates an additional textFile method implicitly attached to the SparkContext using an implicit class (in order to replicate SparkContext's default textFile method):
package com.whatever
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
object Spark {
implicit class ContextExtensions(val sc: SparkContext) extends AnyVal {
def textFile(
path: String,
delimiter: String,
maxRecordLength: String = "1000000"
): RDD[String] = {
val conf = new Configuration(sc.hadoopConfiguration)
// This configuration sets the record delimiter:
conf.set("textinputformat.record.delimiter", delimiter)
// and this one limits the size of one record:
conf.set("mapreduce.input.linerecordreader.line.maxlength", maxRecordLength)
sc.newAPIHadoopFile(
path,
classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
conf
)
.map { case (_, text) => text.toString }
}
}
}
which can be used this way:
import com.whatever.Spark.ContextExtensions
sc.textFile("some/path.txt", "\u0001")
Note the additional setting mapreduce.input.linerecordreader.line.maxlength which limits the maximum size of a record. This comes in handy when reading from a corrupted file for which a record could be too long to fit in memory (more chances of it happening when playing with the record delimiter).
With this setting, when reading a corrupted file, an exception (java.io.IOException - thus catchable) will be thrown rather than getting a messy out of memory which will stop the SparkContext.
If you are using spark-context, the below code helped me
sc.hadoopConfiguration.set("textinputformat.record.delimiter","delimeter")

Resources