How to display string in spark - apache-spark

I am new to spark scala. This is a simple code in which i am fetching a .csv file with three columns. I am using map and split transformation to split it. But I am not able to display it after using mkstring() also. I do not want to use mkstring function in the last line inside collect.foreach(). Please find the code and suggest me how to display the string values.
package test
import org.apache.spark.SparkContext``
import org.apache.log4j._
object practice2 {
def main(args : Array[String])
{
Logger.getLogger("org").setLevel(Level.OFF)
val sc = new SparkContext("local[2]","sampleApp")
val data = sc.textFile("C:/Hadoop/Materials/Module-5_Spark/Spark/TotalSpentByCustomer/customer-orders.csv")
val rec = data.map(x => x. split(","))
val rec1 = rec.collect.mkString(",")
// rec.collect.foreach(array => println(array.mkString(",")))
rec1.foreach(print)
}
}

Please mention your spark version.
I suggest you to tell spark it's reading a csv :
val data = sc.read.csv(path)
//or
val data = sc.format("csv").load(path);
There are options to get column names from csv header, or set them yourself with data.df("col1", "col2).
then, I'm not sure why you want to display strings but the best way for a demo is to use .show() like so :
data.show(10)

You could try something like this
rec.collect.foreach(array => println("%s".format(array.toList)))
Hope this helps.

Related

How can I save a single column of a pyspark dataframe in multiple json files?

I have a dataframe that looks a bit like this:
| key 1 | key 2 | key 3 | body |
I want to save this dataframe in 1 json-file per partition, where a partition is a unique combination of keys 1 to 3. I have the following requirements:
The paths of the files should be /key 1/key 2/key 3.json.gz
The files should be compressed
The contents of the files should be values of body (this column contains a json string), one json-string per line.
I've tried multiple things, but no look.
Method 1: Using native dataframe.write
I've tried using the native write method to save the data. Something like this:
df.write
.partitionBy("key 1", "key 2", "key 3") \
.mode('overwrite') \
.format('json') \
.option("codec", "org.apache.hadoop.io.compress.GzipCodec") \
.save(
path=path,
compression="gzip"
)
This solution doesn't store the files in the correct path and with the correct name, but this can be fixed by moving them afterwards. However, the biggest problem is that this is writing the complete dataframe, while I only want to write the values of the body column. But I need the other columns to partition the data.
Method 2: Using the Hadoop filesystem
It's possible to directly call the Hadoop filesystem java library using this: sc._gateway.jvm.org.apache.hadoop.fs.FileSystem. With access to this filesystem it's possible to create files myself, giving me more control over the path, the filename and the contents. However, in order to make this code scale I'm doing this per partition, so:
df.foreachPartition(save_partition)
def save_partition(items):
# Store the items of this partition here
However, I can't get this to work because the save_partition function is executed on the workers, which doesn't have access to the SparkSession and the SparkContext (which is needed to reach the Hadoop Filesystem JVM libraries). I could solve this by pulling all the data to the driver using collect() and save it from there, but that won't scale.
So, quite a story, but I prefer to be complete here. What am I missing? Is it impossible to do what I want, or am I missing something obvious? Or is it difficult? Or maybe it's only possible from Scala/Java? I would love to get some help on this.
It may be slightly tricky to do in pure pyspark. It is not recommended to create too many partitions. From what you have explained I think you are using partition only to get one JSON body per file. You may need a bit of Scala here but your spark job can still remain to be a PySpark Job.
Spark Internally defines DataSources interfaces through which you can define how to read and write data. JSON is one such data source. You can try to extend the default JsonFileFormat class and create your own JsonFileFormatV2. You will also need to define a JsonOutputWriterV2 class extending the default JsonOutputWriter. The output writer has a write function that gives you access to individual rows and paths passed on from the spark program. You can modify the write function to meet your needs.
Here is a sample of how I achieved customizing JSON writes for my use case of writing a fixed number of JSON entries per file. You can use it as a reference for implementing your own JSON writing strategy.
class JsonFileFormatV2 extends JsonFileFormat {
override val shortName: String = "jsonV2"
override def prepareWrite(
sparkSession: SparkSession,
job: Job,
options: Map[String, String],
dataSchema: StructType): OutputWriterFactory = {
val conf = job.getConfiguration
val fileLineCount = options.get("filelinecount").map(_.toInt).getOrElse(1)
val parsedOptions = new JSONOptions(
options,
sparkSession.sessionState.conf.sessionLocalTimeZone,
sparkSession.sessionState.conf.columnNameOfCorruptRecord)
parsedOptions.compressionCodec.foreach { codec =>
CompressionCodecs.setCodecConfiguration(conf, codec)
}
new OutputWriterFactory {
override def newInstance(
path: String,
dataSchema: StructType,
context: TaskAttemptContext): OutputWriter = {
new JsonOutputWriterV2(path, parsedOptions, dataSchema, context, fileLineCount)
}
override def getFileExtension(context: TaskAttemptContext): String = {
".json" + CodecStreams.getCompressionExtension(context)
}
}
}
}
private[json] class JsonOutputWriterV2(
path: String,
options: JSONOptions,
dataSchema: StructType,
context: TaskAttemptContext,
maxFileLineCount: Int) extends JsonOutputWriter(
path,
options,
dataSchema,
context) {
private val encoding = options.encoding match {
case Some(charsetName) => Charset.forName(charsetName)
case None => StandardCharsets.UTF_8
}
var recordCounter = 0
var filecounter = 0
private val maxEntriesPerFile = maxFileLineCount
private var writer = CodecStreams.createOutputStreamWriter(
context, new Path(modifiedPath(path)), encoding)
private[this] var gen = new JacksonGenerator(dataSchema, writer, options)
private def modifiedPath(path:String): String = {
val np = s"$path-filecount-$filecounter"
np
}
override def write(row: InternalRow): Unit = {
gen.write(row)
gen.writeLineEnding()
recordCounter += 1
if(recordCounter >= maxEntriesPerFile){
gen.close()
writer.close()
filecounter+=1
recordCounter = 0
writer = CodecStreams.createOutputStreamWriter(
context, new Path(modifiedPath(path)), encoding)
gen = new JacksonGenerator(dataSchema, writer, options)
}
}
override def close(): Unit = {
if(recordCounter<maxEntriesPerFile){
gen.close()
writer.close()
}
}
}
You can add this new custom data source jar to spark classpath and then in your pyspark you can invoke it as follows.
df.write.format("org.apache.spark.sql.execution.datasources.json.JsonFileFormatV2").option("filelinecount","5").mode("overwrite").save("path-to-save")

When to use map function on spark in transforming values

I'm new to spark and working with it. Previously I worked with python and pandas, pandas has a map function which is often used to apply transformation on columns. I found out that spark also have map function as well but until now I haven't used it at all except for extracting values like this df.select("id").map(r => r.getString(0)).collect.toList
import spark.implicits._
val df3 = df2.map(row=>{
val util = new Util()
val fullName = row.getString(0) +row.getString(1) +row.getString(2)
(fullName, row.getString(3),row.getInt(5))
})
val df3Map = df3.toDF("fullName","id","salary")
my questions are,
is it common to use map function to transform dataframe columns?
is it common to use map like block of code above? source from sparkbyexamples
when do people usually use map?

Is there a way to list all files in all folders and sub-folders in data lake?

I am trying to list all files in all folders and sub folders. I'm trying to get everything into a RDD or a dataframe (I don't think it matters because it's just a list of file names and paths). I found some code online that looks promising, but it doesn't seem to do anything. I'm pretty new to Scala though, so maybe I just missed something simple.
First code sample:
import org.apache.spark.sql.functions.input_file_name
val inputPath: String = "mnt/rawdata/2019/01/01/corp/*.gz"
val df = spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)] // Optionally convert to Dataset
.rdd // or RDD
Second code sample:
import java.io.File
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
val files = getListOfFiles("mnt/rawdata/2019/01/01/corp/")
There's useful Files.walk method for recursive tree traversal in nio package.
import java.nio.file._
import scala.collection.JavaConverters._
val files = Files.walk(FileSystems.getDefault.getPath("mnt/rawdata/2019/01/01/corp")).iterator.asScala.toList
Just note it returns both files and directories, so you need to filter in case you only need files.

How to convert DataFrame or RDD[object] to Array[Object] in spark?

I am currently using spark streaming and spark sql for my current project. Is there a way to convert Array[Object] to either RDD[object] or DataFrame? I am doing something as below:
val myData = myDf.distinct()
.collect()
.map{ row =>
new myObject(row.getAs[String]("id"), row.getAs[String]("name"))
}
The myData on the code snippet above will be Array[myObject]. How to I make it to RDD[myObject] or directly to DataFrame for next execution?
import org.apache.spark.sql.Row
case class myObject(id:String, name:String)
val myData = myDf.distinct.map {
case Row(id:String, name:String) => myObject(id,name)
}
I think I get to parse it to RDD[myObject]. I hope is the right way to do it.
val myData = myDf.distinct()
.collect()
.map{ row =>
new myObject(row.getAs[String]("id"), row.getAs[String]("name"))
}
val myDataRDD = rdd.SparkContext.parallelize(myData) // since this code snippet is inside a foreachRDD clause.

Xml processing in Spark

Scenario:
My Input will be multiple small XMLs and am Supposed to read these XMLs as RDDs. Perform join with another dataset and form an RDD and send the output as an XML.
Is it possible to read XML using spark, load the data as RDD? If it is possible how will the XML be read.
Sample XML:
<root>
<users>
<user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number>
<\user>
<user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number>
<\user>
<\users>
<\root>
How will this be loaded into the RDD?
Yes it possible but details will differ depending on an approach you take.
If files are small, as you've mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
For larger files you can use Hadoop input formats.
If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
Otherwise Mahout provides XmlInputFormat
Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:
use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
use second mapPartitionsWithIndex to repair broken records
Edit:
There is also relatively new spark-xml package which allows you to extract specific records by tag:
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "foo")
.load("bar.xml")
Here's the way to perform it using HadoopInputFormats to read XML data in spark as explained by #zero323.
Input data:
<root>
<users>
<user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number>
<\user>
<user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number>
<\user>
<\users>
<\root>
Code for reading XML Input:
You will get some jars at this link
Imports:
//---------------spark_import
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
//----------------xml_loader_import
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{ LongWritable, Text }
import com.cloudera.datascience.common.XmlInputFormat
Code:
object Tester_loader {
case class User(account: String, name: String, number: String)
def main(args: Array[String]): Unit = {
val sparkHome = "/usr/big_data_tools/spark-1.5.0-bin-hadoop2.6/"
val sparkMasterUrl = "spark://SYSTEMX:7077"
var jars = new Array[String](3)
jars(0) = "/home/hduser/Offload_Data_Warehouse_Spark.jar"
jars(1) = "/usr/big_data_tools/JARS/Spark_jar/avro/spark-avro_2.10-2.0.1.jar"
val conf = new SparkConf().setAppName("XML Reading")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setMaster("local")
.set("spark.cassandra.connection.host", "127.0.0.1")
.setSparkHome(sparkHome)
.set("spark.executor.memory", "512m")
.set("spark.default.deployCores", "12")
.set("spark.cores.max", "12")
.setJars(jars)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// ---- loading user from XML
// calling function 1.1
val pages = readFile("src/input_data", "<user>", "<\\user>", sc)
val xmlUserDF = pages.map { tuple =>
{
val account = extractField(tuple, "account")
val name = extractField(tuple, "name")
val number = extractField(tuple, "number")
User(account, name, number)
}
}.toDF()
println(xmlUserDF.count())
xmlUserDF.show()
}
Functions:
def readFile(path: String, start_tag: String, end_tag: String,
sc: SparkContext) = {
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, start_tag)
conf.set(XmlInputFormat.END_TAG_KEY, end_tag)
val rawXmls = sc.newAPIHadoopFile(
path, classOf[XmlInputFormat], classOf[LongWritable],
classOf[Text], conf)
rawXmls.map(p => p._2.toString)
}
def extractField(tuple: String, tag: String) = {
var value = tuple.replaceAll("\n", " ").replace("<\\", "</")
if (value.contains("<" + tag + ">") &&
value.contains("</" + tag + ">")) {
value = value.split("<" + tag + ">")(1).split("</" + tag + ">")(0)
}
value
}
}
Output:
+-------+------+------+
|account| name|number|
+-------+------+------+
| 1234|name_1| 34233|
| 58789|name_2| 54697|
+-------+------+------+
The result obtained is in dataframes you can convert them to RDD as per your requirement like this->
val xmlUserRDD = xmlUserDF.toJavaRDD.rdd.map { x =>
(x.get(0).toString(),x.get(1).toString(),x.get(2).toString()) }
Please evaluate it, if it could help you some how.
This will help you.
package packagename;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import com.databricks.spark.xml.XmlReader;
public class XmlreaderSpark {
public static void main(String arr[]){
String localxml="file path";
String booksFileTag = "user";
String warehouseLocation = "file:" + System.getProperty("user.dir") + "spark-warehouse";
System.out.println("warehouseLocation" + warehouseLocation);
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("Java Spark SQL Example")
.config("spark.some.config.option", "some-value").config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport().config("set spark.sql.crossJoin.enabled", "true")
.getOrCreate();
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> df = (new XmlReader()).withRowTag(booksFileTag).xmlFile(sqlContext, localxml);
df.show();
}
}
You need to add this dependency in your POM.xml:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.10</artifactId>
<version>0.4.0</version>
</dependency>
and your input file is not in proper format.
Thanks.
There are two good options for simple cases:
wholeTextFiles. Use map method with your XML parser which could be Scala XML pull parser (quicker to code) or the SAX Pull Parser (better performance).
Hadoop streaming XMLInputFormat which you must define the start and end tag <user> </user> to process it, however, it creates one partition per user tag
spark-xml package is a good option too.
With all options you are limited to only process simple XMLs which can be interpreted as dataset with rows and columns.
However, if we make it a little complex, those options won’t be useful.
For example, if you have one more entity there:
<root>
<users>
<user>...</users>
<companies>
<company>...</companies>
</root>
Now you need to generate 2 RDDs and change your parser to recognise the <company> tag.
This is just a simple case, but the XML could be much more complex and you would need to include more and more changes.
To solve this complexity we’ve built Flexter on top of Apache Spark to take the pain out of processing XML files on Spark. I also recommend to read about converting XML on Spark to Parquet. The latter post also includes some code samples that show how the output can be queried with SparkSQL.
Disclaimer: I work for Sonra

Resources