DataFrame to HDFS in spark scala - apache-spark

I have a spark data frame of the format org.apache.spark.sql.DataFrame = [user_key: string, field1: string]. When I use saveAsTextFile to save the file in hdfs results look like [12345,xxxxx]. I don't want the opening and closing bracket written to output file. if i used .rdd to convert into a RDD still the brackets are present in the RDD.
Thanks

Just concatenate the values and store strings:
import org.apache.spark.sql.functions.{concat_ws, col}
import org.apache.spark.sql.Row
val expr = concat_ws(",", df.columns.map(col): _*)
df.select(expr).map(_.getString(0)).saveAsTextFile("some_path")
Or even better use spark-csv:
selectedData.write
.format("com.databricks.spark.csv")
.option("header", "false")
.save("some_path")
Another approach is to simply map:
df.rdd.map(_.toSeq.map(_.toString).mkString(","))
and save afterwards.

Related

How to handle NullPointerException while reading, filtering and counting the lines of CSV files using SparkSession?

I'm trying to read the CSV files stored on HDFS using sparkSession and count the number of lines and print the value on the console. However, I'm constantly getting NullPointerException while calculating the count. Below is the code snippet,
val validEmployeeIds = Set("12345", "6789")
val count = sparkSession
.read
.option("escape", "\"")
.option("quote", "\"")
.csv(inputPath)
.filter(row => validEmployeeIds.contains(row.getString(0)))
.distinct()
.count()
println(count)
I'm getting an NPE exactly at .filter condition. If I remove .filter in the code, it runs fine and prints the count. How can I handle this NPE?
The inputPath is a folder that contains contains multiple CSV files. Each CSV file has two columns, one represents Id and other represents name of the employee. A sample CSV extract is below:
12345,Employee1
AA888,Employee2
I'm using Spark version 2.3.1.
Try using isin function.
import spark.implicits._
val validEmployeeIds = List("12345", "6789")
val df = // Read CSV
df.filter('_c0.isin(validEmployeeIds:_*)).distinct().count()

Spark RDD after splitting of data data type is changed how can i split without changing data type

I have loaded data from text file to Spark RDD after splitting of data data type is changed. How can I split without changing data type or how can I convert split data to original data type?
My code
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Movie")
sc = SparkContext(conf = conf)
movies = sc.textFile("file:///SaprkCourse/movie/movies.txt")
data=movies.map(lambda x: x.split(","))
data.collect()
My input is like
userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
after splitting my complete data is changed to String type
I required output to be same data type as in input text File, as IntegerType, IntegerType, IntegerType, IntegerType
spark when reading a text file affect the type StringType to all columns so if you want to treat your columns as IntegerType you need to cast them.
it seam that your data is csv,
you should use sparkSession, read the data with csv and define your schema.
scala code :
val schema = new Structype()
.add("userId",IntegerType)
.add("movieId",IntegerType)
.add("rating",IntegerType)
.add("timestamp",TimestampType)
spark.read.schema(schema).csv("file:///SaprkCourse/movie/movies.txt")
if you want to keep reading the file as text you can cast every column :
scala :
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{IntegerType,TimestampType}
val df = data
.select(
col("userId").cast(IntegerType),
col("movieId").cast(IntegerType),
col("rating").cast(IntegerType),
col("timestamp").cast(TimestampType)
)

Write dataframe to csv with datatype map<string,bigint> in Spark

I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying
"Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type."
Code which I have used:
val conf=new SparkConf().setAppName("student-example").setMaster("local")
val sc = new SparkContext(conf)
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
def sumaggr=udf((aggr: Map[String, collection.mutable.WrappedArray[Long]]) => if (aggr.keySet.contains("aggr")) aggr("aggr").sum else 0)
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
datadf.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
I tried converting datadf.toString() but still I am facing same issue.
How can write that result to CSV.
spark version 2.1.1
Spark CSV source supports only atomic types. You cannot store any columns that are non-atomic
I think best is to create a JSON for the column that has map<string,bigint> as a datatype and save it in csv as below.
import spark.implicits._
import org.apache.spark.sql.functions._
datadf.withColumn("column_name_with_map_type", to_json(struct($"column_name_with_map_type"))).write.csv("outputpath")
Hope this helps!
You are trying to save the output of
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
which I guess is a mistake as the udf function and all the aggregation done would go in vain if you do so
So I think you want to save the output of
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
So you need to save it in a new dataframe variable and use that variable to save.
val finalDF = datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0)
finalDF.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
And you should be fine.

Can I read a CSV represented as a string into Apache Spark using spark-csv

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible?
Update : Starting from Spark 2.2.x
there is finally a proper way to do it using Dataset.
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
import spark.implicits._
val csvData: Dataset[String] = spark.sparkContext.parallelize(
"""
|id, date, timedump
|1, "2014/01/01 23:00:01",1499959917383
|2, "2014/11/31 12:40:32",1198138008843
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.show()
frame.printSchema()
Old spark versions
Actually you can, though it's using library internals and not widely advertised. Just create and use your own CsvParser instance.
Example that works for me on spark 1.6.0 and spark-csv_2.10-1.4.0 below
import com.databricks.spark.csv.CsvParser
val csvData = """
|userid,organizationid,userfirstname,usermiddlename,userlastname,usertitle
|1,1,user1,m1,l1,mr
|2,2,user2,m2,l2,mr
|3,3,user3,m3,l3,mr
|""".stripMargin
val rdd = sc.parallelize(csvData.lines.toList)
val csvParser = new CsvParser()
.withUseHeader(true)
.withInferSchema(true)
val csvDataFrame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
You can parse your string into a csv using, e.g. scala-csv:
val myCSVdata : Array[List[String]] =
myCSVString.split('\n').flatMap(CSVParser.parseLine(_))
Here you can do a bit more processing, data cleaning, verifying that every line parses well and has the same number of fields, etc ...
You can then make this an RDD of records:
val myCSVRDD : RDD[List[String]] = sparkContext.parallelize(msCSVdata)
Here you can massage your lists of Strings into a case class, to reflect the fields of your csv data better. You should get some inspiration from the creations of Persons in this example:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
I omit this step.
You can then convert to a DataFrame:
import spark.implicits._
myCSVDataframe = myCSVRDD.toDF()
The accepted answer wasn't working for me in spark 2.2.0 but lead me to what I needed with csvData.lines.toList
val fileUrl = getClass.getResource(s"/file_in_resources.csv")
val stream = fileUrl.getContent.asInstanceOf[InputStream]
val streamString = Source.fromInputStream(stream).mkString
val csvList = streamString.lines.toList
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(csvList.toDS())
.as[SomeCaseClass]

How to convert cassandraRow into Row (apache spark)?

I am trying to create a Dataframe from RDD[cassandraRow].. But i can't because createDataframe(RDD[Row],schema: StructType) need RDD[Row] not RDD[cassandraRow].
How can I achieve this?
And also as per the answer in this question
How to convert rdd object to dataframe in spark
( one of the answers ) suggestion for using toDF() on RDD[Row] to get Dataframe from the RDD, is not working for me. I tried using RDD[Row] in another example ( tried to use toDF() ).
it's also unknown for me that how can we call the method of Dataframe ( toDF() ) with instance of RDD ( RDD[Row] ) ?
I am using Scala.
If you really need this you can always map your data to Spark rows:
sqlContext.createDataFrame(
rdd.map(r => org.apache.spark.sql.Row.fromSeq(r.columnValues)),
schema
)
but if you want DataFrames it is better to import data directly:
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> table, "keyspace" -> keyspace))
.load()

Resources