Cast datatype from array to String for multiple column in Spark Throwing Error - apache-spark

I have a dataframe df that contains Three column of type array, i am trying to save output
to csv, so converted data type to string.
import org.apache.spark.sql.functions._
val df2 = df.withColumn("Total", col("total").cast("string")),
("BOOKID", col("BOOKID").cast("string"),
"PublisherID", col("PublisherID").cast("string")
.write
.csv(path="D:/pennymac/SOLUTION1/OUTPUT")
But getting error.
error as "Cannot Resolve symbol write"
Spark 2.2
Scala

Try below code.
Its not possible to add multiple columns inside withColumn function.
val df2 = df
.withColumn("Total", col("total").cast("string"))
.withColumn("BOOKID", col("BOOKID").cast("string"))
.withColumn("PublisherID", col("PublisherID").cast("string"))
.write
.csv(path="D:/pennymac/SOLUTION1/OUTPUT")

Related

How to handle NullPointerException while reading, filtering and counting the lines of CSV files using SparkSession?

I'm trying to read the CSV files stored on HDFS using sparkSession and count the number of lines and print the value on the console. However, I'm constantly getting NullPointerException while calculating the count. Below is the code snippet,
val validEmployeeIds = Set("12345", "6789")
val count = sparkSession
.read
.option("escape", "\"")
.option("quote", "\"")
.csv(inputPath)
.filter(row => validEmployeeIds.contains(row.getString(0)))
.distinct()
.count()
println(count)
I'm getting an NPE exactly at .filter condition. If I remove .filter in the code, it runs fine and prints the count. How can I handle this NPE?
The inputPath is a folder that contains contains multiple CSV files. Each CSV file has two columns, one represents Id and other represents name of the employee. A sample CSV extract is below:
12345,Employee1
AA888,Employee2
I'm using Spark version 2.3.1.
Try using isin function.
import spark.implicits._
val validEmployeeIds = List("12345", "6789")
val df = // Read CSV
df.filter('_c0.isin(validEmployeeIds:_*)).distinct().count()

Spark RDD after splitting of data data type is changed how can i split without changing data type

I have loaded data from text file to Spark RDD after splitting of data data type is changed. How can I split without changing data type or how can I convert split data to original data type?
My code
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Movie")
sc = SparkContext(conf = conf)
movies = sc.textFile("file:///SaprkCourse/movie/movies.txt")
data=movies.map(lambda x: x.split(","))
data.collect()
My input is like
userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
after splitting my complete data is changed to String type
I required output to be same data type as in input text File, as IntegerType, IntegerType, IntegerType, IntegerType
spark when reading a text file affect the type StringType to all columns so if you want to treat your columns as IntegerType you need to cast them.
it seam that your data is csv,
you should use sparkSession, read the data with csv and define your schema.
scala code :
val schema = new Structype()
.add("userId",IntegerType)
.add("movieId",IntegerType)
.add("rating",IntegerType)
.add("timestamp",TimestampType)
spark.read.schema(schema).csv("file:///SaprkCourse/movie/movies.txt")
if you want to keep reading the file as text you can cast every column :
scala :
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{IntegerType,TimestampType}
val df = data
.select(
col("userId").cast(IntegerType),
col("movieId").cast(IntegerType),
col("rating").cast(IntegerType),
col("timestamp").cast(TimestampType)
)

Write dataframe to csv with datatype map<string,bigint> in Spark

I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying
"Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type."
Code which I have used:
val conf=new SparkConf().setAppName("student-example").setMaster("local")
val sc = new SparkContext(conf)
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
def sumaggr=udf((aggr: Map[String, collection.mutable.WrappedArray[Long]]) => if (aggr.keySet.contains("aggr")) aggr("aggr").sum else 0)
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
datadf.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
I tried converting datadf.toString() but still I am facing same issue.
How can write that result to CSV.
spark version 2.1.1
Spark CSV source supports only atomic types. You cannot store any columns that are non-atomic
I think best is to create a JSON for the column that has map<string,bigint> as a datatype and save it in csv as below.
import spark.implicits._
import org.apache.spark.sql.functions._
datadf.withColumn("column_name_with_map_type", to_json(struct($"column_name_with_map_type"))).write.csv("outputpath")
Hope this helps!
You are trying to save the output of
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
which I guess is a mistake as the udf function and all the aggregation done would go in vain if you do so
So I think you want to save the output of
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
So you need to save it in a new dataframe variable and use that variable to save.
val finalDF = datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0)
finalDF.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
And you should be fine.

Apply a function to a single column of a csv in Spark

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?
My code
SparkContext().addPyFile("myfile.py")
spark = SparkSession\
.builder\
.appName("myApp")\
.getOrCreate()
from myfile import myFunction
df = spark.read.csv(sys.argv[1], header=True,
mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()
I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].
I'm using Spark version 2.0.1
You can simply use User Defined Functions (udf) combined with a withColumn :
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df = df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider
This will add a new column to the dataframe df containing the result of myFunction(line[3]).

createDataFrame() returning a list instead of DataFrame in Spark

I am running Spark 1.5.1. On startup I have HiveContext available as sqlContext but set
sqlContext2 = SQLContext(sc)
I create a pipelined RDD by parsing a list of strings to JSON
data = points.map(lambda line: json.loads(line))
I then try to convert this into a dataframe using
DF = sqlContext2.createDataFrame(data).collect()
This runs perfectly, but then when i run type(DF) it says that it is a list.
How is this possible? How is a list coming out of a createDataFrame()
That's because when you apply collect() on a DataFrame, it return a list that contains all of the elements (Rows) in this DataFrame.
if you want just a DatFrame, df = sqlContext.createDataFrame(data) is enough.
There is no need for sqlContext2 here.

Resources