Spark RDD after splitting of data data type is changed how can i split without changing data type - apache-spark

I have loaded data from text file to Spark RDD after splitting of data data type is changed. How can I split without changing data type or how can I convert split data to original data type?
My code
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Movie")
sc = SparkContext(conf = conf)
movies = sc.textFile("file:///SaprkCourse/movie/movies.txt")
data=movies.map(lambda x: x.split(","))
data.collect()
My input is like
userId,movieId,rating,timestamp
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
after splitting my complete data is changed to String type
I required output to be same data type as in input text File, as IntegerType, IntegerType, IntegerType, IntegerType

spark when reading a text file affect the type StringType to all columns so if you want to treat your columns as IntegerType you need to cast them.

it seam that your data is csv,
you should use sparkSession, read the data with csv and define your schema.
scala code :
val schema = new Structype()
.add("userId",IntegerType)
.add("movieId",IntegerType)
.add("rating",IntegerType)
.add("timestamp",TimestampType)
spark.read.schema(schema).csv("file:///SaprkCourse/movie/movies.txt")
if you want to keep reading the file as text you can cast every column :
scala :
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{IntegerType,TimestampType}
val df = data
.select(
col("userId").cast(IntegerType),
col("movieId").cast(IntegerType),
col("rating").cast(IntegerType),
col("timestamp").cast(TimestampType)
)

Related

Read csv that contains array of string in pyspark

I'm trying to read a csv that has the following data:
name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3
using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe
If I give my own schema like:
schema = StructType([
StructField('name', StringType()),
StructField('date', TimestampType()),
StructField('win', Booleantype()),
StructField('stops', ArrayType(StringType())),
StructField('cost', DoubleType())])
results in this exception:
pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.
so how would I properly read the csv without this failure?
Since csv doesn't support array, you need to first read as string, then convert it.
# You need to set escape option to ", since it is not the default escape character (\).
df = spark.read.csv('file.csv', header=True, escape='"')
df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))
I guess this is what you are looking for:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")
dataframe.printSchema()
Let me know if it helps

Cast datatype from array to String for multiple column in Spark Throwing Error

I have a dataframe df that contains Three column of type array, i am trying to save output
to csv, so converted data type to string.
import org.apache.spark.sql.functions._
val df2 = df.withColumn("Total", col("total").cast("string")),
("BOOKID", col("BOOKID").cast("string"),
"PublisherID", col("PublisherID").cast("string")
.write
.csv(path="D:/pennymac/SOLUTION1/OUTPUT")
But getting error.
error as "Cannot Resolve symbol write"
Spark 2.2
Scala
Try below code.
Its not possible to add multiple columns inside withColumn function.
val df2 = df
.withColumn("Total", col("total").cast("string"))
.withColumn("BOOKID", col("BOOKID").cast("string"))
.withColumn("PublisherID", col("PublisherID").cast("string"))
.write
.csv(path="D:/pennymac/SOLUTION1/OUTPUT")

Write dataframe to csv with datatype map<string,bigint> in Spark

I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying
"Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type."
Code which I have used:
val conf=new SparkConf().setAppName("student-example").setMaster("local")
val sc = new SparkContext(conf)
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
def sumaggr=udf((aggr: Map[String, collection.mutable.WrappedArray[Long]]) => if (aggr.keySet.contains("aggr")) aggr("aggr").sum else 0)
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
datadf.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
I tried converting datadf.toString() but still I am facing same issue.
How can write that result to CSV.
spark version 2.1.1
Spark CSV source supports only atomic types. You cannot store any columns that are non-atomic
I think best is to create a JSON for the column that has map<string,bigint> as a datatype and save it in csv as below.
import spark.implicits._
import org.apache.spark.sql.functions._
datadf.withColumn("column_name_with_map_type", to_json(struct($"column_name_with_map_type"))).write.csv("outputpath")
Hope this helps!
You are trying to save the output of
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
which I guess is a mistake as the udf function and all the aggregation done would go in vain if you do so
So I think you want to save the output of
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
So you need to save it in a new dataframe variable and use that variable to save.
val finalDF = datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0)
finalDF.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
And you should be fine.

Can I read a CSV represented as a string into Apache Spark using spark-csv

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible?
Update : Starting from Spark 2.2.x
there is finally a proper way to do it using Dataset.
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
import spark.implicits._
val csvData: Dataset[String] = spark.sparkContext.parallelize(
"""
|id, date, timedump
|1, "2014/01/01 23:00:01",1499959917383
|2, "2014/11/31 12:40:32",1198138008843
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.show()
frame.printSchema()
Old spark versions
Actually you can, though it's using library internals and not widely advertised. Just create and use your own CsvParser instance.
Example that works for me on spark 1.6.0 and spark-csv_2.10-1.4.0 below
import com.databricks.spark.csv.CsvParser
val csvData = """
|userid,organizationid,userfirstname,usermiddlename,userlastname,usertitle
|1,1,user1,m1,l1,mr
|2,2,user2,m2,l2,mr
|3,3,user3,m3,l3,mr
|""".stripMargin
val rdd = sc.parallelize(csvData.lines.toList)
val csvParser = new CsvParser()
.withUseHeader(true)
.withInferSchema(true)
val csvDataFrame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
You can parse your string into a csv using, e.g. scala-csv:
val myCSVdata : Array[List[String]] =
myCSVString.split('\n').flatMap(CSVParser.parseLine(_))
Here you can do a bit more processing, data cleaning, verifying that every line parses well and has the same number of fields, etc ...
You can then make this an RDD of records:
val myCSVRDD : RDD[List[String]] = sparkContext.parallelize(msCSVdata)
Here you can massage your lists of Strings into a case class, to reflect the fields of your csv data better. You should get some inspiration from the creations of Persons in this example:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
I omit this step.
You can then convert to a DataFrame:
import spark.implicits._
myCSVDataframe = myCSVRDD.toDF()
The accepted answer wasn't working for me in spark 2.2.0 but lead me to what I needed with csvData.lines.toList
val fileUrl = getClass.getResource(s"/file_in_resources.csv")
val stream = fileUrl.getContent.asInstanceOf[InputStream]
val streamString = Source.fromInputStream(stream).mkString
val csvList = streamString.lines.toList
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(csvList.toDS())
.as[SomeCaseClass]

DataFrame to HDFS in spark scala

I have a spark data frame of the format org.apache.spark.sql.DataFrame = [user_key: string, field1: string]. When I use saveAsTextFile to save the file in hdfs results look like [12345,xxxxx]. I don't want the opening and closing bracket written to output file. if i used .rdd to convert into a RDD still the brackets are present in the RDD.
Thanks
Just concatenate the values and store strings:
import org.apache.spark.sql.functions.{concat_ws, col}
import org.apache.spark.sql.Row
val expr = concat_ws(",", df.columns.map(col): _*)
df.select(expr).map(_.getString(0)).saveAsTextFile("some_path")
Or even better use spark-csv:
selectedData.write
.format("com.databricks.spark.csv")
.option("header", "false")
.save("some_path")
Another approach is to simply map:
df.rdd.map(_.toSeq.map(_.toString).mkString(","))
and save afterwards.

Resources