How to convert cassandraRow into Row (apache spark)? - apache-spark

I am trying to create a Dataframe from RDD[cassandraRow].. But i can't because createDataframe(RDD[Row],schema: StructType) need RDD[Row] not RDD[cassandraRow].
How can I achieve this?
And also as per the answer in this question
How to convert rdd object to dataframe in spark
( one of the answers ) suggestion for using toDF() on RDD[Row] to get Dataframe from the RDD, is not working for me. I tried using RDD[Row] in another example ( tried to use toDF() ).
it's also unknown for me that how can we call the method of Dataframe ( toDF() ) with instance of RDD ( RDD[Row] ) ?
I am using Scala.

If you really need this you can always map your data to Spark rows:
sqlContext.createDataFrame(
rdd.map(r => org.apache.spark.sql.Row.fromSeq(r.columnValues)),
schema
)
but if you want DataFrames it is better to import data directly:
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> table, "keyspace" -> keyspace))
.load()

Related

how to insert dataframe having map column in hive table

I have a dataframe with multiple columns out of which one column is map(string,string) type. I'm able to print this dataframe having column as map which gives data as Map("PUN" -> "Pune"). I want to write this dataframe to hive table (stored as avro) which has same column with type map.
Df.withcolumn("cname", lit("Pune"))
withcolumn("city_code_name", map(lit("PUN"), col("cname"))
Df.show(false)
//table - created external hive table..stored as avro..with avro schema
After removing this map type column I'm able to save the dataframe to hive avro table.
Save way to hive table:
spark.save - saving avro file
spark.sql - creating partition on hive table with avro file location
see this test case as an example from spark tests
test("Insert MapType.valueContainsNull == false") {
val schema = StructType(Seq(
StructField("m", MapType(StringType, StringType, valueContainsNull = false))))
val rowRDD = spark.sparkContext.parallelize(
(1 to 100).map(i => Row(Map(s"key$i" -> s"value$i"))))
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("tableWithMapValue")
sql("CREATE TABLE hiveTableWithMapValue(m Map <STRING, STRING>)")
sql("INSERT OVERWRITE TABLE hiveTableWithMapValue SELECT m FROM tableWithMapValue")
checkAnswer(
sql("SELECT * FROM hiveTableWithMapValue"),
rowRDD.collect().toSeq)
sql("DROP TABLE hiveTableWithMapValue")
}
also if you want save option then you can try with saveAsTable as showed here
Seq(9 -> "x").toDF("i", "j")
.write.format("hive").mode(SaveMode.Overwrite).option("fileFormat", "avro").saveAsTable("t")
yourdataframewithmapcolumn.write.partitionBy is the way to create partitions.
You can achieve that with saveAsTable
Example:
Df\
.write\
.saveAsTable(name='tableName',
format='com.databricks.spark.avro',
mode='append',
path='avroFileLocation')
Change the mode option to whatever suits you

How to add multidimensional array to an existing Spark DataFrame

If I understand correctly, ArrayType can be added as Spark DataFrame columns. I am trying to add a multidimensional array to an existing Spark DataFrame by using the withColumn method. My idea is to have this array available with each DataFrame row in order to use it to send back information from the map function.
The error I get says that the withColumn function is looking for a Column type but it is getting an array. Are there any other functions that will allow adding an ArrayType?
object TestDataFrameWithMultiDimArray {
val nrRows = 1400
val nrCols = 500
/** Our main function where the action happens */
def main(args: Array[String]) {
// Create a SparkContext using every core of the local machine, named RatingsCounter
val sc = new SparkContext("local[*]", "TestDataFrameWithMultiDimArray")
val sqlContext = new SQLContext(sc)
val PropertiesDF = sqlContext.read
.format("com.crealytics.spark.excel")
.option("location", "C:/Users/tjoha/Desktop/Properties.xlsx")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.option("sheetName", "Sheet1")
.load()
PropertiesDF.show()
PropertiesDF.printSchema()
val PropertiesDFPlusMultiDimArray = PropertiesDF.withColumn("ArrayCol", Array.ofDim[Any](nrRows,nrCols))
}
Thanks for your help.
Kind regards,
Johann
There are 2 problems in your code
the 2nd argument to withColumn needs to be a Column. you can wrap constant value with function col
Spark cant take Any as its column type, you need to use a specific supported type.
val PropertiesDFPlusMultiDimArray = PropertiesDF.withColumn("ArrayCol", lit(Array.ofDim[Int](nrRows,nrCols)))
will do the trick

Converting CassandraRow obtained from joinWithCassandraTable to DataFrame

case class SourcePartition(id: String, host:String ,bucket: Int)
joinedRDDs =partitions.joinWithCassandraTable("db_name","table_name")
joinedRDDs.values.foreach(println)
I have to use joinWithCassandraTable , How do i covert the result CassandraRow in to a DataFrame? OR is there any equivalent of joinWithCassandraTable with DataFrame ?
I have to read a lot of partitions in one go, I'm aware of Datastax Cassandra connector Predicate push down, but it allows to pull only one Partition at a time ( It doesnt seems to allow IN operator , Only = seems to be supported)
val spark: SparkSession = SparkSession.builder().master("local[4]").appName("RDD2DF").getOrCreate()
val sc: SparkContext = spark.sparkContext
import spark.implicits._
val internalJoinRDD = spark.sparkContext.cassandraTable("test", "test_table_1").joinWithCassandraTable("test", "table_table_2")
internalJoin.toDebugString
internalJoinRDD.toDF()
Can you try the above code snippet.
If you have a schema for your data, you can use
def createDataFrame(internalJoinRDD: RDD[Row], schema: StructType): DataFrame

how can i add a timestamp as an extra column to my dataframe

*Hi all,
I have an easy question for you all.
I have an RDD, created from kafka streaming using createStream method.
Now i want to add a timestamp as a value to this rdd before converting in to dataframe.
I have tried doing to add a value to the dataframe using with withColumn() but returning this error*
val topicMaps = Map("topic" -> 1)
val now = java.util.Calendar.getInstance().getTime()
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "now" among (action, device_os_ver, device_type, event_name,
item_name, lat, lon, memberid, productUpccd, tenantid);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:15
As i came to know that DataFrames cannot be altered as they are immutable, but RDDs are immutable as well.
Then what is the best way to do it.
How to a value to the RDD(adding timestamp to an RDD dynamically).
Try current_timestamp function.
import org.apache.spark.sql.functions.current_timestamp
df.withColumn("time_stamp", current_timestamp())
For add a new column with a constant like timestamp, you can use litfunction:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("timeStamp_column", lit(System.currentTimeMillis))
This works for me. I usually perform a write after this.
val d = dataframe.withColumn("SparkLoadedAt", current_timestamp())
In Scala/Databricks:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("Timestamp",current_timestamp())
See my output
I see in comments that some folks are having trouble getting the timestamp to string. Here is a way to do that using spark 3 datetime format
import org.apache.spark.sql.functions._
val d =dataframe.
.withColumn("timeStamp_column", date_format(current_timestamp(), "y-M-d'T'H:m:sX"))

DataFrame to HDFS in spark scala

I have a spark data frame of the format org.apache.spark.sql.DataFrame = [user_key: string, field1: string]. When I use saveAsTextFile to save the file in hdfs results look like [12345,xxxxx]. I don't want the opening and closing bracket written to output file. if i used .rdd to convert into a RDD still the brackets are present in the RDD.
Thanks
Just concatenate the values and store strings:
import org.apache.spark.sql.functions.{concat_ws, col}
import org.apache.spark.sql.Row
val expr = concat_ws(",", df.columns.map(col): _*)
df.select(expr).map(_.getString(0)).saveAsTextFile("some_path")
Or even better use spark-csv:
selectedData.write
.format("com.databricks.spark.csv")
.option("header", "false")
.save("some_path")
Another approach is to simply map:
df.rdd.map(_.toSeq.map(_.toString).mkString(","))
and save afterwards.

Resources