I have been trying to add a timestamp to my data frames and persist them in to hive.
But here is the problem: as I cannot use timestamps as a data type in hive version 0.13 I want to convert current_timestamp() of the timestamp data type to string so that I can load it in to my hive table.
Here is my timestamp column:
[2017-01-12 12:55:55.278] [2017-01-12 12:55:55.278] [2017-01-12 12:55:55.278] [2017-01-12 12:55:55.278] [2017-01-12 12:55:55.278]
I have tried this but with no luck:
val ts = current_timestamp()
val df:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val date:String = df.format(ts.toLong)
Any way to convert the timestamp to string in Scala??
It's quite easy than i expected,
I have been appending a timestamp to my data frame like this,
val NewDF = oldDF.withColumn("newColumn_name",current_timestamp())
And casted the timestamp in to string like this,
val NewDF = oldDF.withColumn("newColumn_name",current_timestamp().cast("String"))
Hoe this helps some one.
Thanks.
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
format.format(new java.util.Date())
Related
Creating external table with partitions from spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode
val spark = SparkSession.builder().master("local[*]").appName("splitInput").enableHiveSupport().getOrCreate()
val sparkDf = spark.read.option("header","true").option("inferSchema","true").csv("input/candidate/event=ABCD/CandidateScheduleData_3007_2018.csv")
var newDf = sparkDf
for(col <- sparkDf.columns){ newDf = newDf.withColumnRenamed(col,col.replaceAll("\\s", "_")) }
newDf.write.mode(SaveMode.Overwrite).option("path","/output/candidate/event=ABCD/").partitionBy("CenterCode","ExamDate").saveAsTable("abc.candidatelist")
Everything works fine except the partition column ExamDate format created as
ExamDate=30%2F07%2F2018 instead of ExamDate=30-07-2018
How to replace%2F with - in ExamDate format.
%2F is percent encoded /. This means that data is exactly in 30/07/2018 format. You can either:
Parse it to_date using specified format.
Manually format columns with required format.
I read data from Oracle via spark JDBC connection to a DataFrame. I have a column which is obviously StringType in dataframe.
Now I want to persist this in Hive, but as datatype Varchar(5). I know the string would be truncated but it is ok.
I tried using UDFs which didn't work since dataframe does not have varchar or char types. I also created a temporary view in Hive using:
val tv = df.createOrReplaceTempView("t_name")
val df = spark.sql("select cast(col_name as varchar(5)) from tv")
But then when i printSchema, i still see a string type.
How can I make I save it as a varchar column in Hive table ?
Try creating Hive table("dbName.tableName") with required schema (varchar(5) in this case) and insert into the table directly from Dataframe like below.
df.write.insertInto("dbName.tableName" ,overwrite = False)
*Hi all,
I have an easy question for you all.
I have an RDD, created from kafka streaming using createStream method.
Now i want to add a timestamp as a value to this rdd before converting in to dataframe.
I have tried doing to add a value to the dataframe using with withColumn() but returning this error*
val topicMaps = Map("topic" -> 1)
val now = java.util.Calendar.getInstance().getTime()
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "now" among (action, device_os_ver, device_type, event_name,
item_name, lat, lon, memberid, productUpccd, tenantid);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:15
As i came to know that DataFrames cannot be altered as they are immutable, but RDDs are immutable as well.
Then what is the best way to do it.
How to a value to the RDD(adding timestamp to an RDD dynamically).
Try current_timestamp function.
import org.apache.spark.sql.functions.current_timestamp
df.withColumn("time_stamp", current_timestamp())
For add a new column with a constant like timestamp, you can use litfunction:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("timeStamp_column", lit(System.currentTimeMillis))
This works for me. I usually perform a write after this.
val d = dataframe.withColumn("SparkLoadedAt", current_timestamp())
In Scala/Databricks:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("Timestamp",current_timestamp())
See my output
I see in comments that some folks are having trouble getting the timestamp to string. Here is a way to do that using spark 3 datetime format
import org.apache.spark.sql.functions._
val d =dataframe.
.withColumn("timeStamp_column", date_format(current_timestamp(), "y-M-d'T'H:m:sX"))
In Spark SQL, a dataframe can be queried as a table using this:
sqlContext.registerDataFrameAsTable(df, "mytable")
Assuming what I have is mytable, how can I get or access this as a DataFrame?
The cleanest way:
df = sqlContext.table("mytable")
Documentation
Well you can query it and save the result into a variable. Check that SQLContext's method sql returns a DataFrame.
df = sqlContext.sql("SELECT * FROM mytable")
I am using spark 1.6 and I aim to create external hive table like what I do in hive script. To do this, I first read in the partitioned avro file and get the schema of this file. Now I stopped here, I get no idea how to apply this schema to my creating table. I use scala. Need help guys.
finally, I make it myself with old-fashioned way. With the help of code below:
val rawSchema = sqlContext.read.avro("Path").schema
val schemaString = rawSchema.fields.map(field => field.name.replaceAll("""^_""", "").concat(" ").concat(field.dataType.typeName match {
case "integer" => "int"
case smt => smt
})).mkString(",\n")
val ddl =
s"""
|Create external table $tablename ($schemaString) \n
|partitioned by (y int, m int, d int, hh int, mm int) \n
|Stored As Avro \n
|-- inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' \n
| -- outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' \n
| Location 'hdfs://$path'
""".stripMargin
take care no column name can start with _ and hive can't parse integer. I would like to say that this way is not flexible but work. if anyone get better idea, plz comment.
I didn't see a way to automatically infer schema for external tables. So I created case for the string type. You could add case for your data type. But I'm not sure how many columns you have. I apologize as this might not be a clean approach.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SaveMode};
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = hiveContext.read.format("com.databricks.spark.avro").load("people.avro")
val schema = results.schema.map( x => x.name.concat(" ").concat( x.dataType.toString() match { case "StringType" => "STRING"} ) ).mkString(",")
val hive_sql = "CREATE EXTERNAL TABLE people_and_age (" + schema + ") ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/ravi/people_age'"
hiveContext.sql(hive_sql)
results.saveAsTable("people_age",SaveMode.Overwrite)
hiveContext.sql("select * from people_age").show()
Try the below code.
val htctx= new HiveContext(sc)
htctx.sql(create extetnal table tablename schema partitioned by attribute row format serde serde.jar field terminated by value location path)