Spark to SparkSQL equivqlent syntax - apache-spark

I have this two line in spark, I want to get the equivalent in SparkSQL (im working with python env)
df = spark_df.filter(col["col_name".lower()].rlike("[0-9]{9}$")).count()
spark_df = spark_df.withColumn(columnname, F.to_date(F.col((columnname, ),"yyyyMMdd"))

For spark sql first convert dataframe to temp view then run sql.
Example:
spark_df.createOrReplaceTempView("tmp")
df=spark.sql("""select count(*) from tmp where lower(col_name) rlike("[0-9]{9}$") """).collect()[0][0]
spark_df = spark.sql("""select *, to_date(columnname,"yyyyMMdd") columnname from tmp """)

Related

how to insert dataframe having map column in hive table

I have a dataframe with multiple columns out of which one column is map(string,string) type. I'm able to print this dataframe having column as map which gives data as Map("PUN" -> "Pune"). I want to write this dataframe to hive table (stored as avro) which has same column with type map.
Df.withcolumn("cname", lit("Pune"))
withcolumn("city_code_name", map(lit("PUN"), col("cname"))
Df.show(false)
//table - created external hive table..stored as avro..with avro schema
After removing this map type column I'm able to save the dataframe to hive avro table.
Save way to hive table:
spark.save - saving avro file
spark.sql - creating partition on hive table with avro file location
see this test case as an example from spark tests
test("Insert MapType.valueContainsNull == false") {
val schema = StructType(Seq(
StructField("m", MapType(StringType, StringType, valueContainsNull = false))))
val rowRDD = spark.sparkContext.parallelize(
(1 to 100).map(i => Row(Map(s"key$i" -> s"value$i"))))
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("tableWithMapValue")
sql("CREATE TABLE hiveTableWithMapValue(m Map <STRING, STRING>)")
sql("INSERT OVERWRITE TABLE hiveTableWithMapValue SELECT m FROM tableWithMapValue")
checkAnswer(
sql("SELECT * FROM hiveTableWithMapValue"),
rowRDD.collect().toSeq)
sql("DROP TABLE hiveTableWithMapValue")
}
also if you want save option then you can try with saveAsTable as showed here
Seq(9 -> "x").toDF("i", "j")
.write.format("hive").mode(SaveMode.Overwrite).option("fileFormat", "avro").saveAsTable("t")
yourdataframewithmapcolumn.write.partitionBy is the way to create partitions.
You can achieve that with saveAsTable
Example:
Df\
.write\
.saveAsTable(name='tableName',
format='com.databricks.spark.avro',
mode='append',
path='avroFileLocation')
Change the mode option to whatever suits you

Spark SQL 'Show table extended from db like table gives different result' from Hive

spark.sql("SHOW TABLE EXTENDED IN DB LIKE 'TABLE'")
Beeline >>SHOW TABLE EXTENDED IN DB LIKE 'TABLE';
Both queries have different results.
If I run the same query in Spark it is giving different result than Hive. Format and lastUpdatedTime is missing in Spark SQL.
If anyone have idea then please let me know how to see lastUpdatedTime of Hive table from Spark SQL
Try this -
scala> val df = spark.sql(s"describe extended ${db}.${table_name}").select("data_type").where("col_name == 'Table Properties'")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [data_type: string]
scala> df.map(r => r.getString(0).split(",")(1).trim).collect
res39: Array[String] = Array(last_modified_time=1539848078)
scala> df.map(r => r.getString(0).split(",")(1).trim.split("=")(1)).collect.mkString
res41: String = 1539848078

How do I use SparkSQL and its execution engine to query Hive databases and tables without invoking any part of the Hive execution engine?

I've created select and join statements that I can run from the Hive CLI and/or the beeline CLI and/or Spark (2.3.1) WITH enableHiveSupport=TRUE. (Note: I'm using SparkR for my API)
The join and write using beeline takes 30 minutes, but the join and write using Spark with enableHiveSupport=TRUE takes 3.5 HOURS. This either means Spark and its connectors are crap, or I'm not using spark the way I should be... and everything I read about Spark's 'best thing since sliced bread' commentary means I'm probably not using it right.
I want to read from Hive tables, but I don't want Hive to do anything. I'd like to run joins over monthly data, run a regression on each record's monthly delta, then output my final slopes/betas to an output table in parquet that is readable from Hive, if necessary... preferably partitioned the same way that I have partitioned the tables I'm using as input data from Hive.
Here's some code, as requested... but I dont think you're going to learn anything. You're not going to get reproducible results with Big Data queries.
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark2-client")
sessionInfo()
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.stop()
Sys.setenv(SPARKR_SUBMIT_ARGS="--master yarn sparkr-shell") #--master yarn-client sparkr-shell
Sys.setenv(LOCAL_DIRS="/tmp")
config = list()
config$spark.cores.max <- 144L
config$spark.executor.cores <- 2L
config$spark.executor.memory <- '8g'
config$spark.driver.cores <- 6L
config$spark.driver.maxResultSize <-"0"
config$spark.driver.memory <- "32g"
config$spark.shuffle.service.enabled<-TRUE
config$spark.dynamicAllocation.enabled <-FALSE
config$spark.scheduler.mode <- 'FIFO'
config$spark.ui.port<-4044L
sparkR.session(master = "yarn",
sparkHome = Sys.getenv("SPARK_HOME"),
sparkConfig = config,
enableHiveSupport = TRUE)
print("Connected!")
############ SET HIVE CONFIG
collect(sql("SET hive.exec.dynamic.partition") )
sql("SET hive.exec.dynamic.partition=true")
collect(sql("SET hive.exec.dynamic.partition.mode"))
sql("SET hive.exec.dynamic.partition.mode=nonstrict")
##
start_time <- Sys.time()
############### READ IN DATA {FROM HIVE}
sql('use historicdata')
data_tables<-collect(sql('show tables'))
exporttabs <- grep(pattern = 'export_historic_archive_records',x = data_tables$tableName,value = TRUE)
jointabs<-sort(exporttabs)[length(exporttabs)-(nMonths-1):0]
currenttab<-jointabs[6]
############### CREATE TABLE AND INSERT SCRIPTS
sql(paste0('use ',hivedb))
sql(paste0('DROP TABLE IF EXISTS histdata_regression',tab_suffix))
sSelect<-paste0("Insert Into TABLE histdata_regression",tab_suffix," partition (scf) SELECT a.idkey01, a.ssn7")
sCreateQuery<-paste0("CREATE TABLE histdata_regression",tab_suffix," (idkey01 string, ssn7 string")
sFrom<-paste0("FROM historicdata.",jointabs[nMonths]," a")
sAlias<-letters[nMonths:1]
DT <- gsub(pattern = "export_historic_archive_records_",replacement = "",jointabs)
DT<-paste0(DT)
for (i in nMonths:1) {
sSelect<-paste0(sSelect,", ",sAlias[i],".",hdAttr," as ",hdAttr,"_",i,", ",sAlias[i],".recordid as recordid_",DT[i])
sCreateQuery<-paste0(sCreateQuery,", ",hdAttr,"_",i," int, recordid_",DT[i]," int")
if (i==1) sCreateQuery<-paste0(sCreateQuery,') PARTITIONED BY (scf string) STORED AS ORC')
if (i==1) sSelect<-paste0(sSelect,", a.scf")
if (i!=nMonths) sFrom<-paste0(sFrom," inner join historicdata.",jointabs[i]," ",sAlias[i]," on ",
paste(paste0(paste0("a.",c("scf","idkey01","ssn7")),"=",
paste0(sAlias[i],".",c("scf","idkey01","ssn7"))),collapse=" AND "))
}
system(paste0('beeline -u "jdbc:hive2://myserver1.com,myserver2.com,myserver3.com,myserver4.com,myserver5.com/work;\
serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -e "',sCreateQuery,'"'))
system(paste0("beeline -u \"jdbc:hive2://myserver1.com,myserver2.com,myserver3.com,myserver4.com,myserver5.com/work;\
serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2\" -e \"",sSelect," ",sFrom,"\""))

Save and append a file in HDFS using PySpark

I have a data frame in PySpark called df. I have registered this df as a temptable like below.
df.registerTempTable('mytempTable')
date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
Now from this temp table I will get certain values, like max_id of a column id
min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']
max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']
Now I will collect all these values like below.
test = ("{},{},{}".format(date,min_id,max_id))
I found that test is not a data frame but it is a str string
>>> type(test)
<type 'str'>
Now I want save this test as a file in HDFS. I would also like to append data to the same file in hdfs.
How can I do that using PySpark?
FYI I am using Spark 1.6 and don't have access to Databricks spark-csv package.
Here you go, you'll just need to concat your data with concat_ws and right it as a text:
query = """select concat_ws(',', date, nvl(min(id), 0), nvl(max(id), 0))
from mytempTable"""
sqlContext.sql(query).write("text").mode("append").save("/tmp/fooo")
Or even a better alternative :
from pyspark.sql import functions as f
(sqlContext
.table("myTempTable")
.select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id")))
.coalesce(1)
.write.format("text").mode("append").save("/tmp/fooo"))

How to convert a table into a Spark Dataframe

In Spark SQL, a dataframe can be queried as a table using this:
sqlContext.registerDataFrameAsTable(df, "mytable")
Assuming what I have is mytable, how can I get or access this as a DataFrame?
The cleanest way:
df = sqlContext.table("mytable")
Documentation
Well you can query it and save the result into a variable. Check that SQLContext's method sql returns a DataFrame.
df = sqlContext.sql("SELECT * FROM mytable")

Resources