I have this two line in spark, I want to get the equivalent in SparkSQL (im working with python env)
df = spark_df.filter(col["col_name".lower()].rlike("[0-9]{9}$")).count()
spark_df = spark_df.withColumn(columnname, F.to_date(F.col((columnname, ),"yyyyMMdd"))
For spark sql first convert dataframe to temp view then run sql.
Example:
spark_df.createOrReplaceTempView("tmp")
df=spark.sql("""select count(*) from tmp where lower(col_name) rlike("[0-9]{9}$") """).collect()[0][0]
spark_df = spark.sql("""select *, to_date(columnname,"yyyyMMdd") columnname from tmp """)
I have a Val test=sql ("Select * from table1) which returns a dataframe. I want to convert it to dataset which is not working.
test.toDS is throwing error.
Please provide more detail about the error.
If you want to convert a dataframe in dataset use the code below :
case class MyClass(field1: Int, field2: Long) // for example
val df = sql ("Select * from table1)
val ds : Dataset[MyClass] = df.as[MyClass]
I'm a table in cassandra tfm.foehis that have data.
When i did the first charge of data from spark to cassandra, I used this set of commands:
import org.apache.spark.sql.functions._
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val wkdir="/home/adminbigdata/tablas/"
val fileIn= "originales/22_FOEHIS2.csv"
val fileOut= "22_FOEHIS_PRE2"
val fileCQL= "22_FOEHISCQL"
val data = sc.textFile(wkdir + fileIn).filter(!_.contains("----")).map(_.trim.replaceAll(" +", "")).map(_.dropRight(1)).map(_.drop(1)).map(_.replaceAll(",", "")).filter(array => array(6) != "MOBIDI").filter(array => array(17) != "").saveAsTextFile(wkdir + fileOut)
val firstDF = spark.read.format("csv").option("header", "true").option("inferSchema", "true").option("mode", "DROPMALFORMED").option("delimiter", "|").load(wkdir + fileOut)
val columns: Array[String] = firstDF.columns
val reorderedColumnNames: Array[String] = Array("hoclic","hodtac","hohrac","hotpac","honrac","hocdan","hocdrs","hocdsl","hocol","hocpny","hodesf","hodtcl","hodtcm","hodtea","hodtra","hodtrc","hodtto","hodtua","hohrcl","hohrcm","hohrea","hohrra","hohrrc","hohrua","holinh","holinr","honumr","hoobs","hooe","hotdsc","hotour","housca","houscl","houscm","housea","houser","housra","housrc")
val secondDF= firstDF.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)
secondDF.write.cassandraFormat("foehis", "tfm").save()
But when I load new data using the same script, I get errors. I don't know what's wrong?
This is the message:
java.lang.UnsupportedOperationException: 'SaveMode is set to ErrorIfExists and Table
tfm.foehis already exists and contains data.
Perhaps you meant to set the DataFrame write mode to Append?
Example: df.write.format.options.mode(SaveMode.Append).save()" '
The error message clearly says you that you need to use Append mode & shows what you can do with it. In your case it happens because destination table already exists, and writing mode is set to "error if exists". If you still want to write data, the code should be following:
import org.apache.spark.sql.SaveMode
secondDF.write.cassandraFormat("foehis", "tfm").mode(SaveMode.Append).save()
Does spark will reuse rdd in DAG for single action?
Case1
val df1 = spark.sql("select id, value from table")
val df2 = spark.sql("select id, value from table")
df1.join(df2, "id").show()
Case 2
val df1 = spark.sql("select id, value from table")
val df2 = df1.filter($"value" > 0)
df1.join(df2, "id").show()
Questions
In case1, does the query select id, value from table will be executed only once?
In case2, does the query will be executed only once?
If not, how can i optimize the code to make the query execute only once, because the query may be very slow.
My spark version is 1.5.0,and i use spark-SQL in spark-shell do some ETL,here is my code:
import com.databricks.spark.avro._
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
import java.security.MessageDigest
val dfGoods = sqlContext.read.avro("hdfs:///user/data/date=*")
dfGoods.registerTempTable("goodsinfo")
val dfGoodsLmt=sqlContext.sql("SELECT * FROM (SELECT goodsid, etype, goodsattribute, row_number() over (partition by goodsid order by runid DESC) rank_num FROM goodsinfo) tmp WHERE rank_num =1")
i use dfGoodsLmt.count() to see the row number,first time the result is always wrong,but after that, i rerun the code dfGoodsLmt.count() ,the result is right.I have tried many times,I don't know why.
here is a demo code and the result should be 1000,but i need to try more than once to get the right answer.
case class data(id:Int,name:Int)
val tmp=(1 to 1000) zip (1 to 1000)
tmp.map(x=>data(x._1,x._2)).toDF.registerTempTable("test_table")
sqlContext.sql("select * from (select *,row_number() over(partition by id order by id DESC)rank from test_table)tmp where rank=1").count