Too many arguments for method filter - apache-spark

What I tried do here is to filter the rows where log_type = "1"
This is my code:
val sc1Rdd=parDf.select(parDf("token"),parDf("log_type")).rdd
val sc2Rdd=sc1Rdd.filter(x=>x,log_type=="1")
but the error code showed:
parDf: org.apache.spark.sql.DataFrame = [action_time: bigint,
action_type: bigint ... 21 more fields] sc1Rdd:
org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] =
MapPartitionsRDD[865] at rdd at :186 :188: error:
too many arguments for method filter: (f: org.apache.spark.sql.Row =>
Boolean)org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
val sc2Rdd=sc1Rdd.filter(x=>x,log_type=="1")
Any help will be appreciated.

You don't need to change the dataframe to RDD
you can simply filter or where as
val result = parDf.select(parDf("token"),parDf("log_type")).filter(parDf("long_type")===1)
parDf.select(p===1)arDf("token"),parDf("log_type")).where(parDf("long_type")
Hope this helps!

Related

Null Pointer Exception accessing Dataframe

val opt2 :Option[DataFrame]= None
val result:Option[DataFrame] =opt2.getOrElse(None)
if (!result.isEmpty) {
result.show()
}
If i don't use getOrElse and use get instead , I am getting Null pointer exception.
If i use getOrElse , i m getting type Mismatch in result .How to fix. Also is there is any method like creatingEmpty Dataframe.
Spark provide pre-defined function to declare empty dataframe.
Please note, In below code spark in SparkSession.
scala> import org.apache.spark.sql.SparkSession
scala> val df = spark.emptyDataFrame
df: org.apache.spark.sql.DataFrame = []
scala> df.show()
++
||
++
++

Spark load files collection in batch and find the line from each file with additional info from file level

I have the files collection specified with comma separator, like:
hdfs://user/cloudera/date=2018-01-15,hdfs://user/cloudera/date=2018-01-16,hdfs://user/cloudera/date=2018-01-17,hdfs://user/cloudera/date=2018-01-18,hdfs://user/cloudera/date=2018-01-19,hdfs://user/cloudera/date=2018-01-20,hdfs://user/cloudera/date=2018-01-21,hdfs://user/cloudera/date=2018-01-22
and I'm loading the files with Apache Spark, all in once with:
val input = sc.textFile(files)
Also, I have additional information associated with each file - the unique ID, for example:
File ID
--------------------------------------------------
hdfs://user/cloudera/date=2018-01-15 | 12345
hdfs://user/cloudera/date=2018-01-16 | 09245
hdfs://user/cloudera/date=2018-01-17 | 345hqw4
and so on
As the output, I need to receive the DataFrame with the rows, where each row will contain the same ID, as the ID of the file from which this line was read.
Is it possible to pass this information in some way to Spark in order to be able to associate with the lines?
Core sql approach with UDF (the same thing you can achieve with join if you represent File -> ID mapping as Dataframe):
import org.apache.spark.sql.functions
val inputDf = sparkSession.read.text(".../src/test/resources/test")
.withColumn("fileName", functions.input_file_name())
def withId(mapping: Map[String, String]) = functions.udf(
(file: String) => mapping.get(file)
)
val mapping = Map(
"file:///.../src/test/resources/test/test1.txt" -> "id1",
"file:///.../src/test/resources/test/test2.txt" -> "id2"
)
val resutlDf = inputDf.withColumn("id", withId(mapping)(inputDf("fileName")))
resutlDf.show(false)
Result:
+-----+---------------------------------------------+---+
|value|fileName |id |
+-----+---------------------------------------------+---+
|row1 |file:///.../src/test/resources/test/test1.txt|id1|
|row11|file:///.../src/test/resources/test/test1.txt|id1|
|row2 |file:///.../src/test/resources/test/test2.txt|id2|
|row22|file:///.../src/test/resources/test/test2.txt|id2|
+-----+---------------------------------------------+---+
text1.txt:
row1
row11
text2.txt:
row2
row22
This could help (not tested)
// read single text file into DataFrame and add 'id' column
def readOneFile(filePath: String, fileId: String)(implicit spark: SparkSession): DataFrame = {
val dfOriginal: DataFrame = spark.read.text(filePath)
val dfWithIdColumn: DataFrame = dfOriginal.withColumn("id", lit(fileId))
dfWithIdColumn
}
// read all text files into DataFrame
def readAllFiles(filePathIdsSeq: Seq[(String, String)])(implicit spark: SparkSession): DataFrame = {
// create empty DataFrame with expected schema
val emptyDfSchema: StructType = StructType(List(
StructField("value", StringType, false),
StructField("id", StringType, false)
))
val emptyDf: DataFrame = spark.createDataFrame(
rowRDD = spark.sparkContext.emptyRDD[Row],
schema = emptyDfSchema
)
val unionDf: DataFrame = filePathIdsSeq.foldLeft(emptyDf) { (intermediateDf: DataFrame, filePathIdTuple: (String, String)) =>
intermediateDf.union(readOneFile(filePathIdTuple._1, filePathIdTuple._2))
}
unionDf
}
References
spark.read.text(..) method
Create empty DataFrame

Syntax for using toEpochDate with a Dataframe with Spark Scala - elegantly

The following is nice and easy with an RDD in terms of epochDate derivation:
val rdd2 = rdd.map(x => (x._1, x._2, x._3,
LocalDate.parse(x._2.toString).toEpochDay, LocalDate.parse(x._3.toString).toEpochDay))
The RDD are all of String Type. The desired result is gotten. Get this, for example:
...(Mike,2018-09-25,2018-09-30,17799,17804), ...
Trying to do the same if there is a String in the DF appears too tricky for me, and I would like to see something elegant, if possible. Something like this and variations do not work.
val df2 = df.withColumn("s", $"start".LocalDate.parse.toString.toEpochDay)
Get:
notebook:50: error: value LocalDate is not a member of org.apache.spark.sql.ColumnName
I understand the error, but what is an elegant way of doing the conversion?
You can define to_epoch_day as datediff since the beginning of the epoch:
import org.apache.spark.sql.functions.{datediff, lit, to_date}
import org.apache.spark.sql.Column
def to_epoch_day(c: Column) = datediff(c, to_date(lit("1970-01-01")))
and apply it directly on a Column:
df.withColumn("s", to_epoch_day(to_date($"start")))
As long as the string format complies with ISO 8601 you could even skip data conversion (it will be done implicitly by datediff:
df.withColumn("s", to_epoch_day($"start"))
$"start" is of type ColumnName not String.
You will need to define a UDF
Example below:
scala> import java.time._
import java.time._
scala> def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
toEpochDay: (s: String)Long
scala> val toEpochDayUdf = udf(toEpochDay(_: String))
toEpochDayUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> val df = List("2018-10-28").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.withColumn("s", toEpochDayUdf($"value")).collect
res0: Array[org.apache.spark.sql.Row] = Array([2018-10-28,17832])

How to pass dataframe in ISIN operator in spark dataframe

I want to pass dataframe which has set of values to new query but it fails.
1) Here I am selecting particular column so that I can pass under ISIN in next query
scala> val managerIdDf=finalEmployeesDf.filter($"manager_id"!==0).select($"manager_id").distinct
managerIdDf: org.apache.spark.sql.DataFrame = [manager_id: bigint]
2) My sample data:
scala> managerIdDf.show
+----------+
|manager_id|
+----------+
| 67832|
| 65646|
| 5646|
| 67858|
| 69062|
| 68319|
| 66928|
+----------+
3) When I execute final query it fails:
scala> finalEmployeesDf.filter($"emp_id".isin(managerIdDf)).select("*").show
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.DataFrame [manager_id: bigint]
I also tried converting to List and Seq but it generates an error only. Like below when I try to convert to Seq and re run the query it throws an error:
scala> val seqDf=managerIdDf.collect.toSeq
seqDf: Seq[org.apache.spark.sql.Row] = WrappedArray([67832], [65646], [5646], [67858], [69062], [68319], [66928])
scala> finalEmployeesDf.filter($"emp_id".isin(seqDf)).select("*").show
java.lang.RuntimeException: Unsupported literal type class scala.collection.mutable.WrappedArray$ofRef WrappedArray([67832], [65646], [5646], [67858], [69062], [68319], [66928])
I also referred this post but in vain. This type of query I am trying it for solving subqueries in spark dataframe. Anyone here pls ?
An alternative approach using the dataframes and tempviews and free format SQL of SPARK SQL - don't worry about the logic, it's just convention and an alternative to your initial approach - that should equally suffice:
val df2 = Seq(
("Peter", "Doe", Seq(("New York", "A000000"), ("Warsaw", null))),
("Bob", "Smith", Seq(("Berlin", null))),
("John", "Jones", Seq(("Paris", null)))
).toDF("firstname", "lastname", "cities")
df2.createOrReplaceTempView("persons")
val res = spark.sql("""select *
from persons
where firstname
not in (select firstname
from persons
where lastname <> 'Doe')""")
res.show
or
val list = List("Bob", "Daisy", "Peter")
val res2 = spark.sql("select firstname, lastname from persons")
.filter($"firstname".isin(list:_*))
res2.show
or
val query = s"select * from persons where firstname in (${list.map ( x => "'" + x + "'").mkString(",") })"
val res3 = spark.sql(query)
res3.show
or
df2.filter($"firstname".isin(list: _*)).show
or
val list2 = df2.select($"firstname").rdd.map(r => r(0).asInstanceOf[String]).collect.toList
df2.filter($"firstname".isin(list2: _*)).show
In your case specifically:
val seqDf=managerIdDf.rdd.map(r => r(0).asInstanceOf[Long]).collect.toList 2)
finalEmployeesDf.filter($"emp_id".isin(seqDf: _)).select("").show
Yes, you cannot pass a DataFrame in isin. isin requires some values that it will filter against.
If you want an example, you can check my answer here
As per question update, you can make the following change,
.isin(seqDf)
to
.isin(seqDf: _*)

how can i add a timestamp as an extra column to my dataframe

*Hi all,
I have an easy question for you all.
I have an RDD, created from kafka streaming using createStream method.
Now i want to add a timestamp as a value to this rdd before converting in to dataframe.
I have tried doing to add a value to the dataframe using with withColumn() but returning this error*
val topicMaps = Map("topic" -> 1)
val now = java.util.Calendar.getInstance().getTime()
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "now" among (action, device_os_ver, device_type, event_name,
item_name, lat, lon, memberid, productUpccd, tenantid);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:15
As i came to know that DataFrames cannot be altered as they are immutable, but RDDs are immutable as well.
Then what is the best way to do it.
How to a value to the RDD(adding timestamp to an RDD dynamically).
Try current_timestamp function.
import org.apache.spark.sql.functions.current_timestamp
df.withColumn("time_stamp", current_timestamp())
For add a new column with a constant like timestamp, you can use litfunction:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("timeStamp_column", lit(System.currentTimeMillis))
This works for me. I usually perform a write after this.
val d = dataframe.withColumn("SparkLoadedAt", current_timestamp())
In Scala/Databricks:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("Timestamp",current_timestamp())
See my output
I see in comments that some folks are having trouble getting the timestamp to string. Here is a way to do that using spark 3 datetime format
import org.apache.spark.sql.functions._
val d =dataframe.
.withColumn("timeStamp_column", date_format(current_timestamp(), "y-M-d'T'H:m:sX"))

Resources