How to use isin function with values from text file? - apache-spark

I'd like to filter a dataframe using an external file.
This is how I use the filter now:
val Insert = Append_Ot.filter(
col("Name2").equalTo("brazil") ||
col("Name2").equalTo("france") ||
col("Name2").equalTo("algeria") ||
col("Name2").equalTo("tunisia") ||
col("Name2").equalTo("egypte"))
Instead of using hardcoded string literals, I'd like to create an external file with the values to filter by.
So I create this file:
val filter_numfile = sc.textFile("/user/zh/worskspace/filter_nmb.txt")
.map(_.split(" ")(1))
.collect
This gives me:
filter_numfile: Array[String] = Array(brazil, france, algeria, tunisia, egypte)
And then, I use isin function on Name2 column.
val Insert = Append_Ot.where($"Name2".isin(filter_numfile: _*))
But this gives me an empty dataframe. Why?

I am just adding some information to philantrovert answer in filter dataframe from external file
His answer is perfect but there might be some case unmatch so you will have to check for case mismatch as well
tl;dr Make sure that the letters use consistent case, i.e. they are all in upper or lower case. Simply use upper or lower standard functions.
lets say you have input file as
1 Algeria
2 tunisia
3 brazil
4 Egypt
you read the text file and change all the countries to lowercase as
val countries = sc.textFile("path to input file").map(_.split(" ")(1).trim)
.collect.toSeq
val array = Array(countries.map(_.toLowerCase) : _*)
Then you have your dataframe
val Append_Ot = sc.parallelize(Seq(("brazil"),("tunisia"),("algeria"),("name"))).toDF("Name2")
where you apply following condition
import org.apache.spark.sql.functions._
val Insert = Append_Ot.where(lower($"Name2").isin(array : _* ))
you should have output as
+-------+
|Name2 |
+-------+
|brazil |
|tunisia|
|algeria|
+-------+
The empty dataframe might be due to spelling mismatch too.

Related

Spark SQL - Get Column Names of a Hive Table in a String

I'm trying to get the column names of a Hive table in a comma separated String. This is what I'm doing
val colNameDF = spark.sql("show columns in hive_table")
val colNameStr = colNameDF.select("col_name").collect.mkString(", ")
And the output I get is
res0: String = [col_1], [col_2], [col_3]
But what I want is col_1, col_2, col_3. I can remove [ and ] from the String, but I'm curious as to whether we can get the column names without the brackets in the first place.
Edit: The column names in the Hive table don't contain [ ]
Instead of show columns, Try below approach as it is faster than yours.
val colNameDF = spark.sql("select * from hive_table").limit(0)
Or
val colNameDF = spark.table("hive_table").limit(0)
val colNameStr = colNameDF.columns.mkString(", ")
The collect returns to you an array of Row which is particularly represented internally as array of values, so you need to trick it like this:
val colNameDF = spark.sql("show columns in hive_table")
val colNameStr = colNameDF.select("col_name").collect.map(r=>r.getString(0)).mkString(", ")
Building on #Srinivas' answer above, here is the equivalent Python code. It is very fast:
colNameStr = ",".join(spark.table(hive_table).limit(0).columns)

Scala How to Find All Unique Values from a Specific Column in a CSV?

I am using Scala to read from a csv file. The file is formatted to have 3 columns each separated by a \t character. The first 2 columns are unimportant and the third column contains a list of comma separated identifiers stored as as strings. Below is a sample of what the input csv would look like:
0002ba73 US 6o7,6on,6qc,6qj,6nw,6ov,6oj,6oi,15me,6pb,6p9
002f50e4 US 6om,6pb,6p8,15m9,6ok,6ov,6qc,6oo,15me
004b5edc US 6oj,6nz,6on,6om,6qc,6ql,6p6,15me
005cc990 US 6pb,6qf,15me,6og,6nx,6qc,6om,6ok
005fe1ea US 15me,6p0,6ql,6ok,6ox,6ol,6o5,6qj
00777555 US 6pb,15me,6nw,6rk,6qc,6ov,6qj,6o0,6oj,6ok,6on,6p6,6nx,15m9
00cbcc7d US 6oj,6qc,6qg,6pb,6ol,6p6,6ov,15me
010254a6 US 6qc,6pb,6nw,6nx,15me,6o0,6ok,6p8
011b905c US 6oj,6nw,6ov,15me,6qc,6ow,6ql,6on,6qi,6qe
011fffa6 US 15me,6ok,6oj,6p6,6pb,6on,6qc,6ov,6oo,6nw,6oc
I want to read in the csv, get rid of the first two columns, and create a List that contains one instance of each unique identifier code found in the third column, so running the code on the above data should return the result List(6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6p8, 15m9, 6ok, 6oo, 6nz, 6om, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)
I have the following code which returns a List containing every distinct value found anywhere in the csv file:
val in_file = new File("input_file.csv")
val source = scala.io.Source.fromFile(in_file, "utf-8")
val labels = try source.getLines.mkString("\t") finally source.close()
val labelsList: List[String] = labels.split("[,\t]").map(_.trim).toList.distinct
Using the above input, my code returns labelsList with a value of List(0002ba73-e60c-4ffb-9131-c1612b904658, US, 6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 002f50e4-48cc-4b14-bb80-0502068b6161, 6om, 6p8, 15m9, 6ok, 6oo, 004b5edc-c0cc-4ffd-bef3-980bd92b92e6, 6nz, 6ql, 6p6, 005cc990-83dc-4e63-a4b6-58f38241e8fd, 6qf, 6og, 6nx, 005fe1ea-b918-48a3-a495-1f8ac12935ba, 6p0, 6ox, 6ol, 6o5, 00777555-83d4-401e-861b-5892f3aa3e1c, 6rk, 6o0, 00cbcc7d-1b48-4c5c-8141-8fc8f62b7b07, 6qg, 010254a6-2ef0-4a24-aa4d-3cc6656a55de, 011b905c-fbf3-441a-8912-a94cc0fe8a1d, 6ow, 6qi, 6qe, 011fffa6-0b9f-4d88-8ced-ce1cc864984f, 6oc)
How can I get my code to run properly and ignore anything contained within the first 2 columns of the csv?
You can ignore the first two columns and then split the third by the comma.
Finally a toSet will get rid of the duplicate identifiers.
val f = Source.fromFile("input_file.csv")
val lastColumns = f.getLines().map(_.split("\t")(2))
val uniques = lastColumns.flatMap(_.split(",")).toSet
uniques foreach println
Using Scala 2.13 resource management.
util.Using(io.Source.fromFile("input_file.csv")){
_.getLines()
.foldLeft(Array.empty[String]){
_ ++ _.split("\t")(2).split(",")
}.distinct.toList
}
//res0: scala.util.Try[List[String]] =
// Success(List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc))
The .toList can be dropped if an Array result is acceptable.
This is what you can do , Am doing on a sample DF, you can replace with yours
val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
val reqCols = Seq(2)
val finalDf = Df.select(reqCols map Df.columns map col: _*)
finalDf.show
Note : This is 0-based index, so pass 2 to get third column.
If you want distinct values from your desired column.you can use distinct along with mkstring
val Df = Seq(("7369", "SMITH" , "2010-12-17", "800.00"), ("7499", "ALLEN","2011-02-20", "1600.00"), ("7499", "ALLEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
val reqCols = Seq(2)
val distinctValues = Df.select(reqCols map Df.columns map col: _*).distinct.collect.mkString(",").filterNot("[]".toSet)
println(distinctValues)
Dates are duplicate , above code is removing duplicates.
Another method using regex
val data = scala.io.Source.fromFile("source.txt").getLines()
data.toList.flatMap {
line => """\S+\s+\S+\s+(\S+)""".r.findAllMatchIn(line).map( x => x.group(1).split(",").toList)
}.flatten.distinct
// res0: List[String] = List(6o7, 6on, 6qc, 6qj, 6nw, 6ov, 6oj, 6oi, 15me, 6pb, 6p9, 6om, 6p8, 15m9, 6ok, 6oo, 6nz, 6ql, 6p6, 6qf, 6og, 6nx, 6p0, 6ox, 6ol, 6o5, 6rk, 6o0, 6qg, 6ow, 6qi, 6qe, 6oc)

PATINDEX in spark sql

I have this statement in sql
Case WHEN AAAA is not null then AAAA
Else RTRIM(LEFT(BBBB, PATINDEX('%[0-9]%', BBBB) - 1))
END as NAME.
I need to convert this to spark sql. I tried using indexOf, but it doesn't take the string '%[0-9]%. How do i convert the above statement to spark SQL. please help
Thanks !
My code to do this in scala spark. I used udf to do this.
Edit : Assuming string needs to be cut from first occurrence of number.
import spark.implicits._
val df = Seq(("SOUTH TEXAS SYNDICATE 454C"),
("SANDERS 34-27 #3TF"),
("K. R. BRACKEN B 3H"))
.toDF("name")
df.createOrReplaceTempView("temp")
val getIndexOfFirstNumber = (s: String) => {
val str = s.split("\\D+").filter(_.nonEmpty).toList
s.indexOf(str(0))
}
spark.udf.register("getIndexOfFirstNumber", getIndexOfFirstNumber)
spark.sql("""
select name,substr(name, 0, getIndexOfFirstNumber(name) -1) as final_name
from temp
""").show(20,false)
Result ::
+------------------------------------+----------------------+
|name |final_name |
+------------------------------------+----------------------+
|SOUTH TEXAS SYNDICATE 454C |SOUTH TEXAS SYNDICATE |
|SANDERS 34-27 #3TF |SANDERS |
|K. R. BRACKEN B 3H |K. R. BRACKEN B |
|ALEXANDER-WESSENDORFF 1 (SA) A5 A 5H|ALEXANDER-WESSENDORFF |
|USZYNSKI-FURLOW (SA) B 3H |USZYNSKI-FURLOW (SA) B|
+------------------------------------+----------------------+
Based on Manish answer I build this, it's more generic and was build in Python. You can use it on spark sql as well
The exemple is not for numbers but for the string DATE
import re
def PATINDEX(string,s):
if s:
match = re.search(string, s)
if match:
return match.start()+1
else:
return 0
else:
return 0
spark.udf.register("PATINDEX", PATINDEX)
PATINDEX('DATE','a2aDATEs2s')
You can use the below method to remove the leading zeroes using Databricks or Spark SQL.
REPLACE(LTRIM(REPLACE('0000123045','0',' ')),' ','0')
EXPLANATION:
The first replace function replaces the zero with empty space.
Example : ' 123 45'
The LTRIM function removes the empty space from the left.
Example : '123 45'
Then the third replace function replaces the empty space with zero.
Example:'123045'
Similarly, you can use the function with RTRIM for removing the trailing zeroes accordingly.
Do an upvote if you like my answer.
Thanks

Sparksql: Json array maintaining order of elements

I have some json data where one of the elements is an array. Here is a sample dataset:
{"name":"Michael", "schools":[{"sname":"stanford", "year":2010}{"sname":"berkeley", "year":2012}, {"sname":"mit", "year":2016}]}
{"name":"Andy", "schools":[{"sname":"ucsb", "year":2011}, {"sname":"ucsd", "year":2015}]]}
I want to use name as key and for a given name, I want to combine all the school names in the order they are present in the array.
Here is the desired o/p:
michael, "stanford berkeley mit"
Andy "ucsb ucsd"
Here is my code:
val people = sqlContext.read.json("test.json")
val flattened = people.select($"name", explode($"schools").as("schools_flat"))
val schools = flattened.select("name", "schools_flat.sname")
scala> schools.show()
+-------+--------+
| name| sname|
+-------+--------+
|Michael|stanford|
|Michael|berkeley|
|Michael| mit|
+-------+--------+
Unfortunately, when I group this by key, I am not sure if order will be retained (most likely not). I don't want to the school names for Michael to be reorderd, they should appear as they were present in the original json array. Any help with this will be great.
Why explode and group instead of select?
people.select("name", "schools.sname")
It will preserve order as you want.
The following code does what was asked in the question.
val people = sqlContext.read.json("test.json")
val test = people.select("name", "schools.sname")
val getConcatenated = udf( (first: Seq[String]) => { first.mkString(" ") } )
val test_cat = newtest.withColumn("sname_concat", getConcatenated(col("sname"))).select("name", "sname_concat")

How to loop through each row of dataFrame in pyspark

E.g
sqlContext = SQLContext(sc)
sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()
The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.
You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.
You can of course collect
for row in df.rdd.collect():
do_something(row)
or convert toLocalIterator
for row in df.rdd.toLocalIterator():
do_something(row)
and iterate locally as shown above, but it beats all purpose of using Spark.
To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.
def customFunction(row):
return (row.name, row.age, row.city)
sample2 = sample.rdd.map(customFunction)
or
sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.
Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.
sample3 = sample.withColumn('age2', sample.age + 2)
Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:
df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]
In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().
Or more abbreviated:
tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]
And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.
sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
print("{} is a {} year old from {}".format(
row["name"],
row["age"],
row["city"]))
It might not be the best practice, but you can simply target a specific column using collect(), export it as a list of Rows, and loop through the list.
Assume this is your df:
+----------+----------+-------------------+-----------+-----------+------------------+
| Date| New_Date| New_Timestamp|date_sub_10|date_add_10|time_diff_from_now|
+----------+----------+-------------------+-----------+-----------+------------------+
|2020-09-23|2020-09-23|2020-09-23 00:00:00| 2020-09-13| 2020-10-03| 51148 |
|2020-09-24|2020-09-24|2020-09-24 00:00:00| 2020-09-14| 2020-10-04| -35252 |
|2020-01-25|2020-01-25|2020-01-25 00:00:00| 2020-01-15| 2020-02-04| 20963548 |
|2020-01-11|2020-01-11|2020-01-11 00:00:00| 2020-01-01| 2020-01-21| 22173148 |
+----------+----------+-------------------+-----------+-----------+------------------+
to loop through rows in Date column:
rows = df3.select('Date').collect()
final_list = []
for i in rows:
final_list.append(i[0])
print(final_list)
Give A Try Like this
result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]);
for f in result.collect():
print (f.col_name)
If you want to do something to each row in a DataFrame object, use map. This will allow you to perform further calculations on each row. It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1.
Note that this will return a PipelinedRDD, not a DataFrame.
above
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
should be
tupleList = [{'name':x["name"], 'age':x["age"], 'city':x["city"]}
for name, age, and city are not variables but simply keys of the dictionary.

Resources