Join RDD using python conditions - python-3.x

I have two RDD. First one contains information related IP address (see col c_ip):
[Row(unic_key=1608422, idx=18, s_date='2016-12-31', s_time='15:00:07', c_ip='119.228.181.78', c_session='3hyj0tb434o23uxegpnmvzr0', origine_file='inFile', process_date='2017-03-13'),
Row(unic_key=1608423, idx=19, s_date='2016-12-31', s_time='15:00:08', c_ip='119.228.181.78', c_session='3hyj0tb434o23uxegpnmvzr0', origine_file='inFile', process_date='2017-03-13'),
]
And another RDD which is IP geolocation.
network,geoname_id,registered_country_geoname_id,represented_country_geoname_id,is_anonymous_proxy,is_satellite_provider,postal_code,latitude,longitude,accuracy_radius
1.0.0.0/24,2077456,2077456,,0,0,,-33.4940,143.2104,1000
1.0.1.0/24,1810821,1814991,,0,0,,26.0614,119.3061,50
1.0.2.0/23,1810821,1814991,,0,0,,26.0614,119.3061,50
1.0.4.0/22,2077456,2077456,,0,0,,-33.4940,143.2104,1000
I would like to match these two but the problem is that I dont have a strict equivalent between the column in both RDD.
I would like to use the Python3 Package ipaddress and do a check like this:
> import ipaddress
> ipaddress.IPv4Address('1.0.0.5') in ipaddress.ip_network('1.0.0.0/24')
True
Is it possible to use a python function to perform the join (left outer join to not exclude any lines from my first RDD)? How can I do that?

When using Apache Spark 1.6, you can still use an UDF function as a predicate in a join. After generating some test data:
import ipaddress
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, StructField, StructType, BooleanType, ArrayType, IntegerType
sessions = sc.parallelize([(1608422,'119.228.181.78'),(1608423, '119.228.181.78')]).toDF(['unic_key','c_ip'])
geo_ip = sc.parallelize([('1.0.0.0/24',2077456,2077456),
('1.0.1.0/24',1810821,1814991),
('1.0.2.0/23',1810821,1814991),
('1.0.4.0/22',2077456,2077456)]).toDF(['network','geoname_id','registered_country_geoname_id'])
You can create the UDF predicate as follows:
def ip_range(ip, network_range):
return ipaddress.IPv4Address(unicode(ip)) in ipaddress.ip_network(unicode(network_range))
pred = udf(lambda ip, network_range:ipaddress.IPv4Address(unicode(ip)) in ipaddress.ip_network(unicode(network_range)), BooleanType())
And then you can use the UDF if the following join:
sessions.join(geo_ip).where(pred(sessions.c_ip, geo_ip.network))
Unfortunately this currently doesn't work in Spark 2.x, see https://issues.apache.org/jira/browse/SPARK-19728

Related

Keep only the newest data with Spark structured streaming

I am streaming data like this: time, id, value
I want to keep only one record for each id, with the newest value. What is the best way to deal with this problem?
Prefer to use Pyspark
from pyspark.sql import Window
from pyspark.sql.functions import rank, col, monotonically_increasing_id
window = Window.partitionBy("id").orderBy("time",'tiebreak')
df_s
.withColumn('tiebreak', monotonically_increasing_id())
.withColumn('rank', rank().over(window))
.filter(col('rank') == 1).drop('rank','tiebreak')
.show()
Rank and tiebreaks are added to remove duplicates or ties across and within window partitions.

Is there an inverse function for pyspark's expr?

I know there's a function called expr that turns your spark sql into a spark column with that expression:
>>> from pyspark.sql import functions as F
>>> F.expr("length(name)")
Column<b'length(name)'>
Is there a function that does the opposite - turn your Column into a pyspark's sql string? Something like:
>>> F.inverse_expr(F.length(F.col('name')))
'length(name)'
I found out that Column's __repr__ gives you an idea what the column expression is (like Column<b'length(name)'>, but it doesn't seem to be usable programmatically, without some hacky parsing and string-replacing.
In scala, we can use column#expr to get sql type expression as below-
length($"entities").expr.toString()
// length('entities)
In pyspark-
print(F.length("name")._jc.expr.container)
# length(name)
I tried the accepted answer by #Som in Spark 2.4.2 and Spark 3.2.1 and it didn't work.
The following approach worked for me in pyspark:
import pyspark
from pyspark.sql import Column
def inverse_expr(c: Column) -> str:
"""Convert a column from `Column` type to an equivalent SQL column expression (string)"""
from packaging import version
sql_expression = c._jc.expr().sql()
if version.parse(pyspark.__version__) < version.parse('3.2.0'):
# prior to Spark 3.2.0 f.col('a.b') would be converted to `a.b` instead of the correct `a`.`b`
# this workaround is used to fix this issue
sql_expression = re.sub(
r'''(`[^"'\s]+\.[^"'\s]+?`)''',
lambda x: x.group(0).replace('.', '`.`'),
sql_expression,
flags=re.MULTILINE
)
return sql_expression
>>> from pyspark.sql import functions as F
>>> inverse_expr(F.length(F.col('name')))
'length(`name`)'
>>> inverse_expr(F.length(F.lit('name')))
"length('name')"
>>> inverse_expr(F.length(F.col('table.name')))
'length(`table`.`name`)'

How do I add a new date column with constant value to a Spark DataFrame (using PySpark)?

I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark?
I used below code snippet
from pyspark.sql import functions as F
strRecordStartTime="1970-01-01"
recrodStartTime=hashNonKeyData.withColumn("RECORD_START_DATE_TIME",
lit(strRecordStartTime).cast("timestamp")
)
It gives me following error
org.apache.spark.sql.AnalysisException: cannot resolve '1970-01-01'
Any pointer is appreciated?
Try to use python native datetime with lit, I'm sorry don't have the access to machine now.
recrodStartTime = hashNonKeyData.withColumn('RECORD_START_DATE_TIME', lit(datetime.datetime(1970, 1, 1))
I have created one spark dataframe:
from pyspark.sql.types import StringType
df1 = spark.createDataFrame(["Ravi","Gaurav","Ketan","Mahesh"], StringType()).toDF("Name")
Now lets add one new column to the exiting dataframe:
from pyspark.sql.functions import lit
import dateutil.parser
yourdate = dateutil.parser.parse('1901-01-01')
df2= df1.withColumn('Age', lit(yourdate)) // addition of new column
df2.show() // to print the dataframe
You can validate your your schema by using below command.
df2.printSchema
Hope that helps.
from pyspark.sql import functions as F
strRecordStartTime = "1970-01-01"
recrodStartTime = hashNonKeyData.withColumn("RECORD_START_DATE_TIME", F.to_date(F.lit(strRecordStartTime)))

Apply a function to a single column of a csv in Spark

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?
My code
SparkContext().addPyFile("myfile.py")
spark = SparkSession\
.builder\
.appName("myApp")\
.getOrCreate()
from myfile import myFunction
df = spark.read.csv(sys.argv[1], header=True,
mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()
I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].
I'm using Spark version 2.0.1
You can simply use User Defined Functions (udf) combined with a withColumn :
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df = df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider
This will add a new column to the dataframe df containing the result of myFunction(line[3]).

Convert Hive Sql to Spark Sql

i want to convert my Hive Sql to Spark Sql to test the performance of query. Here is my Hive Sql. Can anyone suggests me how to convert the Hive Sql to Spark Sql.
SELECT split(DTD.TRAN_RMKS,'/')[0] AS TRAB_RMK1,
split(DTD.TRAN_RMKS,'/')[1] AS ATM_ID,
DTD.ACID,
G.FORACID,
DTD.REF_NUM,
DTD.TRAN_ID,
DTD.TRAN_DATE,
DTD.VALUE_DATE,
DTD.TRAN_PARTICULAR,
DTD.TRAN_RMKS,
DTD.TRAN_AMT,
SYSDATE_ORA(),
DTD.PSTD_DATE,
DTD.PSTD_FLG,
G.CUSTID,
NULL AS PROC_FLG,
DTD.PSTD_USER_ID,
DTD.ENTRY_USER_ID,
G.schemecode as SCODE
FROM DAILY_TRAN_DETAIL_TABLE2 DTD
JOIN ods_gam G
ON DTD.ACID = G.ACID
where substr(DTD.TRAN_PARTICULAR,1,3) rlike '(PUR|POS).*'
AND DTD.PART_TRAN_TYPE = 'D'
AND DTD.DEL_FLG <> 'Y'
AND DTD.PSTD_FLG = 'Y'
AND G.schemecode IN ('SBPRV','SBPRS','WSSTF','BGFRN','NREPV','NROPV','BSNRE','BSNRO')
AND (SUBSTR(split(DTD.TRAN_RMKS,'/')[0],1,6) IN ('405997','406228','406229','415527','415528','417917','417918','418210','421539','421572','432198','435736','450502','450503','450504','468805','469190','469191','469192','474856','478286','478287','486292','490222','490223','490254','512932','512932','514833','522346','522352','524458','526106','526701','527114','527479','529608','529615','529616','532731','532734','533102','534680','536132','536610','536621','539149','539158','549751','557654','607118','607407','607445','607529','652189','652190','652157') OR SUBSTR(split(DTD.TRAN_RMKS,'/')[0],1,8) IN ('53270200','53270201','53270202','60757401','60757402') )
limit 50;
Query is lengthy to write code for above, I won't attempt to write code here, But I would offer DataFrames approach.
which has flexibility to implement above query Using DataFrame , Column operations
like filter,withColumn(if you want to convert/apply hive UDF to scala function/udf) , cast for casting datatypes etc..
Recently I've done this and its performant.
Below is the psuedo code in Scala
val df1 = hivecontext.sql ("select * from ods_gam").as("G")
val df2 = hivecontext.sql("select * from DAILY_TRAN_DETAIL_TABLE2).as("DTD")
Now, join using your dataframes
val joinedDF = df1.join(df2 , df1("G.ACID") = df2("DTD.ACID"), "inner")
// now apply your string functions here...
joinedDF.withColumn or filter ,When otherwise ... blah.. blah here
Note : I think in your case udfs are not required, simple string functions would suffice.
Also have a look at DataFrameJoinSuite.scala which could be very useful for you...
Further details refer docs
Spark 1.5 :
DataFrame.html
All the dataframe column operations Column.html
If you are looking for sample code of UDF below is code snippet.
Construct Dummy Data
import util.Random
import org.apache.spark.sql.Row
implicit class Crossable[X](xs: Traversable[X]) {
def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y)
}
val students = Seq("John", "Mike","Matt")
val subjects = Seq("Math", "Sci", "Geography", "History")
val random = new Random(1)
val data =(students cross subjects).map{x => Row(x._1, x._2,random.nextInt(100))}.toSeq
// Create Schema Object
import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType}
val schema = StructType(Array(
StructField("student", StringType, nullable=false),
StructField("subject", StringType, nullable=false),
StructField("score", IntegerType, nullable=false)
))
// Create DataFrame
import org.apache.spark.sql.hive.HiveContext
val rdd = sc.parallelize(data)
val df = sqlContext.createDataFrame(rdd, schema)
// Define udf
import org.apache.spark.sql.functions.udf
def udfScoreToCategory=udf((score: Int) => {
score match {
case t if t >= 80 => "A"
case t if t >= 60 => "B"
case t if t >= 35 => "C"
case _ => "D"
}})
df.withColumn("category", udfScoreToCategory(df("score"))).show(10)
Just try to use it as it is, you should benefit from this right away if you run this query with Hive on MapReduce before that, from there if you still would need to get better results you can analyze Query plan and optimize it further like using partitioning for example. Spark uses memory more heavily and beyond simple transformations is generally faster than MapReduce, Spark sql also uses Catalyst Optimizer, your query benefit from that too.
Considering your comment about "using spark functions like Map, Filter etc", map() just transforms data, but you just have string functions I don't think you will gain anything by rewriting them using .map(...), spark will do transformations for you, filter() if you can filter the input data, you can just rewrite query using sub queries and other sql capabilities.

Resources