I'm new to spark and working with it. Previously I worked with python and pandas, pandas has a map function which is often used to apply transformation on columns. I found out that spark also have map function as well but until now I haven't used it at all except for extracting values like this df.select("id").map(r => r.getString(0)).collect.toList
import spark.implicits._
val df3 = df2.map(row=>{
val util = new Util()
val fullName = row.getString(0) +row.getString(1) +row.getString(2)
(fullName, row.getString(3),row.getInt(5))
})
val df3Map = df3.toDF("fullName","id","salary")
my questions are,
is it common to use map function to transform dataframe columns?
is it common to use map like block of code above? source from sparkbyexamples
when do people usually use map?
I have an use case to map the elements of a pyspark column based on a condition.
Going through this documentation pyspark column, i could not find a function for pyspark column to execute map function.
So tried to use the pyspark dataFrame map function, but not being able to convert the pyspark column to a dataframe
Note: The reason i am using the pyspark column is because i get that as an input from a library(Great expectations) which i use.
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return column.isin([3])
# need to replace the above logic with a map function
# like column.map(lambda x: __valid_date(x))
_spark function arguments are passed from the library
What i have,
A pyspark column with timestamp strings
What i require,
A Pyspark column with boolean(True/False) for each element based on validating the timestamp format
example for dataframe,
df.rdd.map(lambda x: __valid_date(x)).toDF()
__valid_date function returns True/False
So, i either need to convert the pyspark column into dataframe to use the above map function or is there any map function available for the pyspark column?
Looks like you need to return a column object that the framework will use for validation.
I have not used Great expectations, but maybe you can define an UDF for transforming your column. Something like this:
import pyspark.sql.functions as F
import pyspark.sql.types as T
valid_date_udf = udf(lambda x: __valid_date(x), T.BooleanType())
#column_condition_partial(engine=SparkDFExecutionEngine)
def _spark(cls, column, ts_formats, **kwargs):
return valid_date_udf(column)
I know there's a function called expr that turns your spark sql into a spark column with that expression:
>>> from pyspark.sql import functions as F
>>> F.expr("length(name)")
Column<b'length(name)'>
Is there a function that does the opposite - turn your Column into a pyspark's sql string? Something like:
>>> F.inverse_expr(F.length(F.col('name')))
'length(name)'
I found out that Column's __repr__ gives you an idea what the column expression is (like Column<b'length(name)'>, but it doesn't seem to be usable programmatically, without some hacky parsing and string-replacing.
In scala, we can use column#expr to get sql type expression as below-
length($"entities").expr.toString()
// length('entities)
In pyspark-
print(F.length("name")._jc.expr.container)
# length(name)
I tried the accepted answer by #Som in Spark 2.4.2 and Spark 3.2.1 and it didn't work.
The following approach worked for me in pyspark:
import pyspark
from pyspark.sql import Column
def inverse_expr(c: Column) -> str:
"""Convert a column from `Column` type to an equivalent SQL column expression (string)"""
from packaging import version
sql_expression = c._jc.expr().sql()
if version.parse(pyspark.__version__) < version.parse('3.2.0'):
# prior to Spark 3.2.0 f.col('a.b') would be converted to `a.b` instead of the correct `a`.`b`
# this workaround is used to fix this issue
sql_expression = re.sub(
r'''(`[^"'\s]+\.[^"'\s]+?`)''',
lambda x: x.group(0).replace('.', '`.`'),
sql_expression,
flags=re.MULTILINE
)
return sql_expression
>>> from pyspark.sql import functions as F
>>> inverse_expr(F.length(F.col('name')))
'length(`name`)'
>>> inverse_expr(F.length(F.lit('name')))
"length('name')"
>>> inverse_expr(F.length(F.col('table.name')))
'length(`table`.`name`)'
I have two RDD. First one contains information related IP address (see col c_ip):
[Row(unic_key=1608422, idx=18, s_date='2016-12-31', s_time='15:00:07', c_ip='119.228.181.78', c_session='3hyj0tb434o23uxegpnmvzr0', origine_file='inFile', process_date='2017-03-13'),
Row(unic_key=1608423, idx=19, s_date='2016-12-31', s_time='15:00:08', c_ip='119.228.181.78', c_session='3hyj0tb434o23uxegpnmvzr0', origine_file='inFile', process_date='2017-03-13'),
]
And another RDD which is IP geolocation.
network,geoname_id,registered_country_geoname_id,represented_country_geoname_id,is_anonymous_proxy,is_satellite_provider,postal_code,latitude,longitude,accuracy_radius
1.0.0.0/24,2077456,2077456,,0,0,,-33.4940,143.2104,1000
1.0.1.0/24,1810821,1814991,,0,0,,26.0614,119.3061,50
1.0.2.0/23,1810821,1814991,,0,0,,26.0614,119.3061,50
1.0.4.0/22,2077456,2077456,,0,0,,-33.4940,143.2104,1000
I would like to match these two but the problem is that I dont have a strict equivalent between the column in both RDD.
I would like to use the Python3 Package ipaddress and do a check like this:
> import ipaddress
> ipaddress.IPv4Address('1.0.0.5') in ipaddress.ip_network('1.0.0.0/24')
True
Is it possible to use a python function to perform the join (left outer join to not exclude any lines from my first RDD)? How can I do that?
When using Apache Spark 1.6, you can still use an UDF function as a predicate in a join. After generating some test data:
import ipaddress
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, StructField, StructType, BooleanType, ArrayType, IntegerType
sessions = sc.parallelize([(1608422,'119.228.181.78'),(1608423, '119.228.181.78')]).toDF(['unic_key','c_ip'])
geo_ip = sc.parallelize([('1.0.0.0/24',2077456,2077456),
('1.0.1.0/24',1810821,1814991),
('1.0.2.0/23',1810821,1814991),
('1.0.4.0/22',2077456,2077456)]).toDF(['network','geoname_id','registered_country_geoname_id'])
You can create the UDF predicate as follows:
def ip_range(ip, network_range):
return ipaddress.IPv4Address(unicode(ip)) in ipaddress.ip_network(unicode(network_range))
pred = udf(lambda ip, network_range:ipaddress.IPv4Address(unicode(ip)) in ipaddress.ip_network(unicode(network_range)), BooleanType())
And then you can use the UDF if the following join:
sessions.join(geo_ip).where(pred(sessions.c_ip, geo_ip.network))
Unfortunately this currently doesn't work in Spark 2.x, see https://issues.apache.org/jira/browse/SPARK-19728
Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this?
My code
SparkContext().addPyFile("myfile.py")
spark = SparkSession\
.builder\
.appName("myApp")\
.getOrCreate()
from myfile import myFunction
df = spark.read.csv(sys.argv[1], header=True,
mode="DROPMALFORMED",)
a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF()
I would like to be able to just call the function on the column name instead of mapping each row to line and then calling the function on line[index].
I'm using Spark version 2.0.1
You can simply use User Defined Functions (udf) combined with a withColumn :
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
udf_myFunction = udf(myFunction, IntegerType()) # if the function returns an int
df = df.withColumn("message", udf_myFunction("_3")) #"_3" being the column name of the column you want to consider
This will add a new column to the dataframe df containing the result of myFunction(line[3]).