I have two dataframes:
df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["name", "age"])
df2 = spark.createDataFrame([("joe", 88), ("luisa", 99)], ["name", "age"])
I want to update the age when the names match. So I thought using a when() would work.
df.withColumn("age", F.when(df.name == df2.name, df2.age)).otherwise(df.age)
but this results in this error:
AnalysisException: Resolved attribute(s) name#181,age#182L missing from name#177,age#178L in operator !Project [name#177, CASE WHEN (name#177 = name#181) THEN age#182L END AS age#724L]. Attribute(s) with the same name appear in the operation: name,age. Please check if the right attribute(s) are used.;
how do resolve this? because when i print the when statement i see this:
Column<'CASE WHEN (name = name) THEN age ELSE age END'>
By "update age", I assume you want latest/greatest age:
df \
.join(df2, how="inner", on="name") \
.withColumn("updated_age", F.greatest(df2.age, df.age)) \
.select("name", F.col("updated_age").alias("age"))
[Out]:
+-----+---+
| name|age|
+-----+---+
| joe| 88|
|luisa| 99|
+-----+---+
you need a join :
df.join(df2, how="left", on="name").withColumn("age", F.coalesce(df2.age, df.age))
Related
I'm trying to create a dataframe from a list
Can someone let me know why I'm getting the error:
java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
with the following
from pyspark.sql.types import *
test_list = ['green', 'peter']
df = spark.createDataFrame(test_list,StringType()).toDF("color", "name")
Thanks
The test_list should contain the list of rows where the rows should be a tuple or list like
test_list = [('green', 'peter')]
or
test_list = [['green', 'peter']]
In case more than one rows it will be like
test_list = [('green', 'peter'), ('red', 'brialle')]
df = spark.createDataFrame(test_list, schema=["color", "name"])
df.show()
Results in
+-----+------+
|color| name |
+-----+--------+
|green|peter |
+-----+--------+
|red |brialle |
+-----+--------+
Reference: CreateDataFrame
required_columns = ['id', 'first_name', 'last_name', 'email', 'phone_numbers', 'courses']
target_column_names = ['user_id', 'user_first_name', 'user_last_name', 'user_email', 'user_phone_numbers', 'enrolled_courses']
users_df. \
select(required_columns). \
toDF(*target_column_names). \
show()
If I have a dataframe created as follows:
df = spark.table("tblName")
Is there anyway that I can get back tblName from df?
You can extract it from the plan:
df.logicalPlan().argString().replace("`","")
We can extract tablename from a dataframe by parsing unresolved logical plan.
Please follow the method below:
def getTableName(df: DataFrame): String = {
Seq(df.queryExecution.logical, df.queryExecution.optimizedPlan).flatMap{_.collect{
case LogicalRelation(_, _, catalogTable: Option[CatalogTable], _) =>
if (catalogTable.isDefined) {
Some(catalogTable.get.identifier.toString())
} else None
case hive: HiveTableRelation => Some(hive.tableMeta.identifier.toString())
}
}.flatten.head
}
scala> val df = spark.table("db.table")
scala> getTableName(df)
res: String = `db`.`table`
Following utility function may be helpful to determine the table name from given DataFrame.
def get_dataframe_tablename(df: pyspark.sql.DataFrame) -> typing.Optional[str]:
"""
If the dataframe was created from an underlying table (e.g. spark.table('dual') or
spark.sql("select * from dual"), this function will return the
fully qualified table name (e.g. `default`.`dual`) as output otherwise it will return None.
Test on: python 3.7, spark 3.0.1, but it should work with Spark >=2.x and python >=3.4 too
Examples:
>>> get_dataframe_tablename(spark.table('dual'))
`default`.`dual`
>>> get_dataframe_tablename(spark.sql("select * from dual"))
`default`.`dual`
It inspects the output of `df.explain()` to determine that the df was created from a table or not
:param df: input dataframe whose underlying table name will be return
:return: table name or None
"""
def _explain(_df: pyspark.sql.DataFrame) -> str:
# df.explain() does not take parameter to accept the out and dump the output on stdout
# by default
import contextlib
import io
with contextlib.redirect_stdout(io.StringIO()) as f:
_df.explain()
f.seek(0) # Rewind stream position
explanation = f.readlines()[1] # Ignore first output line(#Physical Plan...)
return explanation
pattern = re.compile("Scan hive (.+), HiveTableRelation (.+?), (.+)")
output = _explain(df)
match = pattern.search(output)
return match.group(2) if match else None
Below three line of code will give table and database name
import org.apache.spark.sql.execution.FileSourceScanExec
df=session.table("dealer")
df.queryExecution.sparkPlan.asInstanceOf[FileSourceScanExec].tableIdentifier
Any answer on this one yet? I found a way but it's probably not the prettiest. You can access the tablename by retrieving the physical execution plan and then doing some string splitting magic on it.
Let's say you have a table from database_name.tblName. The following should work:
execution_plan = df.__jdf.queryExecution().simpleString()
table_name = string.split('FileScan')[1].split('[')[0].split('.')[1]
The first line will return your execution plan in a string format. That will look similar to this:
== Physical Plan ==\n*(1) ColumnarToRow\n+- FileScan parquet database_name.tblName[column1#2880,column2ban#2881] Batched: true, DataFilters: [], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:/mnt/lake/database_name/table_name], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<column1:string,column2:string...\n\n'
After that you can run some string splitting to access the relevant information. The first string split gets you all the elements of FileScan- you are interested in the second element, then before and after the [- here the first element is of interest. The second string split after . will return tblName
You can create table from df. But if table is a local temporary view or a global temporary view you should drop it (sqlContext.dropTempTable) before create a table with same name or use create or replace function (spark.createOrReplaceGlobalTempView or spark.createOrReplaceTempView). If table is temp table you can create table with same name without error
#Create data frame
>>> d = [('Alice', 1)]
>>> test_df = spark.createDataFrame(sc.parallelize(d), ['name','age'])
>>> test_df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create tables
>>> test_df.createTempView("tbl1")
>>> test_df.registerTempTable("tbl2")
>>> sqlContext.tables().show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | tbl1| true|
| | tbl2| true|
+--------+---------+-----------+
#create data frame from tbl1
>>> df = spark.table("tbl1")
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create tbl1 again with using df data frame. It will get error
>>> df.createTempView("tbl1")
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Temporary view 'tbl1' already exists;"
#drop and create again
>>> sqlContext.dropTempTable('tbl1')
>>> df.createTempView("tbl1")
>>> spark.sql('select * from tbl1').show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create data frame from tbl2 and replace name value
>>> df = spark.table("tbl2")
>>> df = df.replace('Alice', 'Bob')
>>> df.show()
+----+---+
|name|age|
+----+---+
| Bob| 1|
+----+---+
#create tbl2 again with using df data frame
>>> df.registerTempTable("tbl2")
>>> spark.sql('select * from tbl2').show()
+----+---+
|name|age|
+----+---+
| Bob| 1|
+----+---+
I have a spark dataframe and one of its fields is an array of Row structures. I need to expand it into their own columns. One of the problems is in the array, sometimes a field is missing.
The following is an example:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as udf
spark = SparkSession.builder.getOrCreate()
# data
rows = [{'status':'active','member_since':1990,'info':[Row(tag='name',value='John'),Row(tag='age',value='50'),Row(tag='phone',value='1234567')]},
{'status':'inactive','member_since':2000,'info':[Row(tag='name',value='Tom'),Row(tag='phone',value='1234567')]},
{'status':'active','member_since':2015,'info':[Row(tag='name',value='Steve'),Row(tag='age',value='28')]}]
# create dataframe
df = spark.createDataFrame(rows)
# transform info to dict
to_dict = udf.UserDefinedFunction(lambda s:dict(s),MapType(StringType(),StringType()))
df = df.withColumn("info_dict",to_dict("info"))
# extract name, NA if not exists
extract_name = udf.UserDefinedFunction(lambda s:s.get("name","NA"))
df = df.withColumn("name",extract_name("info_dict"))
# extract age, NA if not exists
extract_age = udf.UserDefinedFunction(lambda s:s.get("age","NA"))
df = df.withColumn("age",extract_age("info_dict"))
# extract phone, NA if not exists
extract_phone = udf.UserDefinedFunction(lambda s:s.get("phone","NA"))
df = df.withColumn("phone",extract_phone("info_dict"))
df.show()
You can see for 'Tom', 'age' is missing; for 'Steve', 'phone' is missing. Like the above code snippet, my current solution is to first transform the array into dict and then parse each individual field into their column. The result is like this:
+--------------------+------------+--------+--------------------+-----+---+-------+
| info|member_since| status| info_dict| name|age| phone|
+--------------------+------------+--------+--------------------+-----+---+-------+
|[[name, John], [a...| 1990| active|[name -> John, ph...| John| 50|1234567|
|[[name, Tom], [ph...| 2000|inactive|[name -> Tom, pho...| Tom| NA|1234567|
|[[name, Steve], [...| 2015| active|[name -> Steve, a...|Steve| 28| NA|
+--------------------+------------+--------+--------------------+-----+---+-------+
I really just want the columns 'status','member_since','name', 'age' and 'phone'. This solution works but rather slow because of the UDF. Is there any faster alternatives? Thanks
I can think of 2 ways to do this using DataFrame functions. I believe the first one should be faster, but the code is much less elegant. The second is more compact, but probably slower.
Method 1: Create Map Dynamically
The heart of this method is to turn your Row into a MapType(). This can be achieved using pyspark.sql.functions.create_map() and some magic using functools.reduce() and operator.add().
from operator import add
import pyspark.sql.functions as f
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(3)]
)
)
The problem is that there isn't a way (AFAIK) to dynamically determine the length of the WrappedArray or iterate through it in an easy way. If a value is missing, this will cause an error because map keys can not be null. However since we know that the list can either contain 1, 2, 3 elements, we can just test for each of these cases.
df.withColumn(
'map',
f.when(f.size(f.col('info')) == 1,
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(1)]
)
)
).otherwise(
f.when(f.size(f.col('info')) == 2,
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(2)]
)
)
).otherwise(
f.when(f.size(f.col('info')) == 3,
f.create_map(
*reduce(
add,
[[f.col('info')['tag'].getItem(k), f.col('info')['value'].getItem(k)]
for k in range(3)]
)
)
)))
).select(
['member_since', 'status'] + [f.col("map").getItem(k).alias(k) for k in keys]
).show(truncate=False)
The last step turns the 'map' keys into columns using the method described in this answer.
This produces the following output:
+------------+--------+-----+----+-------+
|member_since|status |name |age |phone |
+------------+--------+-----+----+-------+
|1990 |active |John |50 |1234567|
|2000 |inactive|Tom |null|1234567|
|2015 |active |Steve|28 |null |
+------------+--------+-----+----+-------+
Method 2: Use explode, groupBy and pivot
First use pyspark.sql.functions.explode() on the column 'info', and then use the 'tag' and 'value' columns as arguments to create_map():
df.withColumn('id', f.monotonically_increasing_id())\
.withColumn('exploded', f.explode(f.col('info')))\
.withColumn(
'map',
f.create_map(*[f.col('exploded')['tag'], f.col('exploded')['value']]).alias('map')
)\
.select('id', 'member_since', 'status', 'map')\
.show(truncate=False)
#+------------+------------+--------+---------------------+
#|id |member_since|status |map |
#+------------+------------+--------+---------------------+
#|85899345920 |1990 |active |Map(name -> John) |
#|85899345920 |1990 |active |Map(age -> 50) |
#|85899345920 |1990 |active |Map(phone -> 1234567)|
#|180388626432|2000 |inactive|Map(name -> Tom) |
#|180388626432|2000 |inactive|Map(phone -> 1234567)|
#|266287972352|2015 |active |Map(name -> Steve) |
#|266287972352|2015 |active |Map(age -> 28) |
#+------------+------------+--------+---------------------+
I also added a column 'id' using pyspark.sql.functions.monotonically_increasing_id() to make sure we can keep track of which rows belong to the same record.
Now we can explode the map column, groupBy(), and pivot(). We can use pyspark.sql.functions.first() as the aggregate function for the groupBy() because we know there will only be one 'value' in each group.
df.withColumn('id', f.monotonically_increasing_id())\
.withColumn('exploded', f.explode(f.col('info')))\
.withColumn(
'map',
f.create_map(*[f.col('exploded')['tag'], f.col('exploded')['value']]).alias('map')
)\
.select('id', 'member_since', 'status', f.explode('map'))\
.groupBy('id', 'member_since', 'status').pivot('key').agg(f.first('value'))\
.select('member_since', 'status', 'age', 'name', 'phone')\
.show()
#+------------+--------+----+-----+-------+
#|member_since| status| age| name| phone|
#+------------+--------+----+-----+-------+
#| 1990| active| 50| John|1234567|
#| 2000|inactive|null| Tom|1234567|
#| 2015| active| 28|Steve| null|
#+------------+--------+----+-----+-------+
Suppose you try to extract a substring from a column of a dataframe. regexp_extract() returns a null if the field itself is null, but returns an empty string if field is not null but the expression is not found. How can you return a null value for the latter case?
df = spark.createDataFrame([(None),('foo'),('foo_bar')], StringType())
df.select(regexp_extract('value', r'_(.+)', 1).alias('extracted')).show()
# +---------+
# |extracted|
# +---------+
# | null|
# | |
# | bar|
# +---------+
I'm not sure if regexp_extract() could ever return None for a String type. One thing you could do is replace empty strings with None using a user defined function:
from pyspark.sql.functions import regexp_extract, udf
from pyspark.sql.types import StringType
df = spark.createDataFrame([(None),('foo'),('foo_bar')], StringType())
toNoneUDF = udf(lambda val: None if val == "" else val, StringType())
new_df = df.select(regexp_extract('value', r'_(.+)', 1).alias('extracted'))
new_df.withColumn("extracted", toNoneUDF(new_df.extracted)).show()
This should work:
df = spark.createDataFrame([(None),('foo'),('foo_bar')], StringType())
df = df.select(regexp_extract('value', r'_(.+)', 1).alias('extracted'))
df.withColumn(
'extracted',
when(col('extracted') != '', col('extracted'), lit(None))
).show()
In spark SQL, I've found a solution to count the number of regex occurrence, ignoring null values:
SELECT COUNT(CASE WHEN rlike(col, "_(.+)") THEN 1 END)
FROM VALUES (NULL), ("foo"), ("foo_bar"), ("") AS tab(col);
Result:
1
I hope this will help some of you.
Hi,
I've a dataset that group all the products by a Transaction_ID, and I want to exclude the Transaction_ID that have less than two products. For that I'm using this:
val edges = df.groupBy(col("Transaction_ID")).agg(collect_list(col("Product_ID")) as "Product_ID").withColumn("Product_ID", concat_ws(",", col("Product_ID"))).count().filter("count >= 2")
But when I execute this I'm getting this error:
<console>:37: error: value filter is not a member of Long
How can I solve this problem?
Many thanks!
You can try like below.
val df = Seq(("tx-1", "aaa"), ("tx-2", "bbb"), ("tx-1", "ccc"),("tx-4", "ccc")).toDF("Transaction_ID", "Product_ID")
df.show
+--------------+----------+
|Transaction_ID|Product_ID|
+--------------+----------+
| tx-1| aaa|
| tx-2| bbb|
| tx-1| ccc|
| tx-4| ccc|
+--------------+----------+
If you want Transaction_ID only then you can use
val df4 =df.groupBy(col("Transaction_ID")).count().filter(col("count") >= 2)
df4.show
If you want both Transaction_ID and Product_ID then
val df1 = df.groupBy(col("Transaction_ID")).count().filter(col("count") >= 2)
val df2 = df.groupBy(col("Transaction_ID")).agg(collect_list(col("Product_ID")) as "Product_ID").withColumn("Product_ID", concat_ws(",", col("Product_ID")))
val df3 = df1.join(df2, df1("Transaction_ID") === df2("Transaction_ID"), "inner").select(df2("Transaction_ID"),df2("Product_ID"))
df3.show
+--------------+----------+
|Transaction_ID|Product_ID|
+--------------+----------+
| tx-1| aaa,ccc|
+--------------+----------+