What's the difference between Dataset.col() and functions.col() in Spark? - apache-spark

Here's some statement: https://stackoverflow.com/a/45600938/4164722
Dataset.col returns resolved column while col returns unresolved column.
Can someone provide more details? When should I use Dataset.col() and when functions.col?
Thanks.

In majority of contexts there is no practical difference. For example:
val df: Dataset[Row] = ???
df.select(df.col("foo"))
df.select(col("foo"))
are equivalent, same as:
df.where(df.col("foo") > 0)
df.where(col("foo") > 0)
Difference becomes important when provenance matters, for example joins:
val df1: Dataset[Row] = ???
val df2: Dataset[Row] = ???
df1.join(df2, Seq("id")).select(df1.col("foo") =!= df2.col("foo"))
Because Dataset.col is resolved and bound to a DataFrame it allows you to unambiguously select column descending from a particular parent. It wouldn't be possible with col.

EXPLANATION:
At times you may want to programmatically pre-create (i.e. ahead of time) column expressions for later use -- before the related DataFrame(s) actually exists. In that use-case, col(expression) can be useful. Generically illustrated using pySpark syntax:
>>> cX = col('col0') # Define an unresolved column.
>>> cY = col('myCol') # Define another unresolved column.
>>> cX,cY # Show that these are naked column names.
(Column<b'col0'>, Column<b'myCol'>)
Now these are called unresolved columns because they are not associated with a DataFrame statement to actually know whether those column names actually exist anywhere. However you may, in fact, apply them in a DF context later on, after having prepared them:
>>> df = spark_sesn.createDataFrame([Row(col0=10, col1='Ten', col2=10.0),])
>>> df
>>> DataFrame[col0: bigint, col1: string, col2: double]
>>> df.select(cX).collect()
[Row(col0=10)] # cX is successfully resolved.
>>> df.select(cY).collect()
Traceback (most recent call last): # Oh dear! cY, which represents
[ ... snip ... ] # 'myCol' is truly unresolved here.
# BUT maybe later on it won't be, say,
# after a join() or something else.
CONCLUSION:
col(expression) can help programmatically decouple the DEFINITION of a column specification with the APPLICATION of it against DataFrame(s) later on. Note that expr(aString), which also returns a column specification, provides a generalization of col('xyz'), where whole expressions can be DEFINED and later APPLIED:
>>> cZ = expr('col0 + 10') # Creates a column specification / expression.
>>> cZ
Column<b'(col0 + 10)'>
>>> df.select(cZ).collect() # Applying that expression later on.
[Row((col0 + 10)=20)]
I hope this alternative use-case helps.

Related

Pyspark - Looking to create a normalized version of a Double column

As the title states, I'd like to create a normalized version of an existing Double column.
As I'm quite new to pyspark, this was my attempt at solving this:
df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df2 = df2.withColumn('count_trans_norm', F.col('count_trans) / (F.max(F.col('count_trans'))))
When I do this, I get the following error:
"grouping expressions sequence is empty, and '`movie_id`' is not an aggregate function.
Any help would be much appreciated.
You need to specify an empty window if you want to get the maximum of count_trans in df2:
df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df3 = df2.selectExpr('*', 'count_trans / max(count_trans) over () as count_trans_norm')
Or if you prefer pyspark syntax:
from pyspark.sql import functions as F, Window
df3 = df2.withColumn('count_trans_norm', F.col('count_trans') / F.max(F.col('count_trans')).over(Window.orderBy()))

Data Frame Error: UndefinedVariableError: name is not defined

I'm working in a jupyter notebook and am trying to create objects for two different answers in a column: Yes and No; in order to see the similarities between all of the 'yes' responses and the same for the 'no' responses as well.
When I use the following code, i get an error that states: UndefinedVariableError: name 'No' is not defined
df_yes=df.query('No-show == \"Yes\"')
df_no=df.query('No-show == \"No\"')
Since the same error occurs even when I'm only including the df_yes, then I figured it has to have something to do with the column name "No-show." So I tried it with different columns and sure enough, it works.
So can someone enlighten me what I'm doing wrong with with this code block so I won't do it again? Thanks!
Observe this example:
>>> import pandas as pd
>>> d = {'col1': ['Yes','No'], 'col2': ['No','No']}
>>> df = pd.DataFrame(data=d)
>>> df.query('col1 == \"Yes\"')
col1 col2
0 Yes No
>>> df.query('col2 == \"Yes\"')
Empty DataFrame
Columns: [col1, col2]
Index: []
>>>
Everything seems to work as expected. But, if I change col1 and col2 to col-1 and col-2, respectively:
>>> d = {'col-1': ['Yes','No'], 'col-2': ['No','No']}
>>> df = pd.DataFrame(data=d)
>>> df.query('col-1 == \"Yes\"')
...
pandas.core.computation.ops.UndefinedVariableError: name 'col' is not defined
As you can see, the problem is the minus (-) you use in your column name. As a matter of fact, you were even more unlucky because No in your error message refers to No-show and not to the value No of your columns.
So, the best solution (and best practice in general) is to name your columns differently (think of them as variables; you can not have a minus in the name of a variable, at least in Python). For example, No_show. If this data frame is not created by you (e.g. you read your data from a csv file), it 's a common practice to rename columns appropriately.

Pandas - Behaviour/Issue of reindex vs loc for setting values

I've been trying to use reindex instead of loc in pandas as from 0.24 there is a warning about reindexing with lists.
The issue I have is that I use loc to change the values of my dataframes.
Now if use reindex i lose this and if I try to be smart I even get a bug.
Contemplate the following case:
df = pd.DataFrame(data=pd.np.zeros(4, 2), columns=['a', 'b'])
ds = pd.Series(data=[1]*3)
I want to change a subset of values (while retaining the others), so df keeps the same shape.
So this is the original behaviour which works (and changes the values in a subset of df['a'] to 1)
df.loc[range(3), 'a'] = ds
But when I'm using the reindex I fail to change anything:
df.reindex(range(3)).loc['a'] = ds
Now when I try something like this:
df.loc[:, 'a'].reindex(range(3)) = ds
I get a SyntaxError: can't assign to function call error message.
For reference I am using pandas 0.24 and python 3.6.8
The quick answer from #coldspeed was the easiest, though the behaviour of the warning is misleading.
So reindex returns a copy when loc doesn't.
From the pandas docs:
A new object is produced unless the new index is equivalent to the current one and copy=False.
So saying reindex is an alternative to loc as per the warning is actually misleading.
Hope this helps people who face the same situation.

How to group by multiple columns and collect in list in PySpark?

Here is my problem:
I've got this RDD:
a = [[u'PNR1',u'TKT1',u'TEST',u'a2',u'a3'],[u'PNR1',u'TKT1',u'TEST',u'a5',u'a6'],[u'PNR1',u'TKT1',u'TEST',u'a8',u'a9']]
rdd= sc.parallelize (a)
Then I try :
rdd.map(lambda x: (x[0],x[1],x[2], list(x[3:])))
.toDF(["col1","col2","col3","col4"])
.groupBy("col1","col2","col3")
.agg(collect_list("col4")).show
Finally I should find this:
[col1,col2,col3,col4]=[u'PNR1',u'TKT1',u'TEST',[[u'a2',u'a3'][u'a5',u'a6'][u'a8',u'a9']]]
But the problem is that I can't collect a list.
If anyone can help me I will appreciate it
I finally found a solution, it is not the best way but I can continue working...
from pyspark.sql.functions import udf
from pyspark.sql.functions import *
def example(lista):
d = [[] for x in range(len(lista))]
for index, elem in enumerate(lista):
d[index] = elem.split("#")
return d
example_udf = udf(example, LongType())
a = [[u'PNR1',u'TKT1',u'TEST',u'a2',u'a3'],[u'PNR1',u'TKT1',u'TEST',u'a5',u'a6'],[u'PNR1',u'TKT1',u'TEST',u'a8',u'a9']]
rdd= sc.parallelize (a)
df = rdd.toDF(["col1","col2","col3","col4","col5"])
df2=df.withColumn('col6', concat(col('col4'),lit('#'),col('col5'))).drop(col("col4")).drop(col("col5")).groupBy([col("col1"),col("col2"),col("col3")]).agg(collect_set(col("col6")).alias("col6"))
df2.map(lambda x: (x[0],x[1],x[2],example(x[3]))).collect()
And it gives:
[(u'PNR1', u'TKT1', u'TEST', [[u'a2', u'a3'], [u'a5', u'a6'], [u'a8', u'a9']])]
Hope this solution can help to someone else.
Thanks for all your answers.
This might do your job (or give you some ideas to proceed further)...
One idea is to convert your col4 to a primitive data type, i.e. a string:
from pyspark.sql.functions import collect_list
import pandas as pd
a = [[u'PNR1',u'TKT1',u'TEST',u'a2',u'a3'],[u'PNR1',u'TKT1',u'TEST',u'a5',u'a6'],[u'PNR1',u'TKT1',u'TEST',u'a8',u'a9']]
rdd = sc.parallelize(a)
df = rdd.map(lambda x: (x[0],x[1],x[2], '(' + ' '.join(str(e) for e in x[3:]) + ')')).toDF(["col1","col2","col3","col4"])
df.groupBy("col1","col2","col3").agg(collect_list("col4")).toPandas().values.tolist()[0]
#[u'PNR1', u'TKT1', u'TEST', [u'(a2 a3)', u'(a5 a6)', u'(a8 a9)']]
UPDATE (after your own answer):
I really thought the point I had reached above was enough to further adapt it according to your needs, plus that I didn't have time at the moment to do it myself; so, here it is (after modifying my df definition to get rid of the parentheses, it is just a matter of a single list comprehension):
df = rdd.map(lambda x: (x[0],x[1],x[2], ' '.join(str(e) for e in x[3:]))).toDF(["col1","col2","col3","col4"])
# temp list:
ff = df.groupBy("col1","col2","col3").agg(collect_list("col4")).toPandas().values.tolist()[0]
ff
# [u'PNR1', u'TKT1', u'TEST', [u'a2 a3', u'a5 a6', u'a8 a9']]
# final list of lists:
ll = ff[:-1] + [[x.split(' ') for x in ff[-1]]]
ll
which gives your initially requested result:
[u'PNR1', u'TKT1', u'TEST', [[u'a2', u'a3'], [u'a5', u'a6'], [u'a8', u'a9']]] # requested output
This approach has certain advantages compared with the one provided in your own answer:
It avoids Pyspark UDFs, which are known to be slow
All the processing is done in the final (and hopefully much smaller) aggregated data, instead of adding and removing columns and performing map functions and UDFs in the initial (presumably much bigger) data
Since you cannot update to 2.x your only option is RDD API. Replace you current code with:
rdd.map(lambda x: ((x[0], x[1], x[2]), list(x[3:]))).groupByKey().toDF()

How to remove rows in DataFrame on column based on another DataFrame?

I'm trying to use SQLContext.subtract() in Spark 1.6.1 to remove rows from a dataframe based on a column from another dataframe. Let's use an example:
from pyspark.sql import Row
df1 = sqlContext.createDataFrame([
Row(name='Alice', age=2),
Row(name='Bob', age=1),
]).alias('df1')
df2 = sqlContext.createDataFrame([
Row(name='Bob'),
])
df1_with_df2 = df1.join(df2, 'name').select('df1.*')
df1_without_df2 = df1.subtract(df1_with_df2)
Since I want all rows from df1 which don't include name='Bob' I expect Row(age=2, name='Alice'). But I also retrieve Bob:
print(df1_without_df2.collect())
# [Row(age='1', name='Bob'), Row(age='2', name='Alice')]
After various experiments to get down to this MCVE, I found out that the issue is with the age key. If I omit it:
df1_noage = sqlContext.createDataFrame([
Row(name='Alice'),
Row(name='Bob'),
]).alias('df1_noage')
df1_noage_with_df2 = df1_noage.join(df2, 'name').select('df1_noage.*')
df1_noage_without_df2 = df1_noage.subtract(df1_noage_with_df2)
print(df1_noage_without_df2.collect())
# [Row(name='Alice')]
Then I only get Alice as expected. The weirdest observation I made is that it's possible to add keys, as long as they're after (in the lexicographical order sense) the key I use in the join:
df1_zage = sqlContext.createDataFrame([
Row(zage=2, name='Alice'),
Row(zage=1, name='Bob'),
]).alias('df1_zage')
df1_zage_with_df2 = df1_zage.join(df2, 'name').select('df1_zage.*')
df1_zage_without_df2 = df1_zage.subtract(df1_zage_with_df2)
print(df1_zage_without_df2.collect())
# [Row(name='Alice', zage=2)]
I correctly get Alice (with her zage)! In my real examples, I'm interested in all columns, not only the ones that are after name.
Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT / EXCEPT is no the right choice for partial matches.
Instead, as of Spark 2.0, you can use anti-join:
df1.join(df1_with_df2, ["name"], "leftanti").show()
In 1.6 you can do pretty much the same thing with standard outer join:
import pyspark.sql.functions as F
ref = df1_with_df2.select("name").alias("ref")
(df1
.join(ref, ref.name == df1.name, "leftouter")
.filter(F.isnull("ref.name"))
.drop(F.col("ref.name")))

Resources