As the title states, I'd like to create a normalized version of an existing Double column.
As I'm quite new to pyspark, this was my attempt at solving this:
df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df2 = df2.withColumn('count_trans_norm', F.col('count_trans) / (F.max(F.col('count_trans'))))
When I do this, I get the following error:
"grouping expressions sequence is empty, and '`movie_id`' is not an aggregate function.
Any help would be much appreciated.
You need to specify an empty window if you want to get the maximum of count_trans in df2:
df2 = df.groupBy('id').count().toDF(*['id','count_trans'])
df3 = df2.selectExpr('*', 'count_trans / max(count_trans) over () as count_trans_norm')
Or if you prefer pyspark syntax:
from pyspark.sql import functions as F, Window
df3 = df2.withColumn('count_trans_norm', F.col('count_trans') / F.max(F.col('count_trans')).over(Window.orderBy()))
I'm working in a jupyter notebook and am trying to create objects for two different answers in a column: Yes and No; in order to see the similarities between all of the 'yes' responses and the same for the 'no' responses as well.
When I use the following code, i get an error that states: UndefinedVariableError: name 'No' is not defined
df_yes=df.query('No-show == \"Yes\"')
df_no=df.query('No-show == \"No\"')
Since the same error occurs even when I'm only including the df_yes, then I figured it has to have something to do with the column name "No-show." So I tried it with different columns and sure enough, it works.
So can someone enlighten me what I'm doing wrong with with this code block so I won't do it again? Thanks!
Observe this example:
>>> import pandas as pd
>>> d = {'col1': ['Yes','No'], 'col2': ['No','No']}
>>> df = pd.DataFrame(data=d)
>>> df.query('col1 == \"Yes\"')
col1 col2
0 Yes No
>>> df.query('col2 == \"Yes\"')
Empty DataFrame
Columns: [col1, col2]
Index: []
>>>
Everything seems to work as expected. But, if I change col1 and col2 to col-1 and col-2, respectively:
>>> d = {'col-1': ['Yes','No'], 'col-2': ['No','No']}
>>> df = pd.DataFrame(data=d)
>>> df.query('col-1 == \"Yes\"')
...
pandas.core.computation.ops.UndefinedVariableError: name 'col' is not defined
As you can see, the problem is the minus (-) you use in your column name. As a matter of fact, you were even more unlucky because No in your error message refers to No-show and not to the value No of your columns.
So, the best solution (and best practice in general) is to name your columns differently (think of them as variables; you can not have a minus in the name of a variable, at least in Python). For example, No_show. If this data frame is not created by you (e.g. you read your data from a csv file), it 's a common practice to rename columns appropriately.
I've been trying to use reindex instead of loc in pandas as from 0.24 there is a warning about reindexing with lists.
The issue I have is that I use loc to change the values of my dataframes.
Now if use reindex i lose this and if I try to be smart I even get a bug.
Contemplate the following case:
df = pd.DataFrame(data=pd.np.zeros(4, 2), columns=['a', 'b'])
ds = pd.Series(data=[1]*3)
I want to change a subset of values (while retaining the others), so df keeps the same shape.
So this is the original behaviour which works (and changes the values in a subset of df['a'] to 1)
df.loc[range(3), 'a'] = ds
But when I'm using the reindex I fail to change anything:
df.reindex(range(3)).loc['a'] = ds
Now when I try something like this:
df.loc[:, 'a'].reindex(range(3)) = ds
I get a SyntaxError: can't assign to function call error message.
For reference I am using pandas 0.24 and python 3.6.8
The quick answer from #coldspeed was the easiest, though the behaviour of the warning is misleading.
So reindex returns a copy when loc doesn't.
From the pandas docs:
A new object is produced unless the new index is equivalent to the current one and copy=False.
So saying reindex is an alternative to loc as per the warning is actually misleading.
Hope this helps people who face the same situation.
Here is my problem:
I've got this RDD:
a = [[u'PNR1',u'TKT1',u'TEST',u'a2',u'a3'],[u'PNR1',u'TKT1',u'TEST',u'a5',u'a6'],[u'PNR1',u'TKT1',u'TEST',u'a8',u'a9']]
rdd= sc.parallelize (a)
Then I try :
rdd.map(lambda x: (x[0],x[1],x[2], list(x[3:])))
.toDF(["col1","col2","col3","col4"])
.groupBy("col1","col2","col3")
.agg(collect_list("col4")).show
Finally I should find this:
[col1,col2,col3,col4]=[u'PNR1',u'TKT1',u'TEST',[[u'a2',u'a3'][u'a5',u'a6'][u'a8',u'a9']]]
But the problem is that I can't collect a list.
If anyone can help me I will appreciate it
I finally found a solution, it is not the best way but I can continue working...
from pyspark.sql.functions import udf
from pyspark.sql.functions import *
def example(lista):
d = [[] for x in range(len(lista))]
for index, elem in enumerate(lista):
d[index] = elem.split("#")
return d
example_udf = udf(example, LongType())
a = [[u'PNR1',u'TKT1',u'TEST',u'a2',u'a3'],[u'PNR1',u'TKT1',u'TEST',u'a5',u'a6'],[u'PNR1',u'TKT1',u'TEST',u'a8',u'a9']]
rdd= sc.parallelize (a)
df = rdd.toDF(["col1","col2","col3","col4","col5"])
df2=df.withColumn('col6', concat(col('col4'),lit('#'),col('col5'))).drop(col("col4")).drop(col("col5")).groupBy([col("col1"),col("col2"),col("col3")]).agg(collect_set(col("col6")).alias("col6"))
df2.map(lambda x: (x[0],x[1],x[2],example(x[3]))).collect()
And it gives:
[(u'PNR1', u'TKT1', u'TEST', [[u'a2', u'a3'], [u'a5', u'a6'], [u'a8', u'a9']])]
Hope this solution can help to someone else.
Thanks for all your answers.
This might do your job (or give you some ideas to proceed further)...
One idea is to convert your col4 to a primitive data type, i.e. a string:
from pyspark.sql.functions import collect_list
import pandas as pd
a = [[u'PNR1',u'TKT1',u'TEST',u'a2',u'a3'],[u'PNR1',u'TKT1',u'TEST',u'a5',u'a6'],[u'PNR1',u'TKT1',u'TEST',u'a8',u'a9']]
rdd = sc.parallelize(a)
df = rdd.map(lambda x: (x[0],x[1],x[2], '(' + ' '.join(str(e) for e in x[3:]) + ')')).toDF(["col1","col2","col3","col4"])
df.groupBy("col1","col2","col3").agg(collect_list("col4")).toPandas().values.tolist()[0]
#[u'PNR1', u'TKT1', u'TEST', [u'(a2 a3)', u'(a5 a6)', u'(a8 a9)']]
UPDATE (after your own answer):
I really thought the point I had reached above was enough to further adapt it according to your needs, plus that I didn't have time at the moment to do it myself; so, here it is (after modifying my df definition to get rid of the parentheses, it is just a matter of a single list comprehension):
df = rdd.map(lambda x: (x[0],x[1],x[2], ' '.join(str(e) for e in x[3:]))).toDF(["col1","col2","col3","col4"])
# temp list:
ff = df.groupBy("col1","col2","col3").agg(collect_list("col4")).toPandas().values.tolist()[0]
ff
# [u'PNR1', u'TKT1', u'TEST', [u'a2 a3', u'a5 a6', u'a8 a9']]
# final list of lists:
ll = ff[:-1] + [[x.split(' ') for x in ff[-1]]]
ll
which gives your initially requested result:
[u'PNR1', u'TKT1', u'TEST', [[u'a2', u'a3'], [u'a5', u'a6'], [u'a8', u'a9']]] # requested output
This approach has certain advantages compared with the one provided in your own answer:
It avoids Pyspark UDFs, which are known to be slow
All the processing is done in the final (and hopefully much smaller) aggregated data, instead of adding and removing columns and performing map functions and UDFs in the initial (presumably much bigger) data
Since you cannot update to 2.x your only option is RDD API. Replace you current code with:
rdd.map(lambda x: ((x[0], x[1], x[2]), list(x[3:]))).groupByKey().toDF()
I'm trying to use SQLContext.subtract() in Spark 1.6.1 to remove rows from a dataframe based on a column from another dataframe. Let's use an example:
from pyspark.sql import Row
df1 = sqlContext.createDataFrame([
Row(name='Alice', age=2),
Row(name='Bob', age=1),
]).alias('df1')
df2 = sqlContext.createDataFrame([
Row(name='Bob'),
])
df1_with_df2 = df1.join(df2, 'name').select('df1.*')
df1_without_df2 = df1.subtract(df1_with_df2)
Since I want all rows from df1 which don't include name='Bob' I expect Row(age=2, name='Alice'). But I also retrieve Bob:
print(df1_without_df2.collect())
# [Row(age='1', name='Bob'), Row(age='2', name='Alice')]
After various experiments to get down to this MCVE, I found out that the issue is with the age key. If I omit it:
df1_noage = sqlContext.createDataFrame([
Row(name='Alice'),
Row(name='Bob'),
]).alias('df1_noage')
df1_noage_with_df2 = df1_noage.join(df2, 'name').select('df1_noage.*')
df1_noage_without_df2 = df1_noage.subtract(df1_noage_with_df2)
print(df1_noage_without_df2.collect())
# [Row(name='Alice')]
Then I only get Alice as expected. The weirdest observation I made is that it's possible to add keys, as long as they're after (in the lexicographical order sense) the key I use in the join:
df1_zage = sqlContext.createDataFrame([
Row(zage=2, name='Alice'),
Row(zage=1, name='Bob'),
]).alias('df1_zage')
df1_zage_with_df2 = df1_zage.join(df2, 'name').select('df1_zage.*')
df1_zage_without_df2 = df1_zage.subtract(df1_zage_with_df2)
print(df1_zage_without_df2.collect())
# [Row(name='Alice', zage=2)]
I correctly get Alice (with her zage)! In my real examples, I'm interested in all columns, not only the ones that are after name.
Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT / EXCEPT is no the right choice for partial matches.
Instead, as of Spark 2.0, you can use anti-join:
df1.join(df1_with_df2, ["name"], "leftanti").show()
In 1.6 you can do pretty much the same thing with standard outer join:
import pyspark.sql.functions as F
ref = df1_with_df2.select("name").alias("ref")
(df1
.join(ref, ref.name == df1.name, "leftouter")
.filter(F.isnull("ref.name"))
.drop(F.col("ref.name")))