I have a dataframe like below
df.show(2,False)
col1
----------
[[1,2][3,4]]
I want to add the some static value in each array content like this
col2
----------
[[1,2,"Value"],[3,4,"value]]
Please suggest me the way to achieve
explode the array and then use concat function to add the value to the array, finally use collect_list to recreate nested array.
from pyspark.sql.functions import *
df.withColumn("spark_parti_id",spark_partition_id()).\
withColumn("col2",explode(col("col1"))).\
withColumn("col2",concat(col("col2"),array(lit(2)))).\
groupBy("spark_parti_id").\
agg(collect_list(col("col2")).alias("col2")).\
drop("spark_parti_id").\
show(10,False)
#+----------------------+
#|col2 |
#+----------------------+
#|[[1, 2, 2], [3, 4, 2]]|
#+----------------------+
Related
I'm loading a JSON file into PySpark:
df = spark.read.json("20220824211022.json")
df.show()
+--------------------+--------------------+--------------------+
| data| includes| meta|
+--------------------+--------------------+--------------------+
|[{961778216070344...|{[{2018-02-09T01:...|{1562543391161741...|
+--------------------+--------------------+--------------------+
The two columns I'm interested in here are data and includes. For data, I ran the following:
df2 = df.withColumn("data", F.explode(F.col("data"))).select("data.*")
df2.show(2)
+-------------------+--------------------+-------------------+--------------+--------------------+
| author_id| created_at| id|public_metrics| text|
+-------------------+--------------------+-------------------+--------------+--------------------+
| 961778216070344705|2022-08-24T20:52:...|1562543391161741312| {0, 0, 0, 2}|With Kaskada, you...|
|1275784834321768451|2022-08-24T20:47:...|1562542031284555777| {2, 0, 0, 0}|Below is a protot...|
+-------------------+--------------------+-------------------+--------------+--------------------+
Which is something I can work with. However I can't do the same with the includes column as it has the {} enclosing the [].
Is there a way for me to deal with this using PySpark?
EDIT:
If you were to look at the includes sections in the JSON file, it looks like:
"includes": {"users": [{"id": "893899303" .... }, ...]},
So ideally in the first table in my question, I'd want the includes to be users, or at least be able to drill down to users
As your includes column is a MapType with key value = "users", you can use the .getItem() to get the array by the key, that is:
df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")
Please find the below input and output. Corresponding to each store id and period id , 11 Items should be present , if any item is missing, add it and fill that row with 0
without using loop.
Any help is highly appreciated.
input
Expected Output
You can do this:
Sample df:
df = pd.DataFrame({'store_id':[1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962,1160962, 1160962],
'period_id':[1025,1025,1025,1025,1025,1025,1026,1026,1026,1026,1026],
'item_x':[1,4,5,6,7,8,1,2,5,6,7],
'z':[1,4,5,6,7,8,1,2,5,6,7]})
Solution:
num = range(1,12)
def f(x):
return x.reindex(num, fill_value=0)\
.assign(store_id=x['store_id'].mode()[0], period_id = x['period_id'].mode()[0])
df.set_index('item_x').groupby(['store_id','period_id'], group_keys=False).apply(f).reset_index()
You can do:
from itertools import product
pdindex=product(df.groupby(["store_id", "period_id"]).groups, range(1,12))
pdindex=pd.MultiIndex.from_tuples(map(lambda x: (*x[0], x[1]), pdindex), names=["store_id", "period_id", "Item"])
df=df.set_index(["store_id", "period_id", "Item"])
res=pd.DataFrame(index=pdindex, columns=df.columns)
res.loc[df.index, df.columns]=df
res=res.fillna(0).reset_index()
Now this will work only assuming you don't have any Item outside of range [1,11].
This is a simplification of #GrzegorzSkibinski's correct answer.
This answer is not modifying the original DataFrame. It uses fewer variables to store intermediate data structures and employs a list comprehension to simplify an use of map.
I'm also using reindex() rather than creating a new DataFrame using the generated index and populating it with the original data.
import pandas as pd
import itertools
df.set_index(
["store_id", "period_id", "Item_x"]
).reindex(
pd.MultiIndex.from_tuples([
group + (item,)
for group, item in itertools.product(
df.groupby(["store_id", "period_id"]).groups,
range(1, 12),
)],
names=["store_id", "period_id", "Item_x"]
),
fill_value=0,
).reset_index()
In testing, output matched what you listed as expected.
I have an RDD and I want to find distinct values for multiple columns.
Example:
Row(col1=a, col2=b, col3=1), Row(col1=b, col2=2, col3=10)), Row(col1=a1, col2=4, col3=10)
I would like to find have a map:
col1=[a,b,a1]
col2=[b,2,4]
col3=[1,10]
Can dataframe help compute it faster/simpler?
Update:
My solution with RDD was:
def to_uniq_vals(row):
return [(k,v) for k,v in row.items()]
rdd.flatMap(to_uniq_vals).distinct().collect()
Thanks
I hope I understand your question correctly;
You can try the following:
import org.apache.spark.sql.{functions => F}
val df = Seq(("a", 1, 1), ("b", 2, 10), ("a1", 4, 10))
df.select(F.collect_set("_1"), F.collect_set("_2"), F.collect_set("_3")).show
Results:
+---------------+---------------+---------------+
|collect_set(_1)|collect_set(_2)|collect_set(_3)|
+---------------+---------------+---------------+
| [a1, b, a]| [1, 2, 4]| [1, 10]|
+---------------+---------------+---------------+
The code above should be more efficient than the purposed select distinct
column-by-column for several reasons:
Less workers-host round trips.
De-duping should be done locally on the worker prior to inter-worker de-doupings.
Hope it helps!
You can use drop duplicates and then select the same columns. Might not be the most efficient way but still a decent way:
df.dropDuplicates("col1","col2", .... "colN").select("col1","col2", .... "colN").toJSON
** Works well using Scala
I am trying to create a new dataframe column (b) removing the last character from (a).
column a is a string with different lengths so i am trying the following code -
from pyspark.sql.functions import *
df.select(substring('a', 1, length('a') -1 ) ).show()
I get a TypeError: 'Column' object is not callable
it seems to be due to using multiple functions but i cant understand why as these work on their own -
if i hardcode the column length this will work
df.select(substring('a', 1, 10 ) ).show()
or if i use length on it's own it works
df.select(length('a') ).show()
why can i not use multiple functions ?
is there an easier method of removing the last character from all rows in a column ?
Using substr
df.select(col('a').substr(lit(0), length(col('a')) - 1))
or using regexp_extract:
df.select(regexp_extract(col('a'), '(.*).$', 1))
Function substring does not work as the parameters pos and len needs to be integers, not columns
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.substring
Your code is almost correct.you just need to use len function.
df = spark.createDataFrame([('abcde',)],['dummy'])
from pyspark.sql.functions import substring
df.select('dummy',substring('dummy', 1, len('dummy') -1).alias('substr_dummy')).show()
#+-----+------------+
#|dummy|substr_dummy|
#+-----+------------+
#|abcde| abcd|
#+-----+------------+
I am working with PySpark dataframes here. "test1" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the result is a integer variable but when I try to get max of the same column the result is a dataframe. I would like to understand what operations result in a dataframe and variable. I would also like to know how to store the max of the event date as a variable
Code that results in an integer type:
loop_cnt=test1.select('event_date').distinct().count()
type(loop_cnt)
Code that results in dataframe type:
last_processed_dt=test1.select([max('event_date')])
type(last_processed_dt)
Edited to add a reproducible example:
schema = StructType([StructField("event_date", TimestampType(), True)])
df = sqlContext.createDataFrame([(datetime(2015, 8, 10, 2, 44, 15),),(datetime(2015, 8, 10, 3, 44, 15),)], schema)
Code that returns a dataframe:
last_processed_dt=df.select([max('event_date')])
type(last_processed_dt)
Code that returns a varible:
loop_cnt=df.select('event_date').distinct().count()
type(loop_cnt)
You cannot directly access the values in a dataframe. Dataframe returns a Row Object. Instead Dataframe gives you a option to convert it into a python dictionary. Go through the following example where I will calculate average wordcount:
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordCountsDF = wordsDF.groupBy(wordsDF['word']).count()
wordCountsDF.show()
Here are the word count results:
+--------+-----+
| word|count|
+--------+-----+
| cat| 2|
| rat| 2|
|elephant| 1|
+--------+-----+
Now I calculate the average of count column apply collect() operation on it. Remember collect() returns a list.Here the list contains one element only.
averageCount = wordCountsDF.groupBy().avg('count').collect()
Result looks something like this.
[Row(avg(count)=1.6666666666666667)]
You cannot access directly the average value using some python variable. You have to convert it into a dictionary to access it.
results={}
for i in averageCount:
results.update(i.asDict())
print results
Our final results look like these:
{'avg(count)': 1.6666666666666667}
Finally you can access average value using:
print results['avg(count)']
1.66666666667
I'm pretty sure df.select([max('event_date')]) returns a DataFrame because there could be more than one row that has the max value in that column. In your particular use case no two rows may have the same value in that column, but it is easy to imagine a case where more than one row can have the same max event_date.
df.select('event_date').distinct().count() returns an integer because it is telling you how many distinct values there are in that particular column. It does NOT tell you which value is the largest.
If you want code to get the max event_date and store it as a variable, try the following max_date = df.select([max('event_date')]).distinct().collect()
Using collect()
import pyspark.sql.functions as sf
distinct_count = df.agg(sf.countDistinct('column_name')).collect()[0][0]
Using first()
import pyspark.sql.functions as sf
distinct_count = df.agg(sf.countDistinct('column_name')).first()[0]
last_processed_dt=df.select([max('event_date')])
to get the max of date, we should try something like
last_processed_dt=df.select([max('event_date').alias("max_date")]).collect()[0]
last_processed_dt["max_date"]
Based on sujit's example.We can actually print the value without iterating/looping by
[Row(avg(count)=1.6666666666666667)] by providing averageCount[0][0].
Note: we are not going through the loop, because it's going to return only one value.
try this
loop_cnt=test1.select('event_date').distinct().count()
var = loop_cnt.collect()[0]
Hope this helps
trainDF.fillna({'Age':trainDF.select('Age').agg(avg('Age')).collect()[0][0]})
What you can try is accessing the collect() function.
As of spark 3.0, you can do the following:
loop_cnt=test1.select('event_date').distinct().count().collect()[0][0]
print(loop_cnt)