DataFrame values Changing after adding columns using withColumn - apache-spark

I have created a dataframe by reading the data from db2 and dataframe looks like below.
df1.show()
Table_Name | Source_count | Target_Count
----------------------------------------
Test_tab | 12750 | 12750
After that, I have added 4 columns with values hardcoded using withcolumn lit. after adding these columns count was changed.
df2 = df1.withColumn("").withColumn("").withColumn("").withColumn("")
df2.show()
Table_Name
Source_count
Target_Count
batch
source
test_type
Record_ts
Test_tab
12600
12750
-1
p1
count
2022-05-12 20:20:15
I didn't understand why this happens. df2 was created immediately after df1.
Can someone explain what are the possibilities for this change?

Related

pivoting a single row dataframe where groupBy can not be applied

I have a dataframe like this:
inputRecordSetCount
inputRecordCount
suspenseRecordCount
166
1216
10
I am trying to make it look like
operation
value
inputRecordSetCount
166
inputRecordCount
1216
suspenseRecordCount
10
I tried pivot, but it needs a groupBy field. I dont have any groupBy field. I found some reference of Stack in Scala. But not sure, how to use it in PySpark. Any help would be appreciated. Thank you.
You can use the stack() operation as mentioned in this tutorial.
Since there are 3 unique values, pass the size, and pair of label and column name:
stack(3, "inputRecordSetCount", inputRecordSetCount, "inputRecordCount", inputRecordCount, "suspenseRecordCount", suspenseRecordCount) as (operation, value)
Full example:
df = spark.createDataFrame(data=[[166,1216,10]], schema=['inputRecordSetCount','inputRecordCount','suspenseRecordCount'])
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (operation, value)"
df = df.selectExpr(exprs)
df.show()
+-------------------+-----+
| operation|value|
+-------------------+-----+
|inputRecordSetCount| 166|
| inputRecordCount| 1216|
|suspenseRecordCount| 10|
+-------------------+-----+

Optimize Join of two large pyspark dataframes

I have two large pyspark dataframes df1 and df2 containing GBs of data.
The columns in first dataframe are id1, col1.
The columns in second dataframe are id2, col2.
The dataframes have equal number of rows.
Also all values of id1 and id2 are unique.
Also all values of id1 correspond to exactly one value id2.
For. first few entries are as for df1 and df2 areas follows
df1:
id1 | col1
12 | john
23 | chris
35 | david
df2:
id2 | col2
23 | lewis
35 | boon
12 | cena
So I need to join the two dataframes on key id1 and id2.
df = df1.join(df2, df1.id1 == df2.id2)
I am afraid this may suffer from shuffling.
How can I optimize the join operation for this special case?
To avoid the shuffling at the time of join operation, reshuffle the data based on your id column.
The reshuffle operation will also do a full shuffle but it will optimize your further joins if there are more than one.
df1 = df1.repartition('id1')
df2 = df2.repartition('id2')
Another way to avoid shuffles at join is to leverage bucketing.
Save both the dataframes by using bucketBy clause on id then later when you read the dataframes the id column will reside in same executors, hence avoiding the shuffle.
But to leverage benefit of bucketing, you need a hive metastore as the bucketing information is contained in it.
Also this will include additional steps of creating the bucket then reading.

Grouping by name and then adding up the number of another column [duplicate]

I am using pyspark to read a parquet file like below:
my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**')
Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame.
Is it possible to display the data frame in a table format like pandas data frame? Thanks!
The show method does what you're looking for.
For example, given the following dataframe of 3 rows, I can print just the first two rows like this:
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
df.show(n=2)
which yields:
+---+---+
| k| v|
+---+---+
|foo| 1|
|bar| 2|
+---+---+
only showing top 2 rows
As mentioned by #Brent in the comment of #maxymoo's answer, you can try
df.limit(10).toPandas()
to get a prettier table in Jupyter. But this can take some time to run if you are not caching the spark dataframe. Also, .limit() will not keep the order of original spark dataframe.
Let's say we have the following Spark DataFrame:
df = sqlContext.createDataFrame(
[
(1, "Mark", "Brown"),
(2, "Tom", "Anderson"),
(3, "Joshua", "Peterson")
],
('id', 'firstName', 'lastName')
)
There are typically three different ways you can use to print the content of the dataframe:
Print Spark DataFrame
The most common way is to use show() function:
>>> df.show()
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
Print Spark DataFrame vertically
Say that you have a fairly large number of columns and your dataframe doesn't fit in the screen. You can print the rows vertically - For example, the following command will print the top two rows, vertically, without any truncation.
>>> df.show(n=2, truncate=False, vertical=True)
-RECORD 0-------------
id | 1
firstName | Mark
lastName | Brown
-RECORD 1-------------
id | 2
firstName | Tom
lastName | Anderson
only showing top 2 rows
Convert to Pandas and print Pandas DataFrame
Alternatively, you can convert your Spark DataFrame into a Pandas DataFrame using .toPandas() and finally print() it.
>>> df_pd = df.toPandas()
>>> print(df_pd)
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
Note that this is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory. If this is the case, the following configuration will help when converting a large spark dataframe to a pandas one:
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
For more details you can refer to my blog post Speeding up the conversion between PySpark and Pandas DataFrames
Yes: call the toPandas method on your dataframe and you'll get an actual pandas dataframe !
By default show() function prints 20 records of DataFrame. You can define number of rows you want to print by providing argument to show() function. You never know, what will be the total number of rows DataFrame will have. So, we can pass df.count() as argument to show function, which will print all records of DataFrame.
df.show() --> prints 20 records by default
df.show(30) --> prints 30 records according to argument
df.show(df.count()) --> get total row count and pass it as argument to show
If you are using Jupyter, this is what worked for me:
[1]
df= spark.read.parquet("s3://df/*")
[2]
dsp = users
[3]
%%display
dsp
This shows well-formated HTML table, you can also draw some simple charts on it straight away. For more documentation of %%display, type %%help.
Maybe something like this is a tad more elegant:
df.display()
# OR
df.select('column1').display()

How to fill out nulls according to another dataframe pyspark

I currently started using pyspark. I have a two columns dataframe with one column containing some nulls, e.g.
df1
A B
1a3b 7
0d4s 12
6w2r null
6w2r null
1p4e null
and another dataframe has the correct mapping, i.e.
df2
A B
1a3b 7
0d4s 12
6w2r 0
1p4e 3
so I want to fill out the nulls in df1 using df2 s.t. the result is:
A B
1a3b 7
0d4s 12
6w2r 0
6w2r 0
1p4e 3
in pandas, I would first create a lookup dictionary from df2 then use apply on the df1 to populate the nulls. But I'm not really sure what functions to use in pyspark, most of replacing nulls I saw is based on simple conditions, for example, filling all the nulls to be a single constant value for certain column.
What I have tried is:
from pyspark.sql.functions import when, col
df1.withColumn('B', when(df.B.isNull(), df2.where(df2.B== df1.B).select('A')))
although I was getting AttributeError: 'DataFrame' object has no attribute '_get_object_id'. The logic is to first filter out the nulls then replace it with the column B's value from df2, but I think df.B.isNull() evaluates the whole column instead of single value, which is probably not the right way to do it, any suggestions?
left join on common column A and selecting appropriate columns should get you your desired output
df1.join(df2, df1.A == df2.A, 'left').select(df1.A, df2.B).show(truncate=False)
which should give you
+----+---+
|A |B |
+----+---+
|6w2r|0 |
|6w2r|0 |
|1a3b|7 |
|1p4e|3 |
|0d4s|12 |
+----+---+

spark groupby on several columns at same time

I’m using Spark2.0
I have a dataframe having several columns like id, latitude, longitude, time,
I want to do a groupby and keep [“latitude”,” longitude”] always together,
Could I do the following?
df.groupBy('id',[“latitude”,” longitude”] ,'time')
I want to calculate records number for each user , at each different time, with each different location [“latitude”,” longitude”].
You can combine "latitude" and "longitude" columns and then can use groupBy. Below sample is using Scala.
val df = Seq(("1","33.33","35.35","8:00"),("2","31.33","39.35","9:00"),("1","33.33","35.35","8:00")).toDF("id","latitude","longitude","time")
df.show()
val df1 = df.withColumn("lat-long",array($"latitude",$"longitude"))
df1.show()
val df2 = df1.groupBy("id","lat-long","time").count()
df2.show()
Output will be like below.
+---+--------------+----+-----+
| id| lat-long|time|count|
+---+--------------+----+-----+
| 2|[31.33, 39.35]|9:00| 1|
| 1|[33.33, 35.35]|8:00| 2|
+---+--------------+----+-----+
You can just use:
df.groupBy('id', 'latitude', 'longitude','time').agg(...)
This will work as expected without any additional steps.

Resources