I have two dataframes df1 and df2 somewhat like this:
import pandas as pd
from spark.sql import SparkSession
spark = SparkSession.builder.appName("someAppname").getOrCreate()
df1 = spark.createDataFrame(pd.DataFrame({"entity_nm": ["Joe B", "Donald", "Barack Obama"]}))
df2 = spark.createDataFrame(pd.DataFrame({"aliases": ["Joe Biden; Biden Joe", "Donald Trump; Donald J. Trump", "Barack Obama", "Joe Burrow"], "id": [1, 2, 3, 4]}))
I want to join df2 on df1 based on a string contains match, it does work when I do it like this:
df_joined = df1.join(df2, df2.aliases.contains(df1.entity_nm), how="left")
That join gives me my desired result:
+------------+--------------------+---+
| entity_nm| aliases| id|
+------------+--------------------+---+
| Joe B|Joe Biden; Biden Joe| 1|
| Joe B|Joe Burrow | 4|
| Donald|Donald Trump; Don...| 2|
|Barack Obama| Barack Obama| 3|
Problem here: I tried to do this with a list of 60k entity names in df1 and around 6 million aliases in df2 and this approach takes like forever until at some point my Spark session will just crash due to memory errors. I'm pretty sure that my approach is very naive and far from optimized.
I've read this blog post which suggests to use a udf but I don't have any Scala knowledge and struggle to understand and recreate it in PySpark.
Any suggestions or help on how to optimize my approach? I need to do tasks like this a lot, so any help would be greatly appreciated.
Related
I have the below spark dataset.
Column_1
255678.05345
1111000.00002
It's a string column and I have to round the value to nearest ten thousands of the non decimal part and the output should be in decimal (10,2) format.
Expected output:
Column_1
260000.00
1110000.00
How to achieve this in spark. I tried using round() method but not sure how to achieve this.
As per my understanding, I tried to come you with the solution hoping it will work. Following sample code works fine in my databricks notebook
Step 1 Sample Data Creation
%scala
val df = spark.createDataFrame(
Seq((0, 255678.05345), (1, 1111000.00002))
).toDF("id", "Salary")
df.show()
Output:
+---+-------------+
| id| Salary|
+---+-------------+
| 0| 255678.05345|
| 1|1111000.00002|
+---+-------------+
Step2: Implementing the logic using round
%scala
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.IntegerType
df.withColumn("NewSalary", round(col("Salary")/ 1000000, 2) * 1000000).show()
Output:
+---+-------------+---------+
| id| Salary|NewSalary|
+---+-------------+---------+
| 0| 255678.05345| 260000.0|
| 1|1111000.00002|1110000.0|
+---+-------------+---------+
Just incase if it is helpful please vote for it.
I'm trying to split my dataframe depending on the number of nodes (of my cluster),
my dataframe looks like :
If i had node=2, and dataframe.count=7 :
So, to apply an iterative approach the result of split will be :
My question is : how can i do this ?
You can do that (have a look at the code below) with one of the rdd partition functions, but I don't recommend it as
long as you are not fully aware of what you are doing and the reason why you are doing this. In general (or better for most usecase) it is better to let spark handle the data distribution.
import pyspark.sql.functions as F
import itertools
import math
#creating a random dataframe
l = [(x,x+2) for x in range(1009)]
columns = ['one', 'two']
df=spark.createDataFrame(l, columns)
#create on partition to asign a partition key
df = df.coalesce(1)
#number of nodes (==partitions)
pCount = 5
#creating a list of partition keys
#basically it repeats range(5) several times until we have enough keys for each row
partitionKey = list(itertools.chain.from_iterable(itertools.repeat(x, math.ceil(df.count()/pCount)) for x in range(pCount)))
#now we can distribute the data to the partitions
df = df.rdd.partitionBy(pCount, partitionFunc = lambda x: partitionKey.pop()).toDF()
#This shows us the number of records within each partition
df.withColumn("partition_id", F.spark_partition_id()).groupBy("partition_id").count().show()
Output:
+------------+-----+
|partition_id|count|
+------------+-----+
| 1| 202|
| 3| 202|
| 4| 202|
| 2| 202|
| 0| 201|
+------------+-----+
I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))
You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed
This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)
You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj
In spark I want to be able to parallelise over multiple dataframes.
The method I am trying is to nest dataframes in a parent dataframe but I am not sure the syntax or if it is possible.
For example I have the following 2 dataframes:
df1:
+-----------+---------+--------------------+------+
|id |asset_id | date| text|
+-----------+---------+--------------------+------+
|20160629025| A1|2016-06-30 11:41:...|aaa...|
|20160423007| A1|2016-04-23 19:40:...|bbb...|
|20160312012| A2|2016-03-12 19:41:...|ccc...|
|20160617006| A2|2016-06-17 10:36:...|ddd...|
|20160624001| A2|2016-06-24 04:39:...|eee...|
df2:
+--------+--------------------+--------------+
|asset_id| best_date_time| Other_fields|
+--------+--------------------+--------------+
| A1|2016-09-28 11:33:...| abc|
| A1|2016-06-24 00:00:...| edf|
| A1|2016-08-12 00:00:...| hij|
| A2|2016-07-01 00:00:...| klm|
| A2|2016-07-10 00:00:...| nop|
So i want to combine these to produce something like this.
+--------+--------------------+-------------------+
|asset_id| df1| df2|
+--------+--------------------+-------------------+
| A1| [df1 - rows for A1]|[df2 - rows for A1]|
| A2| [df1 - rows for A2]|[df2 - rows for A2]|
Note, I don't want to join or union them as that would be very sparse (I actually have about 30 dataframes and thousands of assets each with thousands of rows).
I then plan to do a groupByKey on this so that I get something like this that I can call a function on:
[('A1', <pyspark.resultiterable.ResultIterable object at 0x2534310>), ('A2', <pyspark.resultiterable.ResultIterable object at 0x25d2310>)]
I'm new to spark so any help greatly appreciated.
TL;DR It is not possible to nest DataFrames but you can use complex types.
In this case you could for example (Spark 2.0 or later):
from pyspark.sql.functions import collect_list, struct
df1_grouped = (df1
.groupBy("asset_id")
.agg(collect_list(struct("id", "date", "text"))))
df2_grouped = (df2
.groupBy("asset_id")
.agg(collect_list(struct("best_date_time", "Other_fields"))))
df1_grouped.join(df2_grouped, ["asset_id"], "fullouter")
but you have to be aware that:
It is quite expensive.
It has limited applications. In general nested structures are cumbersome to use and require complex and expensive (especially in PySpark) UDFs.
I have a Hive table that contains text data and some metadata associated to each document. Looks like this.
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import CountVectorizer
df = sc.parallelize([
("1", "doc_1", "fruit is good for you"),
("2", "doc_2", "you should eat fruit and veggies"),
("2", "doc_3", "kids eat fruit but not veggies")
]).toDF(["month","doc_id", "text"])
+-----+------+--------------------+
|month|doc_id| text|
+-----+------+--------------------+
| 1| doc_1|fruit is good for...|
| 2| doc_2|you should eat fr...|
| 2| doc_3|kids eat fruit bu...|
+-----+------+--------------------+
I want to count words by month.
So far I've taken a CountVectorizer approach:
tokenizer = Tokenizer().setInputCol("text").setOutputCol("words")
tokenized = tokenizer.transform(df)
cvModel = CountVectorizer().setInputCol("words").setOutputCol("features").fit(tokenized)
counted = cvModel.transform(tokenized)
+-----+------+--------------------+--------------------+--------------------+
|month|doc_id| text| words| features|
+-----+------+--------------------+--------------------+--------------------+
| 1| doc_1|fruit is good for...|[fruit, is, good,...|(12,[0,3,4,7,8],[...|
| 2| doc_2|you should eat fr...|[you, should, eat...|(12,[0,1,2,3,9,11...|
| 2| doc_3|kids eat fruit bu...|[kids, eat, fruit...|(12,[0,1,2,5,6,10...|
+-----+------+--------------------+--------------------+--------------------+
Now I want to group by month and return something that looks like:
month word count
1 fruit 1
1 is 1
...
2 fruit 2
2 kids 1
2 eat 2
...
How could I do that?
There is no built-in mechanism for Vector* aggregation but you don't need one here. Once you have tokenized data you can just explode and aggregate:
from pyspark.sql.functions import explode
(counted
.select("month", explode("words").alias("word"))
.groupBy("month", "word")
.count())
If you prefer to limit the results to the vocabulary just add a filter:
from pyspark.sql.functions import col
(counted
.select("month", explode("words").alias("word"))
.where(col("word").isin(cvModel.vocabulary))
.groupBy("month", "word")
.count())
* Since Spark 2.4 we have access to Summarizer but it won't be useful here.