in pandas , I can do something like this .
data = {"col1" : [np.random.randint(10) for x in range(1,10)],
"col2" : [np.random.randint(100) for x in range(1,10)]}
mypd = pd.DataFrame(data)
mypd
and give the two columns
are there any similar way to create a spark dataframe in pyspark ?
The answer shared by Steven is brilliant
Additionally if you are comfortable with Pandas
You can directly supply your pandas dataframe to the function createDataFrame
Spark >= 2.x
data = {
"col1": [np.random.randint(10) for x in range(1, 10)],
"col2": [np.random.randint(100) for x in range(1, 10)],
}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
+----+----+
|col1|col2|
+----+----+
| 6| 4|
| 1| 39|
| 7| 4|
| 7| 95|
| 6| 3|
| 7| 28|
| 2| 26|
| 0| 4|
| 4| 32|
+----+----+
Related
Say I have a dataframe:
+-----+-----+-----+
|id |foo. |bar. |
+-----+-----+-----+
| 1| baz| 0|
| 2| baz| 0|
| 3| 333| 2|
| 4| 444| 1|
+-----+-----+-----+
I want to set the 'foo' column to a value depending on the value of bar.
If bar is 2: set the value of foo for that row to 'X',
else if bar is 1: set the value of foo for that row to 'Y'
And if neither condition is met, leave the foo value as it is.
pyspark.when seems like the closest method, but that doesn't seem to work based on another columns value.
when can work with other columns. You can use F.col to get the value of the other column and provide an appropriate condition:
import pyspark.sql.functions as F
df2 = df.withColumn(
'foo',
F.when(F.col('bar') == 2, 'X')
.when(F.col('bar') == 1, 'Y')
.otherwise(F.col('foo'))
)
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
We can solve this using when òr UDF in spark to insert new column based on condition.
Create Sample DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('AddConditionalColumn').getOrCreate()
data = [(1,"baz",0),(2,"baz",0),(3,"333",2),(4,"444",1)]
columns = ["id","foo","bar"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3|333| 2|
| 4|444| 1|
+---+---+---+
Using When:
from pyspark.sql.functions import when
df2 = df.withColumn("foo", when(df.bar == 2,"X")
.when(df.bar == 1,"Y")
.otherwise(df.foo))
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
Using UDF:
import pyspark.sql.functions as F
from pyspark.sql.types import *
def executeRule(value):
if value == 2:
return 'X'
elif value == 1:
return 'Y'
else:
return value
# Converting function to UDF
ruleUDF = F.udf(executeRule, StringType())
df3 = df.withColumn("foo", ruleUDF("bar"))
df3.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1| 0| 0|
| 2| 0| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
I want to calculate the Jaro Winkler distance between two columns of a PySpark DataFrame. Jaro Winkler distance is available through pyjarowinkler package on all nodes.
pyjarowinkler works as follows:
from pyjarowinkler import distance
distance.get_jaro_distance("A", "A", winkler=True, scaling=0.1)
Output:
1.0
I am trying to write a Pandas UDF to pass two columns as Series and calculate the distance using lambda function.
Here's how I am doing it:
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col1, col2):
import pandas as pd
distance_df = pd.DataFrame({'column_A': col1, 'column_B': col2})
distance_df['distance'] = distance_df.apply(lambda x: distance.get_jaro_distance(str(distance_df['column_A']), str(distance_df['column_B']), winkler = True, scaling = 0.1))
return distance_df['distance']
temp = temp.withColumn('jaro_distance', get_distance(temp.x, temp.x))
I should be able to pass any two string columns in the above function.
I am getting the following output:
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| A| 1| 2| null|
| B| 3| 4| null|
| C| 5| 6| null|
| D| 7| 8| null|
+---+---+---+-------------+
Expected Output:
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| A| 1| 2| 1.0|
| B| 3| 4| 1.0|
| C| 5| 6| 1.0|
| D| 7| 8| 1.0|
+---+---+---+-------------+
I suspect this might be because str(distance_df['column_A']) is not correct. It contains the concatenated string of all row values.
While this code works for me:
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col):
return col.apply(lambda x: distance.get_jaro_distance(x, "A", winkler = True, scaling = 0.1))
temp = temp.withColumn('jaro_distance', get_distance(temp.x))
Output:
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| A| 1| 2| 1.0|
| B| 3| 4| 0.0|
| C| 5| 6| 0.0|
| D| 7| 8| 0.0|
+---+---+---+-------------+
Is there a way to do this with Pandas UDF? I'm dealing with millions of records so UDF will be expensive but still acceptable if it works. Thanks.
The error was from your function in the df.apply method, adjust it to the following should fix it:
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col1, col2):
import pandas as pd
distance_df = pd.DataFrame({'column_A': col1, 'column_B': col2})
distance_df['distance'] = distance_df.apply(lambda x: distance.get_jaro_distance(x['column_A'], x['column_B'], winkler = True, scaling = 0.1), axis=1)
return distance_df['distance']
However, Pandas df.apply method is not vectorised which beats the purpose why we need pandas_udf over udf in PySpark. A faster and less overhead solution is to use list comprehension to create the returning pd.Series (check this link for more discussion about Pandas df.apply and its alternatives):
from pandas import Series
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col1, col2):
return Series([ distance.get_jaro_distance(c1, c2, winkler=True, scaling=0.1) for c1,c2 in zip(col1, col2) ])
df.withColumn('jaro_distance', get_distance('x', 'y')).show()
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| AB| 1B| 2| 0.67|
| BB| BB| 4| 1.0|
| CB| 5D| 6| 0.0|
| DB|B7F| 8| 0.61|
+---+---+---+-------------+
You can union all the data frame first, partition by the same partition key after the partitions were shuffled and distributed to the worker nodes, and restore them before the pandas computing. Pls check the example where I wrote a small toolkit for this scenario: SparkyPandas
I am trying to output a dataframe only with columns identified with different values after comparing two dataframes. I am finding difficulty in identifying an approach to proceed.
**Code:**
df_a = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"pears","tiger","onion"),("c", 8,"jackfruit","elephant","raddish"),("c", 8,"watermelon","giraffe","tomato")], ["name", "id","fruit","animal","veggie"])
df_b = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"banana","tiger","onion"),("c", 8,"jackfruit","camel","raddish")], ["name", "id","fruit","animal","veggie"])
df_a = df_a.alias('df_a')
df_b = df_b.alias('df_b')
df = df_a.join(df_b, (df_a.id == df_b.id) & (df_a.name == df_b.name),'leftanti').select('df_a.*').show()
Trying to match based on the ids (id,name) between dataframe1 & dataframe2
Dataframe 1:
+----+---+----------+--------+-------+
|name| id| fruit| animal| veggie|
+----+---+----------+--------+-------+
| a| 3| apple| bear| carrot|
| b| 5| orange| lion|cabbage|
| c| 7| pears| tiger| onion|
| c| 8| jackfruit|elephant|raddish|
| c| 9|watermelon| giraffe| tomato|
+----+---+----------+--------+-------+
Dataframe 2:
+----+---+---------+------+-------+
|name| id| fruit|animal| veggie|
+----+---+---------+------+-------+
| a| 3| apple| bear| carrot|
| b| 5| orange| lion|cabbage|
| c| 7| banana| tiger| onion|
| c| 8|jackfruit| camel|raddish|
+----+---+---------+------+-------+
Expected dataframe
+----+---+----------+--------+
|name| id| fruit| animal|
+----+---+----------+--------+
| c| 7| pears| tiger|
| c| 8| jackfruit|elephant|
| c| 9|watermelon| giraffe|
+----+---+----------+--------+
I'm performing computations based on 3 different PySpark DataFrames.
This script works in the sense that it performs the computation as it should, however, I struggle with working properly with the results of said computation.
import sys
import numpy as np
from pyspark import SparkConf, SparkContext, SQLContext
sc = SparkContext("local")
sqlContext = SQLContext(sc)
# Dummy Data
df = sqlContext.createDataFrame([[0,1,0,0,0],[1,1,0,0,1],[0,0,1,0,1],[1,0,1,1,0],[1,1,0,0,0]], ['p1', 'p2', 'p3', 'p4', 'p5'])
df.show()
+---+---+---+---+---+
| p1| p2| p3| p4| p5|
+---+---+---+---+---+
| 0| 1| 0| 0| 0|
| 1| 1| 0| 0| 1|
| 0| 0| 1| 0| 1|
| 1| 0| 1| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
# Values
values = sqlContext.createDataFrame([(0,1,'p1'),(None,1,'p2'),(0,0,'p3'),(None,0, 'p4'),(1,None,'p5')], ('f1', 'f2','index'))
values.show()
+----+----+-----+
| f1| f2|index|
+----+----+-----+
| 0| 1| p1|
|null| 1| p2|
| 0| 0| p3|
|null| 0| p4|
| 1|null| p5|
+----+----+-----+
# Weights
weights = sqlContext.createDataFrame([(4,3,'p1'),(None,1,'p2'),(2,2,'p3'),(None, 3, 'p4'),(3,None,'p5')], ('f1', 'f2','index'))
weights.show()
+----+----+-----+
| f1| f2|index|
+----+----+-----+
| 4| 3| p1|
|null| 1| p2|
| 2| 2| p3|
|null| 3| p4|
| 3|null| p5|
+----+----+-----+
# Function: it sums the vector W for the values of Row equal to the value of V and then divide by the length of V.
# If there a no similarities between Row and V outputs 0
def W_sum(row,v,w):
if len(w[row==v])>0:
return float(np.sum(w[row==v])/len(w))
else:
return 0.0
For each of the columns and for each row in Data, the above function is applied.
# We iterate over the columns of Values (except the last one called index)
for val in values.columns[:-1]:
# we filter the data to work only with the columns that are defined for the selected Value
defined_col = [i[0] for i in values.where(F.col(val) >= 0).select(values.index).collect()]
# we select only the useful columns
df_select= df.select(defined_col)
# we retrieve the reference value and weights
V = np.array(values.where(values.index.isin(defined_col)).select(val).collect()).flatten()
W = np.array(weights.where(weights.index.isin(defined_col)).select(val).collect()).flatten()
W_sum_udf = F.udf(lambda row: W_sum(row, V, W), FloatType())
df_select.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in df_select.columns))))
This gives :
+---+---+---+---+---+---+
| p1| p2| p3| p4| p5| f1|
+---+---+---+---+---+---+
| 0| 1| 0| 0| 0|2.0|
| 1| 1| 0| 0| 1|1.0|
| 0| 0| 1| 0| 1|2.0|
| 1| 0| 1| 1| 0|0.0|
| 1| 1| 0| 0| 0|0.0|
+---+---+---+---+---+---+
It added the column to the sliced DataFrame as I asked it to. The problem is that I would rather collect the data into a new one that I could access at the end to consult the results.
It it possible to grow (somewhat efficiently) a DataFrame in PySpark as I would with pandas?
Edit to make my goal clearer:
Ideally I would get a DataFrame with the just the computed columns, like this:
+---+---+
| f1| f2|
+---+---+
|2.0|1.0|
|1.0|2.0|
|2.0|0.0|
|0.0|0.0|
|0.0|2.0|
+---+---+
There are some issues with your question...
First, your for loop will produce an error, since df_select in the last line is nowhere defined; there is also no assignment at the end (what does it produce?).
Assuming that df_select is actually your subsubsample dataframe, defined some lines before, and that your last line is something like
new_df = subsubsample.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in subsubsample.columns))))
then your problem starts getting more clear. Since
values.columns[:-1]
# ['f1', 'f2']
the result of the whole loop would be just
+---+---+---+---+---+
| p1| p2| p3| p4| f2|
+---+---+---+---+---+
| 0| 1| 0| 0|1.0|
| 1| 1| 0| 0|2.0|
| 0| 0| 1| 0|0.0|
| 1| 0| 1| 1|0.0|
| 1| 1| 0| 0|2.0|
+---+---+---+---+---+
i.e. with only the column f2 included (natural, since the results with f1 are simply overwritten).
Now, as I said, assuming that the situation is like this, and that your problem is actually how to have both columns f1 & f2 together rather in different dataframes, you can just forget subsubsample and append columns to your initial df, possibly dropping afterwards the unwanted ones:
init_cols = df.columns
init_cols
# ['p1', 'p2', 'p3', 'p4', 'p5']
new_df = df
for val in values.columns[:-1]:
# we filter the data to work only with the columns that are defined for the selected Value
defined_col = [i[0] for i in values.where(F.col(val) >= 0).select(values.index).collect()]
# we retrieve the reference value and weights
V = np.array(values.where(values.index.isin(defined_col)).select(val).collect()).flatten()
W = np.array(weights.where(weights.index.isin(defined_col)).select(val).collect()).flatten()
W_sum_udf = F.udf(lambda row: W_sum(row, V, W), FloatType())
new_df = new_df.withColumn(val, W_sum_udf(F.array(*(F.col(x) for x in defined_col)))) # change here
# drop initial columns:
for i in init_cols:
new_df = new_df.drop(i)
The resulting new_df will be:
+---+---+
| f1| f2|
+---+---+
|2.0|1.0|
|1.0|2.0|
|2.0|0.0|
|0.0|0.0|
|0.0|2.0|
+---+---+
UPDATE (after comment): To force the division in your W_sum function to be a float, use:
from __future__ import division
new_df now will be:
+---------+----+
| f1| f2|
+---------+----+
| 2.0| 1.5|
|1.6666666|2.25|
|2.3333333|0.75|
| 0.0|0.75|
|0.6666667|2.25|
+---------+----+
with f2 exactly as it should be according to your comment.
This question already has answers here:
pyspark add new column field with the data frame row number
(1 answer)
Spark Dataframe :How to add a index Column : Aka Distributed Data Index
(7 answers)
Primary keys with Apache Spark
(4 answers)
DataFrame-ified zipWithIndex
(9 answers)
Closed 5 years ago.
I have a DataFrame and I want to add a new column but not based on exit column,what should I do?
This is my dataframe:
+----+
|time|
+----+
| 1|
| 4|
| 3|
| 2|
| 5|
| 7|
| 3|
| 5|
+----+
This is my expect result:
+----+-----+
|time|index|
+----+-----+
| 1| 1|
| 4| 2|
| 3| 3|
| 2| 4|
| 5| 5|
| 7| 6|
| 3| 7|
| 5| 8|
+----+-----+
use rdd zipWithIndex may be what you want.
val newRdd = yourDF.rdd.zipWithIndex.map{case (r: Row, id: Long) => Row.fromSeq(r.toSeq :+ id)}
val schema = StructType(Array(StructField("time", IntegerType, nullable = true), StructField("index", LongType, nullable = true)))
val newDF = spark.createDataFrame(newRdd, schema)
newDF.show
+----+-----+
|time|index|
+----+-----+
| 1| 0|
| 4| 1|
| 3| 2|
| 2| 3|
| 5| 4|
| 7| 5|
| 3| 6|
| 8| 7|
+----+-----+
I assume Your time column is IntegerType here.
Rather using Window function and converting to rdd and using zipWithIndex are slower, you can use a built in function monotonically_increasing_id as
import org.apache.spark.sql.functions._
df.withColumn("index", monotonically_increasing_id())
Hope this hepls!