I hope you can't help.
I have this dataframe, and I want to select, for example, the count of the prediction==4
Code:
the_counts=df.select('prediction').groupby('prediction').count()
the_counts.show()
+----------+-----+
|prediction|count|
+----------+-----+
| 1| 8|
| 6| 14|
| 5| 5|
| 4| 8|
| 8| 5|
| 0| 6|
+----------+-----+
So, I can assign that value to a variable. As this will be within a loop that will run many iterations.
I managed this, but it's by creating a different dataframe, and then changing that datafram to a number.
dfva = the_counts.select('count').filter(the_counts.prediction ==6)
dfva.show()
+-----+
|count|
+-----+
| 14|
+-----+
Is there a way to access the number straight away without so many steps, or the most efficient way?
This is python 3.x and spark 2.1
Thank you very much
you can first() method to take the value directly,
>>> dfva = the_counts.filter(the_counts['prediction'] == 6).first()['count']
>>> type(dfva)
<type 'int'>
>>> print(dfva)
14
Related
I tried to use window function to calculate current value based on previous value in dynamic way
rowID | value
------------------
1 | 5
2 | 7
3 | 6
Logic:
If value > pre_value then value
So in row 2, since 7 > 5 then value becomes 5.
The final result should be
rowID | value
------------------
1 | 5
2 | 5
3 | 5
However using lag().over(w) gave the result as
rowID | value
------------------
1 | 5
2 | 5
3 | 6
it compares third row value 6 against the "7" not the new value "5"
Any suggestion how to achieve this?
df.show()
#exampledataframe
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 7|
| 3| 6|
| 4| 9|
| 5| 4|
| 6| 3|
+-----+-----+
Your required logic is too dynamic for window functions, therefore, we have to go row by row updating our values. One solution could be to use normal python udf on collected list and then explode once udf has been applied. If have relatively small data, this should be fine.(spark2.4 only because of arrays_zip).
from pyspark.sql import functions as F
from pyspark.sql.types import *
def add_one(a):
for i in range(1,len(a)):
if a[i]>a[i-1]:
a[i]=a[i-1]
return a
udf1= F.udf(add_one, ArrayType(IntegerType()))
df.agg(F.collect_list("rowID").alias("rowID"),F.collect_list("value").alias("value"))\
.withColumn("value", udf1("value"))\
.withColumn("zipped", F.explode(F.arrays_zip("rowID","value"))).select("zipped.*").show()
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 5| 4|
| 6| 3|
+-----+-----+
UPDATE:
Better yet, as you have groups of 5000, using a Pandas vectorized udf( grouped MAP) should help a lot with processing. And you do not have to collect_list with 5000 integers and explode or use pivot. I think this should be the optimal solution. Pandas UDAF available for spark2.3+
GroupBy below is empty, but you can add your grouping column in that.
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def grouped_map(df1):
for i in range(1, len(df1)):
if df1.loc[i, 'value']>df1.loc[i-1,'value']:
df1.loc[i,'value']=df1.loc[i-1,'value']
return df1
df.groupby().apply(grouped_map).show()
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 5| 4|
| 6| 3|
+-----+-----+
I want to calculate the Jaro Winkler distance between two columns of a PySpark DataFrame. Jaro Winkler distance is available through pyjarowinkler package on all nodes.
pyjarowinkler works as follows:
from pyjarowinkler import distance
distance.get_jaro_distance("A", "A", winkler=True, scaling=0.1)
Output:
1.0
I am trying to write a Pandas UDF to pass two columns as Series and calculate the distance using lambda function.
Here's how I am doing it:
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col1, col2):
import pandas as pd
distance_df = pd.DataFrame({'column_A': col1, 'column_B': col2})
distance_df['distance'] = distance_df.apply(lambda x: distance.get_jaro_distance(str(distance_df['column_A']), str(distance_df['column_B']), winkler = True, scaling = 0.1))
return distance_df['distance']
temp = temp.withColumn('jaro_distance', get_distance(temp.x, temp.x))
I should be able to pass any two string columns in the above function.
I am getting the following output:
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| A| 1| 2| null|
| B| 3| 4| null|
| C| 5| 6| null|
| D| 7| 8| null|
+---+---+---+-------------+
Expected Output:
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| A| 1| 2| 1.0|
| B| 3| 4| 1.0|
| C| 5| 6| 1.0|
| D| 7| 8| 1.0|
+---+---+---+-------------+
I suspect this might be because str(distance_df['column_A']) is not correct. It contains the concatenated string of all row values.
While this code works for me:
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col):
return col.apply(lambda x: distance.get_jaro_distance(x, "A", winkler = True, scaling = 0.1))
temp = temp.withColumn('jaro_distance', get_distance(temp.x))
Output:
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| A| 1| 2| 1.0|
| B| 3| 4| 0.0|
| C| 5| 6| 0.0|
| D| 7| 8| 0.0|
+---+---+---+-------------+
Is there a way to do this with Pandas UDF? I'm dealing with millions of records so UDF will be expensive but still acceptable if it works. Thanks.
The error was from your function in the df.apply method, adjust it to the following should fix it:
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col1, col2):
import pandas as pd
distance_df = pd.DataFrame({'column_A': col1, 'column_B': col2})
distance_df['distance'] = distance_df.apply(lambda x: distance.get_jaro_distance(x['column_A'], x['column_B'], winkler = True, scaling = 0.1), axis=1)
return distance_df['distance']
However, Pandas df.apply method is not vectorised which beats the purpose why we need pandas_udf over udf in PySpark. A faster and less overhead solution is to use list comprehension to create the returning pd.Series (check this link for more discussion about Pandas df.apply and its alternatives):
from pandas import Series
#pandas_udf("float", PandasUDFType.SCALAR)
def get_distance(col1, col2):
return Series([ distance.get_jaro_distance(c1, c2, winkler=True, scaling=0.1) for c1,c2 in zip(col1, col2) ])
df.withColumn('jaro_distance', get_distance('x', 'y')).show()
+---+---+---+-------------+
| x| y| z|jaro_distance|
+---+---+---+-------------+
| AB| 1B| 2| 0.67|
| BB| BB| 4| 1.0|
| CB| 5D| 6| 0.0|
| DB|B7F| 8| 0.61|
+---+---+---+-------------+
You can union all the data frame first, partition by the same partition key after the partitions were shuffled and distributed to the worker nodes, and restore them before the pandas computing. Pls check the example where I wrote a small toolkit for this scenario: SparkyPandas
I have a dataset that I want to partition by a particular key (clientID) but some clients produce far, far more data that others. There's a feature in Hive called either "ListBucketing" invoked by "skewed by" specifically to deal with this situation.
However, I cannot find any indication that Spark supports this feature, or how (if it does support it) to make use of it.
Is there a Spark feature that is the equivalent? Or, does Spark have some other set of features by which this behavior can be replicated?
(As a bonus - and requirement for my actual use-case - does your suggest method work with Amazon Athena?)
As far as I know, there is no such out of the box tool in Spark. In case of skewed data, what's very common is to add an artificial column to further bucketize the data.
Let's say you want to partition by column "y", but the data is very skewed like in this toy example (1 partition with 5 rows, the others with only one row):
val df = spark.range(8).withColumn("y", when('id < 5, 0).otherwise('id))
df.show()
+---+---+
| id| y|
+---+---+
| 0| 0|
| 1| 0|
| 2| 0|
| 3| 0|
| 4| 0|
| 5| 5|
| 6| 6|
| 7| 7|
+-------+
Now let's add an artificial random column and write the dataframe.
val maxNbOfBuckets = 3
val part_df = df.withColumn("r", floor(rand() * nbOfBuckets))
part_df.show
+---+---+---+
| id| y| r|
+---+---+---+
| 0| 0| 2|
| 1| 0| 2|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 1|
| 5| 5| 2|
| 6| 6| 2|
| 7| 7| 1|
+---+---+---+
// and writing. We divided the partition with 5 elements into 3 partitions.
part_df.write.partitionBy("y", "r").csv("...")
I am currently working on a data migration assignment, trying to compare two dataframes from two different databases using pyspark to find out the differences between two dataframes and record the results in a csv file as part of data validation. I am trying for a performance efficient solution since there are two reasons.i.e. large dataframes and table keys are unknown
#Approach 1 - Not sure about the performance and it is case-sensitive
df1.subtract(df2)
#Approach 2 - Creating row hash for each row in dataframe
piperdd=df1.rdd.map(lambda x: hash(x))
r=row("h_cd")
df1_new=piperdd.map(r).toDF()
The problem which I am facing in approach 2 is final dataframe(df1_new) is retrieving only hash column(h_cd) but I need all the columns of dataframe1(df1) with hash code column(h_cd) since I need to report the row difference in a csv file.Please help
Have a try with dataframes, it should be more concise.
df1 = spark.createDataFrame([(a, a*2, a+3) for a in range(10)], "A B C".split(' '))
#df1.show()
from pyspark.sql.functions import hash
df1.withColumn('hash_value', hash('A','B', 'C')).show()
+---+---+---+-----------+
| A| B| C| hash_value|
+---+---+---+-----------+
| 0| 0| 3| 1074520899|
| 1| 2| 4|-2073566230|
| 2| 4| 5| 2060637564|
| 3| 6| 6|-1286214988|
| 4| 8| 7|-1485932991|
| 5| 10| 8| 2099126539|
| 6| 12| 9| -558961891|
| 7| 14| 10| 1692668950|
| 8| 16| 11| 708810699|
| 9| 18| 12| -11251958|
+---+---+---+-----------+
I have a dataframe and I want to randomize rows in the dataframe. I tried sampling the data by giving a fraction of 1, which didn't work (interestingly this works in Pandas).
It works in Pandas because taking sample in local systems is typically solved by shuffling data. Spark from the other hand avoids shuffling by performing linear scans over the data. It means that sampling in Spark only randomizes members of the sample not an order.
You can order DataFrame by a column of random numbers:
from pyspark.sql.functions import rand
df = sc.parallelize(range(20)).map(lambda x: (x, )).toDF(["x"])
df.orderBy(rand()).show(3)
## +---+
## | x|
## +---+
## | 2|
## | 7|
## | 14|
## +---+
## only showing top 3 rows
but it is:
expensive - because it requires full shuffle and it something you typically want to avoid.
suspicious - because order of values in a DataFrame is not something you can really depend on in non-trivial cases and since DataFrame doesn't support indexing it is relatively useless without collecting.
This code works for me without any RDD operations:
import pyspark.sql.functions as F
df = df.select("*").orderBy(F.rand())
Here is a more elaborated example:
import pyspark.sql.functions as F
# Example: create a Dataframe for the example
pandas_df = pd.DataFrame(([1,2],[3,1],[4,2],[7,2],[32,7],[123,3]),columns=["id","col1"])
df = sqlContext.createDataFrame(pandas_df)
df = df.select("*").orderBy(F.rand())
df.show()
+---+----+
| id|col1|
+---+----+
| 1| 2|
| 3| 1|
| 4| 2|
| 7| 2|
| 32| 7|
|123| 3|
+---+----+
df.select("*").orderBy(F.rand()).show()
+---+----+
| id|col1|
+---+----+
| 7| 2|
|123| 3|
| 3| 1|
| 4| 2|
| 32| 7|
| 1| 2|
+---+----+