Looking for a better performance on PySpark - apache-spark

I'm trying to build a model in PySpark and I'm obviously doing many wrong things.
What I have to do:
I have a list of products, and I have to compare each product to a corpus that will return the most similar ones, and then I have to filter by the product's type.
Here is an example with one product:
1st step:
product_id = 'HZM-1914'
type = get_type(product_id) #takes 8s. I'll show this function below
similar_list = [(product_id , 1.0)] + model.wv.most_similar(positive=id_produto, topn=5) #takes 0.04s
#the similar list shows the product_id and the similarity, it looks like this:
[('HZM-1914', 1.0), ('COL-8430', 0.9951900243759155), ('D23-2178', 0.9946870803833008), ('J96-0611', 0.9943861365318298), ('COL-7719', 0.9930003881454468), ('HZM-1912', 0.9926838874816895)]
2nd step:
#I want to filter the types, so I transform the list in a dataframe, and here is what is taking the longest to perform (and probably what is wrong)
rdd = sc.parallelize([(id, get_type(id), similarity) for (id, similarity) in similar_list]) #takes 55s
products = rdd.map(lambda x: Row(name=str(x[0]), type=str(x[1]), similarity=float(x[2]))) #takes 0.02s
df_recs = sqlContext.createDataFrame(products) #takes 0.02s
df_recs.show() #takes 0.43s
+--------+----------------+------------------+
| name| type| similarity |
+--------+----------------+------------------+
|HZM-1914| Chuteiras| 1.0|
|COL-8430| Chuteiras|0.9951900243759155|
|D23-2178| Bolas|0.9946870803833008|
|J96-0611|Luvas de Goleiro|0.9943861365318298|
|COL-7719| Bolas|0.9930003881454468|
|HZM-1912| Chuteiras|0.9926838874816895|
+--------+----------------+------------------+
3rd step:
#Comes the filter:
df_recs = df_recs.filter(df_recs.type == type) #takes 0.09s
df_recs.show() #takes 0.5s
+--------+---------+------------------+
| name| type| similarity |
+--------+---------+------------------+
|HZM-1914|Chuteiras| 1.0|
|COL-8430|Chuteiras|0.9951900243759155|
|HZM-1912|Chuteiras|0.9926838874816895|
+--------+---------+------------------+
The get_type() function is:
def get_type(product_id):
return df.filter(col("ID") == product_id).select("TYPE").collect()[0]["TYPE"]
And the DataFrame that get_type() gets ID and type is:
+----------+--------------------+--------------------+
|ID | NAME | TYPE |
+----------+--------------------+--------------------+
| 7983 |SNEAKERS 01 | Sneakers|
| 7034 |SHIRT 13 | Shirt|
| 3360 |SHORTS 15 | Short|
The get_type() function and creating the dataframe are the main issues. So if you have any idea how to make it work better, it would be really helpful. I come from Python and I'm struggling a lot with PySpark. Thank you very much in advance.

Related

PySpark split using regex doesn't work on a dataframe column with string type

I have a PySpark data frame with a string column(URL) and all records look in the following way
ID URL
1 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189
I want to basically extract the number after conversations/ from URL column using regex into another column.
I tried the following code but it doesn't give me any results.
df1 = df.withColumn('CONV_ID', split(convo_influ_new['URL'], '(?<=conversations/).*').getItem(0))
Expected:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 2938419189
Result:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 https://app.drift.com/inboxes/136636/conversations/2938419189
Not sure what's happening here. I tried the regex script in different online regex tester toolds and it highlights the part I want but never works in PySpark. I tried different PySpark functions like f.split, regexp_extract, regexp_replace, but none of them work.
If you are URLs have always that form, you can actually just use substring_index to get the last path element :
import pyspark.sql.functions as F
df1 = df.withColumn("CONV_ID", F.substring_index("URL", "/", -1))
df1.show(truncate=False)
#+---+-------------------------------------------------------------+----------+
#|ID |URL |CONV_ID |
#+---+-------------------------------------------------------------+----------+
#|1 |https://app.xyz.com/inboxes/136636/conversations/2686735685 |2686735685|
#|2 |https://app.xyz.com/inboxes/136636/conversations/2938415796 |2938415796|
#|3 |https://app.drift.com/inboxes/136636/conversations/2938419189|2938419189|
#+---+-------------------------------------------------------------+----------+
You can use regexp_extract instead:
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.regexp_extract('URL', 'conversations/(.*)', 1)
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+
Or if you want to use split, you don't need to specify .*. You just need to specify the pattern used for splitting.
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.split('URL', '(?<=conversations/)')[1] # just using 'conversations/' should also be enough
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+

Trouble spliting a column into more columns on Pyspark

I'm having trouble spliting a dataframe's column into more columns in PySpark:
I have a list of lists and I want to transform it into a dataframe, each value in one column.
What I have tried:
I created a dataframe from this list:
[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]
Using this code:
from pyspark.sql import Row
R = Row('col1', 'col2')
# use enumerate to add the ID column
df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])
The result I got is:
+----+--------------------+
|col1| col2|
+----+--------------------+
| 0|[COL-4560, COL-96...|
| 1|[DLL-7760, NAT-98...|
+----+--------------------+
I want to separate the values by comma into columns, so I tried:
from pyspark.sql import functions as F
df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2'))
# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]
df_result = df2.select('col1', *[df2['col2'][i] for i in range(nb_columns)])
df_result.show()
But I get an error on this line df2 = df_from_list.select('col1', F.split('col2', ', ').alias('col2')):
AnalysisException: cannot resolve 'split(`col2`, ', ', -1)' due to data type mismatch: argument 1 requires string type, however, '`col2`' is of array<string> type.;;
My ideal final output would be like this:
+----------+----------+----------+----------+----------+
| SKU | REC_01 | REC_02 | REC_03 | REC_04 |
+----------+----------+----------+----------+----------+
| COL-4560 | COL-9655 | NWG-0610 | D81-3754 | null |
| DLL-7760 | NAT-9885 | PED-0550 | MAR-0004 | LLL-5554 |
+---------------------+----------+----------+----------+
Some rows may have four values, but some my have more or less, I don't know the exact number of columns the final dataframe will have.
Does anyone have any idea of what is happening? Thank you very much in advance.
Dataframe df_from_list col2 column is already array type, so no need to split (as split works with stringtype here we have arraytype).
Here are the steps that will work for you.
recs_list=[['COL-4560', 'COL-9655', 'NWG-0610', 'D81-3754'],
['DLL-7760', 'NAT-9885', 'PED-0550', 'MAR-0004', 'LLL-5554']]
from pyspark.sql import Row
R = Row('col1', 'col2')
# use enumerate to add the ID column
df_from_list = spark.createDataFrame([R(i, x) for i, x in enumerate(recs_list)])
from pyspark.sql import functions as F
df2 = df_from_list
# If you don't know the number of columns:
df_sizes = df2.select(F.size('col2').alias('col2'))
df_max = df_sizes.agg(F.max('col2'))
nb_columns = df_max.collect()[0][0]
cols=['SKU','REC_01','REC_02','REC_03','REC_04']
df_result = df2.select(*[df2['col2'][i] for i in range(nb_columns)]).toDF(*cols)
df_result.show()
#+--------+--------+--------+--------+--------+
#| SKU| REC_01| REC_02| REC_03| REC_04|
#+--------+--------+--------+--------+--------+
#|COL-4560|COL-9655|NWG-0610|D81-3754| null|
#|DLL-7760|NAT-9885|PED-0550|MAR-0004|LLL-5554|
#+--------+--------+--------+--------+--------+

How to Convert a Normal Dataframe into MultiIndex'd based on certain condition

After a long While I visited to the SO's pandas section and got a question which is not indeed nicely framed thus thought to put here in an explicit way as similar kind of situation I'm too as well :-)
Below is the data frame construct:
>>> df
measure Pend Job Run Job Time
cls
ABC [inter, batch] [101, 93] [302, 1327] [56, 131]
DEF [inter, batch] [24279, 421] [4935, 5452] [75, 300]
Desired output would be...
I tried working hard but didn't get any solution, thus though to Sketch it here as that's somewhat I would like it be achieved.
----------------------------------------------------------------------------------
| |Pend Job | Run Job | Time |
cls | measure |-----------------------------------------------------------
| |inter | batch| |inter | batch| |inter | batch |
----|-----------------|------|------|-------|------|------|-----|------|----------
ABC |inter, batch |101 |93 | |302 |1327 | |56 |131 |
----|-----------------|-------------|-------|------|------|-----|------|---------|
DEF |inter, batch |24279 |421 | |4935 |5452 | |75 |300 |
----------------------------------------------------------------------------------
Saying that I want my dataFrame into MultiIndex Dataframe where Pend Job , Run Job, and Time to be on the top as above.
Edit:
cls is not in the columns
This is my approach, you can modify it to your need:
s = (df.drop('measure', axis=1) # remove the measure column
.set_index(df['measure'].apply(', '.join),
append=True) # make `measure` second level index
.stack().explode().to_frame() # concatenate all the values
)
# assign `inter` and `batch` label to each new cell
new_lvl = np.array(['inter','batch'])[s.groupby(level=(0,1,2)).cumcount()]
# or
# new_lvl = np.tile(['inter', 'batch'], len(s)//2)
(s.set_index(new_level, append=True)[0]
.unstack(level=(-2,-1)
.reset_index()
)
Output:
cls measure Pend Job
inter batch
0 ABC inter, batch 101 93
1 DEF inter, batch 24279 421

How to convert rows into a list of dictionaries in pyspark?

I have a DataFrame(df) in pyspark, by reading from a hive table:
df=spark.sql('select * from <table_name>')
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
When i tried the following, got an error
df_dict = dict(zip(df['name'],df['url']))
"TypeError: zip argument #1 must support iteration."
type(df.name) is of 'pyspark.sql.column.Column'
How do i create a dictionary like the following, which can be iterated later on
{'person1':'google','msn','yahoo'}
{'person2':'fb.com','airbnb','wired.com'}
{'person3':'fb.com','google.com'}
Appreciate your thoughts and help.
I think you can try row.asDict(), this code run directly on the executor, and you don't have to collect the data on driver.
Something like:
df.rdd.map(lambda row: row.asDict())
How about using the pyspark Row.as_Dict() method? This is part of the dataframe API (which I understand is the "recommended" API at time of writing) and would not require you to use the RDD API at all.
df_list_of_dict = [row.asDict() for row in df.collect()]
type(df_list_of_dict), type(df_list_of_dict[0])
#(<class 'list'>, <class 'dict'>)
df_list_of_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
If you wanted your results in a python dictionary, you could use collect()1 to bring the data into local memory and then massage the output as desired.
First collect the data:
df_dict = df.collect()
#[Row(Name=u'person1', URL visited=[u'google', u'msn,yahoo']),
# Row(Name=u'person2', URL visited=[u'fb.com', u'airbnb', u'wired.com']),
# Row(Name=u'person3', URL visited=[u'fb.com', u'google.com'])]
This returns a list of pyspark.sql.Row objects. You can easily convert this to a list of dicts:
df_dict = [{r['Name']: r['URL visited']} for r in df_dict]
#[{u'person1': [u'google', u'msn,yahoo']},
# {u'person2': [u'fb.com', u'airbnb', u'wired.com']},
# {u'person3': [u'fb.com', u'google.com']}]
1 Be advised that for large data sets, this operation can be slow and potentially fail with an Out of Memory error. You should consider if this is what you really want to do first as you will lose the parallelization benefits of spark by bringing the data into local memory.
Given:
+++++++++++++++++++++++++++++++++++++++++++
| Name | URL visited |
+++++++++++++++++++++++++++++++++++++++++++
| person1 | [google,msn,yahoo] |
| person2 | [fb.com,airbnb,wired.com] |
| person3 | [fb.com,google.com] |
+++++++++++++++++++++++++++++++++++++++++++
This should work:
df_dict = df \
.rdd \
.map(lambda row: {row[0]: row[1]}) \
.collect()
df_dict
#[{'person1': ['google','msn','yahoo']},
# {'person2': ['fb.com','airbnb','wired.com']},
# {'person3': ['fb.com','google.com']}]
This way you just collect after processing.
Please, let me know if that works for you :)

udf that sorts list in pyspark

I have a dataframe where one of the column, called stopped is:
+--------------------+
| stopped|
+--------------------+
|[nintendo, dsi, l...|
|[nintendo, dsi, l...|
| [xl, honda, 500]|
|[black, swan, green]|
|[black, swan, green]|
|[pin, stripe, sui...|
| [shooting, braces]|
| [haus, geltow]|
|[60, cm, electric...|
| [yamaha, yl1, yl2]|
|[landwirtschaft, ...|
| [wingbar, 9581]|
| [gummi, 16mm]|
|[brillen, lupe, c...|
|[man, city, v, ba...|
|[one, plus, one, ...|
| [kapplocheisen]|
|[tractor, door, m...|
|[pro, nano, flat,...|
|[kaleidoscope, to...|
+--------------------+
I would like to create another column that contains the same list but where the keywords are ordered.
As I understand it, I need to create a udf that takes and returns a list:
udf_sort = udf(lambda x: x.sort(), ArrayType(StringType()))
ps_clean.select("*", udf_sort(ps_clean["stopped"])).show(5, False)
and I get:
+---------+----------+---------------------+------------+--------------------------+--------------------------+-----------------+
|client_id|kw_id |keyword |max_click_dt|tokenized |stopped |<lambda>(stopped)|
+---------+----------+---------------------+------------+--------------------------+--------------------------+-----------------+
|710 |4304414582|nintendo dsi lite new|2017-01-06 |[nintendo, dsi, lite, new]|[nintendo, dsi, lite, new]|null |
|705 |4304414582|nintendo dsi lite new|2017-03-25 |[nintendo, dsi, lite, new]|[nintendo, dsi, lite, new]|null |
|707 |647507047 |xl honda 500 s |2016-10-26 |[xl, honda, 500, s] |[xl, honda, 500] |null |
|710 |26308464 |black swan green |2016-01-01 |[black, swan, green] |[black, swan, green] |null |
|705 |26308464 |black swan green |2016-07-13 |[black, swan, green] |[black, swan, green] |null |
+---------+----------+---------------------+------------+--------------------------+--------------------------+-----------------+
Why is the sorting not being applied?
x.sort() typically sorts the list in place (but I suspect that it won't do that in a pyspark dataframe) and it returns None. That is the reaason your column labeled <lambda>(stopped) has all null values.sorted(x) will sort the list and return a new sorted copy. So, replacing your udf with
udf_sort = udf(lambda x: sorted(x), ArrayType(StringType()))
should solve your problem.
Alternatively, you can use the built-in function sort_array instead of defining your own udf.
from pyspark.sql.functions import sort_array
ps_clean.select("*", sort_array(ps_clean["stopped"])).show(5, False)
This method is a little cleaner, and you can actually expect to get some performance gains because pyspark doesn't have to serialize your udf.
change Your udf to:
udf_sort = udf(lambda x: sorted(x), ArrayType(StringType()))
on diffrences beetwen .sort() and .sorted() read:
What is the difference between `sorted(list)` vs `list.sort()` ? python

Resources