Prep TabPy Script issue - Python - python-3.x

My question is similar to this one: https://community.tableau.com/s/question/0D54T00000C5RcySAF/tableau-prep-builder-20193-including-multiple-functions-from-python-or-r-to-script-step?t=1655319750718&searchQuery
which I found helpful and decided to try to also nest my functions within each-other.
but my script isn't working - I'm not very good at python so I think some syntax might be missing - can anyone provide some guidance?
I also find it very hard to test these things with prep...
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
def cc(df):
def cuecompare(df):
print('splitting cue points')
for merlinCue, mpxCue in df.groupby('media.mediaGuid'):
df['merlinCue'].str.split('|')
df['mpxCue'].str.split('|')
print('createNew cue columns')
df['merlinCue_series'] = merlinCue
df['mpxCue_series'] = mpxCue
print('replace nulls/nan with 0')
df['mpxCue_series'] = df['mpxCue_series'].fillna(0)
df['merlinCue_series'] = df['merlinCue_series'].fillna(0)
return df
def custom_compare_eq(series, other):
print('comparing cue data')
length = len(series.values)
for i in range(length):
r1 = eval(str(series.values[i]))
r2 = eval(str(other.values[i]))
if type(r1) != type(r2):
yield False
else:
if type(r1) == int:
yield r1 == r2
elif type(r1) == list:
yield set(r1) == set(r2)
return series, other
df = cuecompare(df)
df = custom_compare_eq(df)
print('moving to output schema')
return df
def get_output_schema():
print('Getting output schema...\n')
return pd.DataFrame({
'mediaGuid': prep_string(),
'match': prep_string()
})
The error that tabpy throws is
2022-06-15,14:52:37 [ERROR] (base_handler.py:base_handler:115): Responding with status=500, message="Error processing script", info="ValueError : Expected a 1D array, got an array with shape (7, 3)", <<call ID: 3e84a296-08a3-459c-aecc-7dce6e3fad3d>>
2022-06-15,14:52:37 [ERROR] (web.py:web:2239): 500 POST /evaluate (127.0.0.1) 14.99ms
Here's a DF input sample:
,media.mediaGuid,merlinCue,mpxCue
0,abc123,1703.0|1144.0|2172.0|735.0,
1,A5EK6URwtnybJwE9iDZBW_HD_1080p_SDR,1120.0|637.0|2026.0|1599.0,
2,A5EK6URwtnybJwDG8wF3Q_HD_1080p_SDR,2663.0|859.0|2281.0|1487.0,
3,A5EK6URwtnyc26LFN6JG1_HD_1080p_SDR,1407.0|1987.0|696.0,
4,A5EK6URwtnyc26LeBYq7U_HD_1080p_SDR,1392.0|1971.0|552.0,
5,A5EK6URwtnyc26KMrR6xN_HD_1080p_SDR,1025.0|1566.0|581.0,
6,A5EK3BMkhJv97MAWFDsnw_HD_1080p_SDR,768.24,768.24
7,A5EK3BMkhJv97MCKo9JDW_HD_1080p_SDR,527,527

Related

Concat two dataframe from two different source. Unable to handle situation if dataframe missing

I would like to concat two dataframes from two sources, both of them are having identical in structure but only difference in rows entry. And the problem I am facing is to handle on cases like sometime #input might be single, sometime none and sometime both.
My function is like this:
def CombiTwoDf(df1,df2):
if isinstance(df1, pd.DataFrame) and isinstance(df2, pd.DataFrame):
frames = [df1, df2]
result = pd.concat(frames)
result.drop_duplicates(subset =None, keep = 'first', inplace = True)
return result
if isinstance(df1, pd.DataFrame) and df2 is None:
return df1
if df2 is None and isinstance(df2, pd.DataFrame):
return df2
if df1 is None and df2 is None:
return None
You can use a list comprehension to create a list of your 2 dataframes while filtering out non-dataframe objects. Then if you have 2 frames, you can concat + post-process them. If you have 1 frame in your list, just return that frame, and if you have 0: return None
def CombiTwoDf(df1,df2):
frames = [df for df in (df1, df2) if isinstance(df, pd.DataFrame)]
if len(frames) == 2:
out = pd.concat(frames).drop_duplicates(keep="first")
elif len(frames) == 1:
out = frames[0]
else:
out = None
return out
When the only inputs are dataframes or None (and not strings or scalars or whatever), this works:
def combine_two_dataframes(df1,df2):
if isinstance(df1, pd.DataFrame) or isinstance(df2, pd.DataFrame):
return pd.concat([df1, df2])
else:
return None
You could also use a try / except:
def combine_two_dataframes(df1,df2):
try:
return pd.concat([df1, df2])
except:
return None

pyspark modify class attributes using spark.sql.rdd.foreach()

The main task is to connect Hive and read data using spark rdd.
I have tried the code below. Connection and reading are both successful, but when I want to modify the value of self.jobUserProfile, I failed. Then I print this value in three positions(masking in #1,#2 and #3). In the first position, the value is valid, but in the second and third position, the dict is empty. It seems that the modification has not been assigned into the class attribute.
I have tried response = spark.sql('select userid, logtime from hive.dwd_log_login_i_d limit 10').collect() and iterate this dataframe, but when the data volume is too large, the performance may decline.
When I change response.rdd.foreach(lambda x: self.readLoginFunction(x)) to response.rdd.map(lambda x: self.readLoginFunction(x)), the target value in three position are all empty.
I'm a newbie in spark. Any advice could be helpful. Thanks in advance.
from analysis.common.db.hive.connectHive import *
import collections
class OperateHive():
def __init__(self):
self.jobUserProfile = collections.defaultdict(dict)
def readLoginFunction(self, e):
dic = collections.defaultdict()
dic['userid'] = e[0]
dic['logtime'] = e[1]
self.jobUserProfile[e[0]] = dic
print(self.jobUserProfile) #1
def readLogin(self, spark):
response = spark.sql('select userid, logtime from hive.dwd_log_login_i_d limit 10')
response.rdd.foreach(lambda x: self.readLoginFunction(x))
print(self.jobUserProfile) #2
if __name__ == '__main__':
spark = connectHive(['conf/hdfs-site.xml', 'conf/hive-site.xml'], 'utf-8')
operateHive = OperateHive()
operateHive.readLogin(spark)
print(operateHive.jobUserProfile) #3
Finally the code below works.
from analysis.common.db.hive.connectHive import *
import collections
class OperateHive():
def readLoginFunction(self, e,jobUserProfile, devAppProfile):
dic = collections.defaultdict()
dic['userid'] = e[0]
dic['logtime'] = e[1]
jobUserProfile[e[0]] = dic
devAppProfile[e[0]] = dic
print(jobUserProfile)
return jobUserProfile, devAppProfile
def readLogin(self, spark, jobUserProfile,devAppProfile):
response = spark.sql('select userid, logtime from hive.dwd_log_login_i_d limit 10')
rdd1 = response.rdd.map(lambda x: self.readLoginFunction(x, jobUserProfile, devAppProfile))
return rdd1.top(1)[0][0]
if __name__ == '__main__':
spark = connectHive(['conf/hdfs-site.xml', 'conf/hive-site.xml'], 'utf-8')
jobUserProfile = collections.defaultdict(dict)
devAppProfile = collections.defaultdict(dict)
operateHive = OperateHive()
jobUserProfile = operateHive.readLogin(spark, jobUserProfile, devAppProfile)
print(jobUserProfile)
But when I remove devAppProfile, the code show like below:
from analysis.common.db.hive.connectHive import *
import collections
class OperateHive():
def readLoginFunction(self, e,jobUserProfile, devAppProfile):
dic = collections.defaultdict()
dic['userid'] = e[0]
dic['logtime'] = e[1]
jobUserProfile[e[0]] = dic
devAppProfile[e[0]] = dic
print(jobUserProfile)
return jobUserProfile
def readLogin(self, spark, jobUserProfile,devAppProfile):
response = spark.sql('select userid, logtime from hive.dwd_log_login_i_d limit 10')
response.rdd.map(lambda x: self.readLoginFunction(x, jobUserProfile, devAppProfile))
if __name__ == '__main__':
spark = connectHive(['conf/hdfs-site.xml', 'conf/hive-site.xml'], 'utf-8')
jobUserProfile = collections.defaultdict(dict)
devAppProfile = collections.defaultdict(dict)
operateHive = OperateHive()
operateHive.readLogin(spark, jobUserProfile, devAppProfile)
The rdd.map() won't work as there is no print in print(jobUserProfile).
Then I change the code like below, which works again.
from analysis.common.db.hive.connectHive import *
import collections
class OperateHive():
def readLoginFunction(self, e,jobUserProfile, devAppProfile):
dic = collections.defaultdict()
dic['userid'] = e[0]
dic['logtime'] = e[1]
jobUserProfile[e[0]] = dic
devAppProfile[e[0]] = dic
print(jobUserProfile)
return jobUserProfile
def readLogin(self, spark, jobUserProfile,devAppProfile):
response = spark.sql('select userid, logtime from hive.dwd_log_login_i_d limit 10')
rdd1 = response.rdd.map(lambda x: self.readLoginFunction(x, jobUserProfile, devAppProfile))
return rdd1.collect()[-1]
if __name__ == '__main__':
spark = connectHive(['conf/hdfs-site.xml', 'conf/hive-site.xml'], 'utf-8')
jobUserProfile = collections.defaultdict(dict)
devAppProfile = collections.defaultdict(dict)
operateHive = OperateHive()
jobUserProfile = operateHive.readLogin(spark, jobUserProfile, devAppProfile)
print(jobUserProfile)
The problem on the post is about closure. But I don't work out why the three versions on the answer work differently.

nltk wordnet lemmatization with POS tag on pyspark dataframe

I'm processing text data in a pyspark dataframe. i have so far managed to tokenize the data as a column of arrays and produce the table below:
print(df.schema)
StructType(List(StructField(_c0,IntegerType,true),StructField(pageid,IntegerType,true),StructField(title,StringType,true),StructField(text,ArrayType(StringType,true),true)))
df.show(5)
+---+------+-------------------+--------------------+
|_c0|pageid| title| text|
+---+------+-------------------+--------------------+
| 0|137277| Sutton, Vermont|[sutton, is, town...|
| 1|137278| Walden, Vermont|[walden, is, town...|
| 2|137279| Waterford, Vermont|[waterford, is, t...|
| 3|137280|West Burke, Vermont|[west, burke, is,...|
| 4|137281| Wheelock, Vermont|[wheelock, is, to...|
+---+------+-------------------+--------------------+
only showing top 5 rows
Then i tried to lemmatize it with udf functions
def get_wordnet_pos(treebank_tag):
"""
return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v)
"""
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
# As default pos in lemmatization is Noun
return wordnet.NOUN
def postagger(p):
import nltk
x = list(nltk.pos_tag(p))
return x
sparkPosTagger = udf(lambda z: postagger(z),ArrayType(StringType()))
def lemmer(postags):
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
x = [lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag)) for [word,pos_tag] in nltk.pos_tag(postags)]
return x
sparkLemmer = udf(lambda z: lemmer(z), ArrayType(StringType()))
#df = df.select('_c0','pageid','title','text', sparkPosTagger("text").alias('lemm'))
df = df.select('_c0','pageid','title','text', sparkLemmer("lemm").alias('lems'))
which returns this error:
PicklingError: args[0] from __newobj__ args has the wrong class
I believe the error primarily comes from an incompatibility with the object that nltk.pos_tag(postags) produces. Normally, when given a list of tokens, nltk.pos_tag() produces a list of tuples.
I am stuck on working out a workaround though. As you can see from the code, i tried to split up the process beforehand by pos_tagging separately, only to receive the same error.
Is there a way to make this work?
Contrary to what i suspected, the problem was actually due to the initial function:
def get_wordnet_pos(treebank_tag):
"""
return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v)
"""
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
# As default pos in lemmatization is Noun
return wordnet.NOUN
which in regular python works fine. In pyspark, however, there is drama when importing nltk, and therefore calling on wordnet is problematic. There have been similar issues when others have attempted to import stopwords:
pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python
whilst i haven't solved the root cause, i have redesigned the code from what i saw online as a practical workaround to remove references to wordnet (was unnecessary anyway):
def get_wordnet_pos(treebank_tag):
"""
return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v)
"""
if treebank_tag.startswith('J'):
return 'a'
elif treebank_tag.startswith('V'):
return 'v'
elif treebank_tag.startswith('N'):
return 'n'
elif treebank_tag.startswith('R'):
return 'r'
else:
# As default pos in lemmatization is Noun
return 'n'
def lemmatize1(data_str):
# expects a string
list_pos = 0
cleaned_str = ''
lmtzr = WordNetLemmatizer()
#text = data_str.split()
tagged_words = nltk.pos_tag(data_str)
for word in tagged_words:
lemma = lmtzr.lemmatize(word[0], get_wordnet_pos(word[1]))
if list_pos == 0:
cleaned_str = lemma
else:
cleaned_str = cleaned_str + ' ' + lemma
list_pos += 1
return cleaned_str
sparkLemmer1 = udf(lambda x: lemmatize1(x), StringType())
Nice answer by Saleem Khan! I'd just add that it is good to have the lemmatized output like this (array format):
sparkLemmer1 = udf(lambda x: lemmatize1(x), ArrayType(StringType()))
instead of this:
sparkLemmer1 = udf(lambda x: lemmatize1(x), StringType())
to be able to create e.g. ngrams and do further preprocessing in pyspark.

Error message on certain text inputs using Python

I'm pretty new to python and I'm currently working on an assignment to implement a movie recommendation system. I have a .csv file that contains various descriptions of a given movie's attribute. I ask the user for a movie title and then the system returns similar movies.
The dataset is named movie_dataset.csv from this folder on GitHub: https://github.com/codeheroku/Introduction-to-Machine-Learning/tree/master/Building%20a%20Movie%20Recommendation%20Engine
The problem I am encountering is that when I ask the user to enter a movie title, the program only works for certain titles.
The code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#helper functions#
def get_title_from_index(index):
return df[df.index == index]["title"].values[0]
def get_index_from_title(title):
return df[df.title == title]["index"].values[0]
df = pd.read_csv("movie_dataset.csv")
#print (df.columns)
features = ['keywords','cast','genres','director']
for feature in features:
df[feature] = df[feature].fillna('')
def combine_features(row):
return row['keywords'] +" "+ row['cast'] +" "+ row['genres'] +" "+ row['director']
df["combine_features"] = df.apply(combine_features, axis=1)
#print (df["combine_features"].head())
cv = CountVectorizer()
count_matrix = cv.fit_transform(df["combine_features"])
#MTitle = input("Type in a movie title: ")
cosine_sim = cosine_similarity(count_matrix)
movie_user_likes = 'Avatar'#MTitle
movie_index = get_index_from_title(movie_user_likes)
similar_movies = list(enumerate(cosine_sim[movie_index]))
sorted_similar_movies = sorted(similar_movies, key= lambda x:x[1], reverse=True)
i = 0
for movie in sorted_similar_movies:
print (get_title_from_index(movie[0]))
i=i+1
if i>10:
break
When I enter "Batman" the program runs fine. But when I run "Harry Potter" I get:
IndexError Traceback (most recent call last)
<ipython-input-51-687ddb420709> in <module>
30 movie_user_likes = MTitle
31
---> 32 movie_index = get_index_from_title(movie_user_likes)
33
34 similar_movies = list(enumerate(cosine_sim[movie_index]))
<ipython-input-51-687ddb420709> in get_index_from_title(title)
8
9 def get_index_from_title(title):
---> 10 return df[df.title == title]["index"].values[0]
11
12 df = pd.read_csv("movie_dataset.csv")
IndexError: index 0 is out of bounds for axis 0 with size 0
There's simply no entry in the data base for the movie "Harry Potter"
You should add some testing for these cases such as:
def get_index_from_title(title):
try:
return df[df.title == title]["index"].values[0]
except IndexError:
return None
Then of course in the calling code you'll have to test if you got a None from the function and act accordingly.

I have a total of two errors, return outside function as well as error with defining variables

I am having trouble with coding this as i am quite new, my code is supposed to take data, return average, as well as make a chart for the information and plot the peaks/valleys of the data
I don't know why it isn't returning and that fault is what makes the code faulty towards the end, the only part that shows error is the code that is supposed to find the peaks/valley values of my code. It also shows invalid syntax when I try to develop a variable (more specifically 'original_data')
EDIT: Thanks to Jono and Ken, I have fixed some of my code, but I checked the values of my lists and they only have one value stored in each, so its not printing all the peaks/valleys of the dataset i had. I getting KeyError: -331 and i can't find results as to how to fix it
# My Favorite Function
import os
clear = lambda: os.system('cls')
clear()
#import modules
import pandas as pd
import matplotlib.pyplot as plt
#key variables
data_set = pd.read_csv('C:/Users/sanderj/Documents/Work/Work_Experience/Day4.csv')
data = data_set["Data"]
peaks = []
valleys = []
#loop functions
for x in data:
if data[x] == data[0] and data[x] > data[x+1]:
peaks.append(x)
elif data[x] > data[x+1] and data[x] > data[x-1]:
peaks.append(x)
else:
continue
for x in data:
if data[x] == data[0] and data[x] < data[x+1]:
valleys.append(x)
elif data[x] < data[x+1] and data[x] < data[x-1]:
valleys.append(x)
else:
continue
#establishing points
a = peaks
b = valleys
plt.plot(a, b, 'ro')
plt.axis([ 0, 1024, -1000, 1000])
plt.title("Peaks and Valleys")
#final
clear()
plt.show()

Resources