How to use select() transformation in Apache Spark? - apache-spark

I am following the Intro to Spark course on edX. However, I cant understand few things, following is an lab assignment. FYI, I am not looking for solution.
I am not able to understand as why I am receiving the error
TypeError: 'Column' object is not callable
Following is the code
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""
Args:
column (Column): A Column containing a sentence.
"""
# This following is giving error. I believe I am calling all the rows from the dataframe 'column' where the attribute is named as 'sentence'
result = column.select('sentence')
return result
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
Can you be little elaborate? TIA.

The column parameter is not a DataFrame object and, therefore, does not have access to the select method. You'll need to use other functions to solve this problem.
Hint: Look at the import statement.

Related

Python lambda function error when used with brackets

I have a pandas data frame with 'Datetime' column containing timestamp information, I want to extract the hour information from the 'Datetime' column and add it to the hour column of the data frame.
I am confused since my code works if I write the lambda function without the braces
df['Datetime'].apply(lambda x: x.hour)
but, when I try this code instead
df['Datetime'].apply(lambda x: x.hour**()**)
I get the error "TypeError: 'int' object is not callable".
On the other hand when I use split function with lambda expression, it works completely fine with the braces -
df['Reasons'] = df['title'].apply(lambda x: x.split(':')[0])
As mentioned by #Dani Mesejo, hour is an attribute of the datetime object. Hence it is working fine without brackets. Once you add brackets, it assumes the hour is a function and so you are getting that error.
You can read more about datetime object in its documentation

Python: how can I get the mode from a month column that i extracted from a datetime column?

I'm new at this! Doing my first Python project. :)
My tasks are:
convert df['Start Time'] from string to datetime
create a month column from df['Start Time']
get the mode of that month.
I used a few different ways to do all 3 of the steps, but trying to get the mode always returns TypeError: tuple indices must be integers or slices, not str. This happens even if I try converting the "tuple" into a list or NumPy array.
Ways I tried to extract month from Start Time:
df['extracted_month'] = pd.DatetimeIndex(df['Start Time']).month
df['extracted_month'] = np.asarray(df['extracted_month'])
df['extracted_month'] = df['Start Time'].dt.month
Ways I've tried to get the mode:
print(df['extracted_month'].mode())
print(df['extracted_month'].mode()[0])
print(stat.mode(df['extracted_month']))
Trying to get the index with df.columns.get_loc("extracted_month") then replacing it in the mode code gives me the SAME error (TypeError: tuple indices must be integers or slices, not str).
I think I should convert df['extracted_month'] into a different... something. What is it?
Note: My extracted_month column is a STRING, but you should still be able to get the mode from a string variable! I'm not changing it, that would be giving up.
Edit: using the following code still results in the same error
extracted_month = pd.Index(df['extracted_month'])
print(extracted_month.value_counts())
The error is likely caused by the way you are creating your dataframe.
If the dataframe is created in another function, and that function returns other things along with the dataframe, but you assign it to the variable df, then df will be a tuple that contains the actual dataframe, and not the dataframe itself.

Python translate a column with multiple languages to english

I have a dataset where there are multiple comments columns having multiple languages and I want to translate these columns into English and create new columns with all the english translations.
Accountability_COMMENT is the column which has multiple comments in different language in every row. I want to create a new column and translate all such comments to English.
I have tried the following code :
from googletrans import Translator
from textblob import TextBlob
translator = Translator()
data_merge['Accountability_COMMENT'] = data_merge['Accountability_COMMENT'].apply(lambda x:
TextBlob(x).translate(to='en'))
The error that I am getting is :
TypeError: The text argument passed to __init__(text) must be a string, not class 'float'
My column has objet format which is correct
You most probably have some comments that only consists of a float (i.e. a decimal number), that even if they are type: object according to pandas they are still interpreted as float by TextBlob. This leads to the error:
TypeError: The text argument passed to __init__(text) must be a string, not <class 'float'>
One solution is to make sure that the input x of TextBlob(x) is a string. You could do this by modifying the apply row like:
data_merge['Accountability_COMMENT'] = data_merge['Accountability_COMMENT'].apply(lambda x: TextBlob(str(x)).translate(to='en'))
Unfortunately this will probably also rais an error like:
raise NotTranslated('Translation API returned the input string unchanged.')
textblob.exceptions.NotTranslated: Translation API returned the input string unchanged.
This is due to the fact that when translating a number, the translation and the original text will be exactly the same, and apparently TextBlob doesn't like that.
What you can do to avoid this is to catch that exception NotTranslated and just return the untranslated TextBlob, like this:
from textblob import TextBlob
from textblob.exceptions import NotTranslated
def translate_comment(x):
try:
# Try to translate the string version of the comment
return TextBlob(str(x)).translate(to='en')
except NotTranslated:
# If the output is the same as the input just return the TextBlob version of the input
return TextBlob(str(x))
data_merge['Accountability_COMMENT'] = data_merge['Accountability_COMMENT'].apply(translate_comment)
EDIT:
If you get the HTTP error Too Many Requests it's probably because you are being kicked out by the Google Translate API. Instead of using apply, you can make your translation "extra-slow" by using a for loop with some sleep in-between cycles. In this case you should import another package (time) and substitute the last line:
from time import sleep
from textblob import TextBlob
from textblob.exceptions import NotTranslated
def translate_comment(x):
try:
# Try to translate the string version of the comment
return TextBlob(str(x)).translate(to='en')
except NotTranslated:
# If the output is the same as the input just return the TextBlob version of the input
return TextBlob(str(x))
for i in range(len(data_merge['Accountability_COMMENT'])):
# Translate one comment at a time
data_merge['Accountability_COMMENT'].iloc[i] = translate_comment(data_merge['Accountability_COMMENT'].iloc[i])
# Sleep for a quarter of second
sleep(0.25)
You can then experiment with different values for the sleep function. Of course the longer the sleep the slower the translation! N.B. sleep argument is in seconds.

Find Max Value in a field of a shapefile

I have a shapefile (mich_co.shp) which I try to find the county with max population. My idea is to use max() function it's not possible. Here is my code so far:
from osgeo import ogr
import os
shapefile = "C:/Users/root/Python/mich_co.shp"
driver = ogr.GetDriverByName("ESRI Shapefile")
dataSource = driver.Open(shapefile, 0)
layer = dataSource.GetLayer()
for feature in layer:
print(feature.GetField("pop"))
layer.ResetReading()
The code above however only print all values of "pop" field like this:
10635.0
9541.0
112039.0
29234.0
23406.0
15477.0
8683.0
58990.0
106935.0
17465.0
156067.0
43868.0
135099.0
I tried:
print(max(feature.GetField("pop")))
but it returns TypeError: 'float' object is not iterable. For this, I've also tried:
for feature in range(layer):
and it returns TypeError: 'Layer' object cannot be interpreted as an integer.
Any helps of hints would be much appreciated.
Thanks you!
max() needs an iterable, such as a list. Try to build a list:
pops = [ feature.GetField("pop") for feature in layer ]
print(max(pops))

Return all columns + a few more in a UDF used by the map function

I'm using a map function to generate a new column where its value depends on the result of a column that already exists in the dataframe.
def computeTechFields(row):
if row.col1!=VALUE_TO_COMPARE:
tech1=0
else:
tech1=1
return (row.col1, row.col2, row.col3, tech1)
delta2rdd = delta.map(computeTechFields)
The problem is that my main dataframe has more than 150 columns that I have to return with the map function so in the end I have something like this :
return (row.col1, row.col2, row.col3, row.col4, row.col5, row.col6, row.col7, row.col8, row.col9, row.col10, row.col11, row.col12, row.col13, row.col14, row.col15, row.col16, row.col17, row.col18 ..... row.col149, row.col150, row.col151, tech1)
As you can see, it is really long to write and difficult to read. So I tried to do something like this :
return (row.*, tech1)
But of course it did not work.
I know that the "withColumn" function exists but I don't know much about its performance and could not make it work anyway.
Edit (What happened with the withColumn function) :
def computeTech1(row):
if row.col1!=VALUE_TO_COMPARE:
tech1=0
else:
tech1=1
return tech1
delta2 = delta.withColumn("tech1", computeTech1)
And it gave me this error :
AssertionError: col should be Column
I tried to do something like this :
return col(tech1)
The error was the same
I also tried :
delta2 = delta.withColumn("tech1", col(computeTech1))
This time, the error was :
AttributeError: 'function' object has no attribute '_get_object_id'
End of the edit
So my question is, how can I return all the columns + a few more within my UDF used by the map function ?
Thanks !
Not super firm with Python, so people might correct me on the syntax here, but the general idea is to make your function a UDF with a column as input, then call that inside withColumn. I used a lambda here, but with some fiddeling it should also work with a function.
from pyspark.sql.functions import udf
computeTech1UDF = udf(
lambda col: 0 if col != VALUE_TO_COMPARE else 1, IntegerType())
delta2 = delta.withColumn("tech1", computeTech1UDF(col1))
What you tried did not work since you did not provide withColumn with a column expression (see http://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn). Using the UDF wrapper achieves exactly that.

Resources