I am trying to create a new dataframe column (b) removing the last character from (a).
column a is a string with different lengths so i am trying the following code -
from pyspark.sql.functions import *
df.select(substring('a', 1, length('a') -1 ) ).show()
I get a TypeError: 'Column' object is not callable
it seems to be due to using multiple functions but i cant understand why as these work on their own -
if i hardcode the column length this will work
df.select(substring('a', 1, 10 ) ).show()
or if i use length on it's own it works
df.select(length('a') ).show()
why can i not use multiple functions ?
is there an easier method of removing the last character from all rows in a column ?
Using substr
df.select(col('a').substr(lit(0), length(col('a')) - 1))
or using regexp_extract:
df.select(regexp_extract(col('a'), '(.*).$', 1))
Function substring does not work as the parameters pos and len needs to be integers, not columns
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.substring
Your code is almost correct.you just need to use len function.
df = spark.createDataFrame([('abcde',)],['dummy'])
from pyspark.sql.functions import substring
df.select('dummy',substring('dummy', 1, len('dummy') -1).alias('substr_dummy')).show()
#+-----+------------+
#|dummy|substr_dummy|
#+-----+------------+
#|abcde| abcd|
#+-----+------------+
Related
I have a pyspark UDF which returns me a list of weeks. yw_list contains a list of weeks like 202001, 202002, ....202048 etc..
def Week_generator(week, no_of_weeks):
end_index = yw_list.index(week)
start_index = end_index - no_of_weeks + 1
return(yw_list[start_index:end_index+1])
spark.udf.register("Week_generator", Week_generator)
When I'm calling this UDF in my spark sql dataframe, instead of storing the result as a list, it is getting stored as a string. Because of this I'm not able to iterate over the values in the list.
spark.sql(""" select Week_generator('some week column', 4) as col1 from xyz""")
Output Schema: col1:String
Any idea or suggestion on how to resolve this ?
As pointed out by Suresh, I missed out adding the datatype.
spark.udf.register("Week_generator", Week_generator,ArrayType(StringType()))
This solved my issue.
I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below:
Ex: [24,23,27,23] should get converted to [24, 23, 27]
Code:
def uniq_array(col_array):
x = np.unique(col_array)
return x
uniq_array_udf = udf(uniq_array,ArrayType(IntegerType()))
Df3 = Df2.withColumn("age_array_unique",uniq_array_udf(Df2.age_array))
In the above code, Df2.age_array is the array on which I am applying the UDF to get a different column "age_array_unique" which should contain only unique values in the array.
However, as soon as I run the command Df3.show(), I get the error:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
Can anyone please let me know why this is happening?
Thanks!
The source of the problem is that object returned from the UDF doesn't conform to the declared type. np.unique not only returns numpy.ndarray but also converts numerics to the corresponding NumPy types which are not compatible with DataFrame API. You can try something like this:
udf(lambda x: list(set(x)), ArrayType(IntegerType()))
or this (to keep order)
udf(lambda xs: list(OrderedDict((x, None) for x in xs)),
ArrayType(IntegerType()))
instead.
If you really want np.unique you have to convert the output:
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))
You need to convert the final value to a python list. You implement the function as follows:
def uniq_array(col_array):
x = np.unique(col_array)
return list(x)
This is because Spark doesn't understand the numpy array format. In order to feed a python object that Spark DataFrames understand as an ArrayType, you need to convert the output to a python list before returning it.
I also got this error when my UDF returns a float but I forget to cast it as a float. I need to do this:
retval = 0.5
return float(retval)
As of pyspark version 2.4, you can use array_distinct transformation.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
Below Works fine for me
udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))
[x.item() for x in <any numpy array>]
converts it to plain python.
I am new to python3 and trying to do chisquared tests on columns in a pandas dataframe. My columns are in pairs: observed_count_column_1, expected count_column_1, observed_count_column_2, expected_count_column_2 and so on. I would like to make a loop to get all column pairs done at once.
I succeed doing this if I specify column index integers or column names manually.
This works
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
chisquare(df.iloc[:,[0]], df.iloc[:,[1]])
This, trying with a loop, does not:
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
for n in [0,2,4,6,8,10]:
chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
The loop code does not seem to run at all and I get no error but no output either.
I was wondering why this is happening and how can I actually approach this?
Thank you,
Dan
Consider building a data frame of chi-square results from list of tuples, then assign column names as indicators for observed and expected frequencies (subsetting even/odd columns by indexed notation):
# CREATE DATA FRAME FROM LIST IF TUPLES
# THEN ASSIGN COLUMN NAMES
chi_square_df = (pd.DataFrame([chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]) \
for n in range(0,11,2)],
columns = ['chi_sq_stat', 'p_value'])
.assign(obs_freq = df.columns[::2],
exp_freq = df.columns[1::2])
)
chisquare() function returns two values so you can try this:
for n in range(0, 11, 2):
chisq, p = chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
print('Chisq: {}, p-value: {}'.format(chisq, p))
You can find what it returns in the docs here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
Thank you for the suggestions. Using the information from Parfait comment, that loops don't print I managed to find a solution, although not as elegant as their own solution above.
for n in range(0, 11, 2):
print(chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]))
This gives the expected results.
Dan
I want to generate a column based on the value of an existing column. Where ever there is plus sign we want to split and pickup the 2nd part of the column and trim if there is any space.
df = spark.sql("select '10/35/70/25% T4Max-300 + 20/45/80/25% T4Max-400' as col1")
df1 = df.withColumn("newcol",col('col1').split("+")[1].strip())
getting the error TypeError: 'Column' object is not callable
Expected output is 20/45/80/25% T4Max-400
The code col('col1') returns the pyspark.sql.Column in your DataFrame with the name "col1".
You are getting the error:
TypeError: 'Column' object is not callable
because you are trying to call split (and trim) as methods on this column, but no such methods exist.
Instead you want to call the functions pyspark.sql.functions.split() and pyspark.sql.functions.trim() with the Column passed in as an argument.
For instance:
df1 = df.withColumn(
"newcol",
f.trim(
f.split(f.col('col1'), r"\+")[1]
)
)
df1.show(truncate=False)
#+-----------------------------------------------+----------------------+
#|col1 |newcol |
#+-----------------------------------------------+----------------------+
#|10/35/70/25% T4Max-300 + 20/45/80/25% T4Max-400|20/45/80/25% T4Max-400|
#+-----------------------------------------------+----------------------+
The second argument to split() is treated as a regular expression pattern, so the + has to be escaped.
I am working with PySpark dataframes here. "test1" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the result is a integer variable but when I try to get max of the same column the result is a dataframe. I would like to understand what operations result in a dataframe and variable. I would also like to know how to store the max of the event date as a variable
Code that results in an integer type:
loop_cnt=test1.select('event_date').distinct().count()
type(loop_cnt)
Code that results in dataframe type:
last_processed_dt=test1.select([max('event_date')])
type(last_processed_dt)
Edited to add a reproducible example:
schema = StructType([StructField("event_date", TimestampType(), True)])
df = sqlContext.createDataFrame([(datetime(2015, 8, 10, 2, 44, 15),),(datetime(2015, 8, 10, 3, 44, 15),)], schema)
Code that returns a dataframe:
last_processed_dt=df.select([max('event_date')])
type(last_processed_dt)
Code that returns a varible:
loop_cnt=df.select('event_date').distinct().count()
type(loop_cnt)
You cannot directly access the values in a dataframe. Dataframe returns a Row Object. Instead Dataframe gives you a option to convert it into a python dictionary. Go through the following example where I will calculate average wordcount:
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordCountsDF = wordsDF.groupBy(wordsDF['word']).count()
wordCountsDF.show()
Here are the word count results:
+--------+-----+
| word|count|
+--------+-----+
| cat| 2|
| rat| 2|
|elephant| 1|
+--------+-----+
Now I calculate the average of count column apply collect() operation on it. Remember collect() returns a list.Here the list contains one element only.
averageCount = wordCountsDF.groupBy().avg('count').collect()
Result looks something like this.
[Row(avg(count)=1.6666666666666667)]
You cannot access directly the average value using some python variable. You have to convert it into a dictionary to access it.
results={}
for i in averageCount:
results.update(i.asDict())
print results
Our final results look like these:
{'avg(count)': 1.6666666666666667}
Finally you can access average value using:
print results['avg(count)']
1.66666666667
I'm pretty sure df.select([max('event_date')]) returns a DataFrame because there could be more than one row that has the max value in that column. In your particular use case no two rows may have the same value in that column, but it is easy to imagine a case where more than one row can have the same max event_date.
df.select('event_date').distinct().count() returns an integer because it is telling you how many distinct values there are in that particular column. It does NOT tell you which value is the largest.
If you want code to get the max event_date and store it as a variable, try the following max_date = df.select([max('event_date')]).distinct().collect()
Using collect()
import pyspark.sql.functions as sf
distinct_count = df.agg(sf.countDistinct('column_name')).collect()[0][0]
Using first()
import pyspark.sql.functions as sf
distinct_count = df.agg(sf.countDistinct('column_name')).first()[0]
last_processed_dt=df.select([max('event_date')])
to get the max of date, we should try something like
last_processed_dt=df.select([max('event_date').alias("max_date")]).collect()[0]
last_processed_dt["max_date"]
Based on sujit's example.We can actually print the value without iterating/looping by
[Row(avg(count)=1.6666666666666667)] by providing averageCount[0][0].
Note: we are not going through the loop, because it's going to return only one value.
try this
loop_cnt=test1.select('event_date').distinct().count()
var = loop_cnt.collect()[0]
Hope this helps
trainDF.fillna({'Age':trainDF.select('Age').agg(avg('Age')).collect()[0][0]})
What you can try is accessing the collect() function.
As of spark 3.0, you can do the following:
loop_cnt=test1.select('event_date').distinct().count().collect()[0][0]
print(loop_cnt)