How do I collect a single column in Spark? - apache-spark

I would like to perform an action on a single column.
Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot be collected.
Here is an example:
df = sqlContext.createDataFrame([Row(array=[1,2,3])])
df['array'].collect()
This produces the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'Column' object is not callable
How can I use the collect() function on a single column?

Spark >= 2.0
Starting from Spark 2.0.0 you need to explicitly specify .rdd in order to use flatMap
df.select("array").rdd.flatMap(lambda x: x).collect()
Spark < 2.0
Just select and flatMap:
df.select("array").flatMap(lambda x: x).collect()
## [[1, 2, 3]]

Related

How to select a row in a pandas DataFrame datetime index using a datetime variable?

I am not a Professional programmer at all and slowly accumulating some experience in python.
This is the issue I encounter.
On my dev machine I had a python3.7 installed with pandas version 0.24.4
the following sequence was working perfectly fine.
>>> import pandas as pd
>>> df = pd.Series(range(3), index=pd.date_range("2000", freq="D", periods=3))
>>> df
2000-01-01 0
2000-01-02 1
2000-01-03 2
Freq: D, dtype: int64
>>> import datetime
>>> D = datetime.date(2000,1,1)
>>> df[D]
0
in the production environnent the pandas version is 1.1.4 and the sequence described does not work anymore.
>>> df[D]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ec2-user/.local/lib/python3.7/site-packages/pandas/core/series.py", line 882, in __getitem__
return self._get_value(key)
File "/home/ec2-user/.local/lib/python3.7/site-packages/pandas/core/series.py", line 989, in _get_value
loc = self.index.get_loc(label)
File "/home/ec2-user/.local/lib/python3.7/site-packages/pandas/core/indexes/datetimes.py", line 622, in get_loc
raise KeyError(key)
KeyError: datetime.date(2000, 1, 1)
Then, unexpectedly, by transforming D in a string type the following command did work :
>>> df[str(D)]
0
Any idea of why this behaviour has changed in the different versions ?
Is this behaviour a bug or will be permanent over time ?
should I transform all the selections by datetime variables in the code in string variables or is there a more robust way over time to do this ?
It depends of version. If need more robust solution use datetimes for match DatetimeIndex:
import datetime
D = datetime.datetime(2000,1,1)
print (df[D])
0

load jalali date from string in pyspark

I need to load jalali date from string and then, return it as gregorian date string. I'm using the following code:
def jalali_to_gregorian(col, format=None):
if not format:
format = "%Y/%m/d"
gre = jdatetime.datetime.strptime(col, format=format).togregorian()
return gre.strftime(format=format)
# register the function
spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType())
# load the date and show it:)
df = df.withColumn("financial_date", jalali_to_gregorian(df.PersianCreateDate))
df.select(['PersianCreateDate', 'financial_date']).show()
it throws ValueError: time data 'Column<PersianCreateDate>' does not match format '%Y/%m/%d' error at me.
the string from the column is a match and I have tested it. this is a problem from how spark is sending the column value to my function. anyway to solve it?
to test:
df=spark.createDataFrame([('1399/01/02',),('1399/01/01',)],['jalali'])
df = df.withColumn("gre", jalali_to_gregorian(df.jalali))
df.show()
should result in
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/20|
|1399/01/01|2020/03/21|
+----------+----------+
instead, I'm thrown at with:
Fail to execute line 2: df = df.withColumn("financial_date", jalali_to_gregorian(df.jalali))
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6468469233020961307.py", line 375, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 2, in <module>
File "<stdin>", line 7, in jalali_to_gregorian
File "/usr/local/lib/python2.7/dist-packages/jdatetime/__init__.py", line 929, in strptime
(date_string, format))
ValueError: time data 'Column<jalali>' does not match format '%Y/%m/%d''%Y/%m/%d'
Your problem is that you're trying to apply function to the column, not to the values inside the column.
The code that you have used: spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType()) registers your function for use in the Spark SQL (via spark.sql(...), not in the pyspark.
To get function that you can use inside withColumn, select, etc., you need to create a wrapper function that is done with udf function and this function should be used in the withColumn:
from pyspark.sql.functions import udf
jalali_to_gregorian_udf = udf(jalali_to_gregorian, StringType())
df = df.withColumn("gre", jalali_to_gregorian_udf(df.jalali))
>>> df.show()
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/21|
|1399/01/01|2020/03/20|
+----------+----------+
See documentation for more details.
You also have the error in the time format - instead of format = "%Y/%m/d" it should be format = "%Y/%m/%d".
P.S. If you're running on Spark 3.x, then I recommend to look to the vectorized UDFs (aka, Pandas UDFs) - they are much faster than usual UDFs, and will provide better performance if you have a lot of data.

How to convert string array to numpy array and pass it to UDF in Pyspark?

I've stored a Numpy array as a string array in CSV file (didn't know any other way). Now there are two problems I am facing
1) I need to read the CSV file and convert the string array to numpy array and pass it to the UDF.
2) Why I'm not able to use the DF.withcolumn method? It throws below error
Traceback (most recent call last): File "", line 1, in
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line
1989, in withColumn
assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column
My code snippet -
def wantNumpyArr(array):
try:
//some code//
except Exception:
pass
else:
return float_var
spark.udf.register("wantNumpyArr", wantNumpyArr, FloatType())
#Read from csv file
read_data=spark.read.format("csv").load("/path/to/file/part-*.csv", header="true")
rdd = read_data.rdd
convert_data = rdd.map(lambda x: (x[0], x[1], wantNumpyArr(x[2])))
When I print the convert_data RDD it always have the 3rd column value as "None" which means the flow in the UDF always goes in Except block.
Sample data -
[Row(Id='ABCD505936', some_string='XCOYNZGAE', array='[0, 2, 5, 6, 8, 10, 12, 13, 14, 15]')]
The schema of the DF is -
print (read_data.schema)
StructType(List(StructField(col1,StringType,true),StructField(col2,StringType,true),StructField(col3,StringType,true)))

get columns post group by in pyspark with dataframes

I see couple of posts post1 and post2 which are relevant to my question. However while following post1 solution I am running into below error.
joinedDF = df.join(df_agg, "company")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/python/pyspark/sql/dataframe.py", line 1050, in join
jdf = self._jdf.join(other._jdf, on, how)
AttributeError: 'NoneType' object has no attribute '_jdf'
Entire code snippet
df = spark.read.format("csv").option("header", "true").load("/home/ec2-user/techcrunch/TechCrunchcontinentalUSA.csv")
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
joinedDF = df.join(df_agg, "company")
on the second line you have .show at the end
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
remove it like this:
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False)
and your code should work.
You used an action on that df and assigned it to df_agg variable, thats why your variable is NoneType(in python) or Unit(in scala)

how to call diffenet attribute of a Pandas dataframeusing a variabe?

consider df is a pandas data frame with 10 different columns and 500 rows. user is asked to pick a column name which will be stored in var1.
I am trying to call the corresponding column to var1 and change the data type but I see an error.
is there anyway to solve this problem?
Regards,
var1=input('Enter the file name:').lower().capitalize()
df[var1]=df.var1.astype(float)
error:
'DataFrame' object has no attribute 'file_name'
The current approach you're taking - using df.var1 to reference the var1 column - has pandas searching literally for a column/attribute named var1. A correct way of accessing this column/attribute would be to use something like this df[var1], which will look for whatever is contained in var1. See the example below for more detail:
>>> import pandas as pd
>>> var1 = 'hello'
>>> df = pd.DataFrame({'hello': [1]})
>>> df
hello
0 1
>>> df.var1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/blacksite/Documents/envs/dsenv/lib/python3.6/site-packages/pandas/core/generic.py", line 5067, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'var1'
>>> df.hello
0 1
Name: hello, dtype: int64
>>> df[var1]
0 1
Name: hello, dtype: int64

Resources