load jalali date from string in pyspark - apache-spark

I need to load jalali date from string and then, return it as gregorian date string. I'm using the following code:
def jalali_to_gregorian(col, format=None):
if not format:
format = "%Y/%m/d"
gre = jdatetime.datetime.strptime(col, format=format).togregorian()
return gre.strftime(format=format)
# register the function
spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType())
# load the date and show it:)
df = df.withColumn("financial_date", jalali_to_gregorian(df.PersianCreateDate))
df.select(['PersianCreateDate', 'financial_date']).show()
it throws ValueError: time data 'Column<PersianCreateDate>' does not match format '%Y/%m/%d' error at me.
the string from the column is a match and I have tested it. this is a problem from how spark is sending the column value to my function. anyway to solve it?
to test:
df=spark.createDataFrame([('1399/01/02',),('1399/01/01',)],['jalali'])
df = df.withColumn("gre", jalali_to_gregorian(df.jalali))
df.show()
should result in
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/20|
|1399/01/01|2020/03/21|
+----------+----------+
instead, I'm thrown at with:
Fail to execute line 2: df = df.withColumn("financial_date", jalali_to_gregorian(df.jalali))
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6468469233020961307.py", line 375, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 2, in <module>
File "<stdin>", line 7, in jalali_to_gregorian
File "/usr/local/lib/python2.7/dist-packages/jdatetime/__init__.py", line 929, in strptime
(date_string, format))
ValueError: time data 'Column<jalali>' does not match format '%Y/%m/%d''%Y/%m/%d'

Your problem is that you're trying to apply function to the column, not to the values inside the column.
The code that you have used: spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType()) registers your function for use in the Spark SQL (via spark.sql(...), not in the pyspark.
To get function that you can use inside withColumn, select, etc., you need to create a wrapper function that is done with udf function and this function should be used in the withColumn:
from pyspark.sql.functions import udf
jalali_to_gregorian_udf = udf(jalali_to_gregorian, StringType())
df = df.withColumn("gre", jalali_to_gregorian_udf(df.jalali))
>>> df.show()
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/21|
|1399/01/01|2020/03/20|
+----------+----------+
See documentation for more details.
You also have the error in the time format - instead of format = "%Y/%m/d" it should be format = "%Y/%m/%d".
P.S. If you're running on Spark 3.x, then I recommend to look to the vectorized UDFs (aka, Pandas UDFs) - they are much faster than usual UDFs, and will provide better performance if you have a lot of data.

Related

I am trying to read a csv file with pandas and then to search a string in the first column, to use total row for calculations

I am reading a CSV file with pandas, and then I try to find a word like "Net income" in the first column. Then I want to use the whole row which has this structure: string/number/number/number/... to do some calculations with the numbers.
The problem is that find is not working.
data = pd.read_csv(name)
data.str.find('Net income')
Traceback (most recent call last):
File "C:\Users\thoma\Desktop\python programme\manage.py", line 16, in <module>
data.str.find('Net income')
I am using CSV files from here: Income Statement for Deutsche Lufthansa AG (DLAKF) from Morningstar.com
I found this: Python | Pandas Series.str.find() - GeeksforGeeks
Traceback (most recent call last):
File "C:\Users\thoma\Desktop\python programme\manage.py", line 16, in <module>
data.str.find('Net income')
File "C:\Users\thoma\AppData\Roaming\Python\Python37\site-packages\pandas\core\generic.py", line 5067, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'str'
So, it works now. But I still have a question. After using the describe function with pandas I get this:
<bound method NDFrame.describe of 2014-12 615
2015-12 612
2016-12 636
2017-12 713
2018-12 736
Name: Goodwill, dtype: object>
I have problems to use the data. So how can I f.e. use the second column here? I tried to do a new table:
new_Table['Goodwill'] = data1['Goodwill'].describe
but this does not work.
I also would like to add more "second" columns to new_Table.
Hi you should filter the column name like df[‘col name’].str.find(x) this required a series not a data frame.
I recommend setting your header row if pandas isnt recognizing named rows in your CSV file.
Something like:
new_header = data.iloc[0] #grab the first row for the header
data = data[1:] #take the data less the header row
data.columns = new_header
From there you can summarize each column by name:
data['Net Income'].describe
edit: I looked at the csv file, I recommend reshaping the data first before analyzing columns.Something like...
data=data.transpose
So in summation:
data = pd.read_csv(name)
data=data.transpose #flip the columns/rows
new_header = data.iloc[0] #grab the first row for the header
data = data[1:] #take the data less the header row
data.columns = new_header
data['Net Income'].describe #analyze

ValueError: Cannot convert column into bool

I'm trying build a new column on dataframe as below:
l = [(2, 1), (1,1)]
df = spark.createDataFrame(l)
def calc_dif(x,y):
if (x>y) and (x==1):
return x-y
dfNew = df.withColumn("calc", calc_dif(df["_1"], df["_2"]))
dfNew.show()
But, I get:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 346, in <module>
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 334, in <module>
File "<stdin>", line 38, in <module>
File "<stdin>", line 36, in calc_dif
File "/usr/hdp/current/spark2-client/python/pyspark/sql/column.py", line 426, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Why It happens? How can I fix It?
Either use udf:
from pyspark.sql.functions import udf
#udf("integer")
def calc_dif(x,y):
if (x>y) and (x==1):
return x-y
or case when (recommended)
from pyspark.sql.functions import when
def calc_dif(x,y):
when(( x > y) & (x == 1), x - y)
The first one computes on Python objects, the second one on Spark Columns
It is complaining because you give your calc_dif function the whole column objects, not the actual data of the respective rows. You need to use a udf to wrap your calc_dif function :
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
l = [(2, 1), (1,1)]
df = spark.createDataFrame(l)
def calc_dif(x,y):
# using the udf the calc_dif is called for every row in the dataframe
# x and y are the values of the two columns
if (x>y) and (x==1):
return x-y
udf_calc = udf(calc_dif, IntegerType())
dfNew = df.withColumn("calc", udf_calc("_1", "_2"))
dfNew.show()
# since x < y calc_dif returns None
+---+---+----+
| _1| _2|calc|
+---+---+----+
| 2| 1|null|
| 1| 1|null|
+---+---+----+
For anyone who has a similar error: I was trying to pass an rdd when I needed a Pandas object and got the same error. Obviously, I could simply solve it by a ".toPandas()"
For anyone who faces the same error message, check the brackets. Sometimes boolean expression needs more specific expressions like;
DF_New=
df1.withColumn('EventStatus',\
F.when(((F.col("Adjusted_Timestamp")) <\
(F.col("Event_Finish"))) &\
((F.col("Adjusted_Timestamp"))>\
F.col("Event_Start"))),1).otherwise(0))

read content of Column<COLUMN-NAME> in pyspark

I am using spark 1.5.0
I have a data frame created like below, and am trying to read a column from here
>>> words = tokenizer.transform(sentenceData)
>>> words
DataFrame[label: bigint, sentence: string, words: array<string>]
>>> words['words']
Column<words>
I want to read all the words (vocab) from the sentences. How can I read this
Edit 1: Error Still Prevails
I now ran this in spark 2.0.0 and getting this error
>>> wordsData.show()
+--------------------+--------------------+
| desc| words|
+--------------------+--------------------+
|Virat is good bat...|[virat, is, good,...|
| sachin was good| [sachin, was, good]|
|but modi sucks bi...|[but, modi, sucks...|
| I love the formulas|[i, love, the, fo...|
+--------------------+--------------------+
>>> wordsData
DataFrame[desc: string, words: array<string>]
>>> vocab = wordsData.select(explode('words')).rdd.flatMap(lambda x: x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 305, in flatMap
return self.mapPartitionsWithIndex(func, preservesPartitioning)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 330, in mapPartitionsWithIndex
return PipelinedRDD(self, f, preservesPartitioning)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 2383, in __init__
self._jrdd_deserializer = self.ctx.serializer
AttributeError: 'SparkSession' object has no attribute 'serializer'
Resolution for Edit - 1 - Link
You can:
from pyspark.sql.functions import explode
words.select(explode('words')).rdd.flatMap(lambda x: x)

python pandas merging excel sheets not working

I'm trying to merge two excel sheets using the common filed Serial but throwing some errors. My program is as below :
(user1_env)root#ubuntu:~/user1/test/compare_files# cat compare.py
import pandas as pd
source1_df = pd.read_excel('a.xlsx', sheetname='source1')
source2_df = pd.read_excel('a.xlsx', sheetname='source2')
joined_df = source1_df.join(source2_df, on='Serial')
joined_df.to_excel('/root/user1/test/compare_files/result.xlsx')
getting error as below :
(user1_env)root#ubuntu:~/user1/test/compare_files# python3.5 compare.py
Traceback (most recent call last):
File "compare.py", line 5, in <module>
joined_df = source1_df.join(source2_df, on='Serial')
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/core/frame.py", line 4385, in join
rsuffix=rsuffix, sort=sort)
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/core/frame.py", line 4399, in _join_compat
suffixes=(lsuffix, rsuffix), sort=sort)
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
return op.get_result()
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/tools/merge.py", line 223, in get_result
rdata.items, rsuf)
File "/home/user1/miniconda3/envs/user1_env/lib/python3.5/site-packages/pandas/core/internals.py", line 4445, in items_overlap_with_suffix
to_rename)
ValueError: columns overlap but no suffix specified: Index(['Serial'], dtype='object')
I'm referring below SO link for the issue :
python compare two excel sheet and append correct record
Small modification worked for me,
import pandas as pd
source1_df = pd.read_excel('a.xlsx', sheetname='source1')
source2_df = pd.read_excel('a.xlsx', sheetname='source2')
joined_df = pd.merge(source1_df,source2_df,on='Serial',how='outer')
joined_df.to_excel('/home/gk/test/result.xlsx')
It is because of the overlapping column names after join. You can either set your index to Serial and join, or specify a rsuffix= or lsuffix= value in your join function so that the suffix value would be appended to the common column names.

How do I collect a single column in Spark?

I would like to perform an action on a single column.
Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot be collected.
Here is an example:
df = sqlContext.createDataFrame([Row(array=[1,2,3])])
df['array'].collect()
This produces the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'Column' object is not callable
How can I use the collect() function on a single column?
Spark >= 2.0
Starting from Spark 2.0.0 you need to explicitly specify .rdd in order to use flatMap
df.select("array").rdd.flatMap(lambda x: x).collect()
Spark < 2.0
Just select and flatMap:
df.select("array").flatMap(lambda x: x).collect()
## [[1, 2, 3]]

Resources