get columns post group by in pyspark with dataframes - apache-spark

I see couple of posts post1 and post2 which are relevant to my question. However while following post1 solution I am running into below error.
joinedDF = df.join(df_agg, "company")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/python/pyspark/sql/dataframe.py", line 1050, in join
jdf = self._jdf.join(other._jdf, on, how)
AttributeError: 'NoneType' object has no attribute '_jdf'
Entire code snippet
df = spark.read.format("csv").option("header", "true").load("/home/ec2-user/techcrunch/TechCrunchcontinentalUSA.csv")
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
joinedDF = df.join(df_agg, "company")

on the second line you have .show at the end
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
remove it like this:
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False)
and your code should work.
You used an action on that df and assigned it to df_agg variable, thats why your variable is NoneType(in python) or Unit(in scala)

Related

Delta lake stream-stream left outer join using watermarks

My aim here is to:
Read two delta lake tables as streams.
Add a timestamp column to the data frames.
Create a watermark using the timestamp columns.
Use the watermarks for an Outer left join.
Write the join as a stream to a delta lake table.
Here is my code:
from delta import *
from pyspark.sql.functions import *
import pyspark
my_working_dir = "/working"
spark = (
pyspark.sql.SparkSession.builder.appName("test_delta_stream").master("yarn").getOrCreate()
)
df_stream_1 = spark.readStream.format("delta").load(f"{my_working_dir}/table_1").alias("table_1")
df_stream_2 = spark.readStream.format("delta").load(f"{my_working_dir}/table_2").alias("table_2")
df_stream_1 = df_stream_1.withColumn("inserted_stream_1", current_timestamp())
df_stream_2 = df_stream_2.withColumn("inserted_stream_2", current_timestamp())
stream_1_watermark = df_stream_1.selectExpr(
"ID AS STREAM_1_ID", "inserted_stream_1"
).withWatermark("inserted_stream_1", "10 seconds")
stream_2_watermark = df_stream_2.selectExpr(
"ID AS STREAM_2_ID", "inserted_stream_2"
).withWatermark("inserted_stream_2", "10 seconds")
left_join_df = stream_1_watermark.join(
stream_2_watermark,
expr(
"""
STREAM_1_ID = STREAM_2_ID AND
inserted_stream_1 <= inserted_stream_2 + interval 1 hour
"""
),
"leftOuter",
).select("*")
left_join_df.writeStream.format("delta").outputMode("append").option(
"checkpointLocation", f"{my_working_dir}/output_delta_table/mycheckpoint"
).start(f"{my_working_dir}/output_delta_table")
Unfortunately, I get the error:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/some-dir/python/pyspark/sql/streaming.py", line 1493, in start
return self._sq(self._jwrite.start(path))
File "/some-dir/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/some-dir/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: Stream-stream LeftOuter join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;
It seems that even when I create watermarks, I still cannot do a left outer join. Any ideas why I am getting that error even though I clearly have defined a watermark on the joining columns?

load jalali date from string in pyspark

I need to load jalali date from string and then, return it as gregorian date string. I'm using the following code:
def jalali_to_gregorian(col, format=None):
if not format:
format = "%Y/%m/d"
gre = jdatetime.datetime.strptime(col, format=format).togregorian()
return gre.strftime(format=format)
# register the function
spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType())
# load the date and show it:)
df = df.withColumn("financial_date", jalali_to_gregorian(df.PersianCreateDate))
df.select(['PersianCreateDate', 'financial_date']).show()
it throws ValueError: time data 'Column<PersianCreateDate>' does not match format '%Y/%m/%d' error at me.
the string from the column is a match and I have tested it. this is a problem from how spark is sending the column value to my function. anyway to solve it?
to test:
df=spark.createDataFrame([('1399/01/02',),('1399/01/01',)],['jalali'])
df = df.withColumn("gre", jalali_to_gregorian(df.jalali))
df.show()
should result in
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/20|
|1399/01/01|2020/03/21|
+----------+----------+
instead, I'm thrown at with:
Fail to execute line 2: df = df.withColumn("financial_date", jalali_to_gregorian(df.jalali))
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6468469233020961307.py", line 375, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 2, in <module>
File "<stdin>", line 7, in jalali_to_gregorian
File "/usr/local/lib/python2.7/dist-packages/jdatetime/__init__.py", line 929, in strptime
(date_string, format))
ValueError: time data 'Column<jalali>' does not match format '%Y/%m/%d''%Y/%m/%d'
Your problem is that you're trying to apply function to the column, not to the values inside the column.
The code that you have used: spark.udf.register("jalali_to_gregorian", jalali_to_gregorian, StringType()) registers your function for use in the Spark SQL (via spark.sql(...), not in the pyspark.
To get function that you can use inside withColumn, select, etc., you need to create a wrapper function that is done with udf function and this function should be used in the withColumn:
from pyspark.sql.functions import udf
jalali_to_gregorian_udf = udf(jalali_to_gregorian, StringType())
df = df.withColumn("gre", jalali_to_gregorian_udf(df.jalali))
>>> df.show()
+----------+----------+
| jalali| gre|
+----------+----------+
|1399/01/02|2020/03/21|
|1399/01/01|2020/03/20|
+----------+----------+
See documentation for more details.
You also have the error in the time format - instead of format = "%Y/%m/d" it should be format = "%Y/%m/%d".
P.S. If you're running on Spark 3.x, then I recommend to look to the vectorized UDFs (aka, Pandas UDFs) - they are much faster than usual UDFs, and will provide better performance if you have a lot of data.

Error message when appending data to pandas dataframe

Can someone give me a hand with this:
I created a loop to append successive intervals of historical price data from Coinbase.
My loop iterates successfully a few times then crashes.
Error message (under data_temp code line):
"ValueError: If using all scalar values, you must pass an index"
days = 10
end = datetime.now().replace(microsecond=0)
start = end - timedelta(days=days)
data_price = pd.DataFrame()
for i in range(1,50):
print(start)
print(end)
data_temp = pd.DataFrame(public_client.get_product_historic_rates(product_id='BTC-USD', granularity=3600, start=start, end=end))
data_price = data_price.append(data_temp)
end = start
start = end - timedelta(days=days)
Would love to understand how to fix this and why this is happening in the first place.
Thank you!
Here's the full trace:
Traceback (most recent call last):
File "\coinbase_bot.py", line 46, in
data_temp = pd.DataFrame(public_client.get_product_historic_rates(product_id='BTC-USD', granularity=3600, start=start, end=end))
File "D:\Program Files\Python37\lib\site-packages\pandas\core\frame.py", line 411, in init
mgr = init_dict(data, index, columns, dtype=dtype)
File "D:\Program Files\Python37\lib\site-packages\pandas\core\internals\construction.py", line 257, in init_dict
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "D:\Program Files\Python37\lib\site-packages\pandas\core\internals\construction.py", line 77, in arrays_to_mgr
index = extract_index(arrays)
File "D:\Program Files\Python37\lib\site-packages\pandas\core\internals\construction.py", line 358, in extract_index
raise ValueError("If using all scalar values, you must pass an index")
ValueError: If using all scalar values, you must pass an index
Here's json returned via simple url call:
[[1454716800,370.05,384.54,384.44,375.44,6276.66473729],[1454630400,382.99,389.36,387.99,384.5,7443.92933224],[1454544000,368.74,390.63,368.87,387.99,8887.7572324],[1454457600,365.63,373.01,372.93,368.87,7147.95657328],[1454371200,371.17,374.41,371.33,372.93,6856.21815799],[1454284800,366.26,379,367.89,371.33,7931.22922922],[1454198400,365,382.5,378.46,367.95,5506.77681302]]
Very similar to this user's issue but cannot put my finger on it:
When attempting to merge multiple dataframes, how to resolve "ValueError: If using all scalar values, you must pass an index"
-- Hi DashOfProgramming,
Your problem is that the data_temp is initialised with only a single row and pandas requires you to provide it with an index for that.
The following snippet should resolve this. I replaced your API call with a simple dictionary that resembles what I would expect the API to return and used i as index for the dataframe (this has the advantage that you can keep track as well):
import pandas as pd
from datetime import datetime, timedelta
days = 10
end = datetime.now().replace(microsecond=0)
start = end - timedelta(days=days)
data_price = pd.DataFrame()
temp_dict = {'start': '2019-09-30', 'end': '2019-10-01', 'price': '-111.0928',
'currency': 'USD'}
for i in range(1,50):
print(start)
print(end)
data_temp = pd.DataFrame(temp_dict, index=[i])
data_price = data_price.append(data_temp)
end = start
start = end - timedelta(days=days)
print(data_price)
EDIT
Just saw that your API output is a nested list. pd.DataFrame() thinks the list is only one row, because it's nested. I suggest you store your columns in a separate variable and then do this:
cols = ['ts', 'low', 'high', 'open', 'close', 'sth_else']
v = [[...], [...], [...]] # your list of lists
data_temp = pd.DataFrame.from_records(v, columns=cols)

"NameError: name 'datetime' is not defined" with datetime imported

I know there are a lot of datetime not defined posts but they all seem to forget the obvious import of datetime. I can't figure out why I'm getting this error. When I do each step in iPython it works well, but the method dosen't
import requests
import datetime
def daily_price_historical(symbol, comparison_symbol, limit=1, aggregate=1, exchange='', allData='true'):
url = 'https://min-api.cryptocompare.com/data/histoday?fsym={}&tsym={}&limit={}&aggregate={}&allData={}'\
.format(symbol.upper(), comparison_symbol.upper(), limit, aggregate, allData)
if exchange:
url += '&e={}'.format(exchange)
page = requests.get(url)
data = page.json()['Data']
df = pd.DataFrame(data)
df['timestamp'] = [datetime.datetime.fromtimestamp(d) for d in df.time]
datetime.datetime.fromtimestamp()
return df
This code produces this error:
Traceback (most recent call last):
File "C:\Users\20115619\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-29-4f015e05113f>", line 1, in <module>
rv.get_prices(30, 'ETH')
File "C:\Users\20115619\Desktop\projects\testDash\Revas.py", line 161, in get_prices
for symbol in symbols:
File "C:\Users\20115619\Desktop\projects\testDash\Revas.py", line 50, in daily_price_historical
df = pd.DataFrame(data)
File "C:\Users\20115619\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 4372, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'time'
df['timestamp'] = [datetime.datetime.fromtimestamp(d) for d in df.time]
I think that line is the problem.
Your Dataframe df at the end of the line doesn't have the attribute .time
For what it's worth I'm on Python 3.6.0 and this runs perfectly for me:
import requests
import datetime
import pandas as pd
def daily_price_historical(symbol, comparison_symbol, limit=1, aggregate=1, exchange='', allData='true'):
url = 'https://min-api.cryptocompare.com/data/histoday?fsym={}&tsym={}&limit={}&aggregate={}&allData={}'\
.format(symbol.upper(), comparison_symbol.upper(), limit, aggregate, allData)
if exchange:
url += '&e={}'.format(exchange)
page = requests.get(url)
data = page.json()['Data']
df = pd.DataFrame(data)
df['timestamp'] = [datetime.datetime.fromtimestamp(d) for d in df.time]
#I don't have the following function, but it's not needed to run this
#datetime.datetime.fromtimestamp()
return df
df = daily_price_historical('BTC', 'ETH')
print(df)
Note, I commented out the line that calls an external function that I do not have. Perhaps you have a global variable causing a problem?
Update as per the comments:
I'd use join instead to make the URL:
url = "".join(["https://min-api.cryptocompare.com/data/histoday?fsym=", str(symbol.upper()), "&tsym=", str(comparison_symbol.upper()), "&limit=", str(limit), "&aggregate=", str(aggregate), "&allData=", str(allData)])

Creating a DataFrame from Row results in 'infer schema issue'

When I began learning PySpark, I used a list to create a dataframe. Now that inferring the schema from list has been deprecated, I got a warning and it suggested me to use pyspark.sql.Row instead. However, when I try to create one using Row, I get infer schema issue. This is my code:
>>> row = Row(name='Severin', age=33)
>>> df = spark.createDataFrame(row)
This results in the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 390, in _createFromLocal
struct = self._inferSchemaFromList(data)
File "/spark2-client/python/pyspark/sql/session.py", line 322, in _inferSchemaFromList
schema = reduce(_merge_type, map(_infer_schema, data))
File "/spark2-client/python/pyspark/sql/types.py", line 992, in _infer_schema
raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <type 'int'>
So I created a schema
>>> schema = StructType([StructField('name', StringType()),
... StructField('age',IntegerType())])
>>> df = spark.createDataFrame(row, schema)
but then, this error gets thrown.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/spark2-client/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/spark2-client/python/pyspark/sql/session.py", line 387, in _createFromLocal
data = list(data)
File "/spark2-client/python/pyspark/sql/session.py", line 509, in prepare
verify_func(obj, schema)
File "/spark2-client/python/pyspark/sql/types.py", line 1366, in _verify_type
raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object 33 in type <type 'int'>
The createDataFrame function takes a list of Rows (among other options) plus the schema, so the correct code would be something like:
from pyspark.sql.types import *
from pyspark.sql import Row
schema = StructType([StructField('name', StringType()), StructField('age',IntegerType())])
rows = [Row(name='Severin', age=33), Row(name='John', age=48)]
df = spark.createDataFrame(rows, schema)
df.printSchema()
df.show()
Out:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
+-------+---+
| name|age|
+-------+---+
|Severin| 33|
| John| 48|
+-------+---+
In the pyspark docs (link) you can find more details about the createDataFrame function.
you need to create a list of type Row and pass that list with schema to your createDataFrame() method. sample example
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
department1 = Row(id='AAAAAAAAAAAAAA', type='XXXXX',cost='2')
department2 = Row(id='AAAAAAAAAAAAAA', type='YYYYY',cost='32')
department3 = Row(id='BBBBBBBBBBBBBB', type='XXXXX',cost='42')
department4 = Row(id='BBBBBBBBBBBBBB', type='YYYYY',cost='142')
department5 = Row(id='BBBBBBBBBBBBBB', type='ZZZZZ',cost='149')
department6 = Row(id='CCCCCCCCCCCCCC', type='XXXXX',cost='15')
department7 = Row(id='CCCCCCCCCCCCCC', type='YYYYY',cost='23')
department8 = Row(id='CCCCCCCCCCCCCC', type='ZZZZZ',cost='10')
schema = StructType([StructField('id', StringType()), StructField('type',StringType()),StructField('cost', StringType())])
rows = [department1,department2,department3,department4,department5,department6,department7,department8 ]
df = spark.createDataFrame(rows, schema)
If you're just making a pandas dataframe, you can convert each Row to a dict and then rely on pandas' type inference, if that's good enough for your needs. This worked for me:
import pandas as pd
sample = output.head(5) #this returns a list of Row objects
df = pd.DataFrame([x.asDict() for x in sample])
I have had a similar problem recently and the answers here helped me untderstand the problem better.
my code:
row = Row(name="Alice", age=11)
spark.createDataFrame(row).show()
resulted in a very similar error:
An error was encountered:
Can not infer schema for type: <class 'int'>
Traceback ...
the cause of the problem:
createDataFrame expects an array of rows. So if you only have one row and don't want to invent more, simply make it an array: [row]
row = Row(name="Alice", age=11)
spark.createDataFrame([row]).show()

Resources