What is wrong in the following dataframe collect action?
version = spark.sql("select max(version) as latest_version from (DESCRIBE HISTORY {})".format(table_name)).collect()[0][0]
It shows an exception as follows
ParseException Traceback (most recent call last)
<command-2270449224428357> in <module>
5 # table_name = database_name+"."+(all_params.get(param_dict)).get('dim_table_name')
6 table_name = "hive_metastore"+"."+database_name+"."+(all_params.get(param_dict)).get('dim_table_name')
----> 7 version = spark.sql("select max(version) as latest_version from (DESCRIBE HISTORY {})".format(table_name)).collect()[0][0]
8 version_dict.update({table_name: version})
Related
while doing merge with deltatable getting below error.
'DeltaMergeBuilder' object has no attribute 'WhenNotMatchedInsert'
from delta.tables import *
delta_df = DeltaTable.forPath(spark,"dbfs:/user/hive/warehouse/FileStore/tables/stream_write2")
delta_df.alias("t").merge(
df.alias("s"),
"target.empid=source.empid").whenMatchedUpdate(set =
{
"name":"source.name",
"city":"source.city",
"country":"source.country",
"contactno":"source.contactno"
}
).WhenNotMatchedInsert(Values =
{
"empid":"source.empid",
"name":"source.name",
"city":"source.city",
"country":"source.country",
"contactno":"source.contactno"
}
)
.execute()
error:
AttributeError: 'DeltaMergeBuilder' object has no attribute 'WhenNotMatchedInsert'
AttributeError Traceback (most recent call last)
<command-3810732791373279> in <cell line: 1>()
1 delta_df.alias("t").merge(
2 df.alias("s"),
3 "target.empid=source.empid").whenMatchedUpdate(set =
4 {
5 "name":"source.name",
AttributeError: 'DeltaMergeBuilder' object has no attribute 'WhenNotMatchedInsert'
Command took 0.21 seconds -- byat 1/5/2023, 6:09:03 PM on Cluster
I am working on upsert in delta table but getting below error.
The error is simple - your function name is using upper case W, but it should be lower-case: whenNotMatchedInsert
I see couple of posts post1 and post2 which are relevant to my question. However while following post1 solution I am running into below error.
joinedDF = df.join(df_agg, "company")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/python/pyspark/sql/dataframe.py", line 1050, in join
jdf = self._jdf.join(other._jdf, on, how)
AttributeError: 'NoneType' object has no attribute '_jdf'
Entire code snippet
df = spark.read.format("csv").option("header", "true").load("/home/ec2-user/techcrunch/TechCrunchcontinentalUSA.csv")
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
joinedDF = df.join(df_agg, "company")
on the second line you have .show at the end
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False).show()
remove it like this:
df_agg = df.groupby("company").agg(func.sum("raisedAmt").alias("TotalRaised")).orderBy("TotalRaised", ascending = False)
and your code should work.
You used an action on that df and assigned it to df_agg variable, thats why your variable is NoneType(in python) or Unit(in scala)
I am trying to run this code in jupyter notebook but I am getting the following error for fuzzyset package. I am using fuzzyset version 0.0.9. Does anybody know how to convert these dictionary values to list?
agrovocSimple = []
with open('agrovocLabels.json') as data_file:
agrovoc = json.load(data_file)
results = agrovoc["results"]["bindings"]
for entry in results:
uri = entry["uri"]["value"]
label = entry["label"]["value"]
#agrovocSimple.append({"uri": uri , "name": label})
agrovocSimple.append(label)
####instatiation of the fuzzyset for the mappings
# allocate the FuzzySet object
a=FuzzySet()
for e in agrovocSimple:
a.add(e)
####TypeError Traceback (most recent call last)
<ipython-input-19-bc4871be0e65> in <module>()
17 a=FuzzySet()
18 for e in agrovocSimple:
---> 19 a.add(e)
fuzzyset\cfuzzyset.pyx in cfuzzyset.cFuzzySet.add()
fuzzyset\cfuzzyset.pyx in cfuzzyset.cFuzzySet._add()
TypeError: Expected list, got dict_values
I am new to spark and facing an error while converting .csv file to dataframe. I am using pyspark_csv module for the conversion but gives an error saying "module 'pyspark_csv' has no attribute 'csvToDataframe".
here is my code:
import findspark
findspark.init()
findspark.find()
import pyspark
sc=pyspark.SparkContext(appName="myAppName")
sqlCtx = pyspark.SQLContext
#csv to dataframe
sc.addPyFile('/usr/spark-1.5.0/python/pyspark_csv.py')
sc.addPyFile('https://raw.githubusercontent.com/seahboonsiew/pyspark-csv/master/pyspark_csv.py')
import pyspark_csv as pycsv
#skipping the header
def skip_header(idx, iterator):
if(idx == 0):
next(iterator)
return iterator
#loading the dataset
data=sc.textFile('gdeltdata/20160427.CSV')
data_header = data.first()
data_body = data.mapPartitionsWithIndex(skip_header)
data_df = pycsv.csvToDataframe(sqlctx, data_body, sep=",", columns=data_header.split('\t'))
AttributeError Traceback (most recent call last)
<ipython-input-10-8e47cd9759e6> in <module>()
----> 1 data_df = pycsv.csvToDataframe(sqlctx, data_body, sep=",", columns=data_header.split('\t'))
AttributeError: module 'pyspark_csv' has no attribute 'csvToDataframe'
As mentioned in https://github.com/seahboonsiew/pyspark-csv
Please try using the following command:
csvToDataFrame
with Frame instead of frame
I've installed DataStax Community Edition, and added DataStax ODBC connector. Now I try to access the database via pyodbc:
import pyodbc
connection = pyodbc.connect('Driver=DataStax Cassandra ODBC Driver;Host=127.0.0.1',
autocommit = True)
cursor = connection.cursor()
cursor.execute('CREATE TABLE Test (id INT PRIMARY KEY)')
cursor.execute('INSERT INTO Test (id) VALUES (1)')
for row in cursor.execute('SELECT * FROM Test'):
print row
It works fine and returns
>>> (1, )
However when I try
cursor.execute('INSERT INTO Test (id) VALUES (:id)', {'id': 2})
I get
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "test.py", line 11, in <module>
cursor.execute('INSERT INTO Test (id) VALUES (:id)', {'id': 2})
pyodbc.ProgrammingError: ('The SQL contains 0 parameter markers, but 1 parameters were supplied', 'HY000')
Alternatives do neither work:
cursor.execute('INSERT INTO Test (id) VALUES (:1)', (2))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "test.py", line 11, in <module>
cursor.execute('INSERT INTO Test (id) VALUES (?)', (2))
pyodbc.ProgrammingError: ('The SQL contains 0 parameter markers, but 1 parameters were supplied', 'HY000')
and
cursor.execute('INSERT INTO Test (id) VALUES (?)', (2))
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
pyodbc.Error: ('HY000', "[HY000] [DataStax][CassandraODBC] (15) Error while preparing a query in Cassandra: [33562624] : line 1:31 no viable alternative at input '1' (...Test (id) VALUES (:[1]...) (15) (SQLPrepare)")
My Cassandra version is 2.2.3, ODBC driver is from https://downloads.datastax.com/odbc-cql/1.0.1.1002/
According to pyodbc Documentation your query should be
cursor.execute('INSERT INTO Test (id) VALUES (?)', 2)
More details on pyodbc Insert
As per the comment got a thread which says it is a open bug in pyodbc
BUG