I got an error in Pyspark:
AnalysisException: u'Resolved attribute(s) week#5230 missing from
longitude#4976,address#4982,minute#4986,azimuth#4977,province#4979,
action_type#4972,user_id#4969,week#2548,month#4989,postcode#4983,location#4981
in operator !Aggregate [user_id#4969, week#5230], [user_id#4969,
week#5230, count(distinct day#4987) AS days_per_week#3605L].
Attribute(s) with the same name appear in the operation: week.
Please check if the right attribute(s) are used
This seems to come from a snippet of code where the agg function is used:
df_rs = df_n.groupBy('user_id', 'week')
.agg(countDistinct('day').alias('days_per_week'))
.where('days_per_week >= 1')
.groupBy('user_id')
.agg(count('week').alias('weeks_per_user'))
.where('weeks_per_user >= 5').cache()
However I do not see the issue here. And I have previously used this line of code on the same data, many times.
EDIT: I have been looking through the code and the type of error seems to come from joins of this sort:
df = df1.join(df2, 'user_id', 'inner')
df3 = df4.join(df1, 'user_id', 'left_anti).
but still have not solved the problem yet.
EDIT2: Unfortunately the suggested question is not similar to mine, as this is not a question of column name ambiguity but of missing attribute, which seems not to be missing upon inspecting the actual dataframes.
I faced same problem and solved it using renaming the Resolved attributes missing columns to some temp name before join, its a workaround for me , hope it helps you too. Dont know the real reason behind this issue , its still going on since spark 1.6 SPARK-10925
I also faced this issue multiple times and came across this
here it's mentioned that this it's spark related bug.
Based on this article I came up with below code which resolved my issue.
The code can handle LEFT, RIGHT, INNER and OUTER Joins, though OUTER join works as FULL OUTER here.
def join_spark_dfs_sqlbased(sparkSession,left_table_sdf,right_table_sdf,common_join_cols_list=[],join_type="LEFT"):
temp_join_afix="_tempjoincolrenames"
join_type=join_type.upper()
left=left_table_sdf.select(left_table_sdf.columns)
right=right_table_sdf.select(right_table_sdf.columns)
if len(common_join_cols_list)>0:
common_join_cols_list=[col+temp_join_afix for col in common_join_cols_list]
else:
common_join_cols_list = list(set(left.columns).intersection(right.columns))
common_join_cols_list=[col+temp_join_afix for col in common_join_cols_list]
for col in left.columns:
left = left.withColumnRenamed(col, col + temp_join_afix)
left.createOrReplaceTempView('left')
for col in right.columns:
right = right.withColumnRenamed(col, col + temp_join_afix)
right.createOrReplaceTempView('right')
non_common_cols_left_list=list(set(left.columns)-set(common_join_cols_list))
non_common_cols_right_list=list(set(right.columns)-set(common_join_cols_list))
unidentified_common_cols=list(set(non_common_cols_left_list)-set(non_common_cols_right_list))
if join_type in ['LEFT','INNER','OUTER']:
non_common_cols_right_list=list(set(non_common_cols_right_list)-set(unidentified_common_cols))
common_join_cols_list_with_table=['a.'+col +' as '+col for col in common_join_cols_list]
else:
non_common_cols_left_list=list(set(non_common_cols_left_list)-set(unidentified_common_cols))
common_join_cols_list_with_table=['b.'+col +' as '+col for col in common_join_cols_list]
non_common_cols_left_list_with_table=['a.'+col +' as '+col for col in non_common_cols_left_list]
non_common_cols_right_list_with_table=['b.'+col +' as '+col for col in non_common_cols_right_list]
non_common_cols_list_with_table=non_common_cols_left_list_with_table + non_common_cols_right_list_with_table
if join_type=="OUTER":
join_type="FULL OUTER"
join_type=join_type+" JOIN"
select_cols=common_join_cols_list_with_table+non_common_cols_list_with_table
common_join_cols_list_with_table_join_query=['a.'+col+ '='+'b.'+col for col in common_join_cols_list]
query= "SELECT "+ ",".join(select_cols) + " FROM " + "left" + " a " + join_type + " " + "right" + " b" +" ON "+ " AND ".join(common_join_cols_list_with_table_join_query)
print("query:",query)
joined_sdf= sparkSession.sql(query)
for col in joined_sdf.columns:
if temp_join_afix in col:
joined_sdf = joined_sdf.withColumnRenamed(col, col.replace(temp_join_afix,''))
return joined_sdf
Related
Very basic question. I'm currently trying to code a hangman game. I've copied the code verbatim from the book, or so I thought. I keep running into a syntax error and it's pointing me to the colon after the 1. Can someone point out what I'm doing wrong? Appreciate any help given.
Check if player has guessed too many times and lost
if len(missedLetters == len(HANGMANPICS) - 1:
displayBoard(HANGMANPICS, missedLetters, correctLetters, secretWord)
print('You have run out of guesses!\nAfter ' + str(len(missedLetters)) + ' missed guesses and ' + str(len(correctLetters)) + ' correct guesses, and the word was "' + secretWord + '"')
gameIsDone = True
You're missing a closing parentheses in your if statement.
This is what you have:if len(missedLetters == len(HANGMANPICS) - 1:
Should be this to fix the syntax error:
if len(missedLetters) == len(HANGMANPICS) - 1:
I set up an API where on client side user can calculate a route between to points. However, I have trouble with the psql query, which works in postgres but when I use the same query in node JS I got an error.
I am using nodeJS, Express and Postgres. If I run the query below in pgAdmin4, I get the expected output.
SELECT b.gid, b.the_geom, b.cost_s, b.length_m FROM pgr_dijkstra('SELECT gid::bigint as id, source::bigint,
target::bigint, cost_s::double precision as cost,
reverse_cost_s::double precision as reverse_cost FROM ways
WHERE the_geom && ST_Expand((SELECT ST_Collect(the_geom)
FROM ways_vertices_pgr WHERE id IN(589143, 581050)), 0.01)',
589143, 581050) a LEFT JOIN ways b ON (a.edge = b.gid);
But when I use the same query in Node Js (see below), I get an error message saying error: syntax error at or near "&&". What am I doing wrong?
const start = parseInt(request.params.start)
const end = parseInt(request.params.end)
const sql2 =
"SELECT b.gid, b.the_geom, b.cost_s, b.length_m FROM pgr_dijkstra('SELECT gid::bigint as id, source::bigint,"+
"target::bigint, cost_s::double precision as cost," +
"reverse_cost_s::double precision as reverse_cost FROM ways" +
"WHERE the_geom && ST_Expand((SELECT ST_Collect(the_geom)" +
"FROM ways_vertices_pgr WHERE id IN(" + start + "," + end + ")), 0.01)'," +
start + "," + end + ") a LEFT JOIN ways b ON (a.edge = b.gid);"
Got the answer. Instead of start and end I used arguments using the syntax $1...and added the data type of the arguments. In my case ineteger
SELECT b.gid, b.the_geom, b.cost_s, b.length_m FROM pgr_dijkstra('SELECT gid::bigint as id, source::bigint,
target::bigint, cost_s::double precision as cost,
reverse_cost_s::double precision as reverse_cost FROM ways
WHERE the_geom && ST_Expand((SELECT ST_Collect(the_geom)
FROM ways_vertices_pgr WHERE id IN(' || $1::integer || ',' || $2::integer ||')), 0.01)',
$1::integer, $2::integer) a LEFT JOIN ways b ON (a.edge = b.gid);
Basically i'm trying to create a multiple choice test that uses information stored inside of lists to change the questions/ answers by location.
so far I have this
import random
DATASETS = [["You first enter the car", "You start the car","You reverse","You turn",
"Coming to a yellow light","You get cut off","You run over a person","You have to stop short",
"in a high speed chase","in a stolen car","A light is broken","The car next to you breaks down",
"You get a text message","You get a call","Your out of gas","Late for work","Driving angry",
"Someone flips you the bird","Your speedometer stops working","Drinking"],
["Put on seat belt","Check your mirrors","Look over your shoulder","Use your turn signal",
"Slow to a safe stop","Relax and dont get upset","Call 911", "Thank your brakes for working",
"Pull over and give up","Ask to get out","Get it fixed","Offer help","Ignore it","Ignore it",
"Get gas... duh","Drive the speed limit","Don't do it","Smile and wave","Get it fixed","Don't do it"],
[''] * 20,
['B','D','A','A','C','A','B','A','C','D','B','C','D','A','D','C','C','B','D','A'],
[''] * 20]
def main():
questions(0)
answers(1)
def questions(pos):
for words in range(len(DATASETS[0])):
DATASETS[2][words] = input("\n" + str(words + 1) + ".)What is the proper procedure when %s" %DATASETS[0][words] +
'\nA.)'+random.choice(DATASETS[1]) + '\nB.)%s' %DATASETS[1][words] + '\nC.)'
+random.choice(DATASETS[1]) + '\nD.)'+random.choice(DATASETS[1])+
"\nChoose your answer carefully: ")
def answers(pos):
for words in range(len(DATASETS[0])):
DATASETS[4] = list(x is y for x, y in zip(DATASETS[2], DATASETS[3]))
print(DATASETS)
I apologize if the code is crude to some... i'm in my first year of classes and this is my first bout of programming.
list 3 is my key for the right answer's, I want my code in questions() to change the position of the correct answer so that it correlates to the key provided....
I've tried for loops, if statements and while loops but just cant get it to do what I envision. Any help is greatly appreciated
tmp = "\n" + str(words + 1) + ".)What is the proper procedure when %s" %DATASETS[0][words] + '\nA.)'
if DATASETS[3][words] == 'A': #if the answer key is A
tmp = tmp + DATASETS[1][words] #append the first choice as correct choice
else:
tmp = tmp + random.choice(DATASETS[1]) #if not, randomise the choice
Do similar if-else for 'B', 'C', and 'D'
Once your question is formulated, then you can use it:
DATASETS[2][words] = input(tmp)
This is a bit long but I am not sure if any shorter way exists.
I'd like to write a script to run several SQL commands in a for-while-loop-construct. Everything works fine so far.. Except for deletes.
Script:
#!bin/python3.2
# script to remove batches of obsolete stuff from the tracking DB
#
import sys
import getpass
import platform
import cx_Oracle
# print some infos
print("Python version")
print(("Python version: " + platform.python_version()))
print("cx_Oracle version: " + cx_Oracle.version)
print("Oracle client: " + str(cx_Oracle.clientversion()).replace(', ','.'))
dbconn = cx_Oracle.connect('xxxx','yyyy', '1.2.3.4:1521/xxxRAC')
print ("Oracle DB version: " + dbconn.version)
print ("Oracle client encoding: " + dbconn.encoding)
cleanupAdTaKvpQuery = "delete from TABLE1 where TABLE2_ID < 320745354908598 and rownum <= 5"
getOldRowsQuery = "select count(*) from TABLE2 where ID < 320745354908598"
dbconn.begin()
cursor = dbconn.cursor()
cursor.execute(getOldRowsQuery)
rowCnt = cursor.fetchall()
print("# rows (select before delete): " + str(rowCnt))
try:
cursor.execute(cleanupAdTaKvpQuery)
rows = cursor.rowcount
except:
print("Cleanup Failed.")
cursor.execute(getOldRowsQuery)
rowCnt = cursor.fetchall()
print("# rows (select after delete): " + str(rowCnt))
try:
dbconn.commit
print("Success!")
except:
print("Commit failed " + arg)
dbconn.close
print("# of affected rows:" + str(rows))
As you can see in the output. The script runs fine, the results (see rowCnt) are valid and make sense, there are no errors and no exceptions and it does not raise an exception.
Output:
Python version
Python version: 3.2.3
cx_Oracle version: 5.2
Oracle client: (11.2.0.3.0)
Oracle DB version: 11.2.0.3.0
Oracle client encoding: US-ASCII
# rows (select before delete): [(198865,)]
# rows (select after delete): [(198860,)] <--- the result above decreased by 5!
Success!
# of rows:5
(ayemac_ora_cleanup)marcel#mw-ws:~/scripts/python/virt-envs/ayemac_ora_cleanup$
What am I missing or doing wrong? I tried to debug it with several additional select statements, trying to catch exceptions, etc...
Any help is appreciated! Thank you!
UPDATE:
Fixed, thanks for the hint with the missing brackets!
you are missing the brackets in
dbconn.commit()
without them the command will not raise an exception, but simply do nothing. the same goes for dbconn.close()
I would like to get as much information as possible from within the executor, while it is executing, but can't seem to find any information on how to accomplish that other than by using the Web UI. For example, it would be useful to know which file is being processed by which executor, and when.
I need this flexibility for debugging, but cannot find any information about it.
Thank you
One of the ways to accomplish it is to mapPartitionsWithContext
Example code:
import org.apache.spark.TaskContext
val a = sc.parallelize(1 to 9, 3)
def myfunc(tc: TaskContext, iter: Iterator[Int]) : Iterator[Int] = {
tc.addOnCompleteCallback(() => println(
"Partition: " + tc.partitionId +
", AttemptID: " + tc.attemptId
)
)
iter.toList.filter(_ % 2 == 0).iterator
}
a.mapPartitionsWithContext(myfunc)
a.collect
API: https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.TaskContext
However, this does not answer the question about how to see which file was processed, and when.