Can we update a row in Hive Table Using Spark-SQL

Can we update a row in Hive Table Using Spark-SQL - apache-spark

Tried like this :
HiveContext hiveql=new org.apache.spark.sql.hive.HiveContext(ctx);
hiveql.sql("UPDATE sparkexamples.employee SET empname='Sreeharsha' WHERE empid='1210'");
Submitting Job:
./bin/spark-submit
--class com.spark.examples.SparkUpdateHiveContext
--master local[4] /home/hadoop/SparkHIveUpdate.jar
Get the Following Error : Any suggestions please
16/07/01 11:45:38 INFO parse.ParseDriver: Parse Completed
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Unsupported language features in query:
UPDATE sparkexamples.employee SET empname='Sreeharsha' WHERE empid='1210'
TOK_UPDATE_TABLE 1, 0,16, 7
TOK_TABNAME 1, 2,4, 7
sparkexamples 1, 2,2, 7
employee 1, 4,4, 21
TOK_SET_COLUMNS_CLAUSE 1, 6,10, 41
= 1, 8,10, 41
TOK_TABLE_OR_COL 1, 8,8, 34
empname 1, 8,8, 34
'Sreeharsha' 1, 10,10, 42
TOK_WHERE 1, 12,16, 66
= 1, 14,16, 66
TOK_TABLE_OR_COL 1, 14,14, 61
empid 1, 14,14, 61
'1210' 1, 16,16, 67
scala.NotImplementedError: No parse rules for TOK_UPDATE_TABLE:
TOK_UPDATE_TABLE 1, 0,16, 7
TOK_TABNAME 1, 2,4, 7
sparkexamples 1, 2,2, 7
employee 1, 4,4, 21
TOK_SET_COLUMNS_CLAUSE 1, 6,10, 41
= 1, 8,10, 41
TOK_TABLE_OR_COL 1, 8,8, 34
empname 1, 8,8, 34
'Sreeharsha' 1, 10,10, 42
TOK_WHERE 1, 12,16, 66
= 1, 14,16, 66
TOK_TABLE_OR_COL 1, 14,14, 61
empid 1, 14,14, 61
'1210' 1, 16,16, 67
org.apache.spark.sql.hive.HiveQl$.nodeToPlan(HiveQl.scala:1086)

Related

Creating a TXT file and seeking a position in Python

I have given the following variables:
signal1 = 'speed'
bins1 = [0, 10, 20, 30, 40]
signal2 = 'rpm'
bins2 = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]
hist_result = [ [1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
]
I want to create a .TXT file which would look like this with tab separated values:
speed>= 0 10 20 30
speed< 10 20 30 40
rpm>= rpm<
0 500 1 4 5 12
500 1000 5 8 9 0
1000 1500 6 7 11 19
1500 2000 1 4 5 12
2000 2500 -5 8 9 0
2500 3000 -6 7 11 19
3000 3500 1 4 5 12
3500 4000 -5 8 9 0
4000 4500 -6 7 11 19
I have written the following code:
#!/usr/bin/env python3
import os
from datetime import datetime
import time
signal1 = 'speed'
bins1 = [0, 10, 20, 30, 40]
signal2 = 'rpm'
bins2 = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]
hist_result = [ [1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
]
filename = f"{datetime.now().strftime('%Y%m%d_%H%M%S')}_{signal1}_results.TXT"
with open(filename, 'w') as f:
# write the bin1 range
f.write('\n\n\n')
f.write('\t\t\t\t')
f.write(signal1 + '>=')
for bin in bins1[:-1]:
f.write('\t' + str(bin))
f.write('\n')
f.write('\t\t\t\t')
f.write(signal1 + '<')
for bin in bins1[1:]:
f.write('\t' + str(bin))
f.write('\n')
# write the bin2 range
f.write('\t\t')
f.write(signal2 + '>=' + '\t' + signal2 + '<' + '\n')
f.write('\t\t')
# store the cursor position from where hist result will be written line by line
track_cursor_pos = []
curr = bins2[0]
for next in bins2[1:]:
f.write(str(curr) + '\t' + str(next))
track_cursor_pos.append(f.tell())
f.write('\n\t\t')
curr = next
f.write('\n')
print(track_cursor_pos)
i = 0
# Everything is fine until here
# Code below doesn't work as expected!?
for result in hist_result:
f.seek(track_cursor_pos[i], os.SEEK_SET)
for r in result:
f.write('\t' + str(r))
f.write('\n')
i += 1
But, this is producing the TXT file whose contents look like this:
speed>= 0 10 20 30
speed< 10 20 30 40
rpm>= rpm<
0 500 1 4 5 12
0 -5 8 9 0
00 -6 7 11 19
1 4 5 12
00 -5 8 9 0
00 -6 7 11 19
1 4 5 12
00 -5 8 9 0
00 -6 7 11 19
I think I am not using the f.seek() properly. Any suggestion would be appreciated. Thanks in advance.

You don't have to seek inside the file to print your data:
signal1 = 'speed'
bins1 = [0, 10, 20, 30, 40]
signal2 = 'rpm'
bins2 = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]
hist_result = [ [1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
[1, 4, 5, 12],
[-5, 8, 9, 0],
[-6, 7, 11, 19],
]
with open('data.txt', 'w') as f_out:
print('\t{signal1}>=\t{bins}'.format(signal1=signal1, bins='\t'.join(map(str,bins1[:-1]))), file=f_out)
print('\t{signal1}<\t{bins}'.format(signal1=signal1, bins='\t'.join(map(str,bins1[1:]))), file=f_out)
print('{signal2}>=\t{signal2}<'.format(signal2=signal2))
for a, b, data in zip(bins2[:-1], bins2[1:], hist_result):
print(a, b, *data, sep='\t', file=f_out)
Creates data.txt:
speed>= 0 10 20 30
speed< 10 20 30 40
rpm>= rpm<
0 500 1 4 5 12
500 1000 -5 8 9 0
1000 1500 -6 7 11 19
1500 2000 1 4 5 12
2000 2500 -5 8 9 0
2500 3000 -6 7 11 19
3000 3500 1 4 5 12
3500 4000 -5 8 9 0
4000 4500 -6 7 11 19

transform integer value patterns in a column to a group

DataFrame
df=pd.DataFrame({'occurance':[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0],'value':[45, 3, 2, 12, 14, 32, 1, 1, 6, 4, 9, 32, 78, 96, 12, 6, 3]})
df
Expected output
df=pd.DataFrame({'occurance':[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0],'value':[45, 3, 2, 12, 14, 32, 1, 1, 6, 4, 9, 32, 78, 96, 12, 6, 3],'group':[1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4, 100, 5, 5, 5, 5]})
df
I need to transform the dataframe into the output. I am after a wild card that will determine 1 is the start of a new group and a group consists of only 1 followed by n zeroes. If a group criteria is not met, then group it as 100.
I tried in the line of;
bs=df[df.occurance.eq(1).any(1)&df.occurance.shift(-1).eq(0).any(1)].squeeze()
bs
This even when broken down could only bool select start and nothing more.
Any help?

Create mask by compare 1 and next 1 in mask, then filter occurance for all values without them, create cumulative sum by Series.cumsum and last add 100 values by Series.reindex:
m = df.occurance.eq(1) & df.occurance.shift(-1).eq(1)
df['group'] = df.loc[~m, 'occurance'].cumsum().reindex(df.index, fill_value=100)
print (df)
occurance value group
0 1 45 1
1 0 3 1
2 0 2 1
3 0 12 1
4 1 14 2
5 0 32 2
6 0 1 2
7 0 1 2
8 0 6 2
9 0 4 2
10 1 9 3
11 0 32 3
12 1 78 100
13 1 96 4
14 0 12 4
15 0 6 4
16 0 3 4

DataFrameGroupby.agg NamedAgg on same column errors out on custom function, but works on bult-in function

Setup
np.random.seed(0)
df = pd.DataFrame(zip([1, 1, 2, 2, 2, 3, 7, 7, 9, 10],
*np.random.randint(1, 100, 20).reshape(-1,10)),
columns=['A','B', 'C'])
Out[127]:
A B C
0 1 45 71
1 1 48 89
2 2 65 89
3 2 68 13
4 2 68 59
5 3 10 66
6 7 84 40
7 7 22 88
8 9 37 47
9 10 88 89
f = lambda x: x.max()
NamedAgg on built-in function works fine
df.groupby('A').agg(B_min=('B', 'min'), B_max=('B', 'max'), C_max=('C', 'max'))
Out[133]:
B_min B_max C_max
A
1 45 48 89
2 65 68 89
3 10 10 66
7 22 84 88
9 37 37 47
10 88 88 89
NamedAgg on custom function f errors out
df.groupby('A').agg(B_min=('B', 'min'), B_max=('B', f), C_max=('C', 'max'))
KeyError: "[('B', '<lambda>')] not in index"
Is there any explanation for this error? is this error an intentional restriction?

The issue is because of _mangle_lambda_list, which gets called at some point. There seems to be a mismatch where the resulting aggregation gets renamed but the list of output columns, ordered which are then used here, doesn't get changed. Since that function specifically checks for if com.get_callable_name(aggfunc) == "<lambda>" any name other than '<lambda>' will work without issue:
Sample data
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(zip([1, 1, 2, 2, 2, 3, 7, 7, 9, 10],
*np.random.randint(1, 100, 20).reshape(-1,10)),
columns=['A','B', 'C'])
f = lambda x: x.max()
kwargs = {'B_min': ('B', 'min'), 'B_max':('B', f), 'C_max':('C', 'max')}
Here are the most relevant major steps that get called when you aggregate, and we can see where the KeyError comes from.
func, columns, order = pd.core.groupby.generic._normalize_keyword_aggregation(kwargs)
print(order)
#[('B', 'min'), ('B', '<lambda>'), ('C', 'max')]
func = pd.core.groupby.generic._maybe_mangle_lambdas(func)
df.groupby('A')._aggregate(func)
# B C
# min <lambda_0> max # _0 ruins indexing with ('B', '<lambda>')
#A
#1 45 48 89
#2 65 68 89
#3 10 10 66
#7 22 84 88
#9 37 37 47
#10 88 88 89
Because _mangle_lambda_list is only called when there are multiple aggregations for the same column, you can get away with the '<lambda>' name, so long as it is the only aggregation for that column.
df.groupby('A').agg(A_min=('A', 'min'), B_max=('B', f))
# A_min B_max
#A
#1 1 48
#2 2 68
#3 3 10
#7 7 84
#9 9 37
#10 10 88

spark sql 1.x. NotImplementedError: no parse rules

when I run some query by spark sql(hivecontext) it complains like below.
which syntax will cause this?
I am using spark 1.6 and hive 1.2
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Unsupported language features in query: SELECT i.*,
from_unixtime(unix_timestamp('20170221','yyyyMMdd'),"yyyy-MM-dd'T'HH:mm:ssZ") bounce_date
FROM
(SELECT country,
device_id,
os_name,
app_ver
FROM jpl_band_orc
WHERE yyyymmdd='20170221'
AND scene_id='app_intro'
AND action_id='scene_enter'
AND classifier='app_intro'
GROUP BY country, device_id, os_name, app_ver ) i
LEFT JOIN
(SELECT device_id
FROM jpl_band_orc
WHERE yyyymmdd='20170221'
AND scene_id='band_list'
AND action_id='scene_enter'
AND device_id IN
(SELECT DISTINCT device_id
FROM jpl_band_orc x
WHERE yyyymmdd='20170221'
AND scene_id='app_intro'
AND action_id='scene_enter'
AND classifier='app_intro' ) ) s
ON i.device_id = s.device_id
WHERE s.device_id is null
TOK_QUERY 8, 0,425, 10
TOK_FROM 8, 28,412, 10
TOK_LEFTOUTERJOIN 8, 36,412, 10
TOK_SUBQUERY 8, 36,186, 10
TOK_QUERY 8, 37,182, 10
TOK_FROM 8, 91,93, 10
TOK_TABREF 8, 93,93, 10
TOK_TABNAME 8, 93,93, 10
jpl_band_orc 8, 93,93, 10
TOK_INSERT 0, -1,182, 0
TOK_DESTINATION 0, -1,-1, 0
TOK_DIR 0, -1,-1, 0
TOK_TMP_FILE 0, -1,-1, 0
TOK_SELECT 4, 37,84, 13
TOK_SELEXPR 4, 39,39, 13
TOK_TABLE_OR_COL 4, 39,39, 13
country 4, 39,39, 13
TOK_SELEXPR 5, 54,54, 12
TOK_TABLE_OR_COL 5, 54,54, 12
device_id 5, 54,54, 12
TOK_SELEXPR 6, 69,69, 12
TOK_TABLE_OR_COL 6, 69,69, 12
os_name 6, 69,69, 12
TOK_SELEXPR 7, 84,84, 12
TOK_TABLE_OR_COL 7, 84,84, 12
app_ver 7, 84,84, 12
TOK_WHERE 12, 100,161, 13
AND 12, 102,161, 13
AND 11, 138,138, 13
AND 10, 119,119, 13
= 9, 102,104, 19
TOK_TABLE_OR_COL 9, 102,102, 11
yyyymmdd 9, 102,102, 11
'20170221' 9, 104,104, 20
= 10, 121,123, 25
TOK_TABLE_OR_COL 10, 121,121, 17
scene_id 10, 121,121, 17
'app_intro' 10, 123,123, 26
= 11, 140,142, 26
TOK_TABLE_OR_COL 11, 140,140, 17
action_id 11, 140,140, 17
'scene_enter' 11, 142,142, 27
= 12, 159,161, 27
TOK_TABLE_OR_COL 12, 159,159, 17
classifier 12, 159,159, 17
'app_intro' 12, 161,161, 28
TOK_GROUPBY 13, 168,182, 15
TOK_TABLE_OR_COL 13, 173,173, 15
country 13, 173,173, 15
TOK_TABLE_OR_COL 13, 176,176, 24
device_id 13, 176,176, 24
TOK_TABLE_OR_COL 13, 179,179, 35
os_name 13, 179,179, 35
TOK_TABLE_OR_COL 13, 182,182, 44
app_ver 13, 182,182, 44
i 13, 186,186, 54
TOK_SUBQUERY 16, 201,391, 10
TOK_QUERY 16, 202,387, 10
TOK_FROM 16, 211,213, 10
TOK_TABREF 16, 213,213, 10
TOK_TABNAME 16, 213,213, 10
jpl_band_orc 16, 213,213, 10
TOK_INSERT 0, -1,387, 0
TOK_DESTINATION 0, -1,-1, 0
TOK_DIR 0, -1,-1, 0
TOK_TMP_FILE 0, -1,-1, 0
TOK_SELECT 15, 202,204, 13
TOK_SELEXPR 15, 204,204, 13
TOK_TABLE_OR_COL 15, 204,204, 13
device_id 15, 204,204, 13
TOK_WHERE 20, 220,387, 13
AND 20, 222,387, 13
AND 19, 258,258, 13
AND 18, 239,239, 13
= 17, 222,224, 19
TOK_TABLE_OR_COL 17, 222,222, 11
yyyymmdd 17, 222,222, 11
'20170221' 17, 224,224, 20
= 18, 241,243, 25
TOK_TABLE_OR_COL 18, 241,241, 17
scene_id 18, 241,241, 17
'band_list' 18, 243,243, 26
= 19, 260,262, 26
TOK_TABLE_OR_COL 19, 260,260, 17
action_id 19, 260,260, 17
'scene_enter' 19, 262,262, 27
TOK_SUBQUERY_EXPR 20, 279,387, 27
TOK_SUBQUERY_OP 20, 281,281, 27
IN 20, 281,281, 27
TOK_QUERY 22, 291,387, 12
TOK_FROM 22, 305,309, 12
TOK_TABREF 22, 307,309, 12
TOK_TABNAME 22, 307,307, 12
jpl_band_orc 22, 307,307, 12
x 22, 309,309, 25
TOK_INSERT 0, -1,385, 0
TOK_DESTINATION 0, -1,-1, 0
TOK_DIR 0, -1,-1, 0
TOK_TMP_FILE 0, -1,-1, 0
TOK_SELECTDI 21, 292,296, 24
TOK_SELEXPR 21, 296,296, 24
TOK_TABLE_OR_COL 21, 296,296, 24
device_id 21, 296,296, 24
TOK_WHERE 26, 318,385, 15
AND 26, 320,385, 15
AND 25, 360,360, 15
AND 24, 339,339, 15
= 23, 320,322, 21
TOK_TABLE_OR_COL 23, 320,320, 13
yyyymmdd 23, 320,320, 13
'20170221' 23, 322,322, 22
= 24, 341,343, 27
TOK_TABLE_OR_COL 24, 341,341, 19
scene_id 24, 341,341, 19
'app_intro' 24, 343,343, 28
= 25, 362,364, 28
TOK_TABLE_OR_COL 25, 362,362, 19
action_id 25, 362,362, 19
'scene_enter' 25, 364,364, 29
= 26, 383,385, 29
TOK_TABLE_OR_COL 26, 383,383, 19
classifier 26, 383,383, 19
'app_intro' 26, 385,385, 30
TOK_TABLE_OR_COL 20, 279,279, 17
device_id 20, 279,279, 17
s 26, 391,391, 46
= 27, 404,412, 24
. 27, 404,406, 13
TOK_TABLE_OR_COL 27, 404,404, 12
i 27, 404,404, 12
device_id 27, 406,406, 14
. 27, 410,412, 27
TOK_TABLE_OR_COL 27, 410,410, 26
s 27, 410,410, 26
device_id 27, 412,412, 28
TOK_INSERT 0, -1,425, 0
TOK_DESTINATION 0, -1,-1, 0
TOK_DIR 0, -1,-1, 0
TOK_TMP_FILE 0, -1,-1, 0
TOK_SELECT 1, 0,23, 7
TOK_SELEXPR 1, 2,4, 7
TOK_ALLCOLREF 1, 2,4, 7
TOK_TABNAME 1, 2,2, 7
i 1, 2,2, 7
TOK_SELEXPR 2, 11,23, 3
TOK_FUNCTION 2, 11,21, 3
from_unixtime 2, 11,11, 3
TOK_FUNCTION 2, 13,18, 17
unix_timestamp 2, 13,13, 17
'20170221' 2, 15,15, 32
'yyyyMMdd' 2, 17,17, 43
"yyyy-MM-dd'T'HH:mm:ssZ" 2, 20,20, 55
bounce_date 2, 23,23, 81
TOK_WHERE 0, 417,425, 0
TOK_FUNCTION 0, 419,425, 0
TOK_ISNULL 0, 425,425, 0
. 28, 419,421, 10
TOK_TABLE_OR_COL 28, 419,419, 9
s 28, 419,419, 9
device_id 28, 421,421, 11
scala.NotImplementedError: No parse rules for ASTNode type: 864, text: TOK_SUBQUERY_EXPR :
" +

Subquery in WHERE clause is not supported in Spark 1.6. It is supported in 2.0
Reference:
https://issues.apache.org/jira/browse/SPARK-4226

Can I create a dataframe from few 1d arrays as columns?

Is it possible to create a dataframe from few 1d arrays and place them as columns?
If I create a dataframe from 1 1d array everything is ok:
arr1 = np.array([11, 12, 13, 14, 15])
arr1_arr2_df = pd.DataFrame(data=arr1, index=None, columns=None)
arr1_arr2_df
Out:
0
0 11
1 12
2 13
3 14
4 15
But If make a datafreme form 2 arrays they are placed is rows:
arr1 = np.array([11, 12, 13, 14, 15])
arr2 = np.array([21, 22, 23, 24, 25])
arr1_arr2_df = pd.DataFrame(data=(arr1,arr2), index=None, columns=None)
arr1_arr2_df
Out:
0 1 2 3 4
0 11 12 13 14 15
1 21 22 23 24 25
I know that I can achieve it by using transpose:
arr1_arr2_df = arr1_arr2_df.transpose()
arr1_arr2_df
Out:
0 1
0 11 21
1 12 22
2 13 23
3 14 24
4 15 25
But is it possible to get it from the start?

You can use a dictionary:
arr1_arr2_df = pd.DataFrame(data={0:arr1,1:arr2})

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Can we update a row in Hive Table Using Spark-SQL - apache-spark

Related

Creating a TXT file and seeking a position in Python

transform integer value patterns in a column to a group

DataFrameGroupby.agg NamedAgg on same column errors out on custom function, but works on bult-in function

spark sql 1.x. NotImplementedError: no parse rules

Can I create a dataframe from few 1d arrays as columns?

Categories

Resources