"INSERT INTO ..." with SparkSQL HiveContext - apache-spark

I'm trying to run an insert statement with my HiveContext, like this:
hiveContext.sql('insert into my_table (id, score) values (1, 10)')
The 1.5.2 Spark SQL Documentation doesn't explicitly state whether this is supported or not, although it does support "dynamic partition insertion".
This leads to a stack trace like
AnalysisException:
Unsupported language features in query: insert into my_table (id, score) values (1, 10)
TOK_QUERY 0, 0,20, 0
TOK_FROM 0, -1,20, 0
TOK_VIRTUAL_TABLE 0, -1,20, 0
TOK_VIRTUAL_TABREF 0, -1,-1, 0
TOK_ANONYMOUS 0, -1,-1, 0
TOK_VALUES_TABLE 1, 13,20, 41
TOK_VALUE_ROW 1, 15,20, 41
1 1, 16,16, 41
10 1, 19,19, 44
TOK_INSERT 1, 0,-1, 12
TOK_INSERT_INTO 1, 0,11, 12
TOK_TAB 1, 4,4, 12
TOK_TABNAME 1, 4,4, 12
my_table 1, 4,4, 12
TOK_TABCOLNAME 1, 7,10, 22
id 1, 7,7, 22
score 1, 10,10, 26
TOK_SELECT 0, -1,-1, 0
TOK_SELEXPR 0, -1,-1, 0
TOK_ALLCOLREF 0, -1,-1, 0
scala.NotImplementedError: No parse rules for:
TOK_VIRTUAL_TABLE 0, -1,20, 0
TOK_VIRTUAL_TABREF 0, -1,-1, 0
TOK_ANONYMOUS 0, -1,-1, 0
TOK_VALUES_TABLE 1, 13,20, 41
TOK_VALUE_ROW 1, 15,20, 41
1 1, 16,16, 41
10 1, 19,19, 44
Is there any other way to insert to a Hive table that is supported?

Data can be appended to a Hive table using the append mode on the DataFrameWriter.
data = hc.sql("select 1 as id, 10 as score")
data.write.mode("append").saveAsTable("my_table")
This gives the same result as an insert.

I've had the same problem (Spark 1.5.1), and tried different versions.
Given
sqlContext.sql("create table my_table(id int, score int)")
The only versions that worked looked like this:
sqlContext.sql("insert into table my_table select t.* from (select 1, 10) t")
sqlContext.sql("insert into my_table select t.* from (select 2, 20) t")

The accepted answer saveAsTable fails for me with an AnalysisException (I don't understand why). What works for me instead is:
data = hc.sql("select 1 as id, 10 as score")
data.write.mode("append").insertInto("my_table")
I'm using Spark v2.1.0.

You tried to perform something that the data file format cannot, hence the Unsupported language features in query exception.
Many data file format are write-once and no support ACID operation.
Apache ORC supports ACID operation if you need it.
Instead, you can use partition to split your data into folders (/data/year=2017/month=10....), here you can append/insert data into your data lake.

try this hiveContext.sql("insert into table my_table select 1, 10")
if you haven't change your dynamic partition mode to nonstrict, you have to do this hiveCtx.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

When you first time do this
$data.write.mode("append").saveAsTable("my_table")
you should replace "append" with "overwrite", Then, you can use "append".

Related

Understanding Rolling Windows in Python Pandas

I am trying to get the 6th element value. as true but get NaN instead. I have an example based out of Excel. When i try rolling window of 6, i get nan for 6th record but i should get False, instead. However, when i try rolling window of 5, all seems to work. I want to understand what is actually happened and what is the best way to say sum product of 6 elements means rolling window of 6 instead of 5.
Objective : Six points in a row, all increasing or all decreasing
Code I am trying
def condition(x):
if x.tolist()[-1] != 0:
if ( sum(x.tolist()) >= 5 or sum(x.tolist()) <= -5):
return 1
else:
return 0
else:
return 0
df_in['I GET'] = df_in[['lead_one']].rolling(
window=6).apply(condition , raw=False)
Tag column is what is expected.
When you use a rolling window of 6, it takes the current value + the previous 5 values. Then you try to sum those 6 values. I say try, because if there's any nan value in there, ordinary python summing will also give you an na value.
That's also why .rolling(window=5) works: it gets the current value + 4 previous values and since they don't contain any nan values, you actually get a summed value one row earlier
You could use a different kind of summing: np.nansum()
Or use pandas summing where you specify to skip the na's, something like: df['column'].sum(skipna=True)
However looking at your code, I think it could be improved, so you don't get the na's in the first place. Here's an example using np.where():
import numpy as np
import pandas as pd
# create example dataframe
df = pd.DataFrame(
data=[10, 10, 12, 13, 14, 15, 16, 17, 17, 10, 9],
columns=['value']
)
# create an if/then using np.select
df['n > n+1'] = np.select(
[df['value'] > df['value'].shift(1),
df['value'] == df['value'].shift(1),
df['value'] < df['value'].shift(1)],
[1, 0, -1]
)
# take an absolute value of the last 6 values and check if >= 5
df['I GET'] = np.where(
np.abs(df['n > n+1'].rolling(window=6).sum()) >= 5, 1, 0)

Structural Question Regarding pandas .drop method

df2=df.drop(df[df['issue']=="prob"].index)
df2.head()
The code immediately below works fine.
But why is there a need to type df[df[ rather than the below?
df2=df.drop(df['issue']=="prob"].index)
df2.head()
I know that the immediately above won't work while the former does. I would like to understand why or know what exactly I should google.
Also ~ any advice on a more relevant title would be appreciated.
Thanks!
Option 1: df[df['issue']=="prob"] produces a DataFrame with a subset of values.
Option 2: df['issue']=="prob" produces a pandas.Series with a Boolean for every row.
.drop works for Option 1, because it knows to just drop the selected indices, vs. all of the indices returned from Option 2.
I would use the following methods to remove rows.
Use ~ (not) to select the opposite of the Boolean selection.
df = df[~(df.treatment == 'Yes')]
Select rows with only the desired value
df = df[(df.treatment == 'No')]
import pandas as pd
import numpy as np
import random
# sample dataframe
np.random.seed(365)
random.seed(365)
rows = 25
data = {'a': np.random.randint(10, size=(rows)),
'groups': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(rows)],
'treatment': [random.choice(['Yes', 'No']) for _ in range(rows)],
'date': pd.bdate_range(datetime.today(), freq='d', periods=rows).tolist()}
df = pd.DataFrame(data)
df[df.treatment == 'Yes'].index
Produces just the indices where treatment is 'Yes', therefore df.drop(df[df.treatment == 'Yes'].index) only drops the indices in the list.
df[df.treatment == 'Yes'].index
[out]:
Int64Index([0, 1, 2, 4, 6, 7, 8, 11, 12, 13, 14, 15, 19, 21], dtype='int64')
df.drop(df[df.treatment == 'Yes'].index)
[out]:
a groups treatment date
3 5 6-25 No 2020-08-15
5 2 500-1000 No 2020-08-17
9 0 500-1000 No 2020-08-21
10 3 100-500 No 2020-08-22
16 8 1-5 No 2020-08-28
17 4 1-5 No 2020-08-29
18 3 1-5 No 2020-08-30
20 6 500-1000 No 2020-09-01
22 6 6-25 No 2020-09-03
23 8 100-500 No 2020-09-04
24 9 26-100 No 2020-09-05
(df.treatment == 'Yes').index
Produces all of the indices, therefore df.drop((df.treatment == 'Yes').index) drops all of the indices, leaving an empty dataframe.
(df.treatment == 'Yes').index
[out]:
RangeIndex(start=0, stop=25, step=1)
df.drop((df.treatment == 'Yes').index)
[out]:
Empty DataFrame
Columns: [a, groups, treatment, date]
Index: []

How to filter time series if data exists at least data every 6 hours?

I'd like to verify if there is data at least once every 6 hours per ID, and filter out the IDs that do not meet this criteria.
essentially a filter: "if ID's data not at least every 6h, drop id from dataframe"
I try to use the same method for filtering one per day, but having trouble adapting the code.
# add day column from datetime index
df['1D'] = df.index.day
# reset index
daily = df.reset_index()
# count per ID per day. Result is per ID data of non-zero
a = daily.groupby(['1D', 'id']).size()
# filter by right join
filtered = a.merge(df, on = id", how = 'right')
I cannot figure out how to adapt this for the following 6hr periods each day: 00:01-06:00, 06:01-12:00, 12:01-18:00, 18:01-24:00.
Groupby ID and then integer divide hour by 6 and get unique counts. In your case it should be greater than or equal to 4 because there are 4 - 6 hour bins in 24 hours and each day has 4 unique bins i.e.
Bins = 4
00:01-06:00
06:01-12:00
12:01-18:00
18:01-24:00
Code
mask = df.groupby('id')['date'].transform(lambda x: (x.dt.hour // 6).nunique() >= 4)
df = df[mask]
I propose to use pivot_table with resample which allows to change to arbitrary frequencies. Please see comments for further explanations.
# build test data. I need a dummy column to use pivot_table later. Any column with numerical values will suffice
data = [[datetime(2020, 1, 1, 1), 1, 1],
[datetime(2020, 1, 1, 6), 1, 1],
[datetime(2020, 1, 1, 12), 1, 1],
[datetime(2020, 1, 1, 18), 1, 1],
[datetime(2020, 1, 1, 1), 2, 1],
]
df = pd.DataFrame.from_records(data=data, columns=['date', 'id', 'dummy'])
df = df.set_index('date')
# We need a helper dataframe df_tmp.
# Transform id entries to columns. resample with 6h = 360 minutes = 360T.
# Take mean() because it will produce nan values
# WARNING: It will only work if you have at least one id with observations for every 6h.
df_tmp = pd.pivot_table(df, columns='id', index=df.index).resample('360T').mean()
# Drop MultiColumnHierarchy and drop all columns with NaN values
df_tmp.columns = df_tmp.columns.get_level_values(1)
df_tmp.dropna(axis=1, inplace=True)
# Filter values in original dataframe where
mask_id = df.id.isin(df_tmp.columns.to_list())
df = df[mask_id]
I kept your requirements on timestamps but I believe you want to use the commented lines in my solution.
import pandas as pd
period = pd.to_datetime(['2020-01-01 00:01:00', '2020-01-01 06:00:00'])
# period = pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 06:00:00'])
shift = pd.to_timedelta(['6H', '6H'])
id_with_data = set(df['ID'])
for k in range(4): # for a day (00:01 --> 24:00)
period_mask = (period[0] <= df.index) & (df.index <= period[1])
# period_mask = (period[0] <= df.index) & (df.index < period[1])
present_ids = set(df.loc[period_mask, 'ID'])
id_with_data = id_with_data.intersection(present_ids)
period += shift
df = df.loc[df['ID'].isin(list(id_with_data))]

Pandas groupby dataframe then return single value result (sum, total)

Dears,
Please help me, I am stucked.I guess it should not be difficult but I feel overwhelmed.
Need to make ageing of receivables, therefore they must be separated in different buckets.
Suppose we have only 3 groups: current, above_10Days and above_20Days and the following table:
d = {'Cust': [Dfg, Ers, Dac, Vds, Mhf, Kld, Xsd, Hun],
'Amount': [10000, 100000, 4000, 5411, 756000, 524058, 4444785, 54788,
'Days': 150, 21, 30, 231, 48, 15, -4, -14 }
I need to group the amounts to a total sum, depending on the Ageing group.
Example:
Current: 4499573, etc.
For that purpose, I tried to group the receivables with such code:
above_10Days = df.groupby((df['Days'] > 0) & (df['Days'] <= 10))
above10sum = above_10Days.Amount.sum().iloc[1]
It works perfect but only when they are actual amount in this group.
When they are no such A/R it throws an exception and stop executing. I tried to use function or to make 'None' value to 0, but no success.
Hopefully someone could know the solution.
Thanks in advance
IIUC:
d = {'Cust': ['Dfg', 'Ers', 'Dac', 'Vds', 'Mhf', 'Kld', 'Xsd', 'Hun'],
'Amount': [10000, 100000, 4000, 5411, 756000, 524058, 4444785, 54788],
'Days': [150, 21, 30, 231, 48, 15, -4, -14] }
df = pd.DataFrame(d)
#Updated to assign to output dataframe
df_out = (df.groupby(pd.cut(df.Days,
[-np.inf,10,20,np.inf],
labels=['Current','Above 10 Days','Above 20 Days']))['Amount']
.sum())
Output:
Days
Current 4499573
Above 10 Days 524058
Above 20 Days 875411
Name: Amount, dtype: int64
Varible assignent using .loc:
varCurrent = df_out.loc['Current']
var10 = df_out.loc['Above 10 Days']
var20 = df_out.loc['Above 20 Days']
print(varCurrent,var10,var20)
Output:
4499573 524058 875411

Why do all data end up in one partition after reduceByKey?

I have Spark application which contains the following segment:
val repartitioned = rdd.repartition(16)
val filtered: RDD[(MyKey, myData)] = MyUtils.filter(repartitioned, startDate, endDate)
val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, kv._2))
val reduced: RDD[(DateTime, myData)] = mapped.reduceByKey(_+_)
When I run this with some logging this is what I see:
repartitioned ======> [List(2536, 2529, 2526, 2520, 2519, 2514, 2512, 2508, 2504, 2501, 2496, 2490, 2551, 2547, 2543, 2537)]
filtered ======> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
mapped ======> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
reduced ======> [List(0, 0, 0, 0, 0, 0, 922, 0, 0, 0, 0, 0, 0, 0, 0, 0)]
My logging is done using these two lines:
val sizes: RDD[Int] = rdd.mapPartitions(iter => Array(iter.size).iterator, true)
log.info(s"rdd ======> [${sizes.collect().toList}]")
My question is why does my data end up in one partition after the reduceByKey? After the filter it can be seen that the data is evenly distributed, but the reduceByKey results in data in only one partition.
I am guessing all your processing times are the same.
Alternatively, their hashCode (from the DateTime class) are the same. Is that a custom class ?
I will answer my own question, since I figured it out. My DateTimes were all without seconds and milliseconds since I wanted to group data belonging to the same minute. The hashCode() for Joda DateTimes which are one minute apart is a constant:
scala> val now = DateTime.now
now: org.joda.time.DateTime = 2015-11-23T11:14:17.088Z
scala> now.withSecondOfMinute(0).withMillisOfSecond(0).hashCode - now.minusMinutes(1).withSecondOfMinute(0).withMillisOfSecond(0).hashCode
res42: Int = 60000
As can be seen by this example, if the hashCode values are similarly spaced, they can end up in the same partition:
scala> val nums = for(i <- 0 to 1000000) yield ((i*20 % 1000), i)
nums: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((0,0), (20,1), (40,2), (60,3), (80,4), (100,5), (120,6), (140,7), (160,8), (180,9), (200,10), (220,11), (240,12), (260,13), (280,14), (300,15), (320,16), (340,17), (360,18), (380,19), (400,20), (420,21), (440,22), (460,23), (480,24), (500,25), (520,26), (540,27), (560,28), (580,29), (600,30), (620,31), (640,32), (660,33), (680,34), (700,35), (720,36), (740,37), (760,38), (780,39), (800,40), (820,41), (840,42), (860,43), (880,44), (900,45), (920,46), (940,47), (960,48), (980,49), (0,50), (20,51), (40,52), (60,53), (80,54), (100,55), (120,56), (140,57), (160,58), (180,59), (200,60), (220,61), (240,62), (260,63), (280,64), (300,65), (320,66), (340,67), (360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500...
scala> val rddNum = sc.parallelize(nums)
rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:23
scala> val reducedNum = rddNum.reduceByKey(_+_)
reducedNum: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[1] at reduceByKey at <console>:25
scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator, true).collect.toList
res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
To distribute my data more evenly across the partitions I created my own custom Partitoiner:
class JodaPartitioner(rddNumPartitions: Int) extends Partitioner {
def numPartitions: Int = rddNumPartitions
def getPartition(key: Any): Int = {
key match {
case dateTime: DateTime =>
val sum = dateTime.getYear + dateTime.getMonthOfYear + dateTime.getDayOfMonth + dateTime.getMinuteOfDay + dateTime.getSecondOfDay
sum % numPartitions
case _ => 0
}
}
}

Resources