Spark 2.1 Hive Partition Adding Issue ORC Format - apache-spark

I am using pyspark 2.1 to create partitions dynamically from table A to table B. Below are the DDL's
create table A (
objid bigint,
occur_date timestamp)
STORED AS ORC;
create table B (
objid bigint,
occur_date timestamp)
PARTITIONED BY (
occur_date_pt date)
STORED AS ORC;
I am then using a pyspark code where I am trying to determine the partitions that need to be merged, below is the portion of code where I am actually doing that
for row in incremental_df.select(partitioned_column).distinct().collect():
path = '/apps/hive/warehouse/B/' + partitioned_column + '=' + format(row[0])
old_df = merge_df.where(col(partitioned_column).isin(format(row[0])))
new_df = incremental_df.where(col(partitioned_column).isin(format(row[0])))
output_df = old_df.subtract(new_df)
output_df = output_df.unionAll(new_df)
output_df.write.option("compression","none").mode("overwrite").format("orc").save(path)
refresh_metadata_sql = 'MSCK REPAIR TABLE ' + table_name
sqlContext.sql(refresh_metadata_sql)
On Execution of the code I am able to see the partitions in HDFS
Found 3 items
drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-01
drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-02
drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-03
But when I am trying to access the table inside Spark I am getting array out of bound error
>> merge_df = sqlContext.sql('select * from B')
DataFrame[]
>>> merge_df.show()
17/06/01 10:33:13 ERROR Executor: Exception in task 0.0 in stage 200.0 (TID 4827)
java.lang.IndexOutOfBoundsException: toIndex = 3
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:202)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.<init>(OrcRawRecordMerger.java:183)
at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.<init>(OrcRawRecordMerger.java:226)
at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:437)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:252)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:251)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
Any help or pointers to resolve the issue will be appreciated

Posting the comment as an answer for easier reference:
Pls ensure that partitioned column is not included in dataframe.

Related

spark bucketing same partitions on same executor

I have two tables and I am converting them into dataframes, df1 has 3 partitions and df2 has 3 partitions
df1=spark.table("table1")
df2=spark.table("table2")
df1=[[1,2,3],[3,2,1],[2,3,1]]
df2=[[3,2,1],[2,3,1],[1,2,3]]
I am applying bucketing on these two dataframes, the partitions will look like this,
Dataframe 1
partition1 :[1,1,1]
partition2 :[2,2,2]
partition3 :[3,3,3]
Dataframe 2
partition1 :[1,1,1]
partition2 :[2,2,2]
partition3 :[3,3,3]
How does spark launches partition1 of df1 and partition 1 of df2 on same executor?
like wise partion2,partition3 of df1 and partition2partition3 of df2 on same executor ?

What does "PER PARTITION LIMIT" means in cql query in cassandra?

I have a scylla table as shown below:
cqlsh:sampleks> describe table test;
CREATE TABLE test (
client_id int,
when timestamp,
process_ids list<int>,
md text,
PRIMARY KEY (client_id, when) ) WITH CLUSTERING ORDER BY (when DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'DAYS'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 172800
AND max_index_interval = 1024
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
And I see this is how we are querying it. It's been a long time I worked on cassandra so this PER PARTITION LIMIT is new thing to me (looks like recently added). Can someone explain what does this do with some example in a layman language? I couldn't find any good doc on that which explains easily.
SELECT * FROM test WHERE client_id IN ? PER PARTITION LIMIT 1;
The PER PARTITION LIMIT clause can be helpful in a "wide partition scenario."
It returns only the first two rows in the partition.
Take this query:
aploetz#cqlsh:stackoverflow> SELECT client_id,when,md
FROM test PER PARTITION LIMIT 2 ;
Considering the PRIMARY KEY definition of (client_id,when), that query will iterate over each client_id. Cassandra will then return only the first two rows (clustered by when) from that partition, regardless of how many ocurences of when may be present.
In this case, I inserted 7 rows into your test table, using two different client_ids (2 partitions total). Using a PER PARTITION LIMIT of 2, I get 4 rows returned (2 client_id x PER PARTITION LIMIT 2) == 4 rows.
client_id | when | md
-----------+---------------------------------+-----
1 | 2020-05-06 12:00:00.000000+0000 | md1
1 | 2020-05-05 22:00:00.000000+0000 | md1
2 | 2020-05-06 19:00:00.000000+0000 | md2
2 | 2020-05-06 01:00:00.000000+0000 | md2
(4 rows)

Cassandra Partition key duplicates?

I am new to Cassandra so I had a few quick questions, suppose I do this:
CREATE TABLE my_keyspace.my_table (
id bigint,
year int,
datetime timestamp,
field1 int,
field2 int,
PRIMARY KEY ((id, year), datetime))
I imagine Cassandra as something like Map<PartitionKey, SortedMap<ColKey, ColVal>>,
My question is when querying for something from Cassandra using a WHERE, it will be like:
SELECT * FROM my_keyspace.my_table WHERE id = 1 AND year = 4,
This could return 2 or more records, how does this fit in with the data model of Cassandra?
If it really it a Big HashMap how come duplicate records for a partition key are allowed?
Thanks!
There is a batch of entries in the SortedMap<ColKey, ColVal> for each row, using its sorted nature.
To build on your mental model, while there is only 1 partition key for id = 1 AND year = 4 there are multiple cells:
(id, year) | ColKey | ColVal
------------------------------------------
1, 4 | datetime(1):field1 | 1 \ Row1
1, 4 | datetime(1):field2 | 2 /
1, 4 | datetime(5):field1 | 1 \
1, 4 | datetime(5):field2 | 2 / Row2
...

Populating pandas column based on moving date range (efficiently)

I have 2 pandas dataframes, one of them contains dates with measurements, and the other contains dates with an event ID.
df1
from datetime import datetime as dt
from datetime import timedelta
import pandas as pd
import numpy as np
today = dt.now()
ndays = 10
df1 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays))})
df1.Date = df1.Date.dt.date
Date measurement
2018-01-10 8
2018-01-11 2
2018-01-12 7
2018-01-13 3
2018-01-14 1
2018-01-15 1
2018-01-16 6
2018-01-17 9
2018-01-18 8
2018-01-19 4
df2
df2 = pd.DataFrame({'Date': ['2018-01-11', '2018-01-14', '2018-01-16', '2018-01-19'], 'letter': ['event_a', 'event_b', 'event_c', 'event_d']})
df2.Date = pd.to_datetime(df2.Date, format = '%Y-%m-%d')
df2.Date = df2.Date.dt.date
Date event_id
2018-01-11 event_a
2018-01-14 event_b
2018-01-16 event_c
2018-01-19 event_d
I give the dates in df1 an event_id from df2 only if it's between two event dates. The resulting dataframe would look something like:
df3
today = dt.now()
ndays = 10
df3 = pd.DataFrame({'Date': [today + timedelta(days = x) for x in range(ndays)], 'measurement': pd.Series(np.random.randint(1, high = 10, size = ndays)), 'event_id': ['event_a', 'event_a', 'event_b', 'event_b', 'event_b', 'event_c', 'event_c', 'event_d', 'event_d', 'event_d']})
df3.Date = df3.Date.dt.date
Date event_id measurement
2018-01-10 event_a 4
2018-01-11 event_a 2
2018-01-12 event_b 1
2018-01-13 event_b 5
2018-01-14 event_b 5
2018-01-15 event_c 4
2018-01-16 event_c 6
2018-01-17 event_d 6
2018-01-18 event_d 9
2018-01-19 event_d 6
The code I use to achieve this is:
n = 1
while n <= len(list(df2.Date)) - 1 :
for date in list(df1.Date):
if date <= df2.iloc[n].Date and (date > df2.iloc[n-1].Date):
df1.loc[df1.Date == date, 'event_id'] = df2.iloc[n].event_id
n += 1
The dataset that I am working with is significantly larger than this (a few million rows) and this method runs far too long. Is there a more efficient way to accomplish this?
So there are quite a few things to improve performance.
The first question I have is: does it have to be a pandas frame to begin with? Meaning can't df1 and df2 just be lists of tuples or list of lists?
The thing is that pandas adds a significant overhead when accessing items but especially when setting values individually.
Pandas excels when it comes to vectorized operations but I don't see an efficient alternative right now (maybe someone comes up with such an answer, that would be ideal).
Now what I'd do is:
Convert your df1 and df2 to records -> e.g. d1 = df1.to_records() what you get is an array of tuples, basically with the same structure as the dataframe.
Now run your algorithm but instead of operating on pandas dataframes you operate on the arrays of tuples d1 and d2
Use a third list of tuples d3 where you store the newly created data (each tuple is a row)
Now if you want you can convert d3 back to a pandas dataframe:
df3 = pd.DataFrame.from_records(d3, myKwArgs**)
This will speed up your code significantly I'd assume by more than 100-1000%. It does increase memory usage though, so if you are low on memory try to avoid the pandas dataframes all-together or dereference unused pandas frames df1, df2 once you used them to create the records (and if you run into problems call gc manually).
EDIT: Here a version of your code using the procedure above:
d3 = []
n = 1
while n < range(len(d2)):
for i in range(len(d1)):
date = d1[i][0]
if date <= d2[n][0] and date > d2[n-1][0]:
d3.append( (date, d2[n][1], d1[i][1]) )
n += 1
You can try df.apply() method to achieve this. Refer pandas.DataFrame.apply. I think my code will works faster than yours.
My approach:
Merge two dataframes df1 and df2 and create new one df3 by
df3 = pd.merge(df1, df2, on='Date', how='outer')
Sort df3 by date to make easy to travserse.
df3['Date'] = pd.to_datetime(df3.Date)
df3.sort_values(by='Date')
Create set_event_date() method to apply for each rows in df3.
new_event_id = np.nan
def set_event_date(df3):
global new_event_id
if df3.event_id is not np.nan:
new_event_id = df3.event_id
return new_event_id
Apply set_event_method() to each rows in df3.
df3['new_event_id'] = df3.apply(set_event_date,axis=1)
Final Output will be:
Date Measurement New_event_id
0 2018-01-11 2 event_a
1 2018-01-12 1 event_a
2 2018-01-13 3 event_a
3 2018-01-14 6 event_b
4 2018-01-15 3 event_b
5 2018-01-16 5 event_c
6 2018-01-17 7 event_c
7 2018-01-18 9 event_c
8 2018-01-19 7 event_d
9 2018-01-20 4 event_d
Let me know once you tried my solution and it works faster than yours.
Thanks.

Using sc.parallelize inside map() or any other solution?

I have following issue: i need to find all combinations of values in the column B per each id from the column A and return the results as DataFrame
In example below of the input DataFrame
A B
0 5 10
1 1 20
2 1 15
3 3 50
4 5 14
5 1 30
6 1 15
7 3 33
I need to get the following output DataFrame (it is for GraphX\GraphFrame)
src dist A
0 10 14 5
1 50 33 3
2 20 15 1
3 30 15 1
4 20 30 1
The one solution that I thought till now it is:
df_result = df.drop_duplicates().\
map(lambda (A,B):(A,[B])).\
reduceByKey(lambda p, q: p + q).\
map(lambda (A,B_values_array):(A,[k for k in itertools.combinations(B_values_array,2)]))
print df_result.take(3)
output: [(1, [(20,15),(30,20),(30,15)]),(5,[(10,14)]),(3,[(50,33)])]
And here I'm stuck :( how to return it to the data frame that I need? One idea was to use parallelize:
import spark_sc
edges = df_result.map(lambda (A,B_pairs): spark_sc.sc.parallelize([(k[0],k[1],A) for k in B_pairs]))
For spark_sc I have other file with name spark_sc.py
def init():
global sc
global sqlContext
sc = SparkContext(conf=conf,
appName="blablabla",
pyFiles=['my_file_with_code.py'])
sqlContext = SQLContext(sc)
but my code it failed:
AttributeError: 'module' object has no attribute 'sc'
if I use the spark_sc.sc() not into map() it works.
Any idea what I miss in the last step? is it possible at all to use parallelize()? or I need completely different solution?
Thanks!
You definitely need another solution which could be as simple as:
from pyspark.sql.functions import greatest, least, col
df.alias("x").join(df.alias("y"), ["A"]).select(
least("x.B", "y.B").alias("src"), greatest("x.B", "y.B").alias("dst"), "A"
).where(col("src") != col("dst")).distinct()
where:
df.alias("x").join(df.alias("y"), ["A"])
joins table with itself by A,
least("x.B", "y.B").alias("src")
and
greatest("x.B", "y.B")
choose value with a lower id as the source, and higher id as a destination. Finally:
where(col("src") != col("dst"))
drops self loops.
In general it is not possible to use SparkContext from an action or a transformation (not that it would make any sense to do this in your case).

Resources