This statement outputs partitionID and number of records in that partition:
data_frame.toDF().withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().orderBy(asc("count")).show()
+-----------+-----+
|partitionId|count|
+-----------+-----+
| 3| 22|
+-----------+-----+
This statement outputs number of partitions:
logger.warning('Num partitions: %s', data_frame.toDF().rdd.getNumPartitions())
WARNING:root:Num partitions 4
Shouldn't they both be same in num of partitions? First result shows only one partition and second result says there are 4 partitions?
Spark actually created 4 partitions but 3 are empty.
logger.warning("Partitions structure: {}".format(dynamic_frame.toDF().rdd.glom().collect()))
Partitions structure: [[Row(.....), Row(...)], [], [], []]
I have 2 pysaprk dataframes.
I am looking for option to join df1 with df2. Left join only with the first row from df2.
df1:
ID string
1 sfafsda
2 trwe
3 gfdgsd
df2
ID address state
1 Montreal Quebec
1 Quebec Quebec
2 Trichy TN
2 Madurai TN
3 Bangalore KN
3 Mysore KN
3 Hosur KN
Expected output from join:
ID string address state
1 sfafsda Montreal Quebec
2 trwe Trichy TN
3 gfdgsd Bangalore KN
As I am working on databricks, please let me know whether it's easier to implement pyspark left join only with the first row or sql join is possible to achieve the expected output. Thanks.
Yes it's possible using pyspark, but you need to add an index column to df2. See the code below:
df2 = df2.withColumn('index', F.monotonically_increasing_id())
df1.join(df2, 'ID', 'left') \
.select('*', F.first(F.array('address', 'state')).over(Window.partitionBy('ID').orderBy('index')).alias('array')) \
.select('ID', 'string', F.col('array')[0].alias('address'), F.col('array')[1].alias('state')) \
.groupBy('ID', 'string') \
.agg(F.first('address'), F.first('state')) \
.orderBy('ID')
Looking to find the total length of non-exclusive data in DataFrame
df1:
ID
0 7878aa
1 6565dd
2 9899ad
3 4158hf
4 4568fb
5 6877gh
df2:
ID
0 4568fb <-is in df1
1 9899ad <-is in df1
2 6877gh <-is in df1
3 9874ad <-not in df1
4 8745ag <-not in df1
desired output:
2
My code:
len(df1['ID'].isin(df2['ID'] == False)
My code end up showing the total length of the DataFrame which is 6. How do I find the total length of non-exclusive rows?
Thanks!
Use isin with negation and then sum
(~df2['ID'].isin(df1['ID'])).sum()
I have the following dataframes:
> df1
id begin conditional confidence discoveryTechnique
0 278 56 false 0.0 1
1 421 18 false 0.0 1
> df2
concept
0 A
1 B
How do I merge on the indices to get:
id begin conditional confidence discoveryTechnique concept
0 278 56 false 0.0 1 A
1 421 18 false 0.0 1 B
I ask because it is my understanding that merge() i.e. df1.merge(df2) uses columns to do the matching. In fact, doing this I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 4618, in merge
copy=copy, indicator=indicator)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 58, in merge
copy=copy, indicator=indicator)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 491, in __init__
self._validate_specification()
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 812, in _validate_specification
raise MergeError('No common columns to perform merge on')
pandas.tools.merge.MergeError: No common columns to perform merge on
Is it bad practice to merge on index? Is it impossible? If so, how can I shift the index into a new column called "index"?
Use merge, which is an inner join by default:
pd.merge(df1, df2, left_index=True, right_index=True)
Or join, which is a left join by default:
df1.join(df2)
Or concat, which is an outer join by default:
pd.concat([df1, df2], axis=1)
Samples:
df1 = pd.DataFrame({'a':range(6),
'b':[5,3,6,9,2,4]}, index=list('abcdef'))
print (df1)
a b
a 0 5
b 1 3
c 2 6
d 3 9
e 4 2
f 5 4
df2 = pd.DataFrame({'c':range(4),
'd':[10,20,30, 40]}, index=list('abhi'))
print (df2)
c d
a 0 10
b 1 20
h 2 30
i 3 40
# Default inner join
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print (df3)
a b c d
a 0 5 0 10
b 1 3 1 20
# Default left join
df4 = df1.join(df2)
print (df4)
a b c d
a 0 5 0.0 10.0
b 1 3 1.0 20.0
c 2 6 NaN NaN
d 3 9 NaN NaN
e 4 2 NaN NaN
f 5 4 NaN NaN
# Default outer join
df5 = pd.concat([df1, df2], axis=1)
print (df5)
a b c d
a 0.0 5.0 0.0 10.0
b 1.0 3.0 1.0 20.0
c 2.0 6.0 NaN NaN
d 3.0 9.0 NaN NaN
e 4.0 2.0 NaN NaN
f 5.0 4.0 NaN NaN
h NaN NaN 2.0 30.0
i NaN NaN 3.0 40.0
You can use concat([df1, df2, ...], axis=1) in order to concatenate two or more DFs aligned by indexes:
pd.concat([df1, df2, df3, ...], axis=1)
Or merge for concatenating by custom fields / indexes:
# join by _common_ columns: `col1`, `col3`
pd.merge(df1, df2, on=['col1','col3'])
# join by: `df1.col1 == df2.index`
pd.merge(df1, df2, left_on='col1' right_index=True)
or join for joining by index:
df1.join(df2)
By default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join
pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis
Join and pd.merge:
can take DataFrame arguments
This answer has been resolved for a while and all the available
options are already out there. However in this answer I'll attempt to
shed a bit more light on these options to help you understand when to
use what.
This post will go through the following topics:
Merging with index under different conditions
options for index-based joins: merge, join, concat
merging on indexes
merging on index of one, column of other
effectively using named indexes to simplify merging syntax
Index-based joins
TL;DR
There are a few options, some simpler than others depending on the use
case.
DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)
DataFrame.join (joins on index)
pd.concat (joins on index)
PROS
CONS
merge
• supports inner/left/right/full • supports column-column, index-column, index-index joins
• can only join two frames at a time
join
• supports inner/left (default)/right/full • can join multiple DataFrames at a time
• only supports index-index joins
concat
• specializes in joining multiple DataFrames at a time • very fast (concatenation is linear time)
• only supports inner/full (default) joins • only supports index-index joins
Index to index joins
Typically, an inner join on index would look like this:
left.merge(right, left_index=True, right_index=True)
Other types of joins (left, right, outer) follow similar syntax (and can be controlled using how=...).
Notable Alternatives
DataFrame.join defaults to a left outer join on the index.
left.join(right, how='inner',)
If you happen to get ValueError: columns overlap but no suffix specified, you will need to specify lsuffix and rsuffix= arguments to resolve this. Since the column names are same, a differentiating suffix is required.
pd.concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default.
pd.concat([left, right], axis=1, sort=False)
For more information on concat, see this post.
Index to Column joins
To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....
left.merge(right, left_index=True, right_on='key')
Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple levels/columns, provided the number of index levels on the left equals the number of columns on the right.
join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.
This post is an abridged version of my work in Pandas Merging 101. Please follow this link for more examples and other topics on merging.
A silly bug that got me: the joins failed because index dtypes differed. This was not obvious as both tables were pivot tables of the same original table. After reset_index, the indices looked identical in Jupyter. It only came to light when saving to Excel...
I fixed it with: df1[['key']] = df1[['key']].apply(pd.to_numeric)
Hopefully this saves somebody an hour!
If you want to join two dataframes in Pandas, you can simply use available attributes like merge or concatenate.
For example, if I have two dataframes df1 and df2, I can join them by:
newdataframe = merge(df1, df2, left_index=True, right_index=True)
You can try these few ways to merge/join your dataframe.
merge (inner join by default)
df = pd.merge(df1, df2, left_index=True, right_index=True)
join (left join by default)
df = df1.join(df2)
concat (outer join by default)
df = pd.concat([df1, df2], axis=1)
I am using pyspark 2.1 to create partitions dynamically from table A to table B. Below are the DDL's
create table A (
objid bigint,
occur_date timestamp)
STORED AS ORC;
create table B (
objid bigint,
occur_date timestamp)
PARTITIONED BY (
occur_date_pt date)
STORED AS ORC;
I am then using a pyspark code where I am trying to determine the partitions that need to be merged, below is the portion of code where I am actually doing that
for row in incremental_df.select(partitioned_column).distinct().collect():
path = '/apps/hive/warehouse/B/' + partitioned_column + '=' + format(row[0])
old_df = merge_df.where(col(partitioned_column).isin(format(row[0])))
new_df = incremental_df.where(col(partitioned_column).isin(format(row[0])))
output_df = old_df.subtract(new_df)
output_df = output_df.unionAll(new_df)
output_df.write.option("compression","none").mode("overwrite").format("orc").save(path)
refresh_metadata_sql = 'MSCK REPAIR TABLE ' + table_name
sqlContext.sql(refresh_metadata_sql)
On Execution of the code I am able to see the partitions in HDFS
Found 3 items
drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-01
drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-02
drwx------ - 307010265 hdfs 0 2017-06-01 10:31 /apps/hive/warehouse/B/occur_date_pt=2017-06-03
But when I am trying to access the table inside Spark I am getting array out of bound error
>> merge_df = sqlContext.sql('select * from B')
DataFrame[]
>>> merge_df.show()
17/06/01 10:33:13 ERROR Executor: Exception in task 0.0 in stage 200.0 (TID 4827)
java.lang.IndexOutOfBoundsException: toIndex = 3
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.<init>(RecordReaderImpl.java:202)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.<init>(OrcRawRecordMerger.java:183)
at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.<init>(OrcRawRecordMerger.java:226)
at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:437)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:252)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:251)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
Any help or pointers to resolve the issue will be appreciated
Posting the comment as an answer for easier reference:
Pls ensure that partitioned column is not included in dataframe.