Find difference between two dataframes of large data

Find difference between two dataframes of large data - python-3.x

I need to get the difference in data between two dataframes. I'm using subtract() for this.
# Dataframes that need to be compared
df1
df2
#df1-df2
new_df1 = df1.subtract(df2)
#df2-df1
new_df2 = df2.subtract(df1)
It works fine and the output is what I needed, but my only issue is with the performance.
Even for comparing 1gb of data, it takes around 50 minutes, which is far from ideal.
Is there any other optimised method to perform the same operation?
Following are some of the details regarding the dataframes:
df1 size = 9397995 * 30
df2 size = 1500000 * 30
All 30 columns are of dtype string.
Both dataframes are being loaded from a database through jdbc
connection.
Both dataframes have same column names and in same order.

You could use the "WHERE" statement on a dataset to filter out rows you don't need. For example through PK, if you know that df1 has pk ranging from 1 to 100, you just filter in df2 for all those pk's. Obviously after a union

Related

How to extract certain rows from spark dataframe to create another spark dataframe?

In pandas dataframe df, one can extract a subset of rows and store it in another pandas data frame. For example, df1 = df[10:20]. Can we do something similar in spark dataframe?

Since we're at Spark, we're considering large datasets for which Pandas (and Python) are still catching up. I'm trying to stress out that the reason you may've considered PySpark as a better fit for your data processing problem is exactly the amount of data - to large for pandas to handle nicely.
With that said, you simply cannot think about the huge dataset as something to "rank" as no computer could handle it (either because lack of RAM or time).
In order to answer your question:
one can extract a subset of rows and store it in another pandas data frame.
think of filter or where that you use to filter out rows you don't want to include in a result dataset.
That could be as follows (using Scala API):
val cdf: DataFrame = ...
val result: DataFrame = cdf.where("here comes your filter expression")
Use result data frame however you wish. That's what you wanted to work with and is now available. That's a sort of "Spark way".

#chlebek since your answer works for me. I corrected a typo and post here as an answer.
b = cdf.withColumn("id", row_number().over(Window.orderBy("INTERVAL_END_DATETIME")))
b = b.where(b.id >= 10)
b = b.where(b.id <= 20)

You could try to use row_number, it will add increasing row number column. The data will be sorted by column used in .orderBy clause. Then you can just select needed rows.
import org.apache.spark.sql.expressions.Window
val new_df = df.withColumn("id",row_number.over(Window.orderBy('someColumnFromDf))).where('id <= 20 and 'id >= 10)

How to create single row panda DataFrame with headers from big panda DataFrame

I have a DataFrame which contains 55000 rows and 3 columns
I want to return every row as DataFrame from this bigdataframe for using it as parameter of different function.
My idea was iterating over big DataFrame by iterrows(),iloc but I can't make it as DataFrame it is showing series type. How could I solve this

I think it is obviously not necessary, because index of Series is same like columns of DataFrame.
But it is possible by:
df1 = s.to_frame().T
Or:
df1 = pd.DataFrame([s.to_numpy()], columns=s.index)
Also you can try yo avoid iterrows, because obviously very slow.

I suspect you're doing something not optimal if you need what you describe. That said, if you need each row as a dataframe:
l = [pd.DataFrame(df.iloc[i]) for i in range(len(df))]
This makes a list of dataframes for each row in df

pandas MemoryError when left merge two dataframes

I need to merge two df, a and b. a has about 2.5 million rows, and b has about 500k rows. a and b are read direct from mongoDB, converting to df using a list, the code is as,
unique_b = b[['id', 'name']]
unique_b.drop_duplicates()
a = pd.merge(a, unique_b[['id', 'name']], how='left', on='id')
now the merge is not only causing MemoryError, but taking a very long time (never stops) to processing the merge, in case there is enough memory. I am wondering how to optimise this pandas dataframe merge in terms of memory usage and time.

Pyspark Operations are Slower than Hive

I have 3 dataframes df1, df2 and df3.
Each dataframe has approximately 3million rows. df1 and df3 has apprx. 8 columns. df2 has only 3 columns.
(source text file of df1 is approx 600MB size)
These are the operations performed:
df_new=df1 left join df2 ->group by df1 columns->select df1 columns, first(df2 columns)
df_final = df_new outer join df3
df_split1 = df_final filtered using condition1
df_split2 = df_final filtered using condition2
write df_split1,df_split2 into a single table after performing different operations on both dataframes
This entire process takes 15mins in pyspark 1.3.1, with default partition value = 10, executor memory = 30G, driver memory = 10G and I have used cache() wherever necessary.
But when I use hive queries, this hardly takes 5 mins. Is there any particular reason why my dataframe operations are slow and is there any way I can improve the performance?

You should be careful with the use of JOIN.
JOIN in spark can be really expensive. Especially if the join is between two dataframes. You can avoid expensive operations by re-partition the two dataframes on the same column or by using the same partitioner.

Filter a large RDD by iterating over another large RDD - pySpark

I have a large RDD, call it RDD1, that is approximately 300 million rows after an initial filter. What I would like to do is take the ids from RDD1 and find all other instances of it in another large dataset, call it RDD2, that is approximately 3 billion rows. RDD2 is created by querying a parquet table that is stored in Hive as well as RDD1. The number of unique ids from RDD1 is approximately 10 million elements.
My approach is to currently collect the ids and broadcast them and then filter RDD2.
My question is - is there a more efficient way to do this? Or is this best practice?
I have the following code -
hiveContext = HiveContext(sc)
RDD1 = hiveContext("select * from table_1")
RDD2 = hiveContext.sql("select * from table_2")
ids = RDD1.map(lambda x: x[0]).distinct() # This is approximately 10 million ids
ids = sc.broadcast(set(ids.collect()))
RDD2_filter = RDD2.rdd.filter(lambda x: x[0] in ids.value))

I think it would be better to just use a single SQL statement to do the join:
RDD2_filter = hiveContext.sql("""select distinct t2.*
from table_1 t1
join table_2 t2 on t1.id = t2.id""")

What I would do is take the 300 milion ids from RDD1, construct a bloom filter (Bloom filter), use that as broadcast variable to filter RDD2 and you will get RDD2Partial that contains all key-value parits for key that are in RDD1, plus some false positives. If you expect the result to be within order of millions, than you will then be able to use normal operations like join,cogroup, etc. on RDD1 and RDD2Partial to obtain exact result without any problem.
This way you greatly reduce the time of the join operation if you expect the result to be of reasonable size, since the complexity remains the same. You might get some reasonable speedups (e.g. 2-10x) even if the result is within the order of hundreds of millions.
EDIT
The bloom filter can be collected efficiently since you can combine the bits set by one element with the bits set by another element with OR, which is associative and commutative.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Find difference between two dataframes of large data - python-3.x

You could use the "WHERE" statement on a dataset to filter out rows you don't need. For example through PK, if you know that df1 has pk ranging from 1 to 100, you just filter in df2 for all those pk's. Obviously after a union

Related

How to extract certain rows from spark dataframe to create another spark dataframe?

How to create single row panda DataFrame with headers from big panda DataFrame

pandas MemoryError when left merge two dataframes

Pyspark Operations are Slower than Hive

Filter a large RDD by iterating over another large RDD - pySpark

Categories

Resources