pandas MemoryError when left merge two dataframes - python-3.x

I need to merge two df, a and b. a has about 2.5 million rows, and b has about 500k rows. a and b are read direct from mongoDB, converting to df using a list, the code is as,
unique_b = b[['id', 'name']]
unique_b.drop_duplicates()
a = pd.merge(a, unique_b[['id', 'name']], how='left', on='id')
now the merge is not only causing MemoryError, but taking a very long time (never stops) to processing the merge, in case there is enough memory. I am wondering how to optimise this pandas dataframe merge in terms of memory usage and time.

Related

Find difference between two dataframes of large data

I need to get the difference in data between two dataframes. I'm using subtract() for this.
# Dataframes that need to be compared
df1
df2
#df1-df2
new_df1 = df1.subtract(df2)
#df2-df1
new_df2 = df2.subtract(df1)
It works fine and the output is what I needed, but my only issue is with the performance.
Even for comparing 1gb of data, it takes around 50 minutes, which is far from ideal.
Is there any other optimised method to perform the same operation?
Following are some of the details regarding the dataframes:
df1 size = 9397995 * 30
df2 size = 1500000 * 30
All 30 columns are of dtype string.
Both dataframes are being loaded from a database through jdbc
connection.
Both dataframes have same column names and in same order.
You could use the "WHERE" statement on a dataset to filter out rows you don't need. For example through PK, if you know that df1 has pk ranging from 1 to 100, you just filter in df2 for all those pk's. Obviously after a union

best way to merge two pandas dataframe having 150 k records in each..Merging should happen on one common column like ID

best way to merge two pandas dataframe having 150 k records in each..Merging should happen on one common column like ID
df1(1st Datafrma150k records)
df2(2nd dataframe 150k records)
df3 = pd.merge(df1,df2,on='ID',how='left')
is this correct way to merge ?

How to extract certain rows from spark dataframe to create another spark dataframe?

In pandas dataframe df, one can extract a subset of rows and store it in another pandas data frame. For example, df1 = df[10:20]. Can we do something similar in spark dataframe?
Since we're at Spark, we're considering large datasets for which Pandas (and Python) are still catching up. I'm trying to stress out that the reason you may've considered PySpark as a better fit for your data processing problem is exactly the amount of data - to large for pandas to handle nicely.
With that said, you simply cannot think about the huge dataset as something to "rank" as no computer could handle it (either because lack of RAM or time).
In order to answer your question:
one can extract a subset of rows and store it in another pandas data frame.
think of filter or where that you use to filter out rows you don't want to include in a result dataset.
That could be as follows (using Scala API):
val cdf: DataFrame = ...
val result: DataFrame = cdf.where("here comes your filter expression")
Use result data frame however you wish. That's what you wanted to work with and is now available. That's a sort of "Spark way".
#chlebek since your answer works for me. I corrected a typo and post here as an answer.
b = cdf.withColumn("id", row_number().over(Window.orderBy("INTERVAL_END_DATETIME")))
b = b.where(b.id >= 10)
b = b.where(b.id <= 20)
You could try to use row_number, it will add increasing row number column. The data will be sorted by column used in .orderBy clause. Then you can just select needed rows.
import org.apache.spark.sql.expressions.Window
val new_df = df.withColumn("id",row_number.over(Window.orderBy('someColumnFromDf))).where('id <= 20 and 'id >= 10)

split,operate and union dataframe in spark

How can we split a dataframe and operate on individual split and union all the individual dataframes results back ?
Lets say i have dataframe with below columns. I need to split the dataframe based on channel and operate on individual splits which adds new column called bucket. then i need to union back the results.
account,channel,number_of_views
The groupBy is only allowing simple aggreagted operation. On each splitted dataframe i need to do feature extraction.
currently all Feature Transformers of spark-mllib are support only single dataframe.
you can randomly split like this
val Array(training_data, validat_data, test_data) = raw_data_rating_before_spilt.randomSplit(Array(0.6,0.2,0.2))
this will create 3 df then d what you want to do then you can join or union
val finalDF = df1.join(df2, df1.col("col_name")===df2.col("col_name"))
you can also join multiple df at the same time.
this is what you want or anything else.??

Pyspark Operations are Slower than Hive

I have 3 dataframes df1, df2 and df3.
Each dataframe has approximately 3million rows. df1 and df3 has apprx. 8 columns. df2 has only 3 columns.
(source text file of df1 is approx 600MB size)
These are the operations performed:
df_new=df1 left join df2 ->group by df1 columns->select df1 columns, first(df2 columns)
df_final = df_new outer join df3
df_split1 = df_final filtered using condition1
df_split2 = df_final filtered using condition2
write df_split1,df_split2 into a single table after performing different operations on both dataframes
This entire process takes 15mins in pyspark 1.3.1, with default partition value = 10, executor memory = 30G, driver memory = 10G and I have used cache() wherever necessary.
But when I use hive queries, this hardly takes 5 mins. Is there any particular reason why my dataframe operations are slow and is there any way I can improve the performance?
You should be careful with the use of JOIN.
JOIN in spark can be really expensive. Especially if the join is between two dataframes. You can avoid expensive operations by re-partition the two dataframes on the same column or by using the same partitioner.

Resources