Pandas DataFrame, Iterate through groups is very slow - python-3.x

I have a dataframe df with ~ 300.000 rows and plenty of columns:
| COL_A | ... | COL_B | COL_C |
-----+--------+-...--+--------+--------+
IDX
-----+--------+-...--+--------+--------+
'AAA'| 'A1' | ... | 'B1' | 0 |
-----+--------+-...--+--------+--------+
'AAB'| 'A1' | ... | 'B2' | 2 |
-----+--------+-...--+--------+--------+
'AAC'| 'A1' | ... | 'B3' | 1 |
-----+--------+-...--+--------+--------+
'AAD'| 'A2' | ... | 'B3' | 0 |
-----+--------+-...--+--------+--------+
I need to group after COL_A and from each row of each group I need the value of IDX (e.G.: 'AAA') and COL_B (e.G.: B1) in the order given in COL_C
For A1 I thus need: [['AAA','B1'], ['AAC','B3'], ['AAB','B2']]
This is what I do.
grouped_by_A = self.df.groupby(COL_A)
for col_A, group in grouped_by_A:
group = group.sort_values(by=[COL_C], ascending=True)
...
It works fine, but it's horribly slow (Core i7, 16 GB RAM). It already takes ~ 5 Minutes when I'm not doing anything with the values. Do you know a faster way?

Related

groupby Sum on columns of multiple dataframes in Pandas

I have 2 dataframes (df1 and df2) below. Based on the values of df1.col1 and df2.col1 , I want the values in df1.col2 and df2.col2 to be added. This sum total should then go against each row in df1 as a 3rd column. For example: wherever we have A01 in col1, the values in col2 should be summed. So 1+7+5=13. Same for all other col1 values.
To give further clarity, I have given a snapshot of how df1 should look at the end (End Result)
**df1
|col1 | col2** |
|------| -------- |
| A01 | 1 |
| A02 | 0 |
| A03 | 0 |
| A01 | 7 |
| A02 | 1 |
**df2
|col1 | col2 | col3**
|------| --------|----------
| A01 | 5 | x
| A02 | 0 | y
| A06 | 0 | asa
| A07 | 1 | asa
| A02 | 4 | st
END Result:
**df1**
**col1 | col2 | col3**
|------| --------|----------
A01 | 1 | 13
A02 | 0 | 5
A03 | 0 | 0
A01 | 7 | 13
A02 | 1 | 5
found a solution based on manipulating merge and groupby operations.
First I grouped by 'col1' df1 and 'df2', then calculated the sum, and finally merged with df1 the 'sum'
dic_df1={'col1':['A01', 'A02', 'A03', 'A01', 'A02'], 'col2':[1, 0, 0, 7, 1]}
df1=pd.DataFrame(dic_df1)
dic_df2={'col1':['A01', 'A02', 'A06', 'A07', 'A02'], 'col2':[5, 0, 0, 1, 4] ,'col3**':['x', 'y', 'asa', 'asa', 'st']}
df2=pd.DataFrame(dic_df2)
print(df1), print(df2)
then perform a groupby on 'col1' so we can calculate the first part of the sum, and then when merging df1 and df2, we will have total sum in both datasets in column 'sum'
df2=df2.groupby(['col1'], as_index=False)['col2'].sum()
merged=(df1.groupby(['col1'], as_index=False)['col2'].sum()).merge(df2, left_on='col1', right_on='col1', how='left')
merged['sum']=merged['col2_x']+merged['col2_y']
finally a merge on df1 with merged
df1=df1.merge(merged[['col1', 'sum']], left_on='col1', right_on='col1', how='left')
final output if df1:
There are many ways to fill the Nan value in final output df1.

Pandas groupby compare count equal values in 2 columns in excel with subrows

I have an excel file like this:
link
.----.-------------.-------------------------.-----------------.
| | ID | Shareholder - Last name | DM Cognome |
:----+-------------+-------------------------+-----------------:
| 1. | 01287560153 | MASSIRONI | Bocapine Ardaya |
:----+-------------+-------------------------+-----------------:
| | | CAGNACCI | |
:----+-------------+-------------------------+-----------------:
| 2. | 05562881002 | | Directors |
:----+-------------+-------------------------+-----------------:
| 3. | 04113870655 | SABATO | Sabato |
:----+-------------+-------------------------+-----------------:
| | | VILLARI | |
:----+-------------+-------------------------+-----------------:
| 4. | 01419190846 | SALMERI | Salmeri |
:----+-------------+-------------------------+-----------------:
| | | MICALIZZI | Lipari |
:----+-------------+-------------------------+-----------------:
| | | LIPARI | |
'----'-------------'-------------------------'-----------------'
I open this file with pandas and ffill the ID column since there are subrows. Then groupby by ID to get the count of any equal values on the Shareholder - Last name and DM\nCognome columns. However I can't. In this case the result should be 0 row1 0 row2 1 row3 2 row4.
It should be noted that row 4 is consist of 3 subrow and row3 also consist of 2 subrow.(ex)
I have 2 questions:
What is the best way to read an unorganised excel file like above and do lots of comparisons, replacing values etc.
How can I achieve the results that I mentioned earlier.
Here is what I did, but it doesn't work:
data['ID'] = data['ID'].fillna(method='ffill')
data.groupby('ID', sort=False, as_index=False)['Shareholder - Last name', 'DM\nCognome'].apply(lambda x: (x['Shareholder - Last name']==x['DM\nCognome']).count())
First, read as input the table (keeping the ID as string instead of float):
df = pd.read_excel("Workbook1.xlsx", converters={'ID':str})
df = df.drop("Unnamed: 0", axis=1) #drop this column since it is not useful
Fill the ID and if a shareholder is missing replace Nan with "Missing":
df['ID'] = df['ID'].fillna(method='ffill')
df["Shareholder - Last name"] = df["Shareholder - Last name"].fillna("missing")
Convert to lowercase the surnames:
df["Shareholder - Last name"] = df["Shareholder - Last name"].str.lower()
Custom function to count how many householders occur in the other column:
def f(group):
s = pd.Series(group["DM\nCognome"].str.lower())
count = 0
for surname in group["Shareholder - Last name"]:
count += s.str.count(surname).sum()
return count
And finally get the count for each ID:
df.groupby("ID",sort=False)[["Shareholder - Last name", "DM\nCognome"]].apply(lambda x: f(x))
Output:
ID
01287560153 0.0
05562881002 0.0
04113870655 1.0
01419190846 2.0

How to divide two cells based on match?

In a table 1, I have,
+---+---+----+
| | A | B |
+---+---+----+
| 1 | A | 30 |
| 2 | B | 20 |
| 3 | C | 15 |
+---+---+----+
On table 2, I have
+---+---+---+----+
| | A | B | C |
+---+---+---+----+
| 1 | A | 2 | 15 |
| 2 | A | 5 | 6 |
| 3 | B | 4 | 5 |
+---+---+---+----+
I want the number in second column to divide the number in table 1, based on match, and the result in third column.
The number present in the bracket is the result needed. What is the formula that I must apply in third column in table 2?
Please help me on this.
Thanks in advance
You can use a vlookup() formula to go get the dividend. (assuming table 1 is on Sheet1 and table 2 in Sheet2 where we are doing this formula):
=VLOOKUP(A1,Sheet1!A:B, 2, FALSE)/Sheet2!B1
Since you mention table, with structured references, though it seems you are not applying those here:
=VLOOKUP([#Column1],Table1[#All],2,0)/[#Column2]

PySpark getting distinct values over a wide range of columns

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.
------------------------------------------------
| columnname | value | count |
|------------|-----------------------|---------|
| evar1 | en-GB | 7654321 |
| evar1 | en-US | 1234567 |
| evar2 | www.myclient.com | 123 |
| evar2 | app.myclient.com | 456 |
| ...
The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.
i = 1
df_evars = None
while i <= 30:
colname = "evar" + str(i)
df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
.withColumn("colName", fn.lit(colname))
if df_evars:
df_evars = df_evars.union(df_temp)
else:
df_evars = df_temp
display(df_evars)
Am I missing a better solution?
Update
This has been marked as a duplicate but the two responses IMO only solve part of my question.
I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.
The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.
The second response seems less relevant than the first. To clarify, source data like this:
-----------------------
| evar1 | evar2 | ... |
|---------------|-----|
| A | A | ... |
| B | A | ... |
| B | B | ... |
| B | B | ... |
| ...
Should result in the output
--------------------------------
| columnname | value | count |
|------------|-------|---------|
| evar1 | A | 1 |
| evar1 | B | 3 |
| evar2 | A | 2 |
| evar2 | B | 2 |
| ...
Using melt borrowed from here:
from pyspark.sql.functions import col
melt(
df.select([col(c).cast("string") for c in df.columns]),
id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()
Adapted from the answer by user6910411.

PySpark: How to fillna values in dataframe for specific columns?

I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)
df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1
Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )

Resources