Add columns based on a changing number of rows below - python-3.x

I'm trying to solve a machine learning problem for an university project. As a input I got a excel table.
It is needed to access information below specific rows (condition: df[c1] !=0) and create new columns with it. But the number of rows after the specific row is not fixed.
There are various pandas functions I tried to get running (e.g.: While-Loops combined with iloc, iterrows.) But nothing seemed to work. Now I wonder if I need to create a function where I create a new df for every group below each top element. I asume there must be a better option. I use Python 3.6 and Pandas 0.25.0.
I try to get the following result.
Input:
| name | c1 | c2 |
|------|-------|--------------|
| ab | 1 | info |
| tz | 0 | more info |
| ka | 0 | more info |
| cd | 2 | info |
| zz | 0 | more info |
The output should look like this:
Output:
| name | c1 | c2 | tz3 | ka4 | zz5 |
|------|-------|--------------|-----------|-----------|------------|
| ab | 1 | info | more info | more info | |
| tz | 0 | more info | | | |
| ka | 0 | more info | | | |
| cd | 2 | info | | | more info |
| zz | 0 | more info | | | |

You can do this as follows:
# make sure c1 is of type int (if it isn't already)
# if it is string, just change the comparison further below
df['c1']= df['c1'].astype('int32')
# create two temporary aux columns in the original dataframe
# the first contains 1 for each row where c1 is nonzero
df['nonzero']= (df['c1'] != 0).astype('int')
# the second contains a "group index" to give
# all rows that belong together the same number
df['group']= df['nonzero'].cumsum()
# create a working copy from the original dataframe
df2= df[['c1', 'c2', 'group']].copy()
# add another column which contains the name of the
# column under which the text should appear
df2['col']= df['name'].where(df['nonzero']==0, 'c2')
# add a dummy column with all ones
# (needed to merge the original dataframe
# with the "transposed" dataframe later)
df2['nonzero']= 1
# now the main part
# use the prepared copy and index it on
# group, nonzero(1) and col
df3= df2[['group', 'nonzero', 'col', 'c2']].set_index(['group', 'nonzero', 'col'])
# unstack it, meaning col is "split off" to create a new column
# level (like pivoting), the rest remains in the index
df3= df3.unstack()
# now df3 has a multilevel column index
# to get rid of it and have regular column names
# just rename the columns and remove c2 which
# we get from the original dataframe
df3_names= ['{1}'.format(*tup) for tup in df3.columns]
df3.columns= df3_names
df3.drop(['c2'], axis='columns', inplace=True)
# df3 now contains the "transposed" infos in column c1
# which should appear in the row for which 'nonzero' contains 1
# to get this, use merge
result= df.merge(df3, left_on=['group', 'nonzero'], right_index=True, how='left')
# if you don't like the NaN values (for the rows with nonzero=0), use fillna
result.fillna('', inplace=True)
# remove the aux columns and the merged c2_1 column
# for c2_1 we can use the original c2 column from df
result.drop(['group', 'nonzero'], axis='columns', inplace=True)
# therefore we rename it to get the same naming schema
result.rename({'c2': 'c2_1'}, axis='columns', inplace=True)
The result looks like this:
Out[191]:
name c1 c2 ka tz zz
0 ab 1 info even more info more info
1 tz 0 more info
2 ka 0 even more info
3 cd 2 info more info
4 zz 0 more info
For this input data:
Out[166]:
name c1 c2
0 ab 1 info
1 tz 0 more info
2 ka 0 even more info
3 cd 2 info
4 zz 0 more info
# created by the following code:
import io
raw=""" name c1 c2
0 ab 1 info
1 tz 0 more_info
2 ka 0 even_more_info
3 cd 2 info
4 zz 0 more_info"""
df= pd.read_csv(io.StringIO(raw), sep='\s+', index_col=0)
df['c2']=df['c2'].str.replace('_', ' ')

Related

Trying to subistitute values of a column of a dataframe with the values of another dataframe when common lines of a column are present, but no result

I have two dataframe
db
index| Line Item | Creative Size
1 | AA | Size1
2 | BB | Unknown
3 | CC | Unknown
df1
index| Line Item | Size
1 | BB | Size2
2 | CC | Size3
When Line Item code is the same, I want to substitute the value of Creative Size with the value of Size.
My expected output is:
db:
index| Line Item | Creative Size
1 | AA | Size1
2 | BB | Size2
3 | CC | Size3
Note that the size of df1 is less than the one of db.
I come up with:
rename_dict = df2.set_index('Line Item').to_dict()['Size']
db['Creative Size'] = db['Creative Size'].replace(rename_dict)
But for some reason it doesn't work.
Any help? Any other way?
You are close, need Series.map by column Line Item and non matched values replace by original by Series.fillna:
db['Creative Size'] = db['Line Item'].map(rename_dict).fillna(db['Creative Size'])

Pandas groupby compare count equal values in 2 columns in excel with subrows

I have an excel file like this:
link
.----.-------------.-------------------------.-----------------.
| | ID | Shareholder - Last name | DM Cognome |
:----+-------------+-------------------------+-----------------:
| 1. | 01287560153 | MASSIRONI | Bocapine Ardaya |
:----+-------------+-------------------------+-----------------:
| | | CAGNACCI | |
:----+-------------+-------------------------+-----------------:
| 2. | 05562881002 | | Directors |
:----+-------------+-------------------------+-----------------:
| 3. | 04113870655 | SABATO | Sabato |
:----+-------------+-------------------------+-----------------:
| | | VILLARI | |
:----+-------------+-------------------------+-----------------:
| 4. | 01419190846 | SALMERI | Salmeri |
:----+-------------+-------------------------+-----------------:
| | | MICALIZZI | Lipari |
:----+-------------+-------------------------+-----------------:
| | | LIPARI | |
'----'-------------'-------------------------'-----------------'
I open this file with pandas and ffill the ID column since there are subrows. Then groupby by ID to get the count of any equal values on the Shareholder - Last name and DM\nCognome columns. However I can't. In this case the result should be 0 row1 0 row2 1 row3 2 row4.
It should be noted that row 4 is consist of 3 subrow and row3 also consist of 2 subrow.(ex)
I have 2 questions:
What is the best way to read an unorganised excel file like above and do lots of comparisons, replacing values etc.
How can I achieve the results that I mentioned earlier.
Here is what I did, but it doesn't work:
data['ID'] = data['ID'].fillna(method='ffill')
data.groupby('ID', sort=False, as_index=False)['Shareholder - Last name', 'DM\nCognome'].apply(lambda x: (x['Shareholder - Last name']==x['DM\nCognome']).count())
First, read as input the table (keeping the ID as string instead of float):
df = pd.read_excel("Workbook1.xlsx", converters={'ID':str})
df = df.drop("Unnamed: 0", axis=1) #drop this column since it is not useful
Fill the ID and if a shareholder is missing replace Nan with "Missing":
df['ID'] = df['ID'].fillna(method='ffill')
df["Shareholder - Last name"] = df["Shareholder - Last name"].fillna("missing")
Convert to lowercase the surnames:
df["Shareholder - Last name"] = df["Shareholder - Last name"].str.lower()
Custom function to count how many householders occur in the other column:
def f(group):
s = pd.Series(group["DM\nCognome"].str.lower())
count = 0
for surname in group["Shareholder - Last name"]:
count += s.str.count(surname).sum()
return count
And finally get the count for each ID:
df.groupby("ID",sort=False)[["Shareholder - Last name", "DM\nCognome"]].apply(lambda x: f(x))
Output:
ID
01287560153 0.0
05562881002 0.0
04113870655 1.0
01419190846 2.0

PySpark: How to fillna values in dataframe for specific columns?

I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)
df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1
Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )

Pandas DataFrame, Iterate through groups is very slow

I have a dataframe df with ~ 300.000 rows and plenty of columns:
| COL_A | ... | COL_B | COL_C |
-----+--------+-...--+--------+--------+
IDX
-----+--------+-...--+--------+--------+
'AAA'| 'A1' | ... | 'B1' | 0 |
-----+--------+-...--+--------+--------+
'AAB'| 'A1' | ... | 'B2' | 2 |
-----+--------+-...--+--------+--------+
'AAC'| 'A1' | ... | 'B3' | 1 |
-----+--------+-...--+--------+--------+
'AAD'| 'A2' | ... | 'B3' | 0 |
-----+--------+-...--+--------+--------+
I need to group after COL_A and from each row of each group I need the value of IDX (e.G.: 'AAA') and COL_B (e.G.: B1) in the order given in COL_C
For A1 I thus need: [['AAA','B1'], ['AAC','B3'], ['AAB','B2']]
This is what I do.
grouped_by_A = self.df.groupby(COL_A)
for col_A, group in grouped_by_A:
group = group.sort_values(by=[COL_C], ascending=True)
...
It works fine, but it's horribly slow (Core i7, 16 GB RAM). It already takes ~ 5 Minutes when I'm not doing anything with the values. Do you know a faster way?

Is there a way to create RDBMS-behavior in Excel for row column lookups?

Is it possible to lookup values in excel in the following method:
Table 1
ID | Code
-----------------
1 | I
1 | J
1 | K
2 | I
2 | J
2 | L
Table 2
ID | I | J | K | L
----------------------------------------------
1 | 14.40 | 12.33 | 9.21 |
2 | 13.99 | 11.28 | | 32.33
The lookup would be to add the column values in table 2 next to the table 'code' in table 1. So table 1 would change to:
Table 1
ID | Code | Amount
-------------------------
1 | I | 14.40
1 | J | 12.33
1 | K | 9.21
1 | L |
2 | I | 13.99
2 | J | 11.28
2 | K |
2 | L | 32.33
As a reminder, this is a project being run in Microsoft Excel 2003.
Update
I believe I can use a vlookup on the first column and given I know the placement of the code fields, I could go this route but the issue would be I cannot copy and paste this formula across an entire column because the order of which codes may appear can vary (and are not the same from ID to ID).
You can use Index and Match
=INDEX($C$4:$E$6,MATCH(H3,$B$4:$B$6,0),MATCH(I3,$C$3:$E$3,0))
Match finds the position of your ID and code in the Table 2 row and column headers. Index uses those to return the intersection of the row & column.
Assuming table 1 is in cells A1:B7 and table 2 is in A10:E12, you can put this formula in c2 and copy it down to c7. It's an array formula, so you need to press ctrl-shift-enter after you enter it.
=SUM(IF($A$11:$A$12=A2,IF($B$10:$E$10=B2,$B$11:$E$12,0)))

Resources