I read a PDF file with PDFMiner and I get a string; following that structure:
text
text
text
col1
1
2
3
4
5
col2
(1)
(2)
(3)
(7)
(4)
col3
name1
name2
name3
name4
name5
col4
name
5
45
7
87
8
col5
FAE
EFD
SDE
FEF
RGE
col6
name
45
7
54
4
130
# col7
16
18
22
17
25
col8
col9
55
30
60
1
185
col10
name
1
7
1
8
text1
text1
text1
col1
6
7
8
9
10
col2
(1)
(2)
(3)
(7)
(4)
col3
name6
name7
name8
name9
name10
col4
name
54
4
78
8
86
col5
SDE
FFF
EEF
GFE
JHG
col6
name
6
65
65
45
78
# col7
16
18
22
17
25
col8
col9
55
30
60
1
185
col10
name
1
4
1
54
I have 10 columns named: col1, col2, col3, col4 name, col5, col6 name, # col7, col8, col9, col10 name.
But as I have those 10 columns on each page; I get the structure repeated. Those names will always be the same, on each page. I am not sure how to pull it all in the same dataframe.
For example for col1 I would have in the dataframe:
1
2
3
4
5
6
7
8
9
10
I also have some empty columns (col8 in my example) and I am not sure how to deal with it.
Any idea? thanks!
You can use regex to parse the document (regex101), for example (txt is your string from the question):
import re
d = {}
for col_name, cols in re.findall(r'\n^((?:#\s)?col\d+(?:\n\s*name\n+)?)(.*?)(?=\n\n|^(?:#\s)?col\d+|\Z)', txt, flags=re.M|re.S):
d.setdefault(col_name.strip(), []).extend(cols.strip().split('\n'))
df = pd.DataFrame.from_dict(d, orient='index').T
print(df)
Prints:
col1 col2 col3 col4\n name col5 col6\n name # col7 col8 col9 col10\nname
0 1 (1) name1 5 FAE 45 16 55 1
1 2 (2) name2 45 EFD 7 18 30 7
2 3 (3) name3 7 SDE 54 22 None 60 1
3 4 (7) name4 87 FEF 4 17 None 1 8
4 5 (4) name5 8 RGE 130 25 None 185 1
5 6 (1) name6 54 SDE 6 16 None 55 4
6 7 (2) name7 4 FFF 65 18 None 30 1
7 8 (3) name8 78 EEF 65 22 None 60 54
8 9 (7) name9 8 GFE 45 17 None 1 None
9 10 (4) name10 86 JHG 78 25 None 185 None
Related
I am trying to convert excel if else condition in python dataframe columns, can anyone help me out in this:
Input: df
Name1 Name2 Name3 Name4 Value1 Value2 Value3 Value4 MaxValue
0 John1 John2 John3 John4 10 3 5 7 10
1 Sony1 Sony2 Sony3 Sony4 2 12 4 8 12
2 Mark1 Mark2 Mark3 Mark4 5 13 0 3 13
3 Biky1 Biky2 Biky3 Biky4 7 7 5 44 44
4 Rose1 Rose2 Rose3 Rose4 7 0 9 7 9
Name values may not be ended with 1/2/3 etc this may have different name also.
Output: How to calculate the Final_Name column
Name1 Name2 Name3 Name4 Value1 Value2 Value3 Value4 MaxValue Final_Name
0 John1 John2 John3 John4 10 3 5 7 10 John1
1 Sony1 Sony2 Sony3 Sony4 2 12 4 8 12 Sony2
2 Mark1 Mark2 Mark3 Mark4 5 13 0 3 13 Mark2
3 Biky1 Biky2 Biky3 Biky4 7 7 5 44 44 Biky4
4 Rose1 Rose2 Rose3 Rose4 7 0 9 7 9 Rose3
In excel we, can write something like this:
=IF(I2=H2,D2,IF(I2=G2,C2,IF(I2=F2,B2,IF(I2=E2,A2,""))))
You can first filter the df into two parts , then we use the value position locate the Name
v = df.filter(regex = '^Value')
name = df.filter(regex = '^Name')
df['out'] = name.values[df.index, v.columns.get_indexer(v.idxmax(1))]
df
Out[188]:
Name1 Name2 Name3 Name4 Value1 Value2 Value3 Value4 MaxValue out
0 John1 John2 John3 John4 10 3 5 7 10 John1
1 Sony1 Sony2 Sony3 Sony4 2 12 4 8 12 Sony2
2 Mark1 Mark2 Mark3 Mark4 5 13 0 3 13 Mark2
3 Biky1 Biky2 Biky3 Biky4 7 7 5 44 44 Biky4
4 Rose1 Rose2 Rose3 Rose4 7 0 9 7 9 Rose3
You can first create a column that will show you which 'Name' column should you return back using idxmax(). Then you can stack() your 'Name' columns and merge this result with your created column above based on index and 'Name':
# Create a helper column
v_c = [c for c in df if c.startswith('Value')]
df['id_col'] = df[v_c].idxmax(axis=1).str.replace('Value','Name')
# Merge the helper column with your stacked 'Name' columns
n_c = df.filter(like='Name').columns
res = pd.merge(df[n_c].stack().reset_index(),df[['id_col']].reset_index(),left_on=['level_0','level_1'], right_on=['index','id_col'])[0]
# Assign as a column
df['Final_Name'] = res
prints:
Name1 Name2 Name3 Name4 ... Value4 MaxValue id_col Final_Name
0 John1 John2 John3 John4 ... 7 10 Name1 John1
1 Sony1 Sony2 Sony3 Sony4 ... 8 12 Name2 Sony2
2 Mark1 Mark2 Mark3 Mark4 ... 3 13 Name2 Mark2
3 Biky1 Biky2 Biky3 Biky4 ... 44 44 Name4 Biky4
4 Rose1 Rose2 Rose3 Rose4 ... 7 9 Name3 Rose3
[5 rows x 11 columns]
Hello I have data as follows:
Col1 Col2 col3
A 2020-01-08 25
A 2020-01-11 26
B 2020-01-06 32
B 2020-01-08 45
I want to create another column(col 4) which will have the value for each category in col1 with the 2 months prior col-3 values as below:
Col1 Col2 col3 col4
A 2020-01-08 25 NaN
A 2020-01-10 56 25
A 2020-01-11 26 NaN
B 2020-01-06 32 NaN
B 2020-01-08 45 32
I tried pd.shift, but its not working If I have missing months in the data. Can anyone please help?
Use np.where to conditionally identify groups in which consecutive difference are greater than or equal to 60 days
df['col4'] = np.where(df.groupby('Col1')['Col2'].diff().dt.days.ge(60),df['col3'].shift(), np.nan)
Col1 Col2 col3 col4
0 A 2020-08-01 25 NaN
1 A 2020-10-01 56 25.0
2 A 2020-11-01 26 NaN
3 B 2020-06-01 32 NaN
4 B 2020-08-01 45 32.0
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'col3':np.random.randint(1,10,5),'col1':np.random.randint(30,80,5)})
df2 = pd.DataFrame({'col4':np.random.randint(30,80,5),'col5':np.random.randint(100,130,5)})
df3 = pd.DataFrame({'col9':np.random.randint(1,10,5),'col8':np.random.randint(30,80,5)})
x1 = pd.concat([df1,df2,df3],axis=1,sort=False)
x1.columns = pd.MultiIndex.from_product([['I2'],x1.columns])
x2 = pd.concat([df1,df2,df3],axis=1,sort=False)
x2.columns = pd.MultiIndex.from_product([['I3'],x2.columns])
x3 = pd.concat([df1,df2,df3],axis=1,sort=False)
x3.columns = pd.MultiIndex.from_product([['I1'],x3.columns])
pd.concat([x1,x2,x3],axis=0,sort=False)
I was trying to get an aggregated dataframe with exactly the same column order as those of x1, x2 and x3 (which are already the same) as figure 1 shows below:
Figure 1: I was trying to get this
But actually the above codes created a dataframe presented in figure 2 below:
Figure 2: The code actually created this
I am wondering why the "sort=False" param did not successfully handle the sorting behaviour neither in the first level nor the second level of the columns in the pandas.concat() function?
Is there any other way that I can get the dataframe that I want?
Great thanks for your time and intelligence!
You could use join instead of using concat
x1.join(x2,how='outer').join(x3,how='outer')
Result:
I2 I3 I1
col3 col1 col4 col5 col9 col8 col3 col1 col4 col5 col9 col8 col3 col1 col4 col5 col9 col8
0 7 54 42 128 8 79 7 54 42 128 8 79 7 54 42 128 8 79
1 1 56 56 102 1 77 1 56 56 102 1 77 1 56 56 102 1 77
2 9 34 52 108 4 68 9 34 52 108 4 68 9 34 52 108 4 68
3 3 42 51 108 8 75 3 42 51 108 8 75 3 42 51 108 8 75
4 3 34 70 100 5 78 3 34 70 100 5 78 3 34 70 100 5 78
I have a df that looks something like the below
Index Col1 Col2 Col3 Col4 Col5
0 12 121 346 abc 747
1 156 121 146 68 75967
2 234 121 346 567
3 gj 161 646
4 214 171
5 fhg
....
.....
And I want to make the dataframe appear such that the columns where there are null values, the columns move/shift their data to the bottom of the dataframe.
Eg it should look like:
Index Col1 Col2 Col3 Col4 Col5
0 12
1 156 121
2 234 121 346
3 gj 121 146 abc
4 214 161 346 68 747
5 fhg 171 646 567 75967
I have thought along the lines of shift and/or justify.
However not sure how it can be accomplished in the most efficient way for a large dataframe
You can use a bit changed justify function for working also with non numeric values:
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = pd.notnull(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val, dtype=object)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
arr = justify(df.values, invalid_val=np.nan, side='down', axis=0)
df = pd.DataFrame(arr, columns=df.columns, index=df.index).astype(df.dtypes)
print (df)
Col1 Col2 Col3 Col4 Col5
0 12 NaN NaN NaN NaN
1 156 121 NaN NaN NaN
2 234 121 346 NaN NaN
3 gj 121 346 567 NaN
4 214 121 346 567 75967
5 fhg 121 346 567 75967
I tried this,
t=df.isnull().sum()
for val in zip(t.index.values,t.values):
df[val[0]]=df[val[0]].shift(val[1])
print df
Output:
Index Col1 Col2 Col3 Col4 Col5
0 0 12 NaN NaN NaN NaN
1 1 156 121.0 NaN NaN NaN
2 2 234 121.0 346.0 NaN NaN
3 3 gj 121.0 146.0 abc NaN
4 4 214 161.0 346.0 68 747.0
5 5 fhg 171.0 646.0 567 75967.0
Note: Here I have used loop, may be not a better solution, but it will give you an idea to solve this.
Using Python/Pandas I am trying to transform a dataframe by creating two new columns (A and B) conditional on values from different lines (from column ID3), but from within the same group (as determined by ID1).
For each ID1 group, I want to take the ID2 value where ID3 is equal to 31 and put this value in a new column called A conditional on ID3 being a 1 or a 2. Similarly, I want to take the ID2 value where ID3 is equal to 41 and put this value in a new column called B, again conditional on ID3 being a 1 or a 2.
Assuming I have a dataframe in the following format:
import pandas as pd
df = pd.DataFrame({'ID1': (1, 1, 1, 1, 2, 2, 2), 'ID2': (151, 152, 153, 154, 261, 262, 263), 'ID3': (1, 2, 31, 41, 1, 2, 41), 'ID4': (2, 2, 1, 2, 1, 1, 2)})
print(df)
ID1 ID2 ID3 ID4
0 1 151 1 2
1 1 152 2 2
2 1 153 31 1
3 1 154 41 2
4 2 261 1 1
5 2 262 2 1
6 2 263 41 2
Post-transformation format should look like what is shown below. Where columns A and B are populated with values from ID2, conditional on values within ID3.
ID1 ID2 ID3 ID4 A B
0 1 151 1 2 153 154
1 1 152 2 2 153 154
2 1 153 31 1
3 1 154 41 2
4 2 261 1 1
5 2 262 2 1 263
6 2 263 41 2 263
I have attempted using what is shown below, but transform will retain the same number of values as the original dataset. This poses a problem for the lines in which ID3 = 31 or 41. Also, it returns the ID2 value by default if there is no ID2 value of 31 within the group.
df['A'] = df.groupby('ID1')['ID2'].transform(lambda x: x.loc[df['ID3'] == 31])
df['B'] = df.groupby('ID1')['ID2'].transform(lambda x: x.loc[df['ID3'] == 41])
Result:
ID1 ID2 ID3 ID4 A B
0 1 151 1 2 153 154
1 1 152 2 2 153 154
2 1 153 31 1 153 154
3 1 154 41 2 153 154
4 2 261 1 1 261 263
5 2 262 2 1 262 263
6 2 263 41 2 263 263
Any suggestions? Thank you in advance!
In no why do I think this is the best solution, but it its a solution.
You can replace .loc with .where, which will return NaN wherever the condition is not true. Then backfill NaN, and then again filter with .where on ID3 being 1 or 2
df['A'] = df.groupby('ID1')['ID2'].transform(lambda x:
x.where(df.ID3==31).fillna(method='bfill').where(df.ID3.isin([1,2])))
df['B'] = df.groupby('ID1')['ID2'].transform(lambda x:
x.where(df.ID3==41).fillna(method='bfill').where(df.ID3.isin([1,2])))
ID1 ID2 ID3 ID4 A B
0 1 151 1 2 153.0 154.0
1 1 152 2 2 153.0 154.0
2 1 153 31 1 NaN NaN
3 1 154 41 2 NaN NaN
4 2 261 1 1 NaN 263.0
5 2 262 2 1 NaN 263.0
6 2 263 41 2 NaN NaN