Python - group by + transform + substring - python-3.x

I'm trying to extract values from a string in a pandas data frame using other two columns as indices.
My data looks like this.
Address second_dot third_dot
0 1.273.1735.0 5 10
1 1.263.48.0 5 8
2 1.273.1341.0 5 10
3 1.273.1527.0 5 10
4 1.273.1379.0 5 10
5 1.273.1094.0 5 10
6 1.273.845.0 5 9
7 1.273.1393.0 5 10
8 1.275.988.0 5 9
9 1.273.973.0 5 9
In columns second_dot and third_dot I've stored the position of within column 'address' of the '.' characters. What I would like to do is to extract from each rows all characters between second and third dot.
The result should be like this:
Result
273
263
273
273
273
273
273
273
275
273
I've already managed to do it by using apply on axis 1 with custom function, but it takes too long (I've got millions or records in my data frame. Considering that the address are repeated over lines, I'm trying to group by the calculation, hoping to speed up.
This is my last attempt, but it does not work.
df.groupby(['Address']).transform(lambda x :
x['Address'].str[x['first_dot']:x['second_dot']])
I get error --> KeyError: ('Address', 'occurred at index MachineIdentifier').
MachineIdentifier is the first column of my df (not the index, a normal column)
Thanks a lot in advance

Related

How to get the median of different intervals of dataframe based on label name? [duplicate]

This question already has answers here:
How to groupby consecutive values in pandas DataFrame
(4 answers)
Closed 3 years ago.
So I have a DataFrame with two columns, one with label names (df['Labels']) and the other with int values (df['Volume']).
df = pd.DataFrame({'Labels':
['A','A','A','A','B','B','B','B','B','B','A','A','A','A','A','A','A','A','C','C','C','C','C'],
'Volume':[10,40,20,20,50,60,40,50,50,60,10,10,10,10,20,20,10,20,80,90,90,80,100]})
I would like to identify intervals where my labels change and then calculate the median on the column 'Volume' for each of these intervals. Later I should replace every value of column 'Volume' by the respective median of each interval.
In case of label A, I would like to have the median for both intervals.
Here is how my DataFrame should looks like:
df2 = pd.DataFrame({'Labels':['A','A','A','A','B','B','B','B','B','B','A','A','A','A','A','A','A','A','C','C','C','C','C'],
'Volume':[20,20,20,20,50,50,50,50,50,50,10,10,10,10,10,10,10,10,90,90,90,90,90]})
You want to groupby the blocks and transform median:
blocks = df['Labels'].ne(df['Labels'].shift()).cumsum()
df['group_median'] = df['Volume'].groupby(blocks).transform('median')
Use Series.cumsum + Series.shift() to create groups using groupby and then use transform
df['Volume']=df.groupby(df['Labels'].ne(df['Labels'].shift()).cumsum())['Volume'].transform('median')
print(df)
Labels Volume
0 A 20
1 A 20
2 A 20
3 A 20
4 B 50
5 B 50
6 B 50
7 B 50
8 B 50
9 B 50
10 A 10
11 A 10
12 A 10
13 A 10
14 A 10
15 A 10
16 A 10
17 A 10
18 C 90
19 C 90
20 C 90
21 C 90
22 C 90

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

Assigning integers to dataframe fields ` OverflowError: Python int too large to convert to C unsigned long`

I have a dataframe df that looks like this:
var val
0 clump_thickness 5
1 unif_cell_size 1
2 unif_cell_shape 1
3 marg_adhesion 1
4 single_epith_cell_size 2
5 bare_nuclei 1
6 bland_chrom 3
7 norm_nucleoli 1
8 mitoses 1
9 class 2
11 unif_cell_size 4
12 unif_cell_shape 4
13 marg_adhesion 5
14 single_epith_cell_size 7
15 bare_nuclei 10
17 norm_nucleoli 2
20 clump_thickness 3
25 bare_nuclei 2
30 clump_thickness 6
31 unif_cell_size 8
32 unif_cell_shape 8
34 single_epith_cell_size 3
35 bare_nuclei 4
37 norm_nucleoli 7
40 clump_thickness 4
43 marg_adhesion 3
50 clump_thickness 8
51 unif_cell_size 10
52 unif_cell_shape 10
53 marg_adhesion 8
... ... ...
204 single_epith_cell_size 5
211 unif_cell_size 5
215 bare_nuclei 7
216 bland_chrom 7
217 norm_nucleoli 10
235 bare_nuclei -99999
257 norm_nucleoli 6
324 single_epith_cell_size 8
I want to create a new column that holds the values of the var and val columns, converted to a number. I wrote the following code:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
When I run this code I get the following error:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4384, in _apply_standard
result = Series(results)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 205, in __init__
default=np.nan)
File "pandas/_libs/src/inference.pyx", line 1701, in pandas._libs.lib.fast_multiget (pandas/_libs/lib.c:68371)
File "pandas/_libs/src/inference.pyx", line 1165, in pandas._libs.lib.maybe_convert_objects (pandas/_libs/lib.c:58498)
OverflowError: Python int too large to convert to C unsigned long
I don't understand why. If I run
for column in df['var'].unique():
for value in df['val'].unique():
if int.from_bytes('{}{}'.format(column, value).encode(), 'little') > maximum:
maximum = int.from_bytes('{}{}'.format(column, value).encode(), 'little')
print(int.from_bytes('{}{}'.format(column, value), 'little'))
print()
print(maximum)
I get the following result:
65731626445514392434127804442952952931
67060854441299308307031611503233297507
65731626445514392434127804442952952931
68390082437084224179935418563513642083
69719310432869140052839225623793986659
73706994420223887671550646804635020387
16399285238650560638676108961167827102819
67060854441299308307031611503233297507
72377766424438971798646839744354675811
75036222416008803544454453864915364963
69719310432869140052839225623793986659
16399285238650560638676108961167827102819
76365450411793719417358260925195709539
68390082437084224179935418563513642083
76365450411793719417358260925195709539
73706994420223887671550646804635020387
83632281929131549175300318205721294812263623257187
71048538428654055925743032684074331235
75036222416008803544454453864915364963
72377766424438971798646839744354675811
277249955343544548646026928445812341
256480767909405238131904943128931957
266865361626474893388965935787372149
287634549060614203903087921104252533
64059424565585367137514643836585471605
261673064767940065760435439458152053
282442252202079376274557424775032437
.....
60968996531299
69179002195346541894528099
58769973275747
62068508159075
59869484903523
6026341019714892551838472781928948268513458935618931750446847388019
Based on these results I would say that the conversion to integers works fine. Furthermore, the largest created integer is not so big that it should cause problems when being inserted into the dataframe right?
Question: How can I successfully create a new column with the newly created integers? What am I doing wrong here?
Edit: Although bws's solution
str(int.from_bytes('{}{}'.format(column, value).encode(), 'little'))
solves the error, I now have a new problem: the ids are all unique.. I don't understand why this happens but I suddenly have 3000 unique ids, while there are only 92 unique var/val combinations.
I dont know the why. Maybe lamda use by default int in front of int64?
I have a workaround that maybe is useful for you.
Convert the result to string (object):df['id'] = df.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
This is interesting to know: https://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html
uint64 Unsigned integer (0 to 18446744073709551615)
edit:
After read the last link I asume that when you use a loop, you are using the int python type, not the int that use pandas (come from numpy). So, when you work with a Dataframe you are using the types that numpy provide...
Int type from numpy come from Object so I think that the correct way to work with large integer is use object.
Its my conclusion but maybe I am wrong.
Edit second question:
Simple example works:
d2 = {'val': [2, 1, 1, 2],
'var': ['clump_thickness', 'unif_cell_size', 'unif_cell_size', 'clump_thickness']
}
df2 = pd.DataFrame(data=d2)
df2['id'] = df2.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
Result of df2:
print (df2)
val var id
0 2 clump_thickness 67060854441299308307031611503233297507
1 1 unif_cell_size 256480767909405238131904943128931957
2 1 unif_cell_size 256480767909405238131904943128931957
3 2 clump_thickness 67060854441299308307031611503233297507

Excel Rank Multiple Columns

I'm facing a issue with ranking in Excel particularly in regards to tie breaking. I tried several options but i guess they don't fit my issue. Its quite simple really, I'll explain:
The Data:
1 2 3 4 5 6 7 8 9 10
87 83 74 95 69 90 73 0 74 85
121 121 96 121 121 121 121 83 121 121
As you can see its easy for me to rank the first line (I'm working in columns instead of rows for the data). When i do a Rank Function gives the following result:
3 5 6 1 9 2 8 10 6 4
Which is correct.
The problem arises in the second line. There are ties because all of them reach the maximum of 121:
1 1 9 1 1 1 1 10 1 1
What i would like to do is take the first row as a tie breaker. So even if there is a tie the first line which was firstly text but now is a sequence from 1 to 10 could provide as secondary criteria to order the rank, thus giving the following ranking line:
1 2 9 3 4 5 6 10 7 9
Could one achieve this result?
Thank You very much in advance.
You need a helper row to break the tie. You can add a fraction of the first row to the second row to create a new row & use the new row to rank
A4 = A3+(A2/(MAX($A$2:$J$2)+1))
Using the MAX I ensure the fraction is less than 1 which is adequate to break ties in this case.
A6 = RANK(A4,$A$4:$J$4)
You can hide the helper row if you dont want to show it.

Check field position in an Excel

How to check field position in an Excel? I have to check length of an ASCii File and field positions. I have checked the length but not sure how to check the positions of field.
Example I have:
Account Number Len Institution Len Cost Center Len
830226579 9 268 3 8924 4
830168953 9 268 3 8904 4
830255130 9 268 3 8904 4
830065638 9 268 3 8924 4
830065620 9 268 3 8924 4
Thank you.
You can choose the cell boundaries with the Text Import Wizard.

Resources