Keep rows if difference between two values in column 2 exceeds certain amount; do this for each category given in column 1 - subset

I want to subsample a file by keeping as many entries as possible whose difference in values in column 2 are at least 500 units, for each name in column 1. The full file is ~200,000 lines long, sorted by column 1 then column 2, tab-separated, and looks something like this:
name1 107
name1 110
name1 472
name1 509
name1 599
name1 679
name1 710
name2 36
name2 179
name2 391
name2 696
name2 1427
name2 1583
name2 1722
name2 2090
name2 2136
name2 2235
name3 687
name3 933
name4 43
name4 207
name4 384
name4 439
name4 447
name4 603
name4 774
name4 802
name4 876
name4 988
I would like an output that looks like this:
name1 107
name1 679
name2 36
name2 696
name2 1427
name2 2090
name3 687
name4 43
name4 603
I think one way to do it is to keep the first entry for each name and then keep the next entry for that name that is at least 500 units larger, and then the next entry that is at least 500 larger than that, etc. Then, repeat for each name. It would also be fine if it was in reverse starting with the last entry for each name, or it would be fine if it started elsewhere as long as it maximized the number of entries retained for each name that are greater 500 units apart.
However, I have no idea how to code this, as I am a novice! Thank you for your help!

I chose to do it in Python, which is turning into the lingua franca of bioinformatics.
(Learn enough Python for your biology needs here: http://learnpythonthehardway.org/book/)
Copy the following into a file and run it with python script_name.py input_textfile.txt
(If you do not know enough python to do that chapters 0 and 1 in the book referred to above will help you)
import sys
name_column = 0
number_column = 1
last_name = "dummy variable"
last_number = -1
min_difference = 500
with open(sys.argv[1], 'r') as input_file:
for line in input_file:
name = line.split()[name_column]
number = int(line.split()[number_column])
if name != last_name:
print(line.strip())
last_number = number
last_name = name
continue
if (number-last_number) >= min_difference:
print(line.strip())
last_number = number
Output using data above:
name1 107
name1 679
name2 36
name2 696
name2 1427
name2 2090
name3 687
name4 43
name4 603
If you want the output in a file, use python script_name.py input_textfile.txt > output_file

Related

INDEX / MATCH / VLOOKUP Assistance

The below table is the first 29 rows in an XLSX I'm working on, which basically aims to calculate the costs of exported call data.
The data in the table below is the result of population from a PowerShell script, which combines Rate Data from a CSV (to calculate call charges) with Call Data from the customer's daily call stats in another CSV.
Rate Data:
Column F [Destination] contains every known Country Code.
Column E [Rate] contains a Rate value for each Country Code in column F, which will be used to calculate the call cost at the end.
Call Export Data:
Column C [Callee Number] contains the original phone number that was called (Callee).
Column H [Callee Country Code] takes first few digits of the number in Column C for the next step.
Required goal that I'm quite frankly stuck on:
Column I is what I'm working on.
I need a formula that effectively looks for the dialled country code in Column H and looks for the country code that's the MOST SIMILAR to it (Doesn't need to be exact) in Column F. Once found, I need it to return the value on the same row, in Column E [Rate].
Column I should then be populated with the correct Rate for the Number in Column C / H.
Formula's I've tried:
=INDEX($A$2:$K$100000,VLOOKUP(H2,$F$2:$F$100000,5,TRUE))
=INDEX($G$2:$G$100000,MATCH(H2,$F$2:$F$100000,0))
I'm not great with Excel but and I'm sure using 100000 to select the whole column is poor practise.
Thanks for any help :)
Start time
Customer
Callee Number
Country
Rate
Destination
Duration (Minutes)
Callee Country Code
Rate for call
Cost of call
2020-09-01T07:25:30.5190000Z
Name1
+44***
AFGHANISTAN
1.415
93
0
44
2020-09-01T08:05:52.6250000Z
Name2
+442476******
AFGHANISTAN
1.415
9320
0.383333333
442
2020-09-01T08:33:49.6530000Z
Name3
+441509******
AFGHANISTAN
1.415
9321
0.7
441
2020-09-01T08:35:18.5300000Z
Name4
+441509******
AFGHANISTAN
1.415
9322
0.766666667
441
2020-09-01T08:43:45.3300000Z
Name5
+447976******
AFGHANISTAN
1.415
9323
1.85
447
2020-09-01T08:47:29.9630000Z
Name6
+442476******
AFGHANISTAN
1.415
9324
2.533333333
442
2020-09-01T08:57:43.2680000Z
Name7
+447875******
AFGHANISTAN
1.415
9325
3.633333333
447
2020-09-01T09:04:42.8230000Z
Name8
+441212******
AFGHANISTAN
1.415
9326
4.916666667
441
2020-09-01T09:15:32.7220000Z
Name9
+441923******
AFGHANISTAN
1.415
9327
1.9
441
2020-09-01T09:30:36.4750000Z
Name10
+441923******
AFGHANISTAN
1.415
9328
5.8
441
2020-09-01T09:58:12.8380000Z
Name11
+442476******
AFGHANISTAN
1.415
9370
0.516666667
442
2020-09-01T10:03:04.1270000Z
Name12
+442476******
AFGHANISTAN
1.415
9375
13.51666667
442
2020-09-01T10:27:49.6090000Z
Name13
+442476******
AFGHANISTAN
1.415
9377
2.716666667
442
2020-09-01T11:04:21.7850000Z
Name14
+442476******
AFGHANISTAN
1.415
9378
1.6
442
2020-09-01T11:13:31.9810000Z
Name15
+442070******
AFGHANISTAN
1.415
9379
9.816666667
442
2020-09-01T11:46:53.4730000Z
Name16
+442476******
ALAND ISLANDS
247
0.283333333
442
2020-09-01T11:47:14.9110000Z
Name17
+442476******
ALBANIA
0.537
355
0.866666667
442
2020-09-01T12:30:38.4380000Z
Name18
+442476******
ALBANIA
0.537
3554
0.25
442
2020-09-01T12:30:59.5190000Z
Name19
+442476******
ALBANIA
0.537
35567
0.283333333
442
2020-09-01T12:31:34.3300000Z
Name20
+442476******
ALBANIA
0.537
35568
0.283333333
442
2020-09-01T12:35:20.8430000Z
Name21
+442476******
ALBANIA
0.537
35569
0.3
442
2020-09-01T12:37:36.5550000Z
Name22
+442476******
ALGERIA
0.537
213
1.366666667
442
2020-09-01T12:42:07.9660000Z
Name23
+447723******
ALGERIA
0.537
21321
1.466666667
447
2020-09-01T13:13:37.7610000Z
Name24
+441926******
ALGERIA
0.537
21355
3.283333333
441
2020-09-01T13:44:57.3190000Z
Name25
+442476******
ALGERIA
0.537
21356
0.15
442
2020-09-01T13:46:39.2640000Z
Name26
+442476******
ALGERIA
0.537
21366
0.15
442
2020-09-01T13:58:14.1340000Z
Name27
+442476******
ALGERIA
0.537
21369
6.2
442
2020-09-01T13:58:30.5560000Z
Name28
+442476******
ALGERIA
0.537
21377
0.583333333
442
Copied the table to a workbbok and figured out this code:
=IF(LEFT($C1;4)=$H1;INDEX($A$1:$I$28;ROW($H1);5);"")
This code checks, that if the first 4 digits of the phone number equals to the number in H column. If this is true then it gets the row number of the H and gets the value from E. If its false writes nothing.
I don't know if You mistyped the question, but I see no similarities between columns H and F.
I need a formula that effectively looks for the dialled country code in Column H and looks for the country code that's the MOST SIMILAR to it (Doesn't need to be exact) in Column F.
So I made the code for Column C and H.
BUT theres is 1 downside to this code: You need to have exactly the first 4 digits in H column. It means You have to have H formed like this: +442; +447 etc.
Feel free to change the range and names of columns and rows to match Your Excel table.
Just a tip: if You want to select entire columns, You have to write A:A;B:B etc. into the formula. Or just click on the name of the column, and it inserts automatically. Same with rows.
Couple of suggestions to improve your Excel:
In order to work well with Excel I would suggest you would use help columns with the required sub calculation, since no-one can understand lengthy formulas. Better yet: use small lookup tables for reference.
You need two sub columns: 1 for country code (extract that from the phone number) and 1 for rate.
So I would suggest to add a table with "Country Code" "Country Name" "Rate". It is going to be a small table and you can look you data up from there.
Please note the following:
A phone number is a string. It's not a number. Look at the cells in "Callee Number". Are they formatted as Numbers? Text? It's important to make sure they are formatted as TEXT. once they are, you can start manipulating them properly.
You're right, and as was noted, if you want to search an entire column, Just write F:F and don't use the start row - end row if there isn't really one. That's also a good reason to use small lookup table: Excel need to look for less data in order to find the information required.
Test your results in order to make sure that the VLOOKUP or INDEX/MATCH (or XLOOKUP, the new formula on the block) are doing what is expected of them :)

How to transfer a list of elements to a pandas dataframe by every three elements?

I have a list of people's info and I want to transfer it to a pandas dataframe.
My list:
My_lst = ['Name1','Title1','Company1','Name2','Title2','Company2','Name3','Title3','Company3',
'Name4','Title4','Company4','Name5','Title5','Company5','Name6','Title6','Company6'...]
Expected outputs:
NAME TITLE COMPANY
Name1 Title1 Company1
Name2 Title2 Company2
Name3 Title3 Company3
...
How do I do that in Python? Thank you for the help!
IIUC reshape
pd.DataFrame(np.array(My_lst).reshape((-1,3)),columns=['name','title','company'])
name title company
0 Name1 Title1 Company1
1 Name2 Title2 Company2
2 Name3 Title3 Company3
3 Name4 Title4 Company4
4 Name5 Title5 Company5
5 Name6 Title6 Company6

Pandas: Better way to combine rows for 'wide' dataset?

I'm trying to make a 'wide' dataset, with one record per game, rather than one record per team, per game. Here's a small example of what I have, first, and then what I'd like to have.
GAME-ID TEAM SCORE
0 123 Cleveland 95
1 123 Orlando 101
2 124 New York 104
3 124 Detroit 98
GAME-ID TEAM1 TEAM2 SCORE1 SCORE2
0 123 Cleveland Orlando 95 101
1 124 New York Detroit 104 98
I can set a flag for game id count (see below), then later use a for loop to iterate through and set values conditionally, but thought there may be an easier way.
import pandas as pd
dict1 = {'GAME-ID':[123, 123, 124, 124],
'TEAM':['Cleveland', 'Orlando', 'New York', 'Detroit'],
'SCORE':[95, 101, 104, 98]}
df = pd.DataFrame(dict1)
df['GAME_ID_CT'] = df.groupby('GAME-ID').cumcount() + 1
print(df)
Result from code above:
GAME-ID TEAM SCORE GAME_ID_CT
0 123 Cleveland 95 1
1 123 Orlando 101 2
2 124 New York 104 1
3 124 Detroit 98 2
If there's a way to do this by column rather than a bunch of loops, it would be great.
You can try pivot:
new_df = df.pivot(index='GAME-ID',columns='GAME_ID_CT')
# rename
new_df.columns = [f'{a}{b}' for a,b in new_df.columns]
Output:
TEAM1 TEAM2 SCORE1 SCORE2
GAME-ID
123 Cleveland Orlando 95 101
124 New York Detroit 104 98
I think this actually worked best for me. It's simple and accommodates lots more variables.
df1 = df[df['GAME_ID_CT'] == 1]
df2 = df[df['GAME_ID_CT'] == 2]
new_df = pd.merge(df1, df2, on='GAME-ID', suffixes=['1', '2'])
print(new_df)
GAME-ID TEAM1 SCORE1 GAME_ID_CT1 TEAM2 SCORE2 GAME_ID_CT2
0 123 Cleveland 95 1 Orlando 101 2
1 124 New York 104 1 Detroit 98 2

duplicate rows and swap clumns based on condition

I have the following table:
name a0 a1 type val
0 name1 1 0 AB 100
1 name1 2 0 AB 105
2 name2 1 2 BB 110
3 name3 1 0 AN 120
and I want to do this.
For every type I see where the type name does not contain the same 2 letters, I want to duplicate the row and swap the a0 and a1 columns and the letters of the type column. So, my result will be:
name a0 a1 type val
0 name1 1 0 AB 100
1 name1 0 1 BA 100
2 name1 2 0 AB 105
3 name1 0 2 BA 105
4 name2 1 2 BB 110
5 name3 1 0 AN 120
6 name3 0 1 NA 120
Note that for example for the same name we can have the same type but different a0 and a1 (and hence val).
So, we can have name1 and type AB as in the first two lines of the original table.
I tried:
df1 = pd.DataFrame({'name':['name1', 'name1', 'name2', 'name3'], 'a0':[1, 2, 1, 1], 'a1':[0, 0, 2, 0], 'type':['AB', 'AB', 'BB', 'AN'], 'val':[100,105, 110, 120]})
for idx in df1.index:
a1 = df1.loc[idx, 'a0']
a0 = df1.loc[idx, 'a1']
val = df1.loc[idx, 'val']
name = df1.loc[idx, 'name']
if df1.loc[idx, 'type'] == 'AB':
new_type = 'BA'
elif df1.loc[idx, 'type'] == 'AN':
new_type = 'NA'
row = pd.DataFrame({'name':name, 'a0':a0 , 'a1':a1 , 'type':new_type, 'val':val}, index=np.arange(idx))
df1 = df1.append(row, ignore_index=False)
df1 = df1.sort_index().reset_index(drop=True)
but it gives me:
name a0 a1 type val
0 name1 1 0 AB 100
1 name1 2 0 BA 105
2 name1 0 2 BA 105
3 name1 2 0 BA 105
4 name1 0 2 BA 105
5 name1 2 0 BA 105
6 name1 0 2 BA 105
7 name1 2 0 AB 105
8 name2 1 2 BB 110
9 name3 1 0 AN 120
First create mask for identify values with 2 different letters, create new DataFrame by DataFrame.assign, swap values in columns, join together and sorting by index, last create default index values:
mask = df['type'].apply(set).str.len() == 2
df1 = df[mask].assign(type=lambda x: df['type'].str[1] + df['type'].str[0])
df1[['a0','a1']] = df1[['a1','a0']].to_numpy()
#pandas below
#df1[['a0','a1']] = df1[['a1','a0']].values
df = pd.concat([df, df1]).sort_index().reset_index(drop=True)
print (df)
name a0 a1 type val
0 name1 1 0 AB 100
1 name1 0 1 BA 100
2 name1 2 0 AB 105
3 name1 0 2 BA 105
4 name2 1 2 BB 110
5 name3 1 0 AN 120
6 name3 0 1 NA 120
You can use:
def myfunc(x):
x['type']=x['type'][::-1]
x[['a0','a1']]=x[['a1','a0']].values
return x
m=df.type.apply(set).str.len().gt(1)
pd.concat([df,df.loc[m].apply(myfunc,axis=1)],ignore_index=True).sort_values(['name','val'])
name a0 a1 type val
0 name1 1 0 AB 100
4 name1 0 1 BA 100
1 name1 2 0 AB 105
5 name1 0 2 BA 105
2 name2 1 2 BB 110
3 name3 1 0 AN 120
6 name3 0 1 NA 120

Perform computation on a value in one row and update another row's column with that value

I have a dataframe that looks somewhat like :
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3 Numeric_col4 Month
ABC XYZ 3523 454 4354 565 2018-02
ABC XYZ 333 444 123 565 2018-03
qww ggg 3222 568 123 483976 2018-03
I would like to apply some simple math on a column with a condition and assign it to a different row.
For instance
if Month == 2018-03 & Categor_2 == 'XYZ', perform Numeric_3*2 and assign it to Numeric_3 under month 2018-02.
So the output would be something like :
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3_ Adj Numeric_col4 Month
ABC XYZ 3523 454 246 565 2018-02
ABC XYZ 333 444 123 565 2018-03
qww ggg 3222 568 123 483976 2018-03
I was thinking of taking out the necessary columns, then doing a pivot, applying the math, then again reshaping it back in the orginal way.
However if there is a quick way, would be grateful to know
It depends what is length of Series of filtered DataFrame - here is one element Series, so possible set to scalar by next with iter for posible add default value if condition not match:
mask = (df.Month == '2018-03') & (df.Categor_2 == 'XYZ')
print (df.loc[mask, 'Numeric_3'] * 3)
1 369
Name: Numeric_3, dtype: int64
#get first value of Series, if emty Series is returned 0
a = next(iter(df.loc[mask, 'Numeric_3'] * 3), 0)
print (a)
369
df.loc[df.Month == '2018-02', 'Numeric_3'] = a
print (df)
Categor_1 Categor_2 Numeric_1 Numeric_2 Numeric_3 Numeric_col4 Month
0 ABC XYZ 3523 454 369 565 2018-02
1 ABC XYZ 333 444 123 565 2018-03
2 qww ggg 3222 568 123 483976 2018-03

Resources