Delete duplicate rows but keep the first 2 instances

Delete duplicate rows but keep the first 2 instances - linux

I have a column which contends duplicate rows, then i will like to delete but to keep the first 2 instances .
Remove duplicate lines which has been repeated more than 2 times
Example input
i 10
i 10
a 12
a 12
b 12
b 12
c 14
c 14
x 14
x 14
y 14
y 14
a 14
a 14
n 13
n 13
m 13
m 13
x 13
x 13
output desired.
i 10
i 10
a 12
a 12
c 14
c 14
n 13
n 13
I tried
awk '!a[$2]++' file
Appreciate your help

I think the problem with your command is that you are checking if it is the first one instead of checking whether it is the one of the first two. Something like this should work:
awk 'a[$2]++<2' file

Related

How to drop columns of csv data in J

I have a lot of csv files that I have to drop the date column.
I have a J line that reads in csv file into a numeric array, rdtabfile =: (0&".;.2#:(TAB&,)#:}:);._2) # ReadFile #<

If you know the column number of the date column, I would just use a mask across each line of the array and the copy # dyadic verb.
[ t =: i. 4 5
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
mask=: ~: [: i. # NB. x would be the column to be dropped, y is the numeric matrix
delcol=: (mask # ])"1
1 delcol t
0 2 3 4
5 7 8 9
10 12 13 14
15 17 18 19
delcola=: ((~: [: i. #) # ])"1 NB. can be done in one line
2 delcola t
0 1 3 4
5 6 8 9
10 11 13 14
15 16 18 19

pandas random shuffling dataframe with constraints

I have a dataframe that I need to randomise in a very specific way with a particular rule, and I'm a bit lost. A simplified version is here:
idx type time
1 a 1
2 a 1
3 a 1
4 b 2
5 b 2
6 b 2
7 a 3
8 a 3
9 a 3
10 b 4
11 b 4
12 b 4
13 a 5
14 a 5
15 a 5
16 b 6
17 b 6
18 b 6
19 a 7
20 a 7
21 a 7
If we consider this as containing seven "bunches", I'd like to randomly shuffle by those bunches, i.e. retaining the time column. However, the constraint is that after shuffling, a particular bunch type (a or b in this case) cannot appear more than n (e.g. 2) times in a row. So an example correct result looks like this:
idx type time
21 a 7
20 a 7
19 a 7
7 a 3
8 a 3
9 a 3
17 b 6
16 b 6
18 b 6
6 b 2
5 b 2
4 b 2
2 a 1
3 a 1
1 a 1
14 a 5
13 a 5
15 a 5
12 b 4
11 b 4
10 b 4
I was thinking I could create a separate "order" array from 1 to 7 and np.random.shuffle() it, then sort the dataframe by time in that order, which will probably work - I can think of ways to do that part, but I'm especially struggling with the rule of restricting the number of repeats.
I know roughly that I should use a while loop, shuffle it in that way, loop over the frame and track the number of consecutive types, if it exceeds my n then break out and start the while loop again until it completes without breaking out, in which case set a value to end the while loop. But this got so messy and didn't work.
Any ideas?

See if this works.
import pandas as pd
import numpy as np
n = [['a',1],['a',1],['a',1],
['b',2],['b',2],['b',2],
['a',3],['a',3],['a',3]]
df = pd.DataFrame(n)
df.columns = ['type','time']
print(df)
order = np.unique(np.array(df['time']))
print("Before Shuffling",order)
np.random.shuffle(order)
print("Shuffled",order)
n =2
for i in order:
print(df[df['time']==i].iloc[0:n])

Concatenating columns from different files, while skipping the blank lines

I know it's likely possible to do this with awk, but I have no idea how to do it.
Suppose I have the following 2 tab separated files, where there are blank lines that only contain \n:
file1:
A 1 4
B 2 5
C 3 6
D 7 10
E 8 11
A 9 12
file2:
E 13 16
F 14 17
G 15 18
H 19 22
I 20 23
J 21 24
I want to generate a new file which corresponds to the concatenation of the first 2 columns from file 1 with the third column from file 2, and then the third column from file 1:
final file:
A 1 16 4
B 2 17 5
C 3 18 6
D 7 22 10
E 8 23 11
A 9 24 12
Note that, in the final file, it's important that the blank lines should be kept blank, and no tabs should be inserted in there.

Simple paste + awk combination:
paste file1 file2 | awk '!NF{ print "" }NF{ print $1,$2,$6,$3 }'
The output:
A 1 16 4
B 2 17 5
C 3 18 6
D 7 22 10
E 8 23 11
A 9 24 12

awk 'NR==FNR{a[NR]=$3;next} NF{$3=a[FNR] OFS $3} 1' file2 file1

How do I get each column of the same row in one line?

I have this columns in excel:
A B C D E F
Nima1 1 2 3 4 5
Nima2 6 7 8 9 10
Nima3 11 12 13 14 15
Nima4 16 17 18 19 20
and I want to show them like this:
Nima1 1
Nima1 2
Nima1 3
Nima1 4
Nima1 5
Nima2 6
…
Nima4 20
and so far I come up with nothing, every formula that I write doesn't work.
please if anyone knows how to do it, guide me through it.

In any unused cell to the right put in this formula,
'for system that use a comma as a list separator
=INDEX(A:E,(ROW(1:1)-1)/5+1,IF(COLUMN(A:A)=1,1,MOD(ROW(1:1)-1,5)+1))
'for system that use a semi-colon as a list separator
=INDEX(A:E;(ROW(1:1)-1)/5+1;IF(COLUMN(A:A)=1;1;MOD(ROW(1:1)-1;5)+1))
Fill right one column then fill both down until you get zeroes.

Why does this not work EXCEL 2010

a 1 Offset 4
b 2 message h e l l o
c 3 Value 8 5 12 12 15
d 4 Encrypted value 12 9 16 16 19
e 5 Encrypted letter #N/A #N/A #N/A #N/A #N/A
f 6
g 7
h 8
i 9
j 10
k 11
l 12
m 13
n 14
o 15
p 16
q 17
r 18
s 19
t 20
u 21
v 22
w 23
x 24
y 25
z 26
The task is to take the set value of the letter and add the offset too it and then put the encrypted letters in the encrypted letter cell. But I keep getting this #N/A error the formula I used is:
=VLOOKUP(F4,$A$1:$B$26,2,FALSE)
link to screenshot: http://i1370.photobucket.com/albums/ag263/dylanevs/excel_zps3f61ed50.jpg

I believe you need to swap your columns.
Vlookup looks for the value in the first column, and returns a cell to the right (in your formula, the second column).
The reason for the #N/A is it's looking for an exact match for a number, in a range that only contains letters.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Delete duplicate rows but keep the first 2 instances - linux

I think the problem with your command is that you are checking if it is the first one instead of checking whether it is the one of the first two. Something like this should work: awk 'a[$2]++<2' file

Related

How to drop columns of csv data in J

pandas random shuffling dataframe with constraints

Concatenating columns from different files, while skipping the blank lines

How do I get each column of the same row in one line?

Why does this not work EXCEL 2010

Categories

Resources