Pattern decoding - linux

I need a little help in the following. I have this kind of datafile:
0 0 # <--- Group 1 -- 1 house (0) and 1 room (0)
0 0 # <--- Group 2 -- 2 houses (0;1) and 3,2 rooms (0,1,2;0,1)
0 1
0 2
1 0 # <--- house 2 in Group 2, with the first room (0)
1 1 # <--- house 2 in Group 2, with the second room (1)
0 0 # <--- Group 3
0 1 # <--- house 1 in Group 3, with the second room (1)
0 2
0 0 # <--- Group 4
1 0 # <--- house 2 in Group 4, with one room only (0)
2 0
3 0 # <--- house 4 in Group 4, with one room only (0)
0 0 # <--- Group 5
0 0 # <--- Group 6
There are some cases, which have to be answered:
There are groups in the example; there exists a group, if it is separated by new lines from the other, so in this case we have 6 groups. We have to determine the followings
Get the actual number (ordinal number) of the group (the counter starts for example from 1)
if the 1st column = 0 and the 2nd column = 0 and the next line is empty
So the desired output according to the above example would be
1
5
6
if the first column = 0 and the 2nd column can vary and the next line is empty
So the desired output according to the above example would be
3
... etc. How can this be generalized in a way that we can set at the beginning which case would we like to get ?
There might be many cases according to the values of the columns in a group.
We can imagine this if we think about something like this: the first column means the number of houses in a street, and the second column means the number of rooms in a house. Now I would like to find all possible kind of streets in a city, which means for example
let us pick up those streets, in which there are two houses with different number of rooms, in the first house there are 3 rooms, and in the second house there are 2 rooms. So we have the get output 2, becasue this requirement fulfills this group in the file
0 0
0 1
0 2
1 0
1 1
Important: 0 0 means there is one house with one room
Correction: if there is one house only, then it has just one room all the time! Like in the cases Group 1, Group 5, and Group 6. Remember, that the second column is the number of room, and 0 mean "1 room", 1 means "2 rooms", ...etc. This is just a counter which starts from 0, instead of 1, sorry if it is confusing a little bit...

I don't know what would be your expected output, however I have converted/decoded your number pattern to a meaningful group/house/rooms format. any further "query" could be done on this content.
see below:
kent$ cat file
0 0
0 0
0 1
0 2
1 0
1 1
0 0
0 1
0 2
0 0
1 0
2 0
3 0
0 0
0 0
awk:
kent$ awk 'BEGIN{RS=""}
{ print "\ngroup "++g;
delete a;
for(i=1;i<=NF;i++) if(i%2) a[$i]++;
for(x in a) printf "House#: %s , Room(s): %s \n", x, a[x]; }' file
we get output:
group 1
House#: 0 , Room(s): 1
group 2
House#: 0 , Room(s): 3
House#: 1 , Room(s): 2
group 3
House#: 0 , Room(s): 3
group 4
House#: 0 , Room(s): 1
House#: 1 , Room(s): 1
House#: 2 , Room(s): 1
House#: 3 , Room(s): 1
group 5
House#: 0 , Room(s): 1
group 6
House#: 0 , Room(s): 1
note that the generated format could be changed to fit your "filter" or "query"
UPDATE
OP's comment:
I need to know, the number of the group(s) which have/has for example
1 house with one room. The output would be in the above case: 1, 5 ,6
as I said, based on your query criteria, we could adjust the awk output for next step. now I change the awk abovet to:
awk 'BEGIN{RS=""}
{print ""; gid=++g;
delete a;
for(i=1;i<=NF;i++) if(i%2) a[$i]++;
for(x in a) printf "%s %s %s\n", gid,x, a[x]; }' file
this will output:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 1
the format is groupIdx houseIdx numberOfRooms and there is a blank line between groups. we save the text above to a file named decoded.txt
so your query could be done on this text:
kent$ awk 'BEGIN{RS="\n\n"}{if (NF==3 && $3==1)print $1}' decoded.txt
1
5
6
the last awk line above means, print the group number, if room number ($3) = 1 and there is only one line in the group block.

I would first define a House class and a Group class:
class House:
def __init__(self, rooms):
self.rooms = rooms
class Group:
def __init__(self, index, houses):
self.index = index
# houses.values() is a list with number of rooms for each house.
self.houses = [House(houses[house_nr]) for house_nr in sorted(houses)]
def __str__(self):
return 'Group {}'.format(self.index)
def __repr__(self):
return 'Group {}'.format(self.index)
Then parse the data into this hierarchical structure:
with open('in.txt') as f:
groups = []
# Variable to accumulate current group.
group = collections.defaultdict(int)
i = 1
for line in f:
if not line.strip():
# Empty line found, create a new group.
groups.append(Group(i, group))
# Reset accumulator.
group = collections.defaultdict(int)
i += 1
continue
house_nr, room_nr = line.split()
group[house_nr] += 1
# Create the last group at EOF
groups.append(Group(i, group))
Then you can do stuff like this:
found = filter(
lambda g:
len(g.houses) == 1 and # Group contains one house
g.houses[0].rooms == 1, # First house contains one room
groups)
print(list(found)) # Prints [Group 1, Group 5, Group 6]
found = filter(
lambda g:
len(g.houses) == 2 and # Group contains two houses
g.houses[0].rooms == 3 and # First house contains three rooms
g.houses[1].rooms == 2, # Second house contains two rooms
groups)
print(list(found)) # Prints [Group 2]

Perl solution. It converts the input into this format:
1|0
2|1 2
3|2
4|0 0 0 0
5|0
6|0
The first column is group number, in second column there are number of rooms (minus one) of all its houses, sorted. To search for group with two different houses with 2 and 3 rooms, you can just grep '|1 2$', to search for groups with just one house with one room, grep '|0$'
#!/usr/bin/perl
#-*- cperl -*-
#use Data::Dumper;
use warnings;
use strict;
sub report {
print join ' ', sort {$a <=> $b} #_;
print "\n";
}
my $group = 1;
my #last = (0);
print '1|';
my #houses = ();
while (<>) {
if (/^$/) { # group end
report(#houses, $last[1]);
undef #houses;
print ++$group, '|';
#last = (0);
} else {
my #tuple = split;
if ($tuple[0] != $last[0]) { # new house
push #houses, $last[1];
}
#last = #tuple;
}
}
report(#houses, $last[1]);
It is based on the fact that for each house, only the last line is important.

Related

Pandas remove group if difference between first and last row in group exceeds value

I have a dataframe df:
df = pd.DataFrame({})
df['X'] = [3,8,11,6,7,8]
df['name'] = [1,1,1,2,2,2]
X name
0 3 1
1 8 1
2 11 1
3 6 2
4 7 2
5 8 2
For each group within 'name' and want to remove that group if the difference between the first and last row of that group is smaller than a specified value d_dif in absolute way:
For example, when d_dif= 5, I want to get:
X name
0 3 1
1 8 1
2 11 1
If your data is increasingly in X, you can use groupby().transform() and np.ptp
threshold = 5
ranges = df.groupby('name')['X'].transform(np.ptp)
df[ranges > threshold]
If you only care about first and last, then transform just first and last:
threshold = 5
groups = df.groupby('name')['X']
ranges = groups.transform('last') - groups.transform('first')
df[ranges.abs() > threshold]

limit number of words in a column in a DataFrame

My dataframe looks like
Abc XYZ
0 Hello How are you doing today
1 Good This is a
2 Bye See you
3 Books Read chapter 1 to 5 only
max_size = 3,
I want to truncate the column(XYZ) to a maximum size of 3 words(max_size). There are rows with length less than max_size, and it should be left as it is.
Desired output:
Abc XYZ
0 Hello How are you
1 Good This is a
2 Bye See you
3 Books Read chapter 1
Use split with limit, remove last value and then join lists together:
max_size = 3
df['XYZ'] = df['XYZ'].str.split(n=max_size).str[:max_size].str.join(' ')
print (df)
Abc XYZ
0 Hello How are you
1 Good This is a
2 Bye See you
3 Books Read chapter 1
Another solution with lambda function:
df['XYZ'] = df['XYZ'].apply(lambda x: ' '.join(x.split(maxsplit=max_size)[:max_size]))

connect 4 check for win by using lists and matrix

I am currently writing a code that takes a connect 4 board from a text file and transfers it into a list. I split each row into a separate list and then added it to a matrix. My goal here is to check if there is a winner. I only need to check for horizontal and vertical wins. I am thinking of checking each element of the matrix to see if there is a winner for four in a row. I know that this is tedious and there is probably a more efficient way. This is what the text file contains:
0 0 0 0 0 0 2
0 0 0 0 0 2 1
2 1 0 2 2 1 2
2 1 0 1 1 2 2
1 1 2 2 2 1 2
1 1 1 2 1 2 1
I see the win in the second column but how would I check for everything to see if there is a win?
This is the code I have so far:
file1=open("file1.txt","r")
matrix=[]
for line in file1:
connect=line.split(" ")
matrix.append(connect)
print(matrix)
if matrix[0][0]==matrix[0][1]==matrix[0][2]==matrix[0][3]: #this is only temporary, supposed to check for every element
if matrix[0][0]==1:
print("player 1 wins!")
elif matrix[0][0]==2:
print("player 2 wins!")
else:
print("no winner")
if matrix[0][0]==matrix[1][0]==matrix[2][0]==matrix[0][0]: #check for vertical matches
if matrix[0][0]==1:
print("player 1 wins!")
elif matrix[0][0]==2:
print("player 2 wins!")
else:
print("no winner")
You can rely on the builtin substing match to do all of the heavy lifting.
# read data
board = list(
map(
str.split,
"""0 0 0 0 0 0 2
0 0 0 0 0 2 1
2 1 0 2 2 1 2
2 1 0 1 1 2 2
1 1 2 2 2 1 2
1 1 1 2 1 2 1""".split(
"\n"
),
)
)
# contains ['1111', '2222']
winning_strings = [c * 4 for c in "12"]
def has_winner(board):
# get rows, e.g. rows[0] == '0000002'
rows = map("".join, board)
# get cols, e.g. cols[0] == '002211'
# zip(*...) is a common idiom for a transpose
cols = map("".join, zip(*board))
# check if a winning sequence occurs as a substring
# of any row or column
return any(
seq in row or seq in col
for seq in winning_strings
for row, col in zip(rows, cols)
)
assert has_winner(board)

How to delete the first subset of each set of column in a data file?

I have a data file with more than 40000 column. In header each column's name begins with C1 , c2, ..., cn and each set of c has one or several subset for example c1. has 2 subsets. I need to delete first column(subset) of each set of c. for example if input looks like :
input:
c1.20022 c1.31012 c2.44444 c2.87634 c2.22233 c3.00444 c3.44444
1 1 0 1 0 0 0 1
2 0 1 0 0 1 0 1
3 0 1 0 0 1 1 0
4 1 0 1 0 0 1 0
5 1 0 1 0 0 1 0
6 1 0 1 0 0 1 0
I need the output be like:
c1.31012 c2.87634 c2.22233 c3.44444
1 0 0 0 1
2 1 0 1 1
3 1 0 1 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 1 0 0 0
Any suggestion please?
update: If there be no space between digits in row (which is th real situation of my data set) then what should I do? my mean is that my real data looks like this:
input:
c1.20022 c1.31012 c2.44444 c2.87634 c2.22233 c3.00444 c3.44444
1 1010001
2 0100101
3 0100110
4 1010010
5 1010010
6 1010010
and output:
c1.31012 c2.87634 c2.22233 c3.44444
1 0001
2 1011
3 1010
4 0000
5 0000
6 0000
7 1000
Perl solution: It first reads the header line, uses a regex to extract the column name before a dot, and keeps a list of column numbers to keep. It then uses the indices to print only the wanted columns from the header and remaining lines.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #header = split ' ', <>;
my $last = q();
my #keep;
for my $i (0 .. $#header) {
my ($prefix) = $header[$i] =~ /(.*)\./;
if ($prefix eq $last) {
push #keep, $i + 1;
}
$last = $prefix;
}
unshift #header, q();
say join "\t", #header[#keep];
while (<>) {
my #columns = split;
say join "\t", #columns[#keep];
}
Update:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #header = split ' ', <>;
my $last = q();
my #keep;
for my $i (0 .. $#header) {
my ($prefix) = $header[$i] =~ /(.*)\./;
if ($prefix eq $last) {
push #keep, $i;
}
$last = $prefix;
}
say join "\t", #header[#keep];
while (<>) {
my ($line_number, $all_digits) = split;
my #digits = split //, $all_digits;
say join "\t", $line_number, join q(), #digits[#keep];
}

Pattern decoding II [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Pattern decoding
I have some new question concerning to the previous post about pattern decoding:
I have almost the same data file, BUT there are double empty (blank) lines, which have to be taken into account in the decoding.
So, the double empty lines mean that there was a street/grout (for definitions see the previous post: Pattern decoding) in which there was zero (0) house, but we have to count these kind of patterns too. (Yes, you may think, that this is absolutely wrong statement, because there is no street without at least one house, but this is just an analogy, so please, just accept it as it is.)
Here is the new data file, with the double lines:
0 0 # <--- Group 1 -- 1 house (0) and 1 room (0)
0 0 # <--- Group 2 -- 2 houses (0;1) and 3,2 rooms (0,1,2;0,1)
0 1
0 2
1 0 # <--- house 2 in Group 2, with the first room (0)
1 1 # <--- house 2 in Group 2, with the second room (1)
0 0 # <--- Group 3
0 1 # <--- house 1 in Group 3, with the second room (1)
0 2
0 0 # <--- Group 4
1 0 # <--- house 2 in Group 4, with one room only (0)
2 0
3 0 # <--- house 4 in Group 4, with one room only (0)
0 0 # <--- Group 5
# <--- Group 6 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 7
# <--- Group 8 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 9
0 0 # <--- Group 10
I need to convert this into an elegant way as it has been done before, but in this case we have to take into account these new groups too, and indicate them in this way, following Kent for example: roupIdx houseIdx numberOfRooms, where the houseIdx let equal to zero houseIdx = 0 and the numberOfRooms let equal to zero too numberOfRooms = 0. So, I need to get this kind of output for example:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 1
8 0 0
9 0 1
10 0 1
Can we tune the previous code in this way?
UPDATE: the new second empty line indicates a new group. If there was an additional empty new line after the empty line, as in this case
0 0 # <--- Group 5
# <--- Group 6 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 7
# <--- Group 8 << ---- THIS IS THE NEW GROUP
we just treat the new empty line (the second one in the 2 blank lines) as a new group, and indicate them as group_index 0 0. See the desired output above!
Try:
$ cat houses.awk
BEGIN{max=1;group=1}
NF==0{
empty++
if (empty==1) group++
next
}
{ max = ($1 > max) ? $1 : max
if (empty<=1){
a[group,$1]++
} else {
a[group,$1]=-1
}
empty=0
}
END{for (i=1;i<=group;i++){
for (j=0;j<=max;j++){
if (a[i,j]>=1)
print i , j , a[i,j]
if (a[i,j]==-1)
print i, j, 0
}
printf "\n"
}
}
Command:
awk -f houses.awk houses
Output:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1

Resources