Pattern decoding II [duplicate] - linux

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Pattern decoding
I have some new question concerning to the previous post about pattern decoding:
I have almost the same data file, BUT there are double empty (blank) lines, which have to be taken into account in the decoding.
So, the double empty lines mean that there was a street/grout (for definitions see the previous post: Pattern decoding) in which there was zero (0) house, but we have to count these kind of patterns too. (Yes, you may think, that this is absolutely wrong statement, because there is no street without at least one house, but this is just an analogy, so please, just accept it as it is.)
Here is the new data file, with the double lines:
0 0 # <--- Group 1 -- 1 house (0) and 1 room (0)
0 0 # <--- Group 2 -- 2 houses (0;1) and 3,2 rooms (0,1,2;0,1)
0 1
0 2
1 0 # <--- house 2 in Group 2, with the first room (0)
1 1 # <--- house 2 in Group 2, with the second room (1)
0 0 # <--- Group 3
0 1 # <--- house 1 in Group 3, with the second room (1)
0 2
0 0 # <--- Group 4
1 0 # <--- house 2 in Group 4, with one room only (0)
2 0
3 0 # <--- house 4 in Group 4, with one room only (0)
0 0 # <--- Group 5
# <--- Group 6 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 7
# <--- Group 8 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 9
0 0 # <--- Group 10
I need to convert this into an elegant way as it has been done before, but in this case we have to take into account these new groups too, and indicate them in this way, following Kent for example: roupIdx houseIdx numberOfRooms, where the houseIdx let equal to zero houseIdx = 0 and the numberOfRooms let equal to zero too numberOfRooms = 0. So, I need to get this kind of output for example:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 1
8 0 0
9 0 1
10 0 1
Can we tune the previous code in this way?
UPDATE: the new second empty line indicates a new group. If there was an additional empty new line after the empty line, as in this case
0 0 # <--- Group 5
# <--- Group 6 << ---- THIS IS THE NEW GROUP
0 0 # <--- Group 7
# <--- Group 8 << ---- THIS IS THE NEW GROUP
we just treat the new empty line (the second one in the 2 blank lines) as a new group, and indicate them as group_index 0 0. See the desired output above!

Try:
$ cat houses.awk
BEGIN{max=1;group=1}
NF==0{
empty++
if (empty==1) group++
next
}
{ max = ($1 > max) ? $1 : max
if (empty<=1){
a[group,$1]++
} else {
a[group,$1]=-1
}
empty=0
}
END{for (i=1;i<=group;i++){
for (j=0;j<=max;j++){
if (a[i,j]>=1)
print i , j , a[i,j]
if (a[i,j]==-1)
print i, j, 0
}
printf "\n"
}
}
Command:
awk -f houses.awk houses
Output:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 0
7 0 0
8 0 1

Related

How to recognize [1,X,X,X,1] repeating pattern in panda serie

I have a boolean column in a csv file for example:
1 1
2 0
3 0
4 0
5 1
6 1
7 1
8 0
9 0
10 1
11 0
12 0
13 1
14 0
15 1
You can see here 1 is reapting every 5 lines.
I want to recognize this repeating pattern [1,0,0,0] as soon as the repetition is above 10 in python (I have ~20.000 rows/file).
The pattern can start at any position
How could I manage this in python avoiding if .....
# Generate 20000 of 0s and 1s
data = pd.Series(np.random.randint(0, 2, 20000))
# Keep indices of 1s
idx = df[df > 0].index
# Check distance of current index with next index whether is 4 or not,
# Say if position 2 and position 6 is found as 1, so 6 - 2 = 4
found = []
for i, v in enumerate(idx):
if i == len(idx) - 1:
break
next_value = idx[i + 1]
if (next_value - v) == 4:
found.append(v)
print(found)

Gnuplot draw logical gate output in time

I am working on a school project, which is a simulation of logical gates. I can implement and run the simulation with ease, but i need help with showing the output.
Right now, i print everything to the console, like this:
sample frequency: 50
###############################################
IN NOT(1) OUT
IN1:0 IN1:3 IN1:5
IN2:0 IN2:0 IN2:0
OUT:3 OUT:5 OUT:0
0 1 -1 -1
50 1 -1 -1
100 1 0 0
150 0 0 0
200 1 1 1
250 1 0 0
300 1 0 0
350 1 0 0 (IN = 1, delay is 1 so we can see
400 0 0 0 the correct output of NOT element in line 400 <-> 350 + 1*50)
450 1 1 1
500 1 0 0
550 1 0 0
600 1 0 0
650 0 0 0
700 0 1 1
750 1 1 1
800 1 0 0
850 1 0 0
900 1 0 0
950 1 0 0
1000 1 0 0
on the left, there is the simulation time (step). In each step, the values are printed out and new set of inputs is generated.
where there is -1, this means undefined output.
The 3rd row ( IN NOT(1) OUT ) means that there are 3 elements, 1 input, 1 NOT gate and an output. The value in brackets means the delay of the element, so an element with delay value of X will show the correct output after X*sample_freq (excluding the 0 time).
The rows after mean:
IN1 - the index of the node that is read as input 1
IN2 - the index of the node that is read as input 2
OUT - the index of the output node
In this situation, the IN is giving its output to node #3. The NOT element reads its input from node #3 and gives some output to node #5. The overall output of this system is the OUT element, which reads from #5.
Here is the file that specifies the topology:
3 (number of elems)
IN 0 0 3 (no inputs for input element obviously)
NOT 3 0 5 (reads from #3 and outputs to #5)
OUT 5 0 0 (reads from #5 and this is the end point of the system)
There can obviously be more elements, IN's and OUT's, but lets stick to this for the sake of simplicity.
And what i want to see as the result is: X-axis tells the simulation time (0 - 1000, step is 50), y axis tells the output value of each element in the system and the elements write their output one above the other, see this picture as an example.
Can you tell me how to create this kind of gnuplot script, that transforms the output of my application into the desired plot?
Thank you!
ok, I have found a solultion myself, here it is:
first, I had to transform the output of the app a bit, so that it looks like this:
0 1 2 4
49 1 2 4
50 1 2 4
99 1 2 4
100 0 2 4
149 0 2 4
150 0 3 5
199 0 3 5
200 1 3 5
249 1 3 5
250 1 2 4
299 1 2 4
300 0 2 5
349 0 2 5
350 1 3 5
399 1 3 5
400 0 2 4
449 0 2 4
450 1 3 5
499 1 3 5
the extra sim time steps make the edges look almost square, I also separated each column by 2 (added 0 to column #2, added 2 to column #3, added 4 to column #4 and so on), so that it is drawn one above each other and the simple command to plot this is:
plot 'out.txt' using 1:2 with lines, 'out.txt' using 1:3 with lines, 'out.txt' using 1:4 with lines
plus some set xtics, set ytics and other cosmetic stuff
now I have to deal with naming the lines with the names of the elements and voila.

Combining pairs in a string (Matlab)

I have a string:
sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG'
How can I combine pairs which have the last character of 1 pair is the first character of the follow pairs into strings? And the new strings must contain all of the character 'A','B','C','D','E','F' , 'G', those characters are appeared in the sup_pairs string.
The expected output should be:
S1 = 'BAEFCGD' % because BA will be followed by AE in sup_pairs string, so we combine BAE, and so on...we continue the rule to generate S1
S2 = 'DFCEABG'
If I have AB, BC and BD, the generated strings should be both : ABC and ABD .
If there is any repeated character in the pairs like : AB BC CA CE . We will skip the second A , and we get ABCE .
This, like all good things in life, is a graph problem. Each letter is a node, and each pair is an edge.
First we must transform your string of pairs into a numeric format so we can use the letters as subscripts. I will use A=2, B=3, ..., G=8:
sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG';
p=strsplit(sup_pairs,' ');
m=cell2mat(p(:));
m=m-'?';
A=sparse(m(:,1),m(:,2),1);
The sparse matrix A is now the adjacency matrix (actually, more like an adjacency list) representing our pairs. If you look at the full matrix of A, it looks like this:
>> full(A)
ans =
0 0 0 0 0 0 0 0
0 0 1 0 0 1 0 0
0 1 0 0 0 0 0 1
0 0 0 0 0 1 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
As you can see, the edge BA, which translates to subscript (3,2) is equal to 1.
Now you can use your favorite implementation of Depth-first Search (DFS) to perform a traversal of the graph from your starting node of choice. Each path from the root to a leaf node represents a valid string. You then transform the path back into your letter sequence:
treepath=[3,2,6,7,4,8,5];
S1=char(treepath+'?');
Output:
S1 = BAEFCGD
Here's a recursive implementation of DFS to get you going. Normally in MATLAB you have to worry about not hitting the default limitation on recursion depth, but you're finding Hamiltonian paths here, which is NP-complete. If you ever get anywhere near the recursion limit, the computation time will be so huge that increasing the depth will be the least of your worries.
function full_paths = dft_all(A, current_path)
% A - adjacency matrix of graph
% current_path - initially just the start node (root)
% full_paths - cell array containing all paths from initial root to a leaf
n = size(A, 1); % number of nodes in graph
full_paths = cell(1,0); % return cell array
unvisited_mask = ones(1, n);
unvisited_mask(current_path) = 0; % mask off already visited nodes (path)
% multiply mask by array of nodes accessible from last node in path
unvisited_nodes = find(A(current_path(end), :) .* unvisited_mask);
% add restriction on length of paths to keep (numel == n)
if isempty(unvisited_nodes) && (numel(current_path) == n)
full_paths = {current_path}; % we've found a leaf node
return;
end
% otherwise, still more nodes to search
for node = unvisited_nodes
new_path = dft_all(A, [current_path node]); % add new node and search
if ~isempty(new_path) % if this produces a new path...
full_paths = {full_paths{1,:}, new_path{1,:}}; % add it to output
end
end
end
This is a normal Depth-first traversal except for the added condition on the length of the path in line 15:
if isempty(unvisited_nodes) && (numel(current_path) == n)
The first half of the if condition, isempty(unvisited_nodes) is standard. If you only use this part of the condition you'll get all paths from the start node to a leaf, regardless of path length. (Hence the cell array output.) The second half, (numel(current_path) == n) enforces the length of the path.
I took a shortcut here because n is the number of nodes in the adjacency matrix, which in the sample case is 8 rather than 7, the number of characters in your alphabet. But there are no edges into or out of node 1 because I was apparently planning on using a trick that I never got around to telling you about. Rather than run DFS starting from each of the nodes to get all of the paths, you can make a dummy node (in this case node 1) and create an edge from it to all of the other real nodes. Then you just call DFS once on node 1 and you get all the paths. Here's the updated adjacency matrix:
A =
0 1 1 1 1 1 1 1
0 0 1 0 0 1 0 0
0 1 0 0 0 0 0 1
0 0 0 0 0 1 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
If you don't want to use this trick, you can change the condition to n-1, or change the adjacency matrix not to include node 1. Note that if you do leave node 1 in, you need to remove it from the resulting paths.
Here's the output of the function using the updated matrix:
>> dft_all(A, 1)
ans =
{
[1,1] =
1 2 3 8 5 7 4 6
[1,2] =
1 3 2 6 7 4 8 5
[1,3] =
1 3 8 5 2 6 7 4
[1,4] =
1 3 8 5 7 4 6 2
[1,5] =
1 4 6 2 3 8 5 7
[1,6] =
1 5 7 4 6 2 3 8
[1,7] =
1 6 2 3 8 5 7 4
[1,8] =
1 6 7 4 8 5 2 3
[1,9] =
1 7 4 6 2 3 8 5
[1,10] =
1 8 5 7 4 6 2 3
}

replace the first N dots of a string

I hope to replace the first 14 dots of my.string with 14 zeroes when region = 2. All other dots should be kept the way they are.
df.1 = read.table(text = "
city county state region my.string reg1 reg2
1 1 1 1 123456789012345678901234567890 1 0
1 2 1 1 ...................34567890098 1 0
1 1 2 1 112233..............0099887766 1 0
1 2 2 1 ..............2020202020202020 1 0
1 1 1 2 ..............00.............. 0 1
1 2 1 2 ..............0987654321123456 0 1
1 1 2 2 ..............9999988888777776 0 1
1 2 2 2 ..................555555555555 0 1
", sep = "", header = TRUE, stringsAsFactors = FALSE)
df.1
I do not think this question has been asked here. Sorry if it has. Sorry also not to have spent more time looking for the solution. A quick Google search did not turn up an answer. I did ask a similar question here earlier: R: removing the last three dots from a string Thank you for any help.
I should clarify that I only want to remove 14 consecutive dots at the far left of the string. If a string begins with a number that is followed by 14 dots, then those 14 dots should remain the way they are.
Here is how my.string would look:
123456789012345678901234567890
...................34567890098
112233..............0099887766
..............2020202020202020
0000000000000000..............
000000000000000987654321123456
000000000000009999988888777776
00000000000000....555555555555
Have you tried:
sub("^\\.{14}", "00000000000000", df.1$my.string )
For conditional replacement try:
> df.1[ df.1$region ==2, "mystring"] <-
sub("^\\.{14}", "00000000000000", df.1$my.string[ df.1$region==2] )
> df.1
city county state region my.string reg1 reg2
1 1 1 1 1 123456789012345678901234567890 1 0
2 1 2 1 1 ...................34567890098 1 0
3 1 1 2 1 112233..............0099887766 1 0
4 1 2 2 1 ..............2020202020202020 1 0
5 1 1 1 2 ..............00.............. 0 1
6 1 2 1 2 ..............0987654321123456 0 1
7 1 1 2 2 ..............9999988888777776 0 1
8 1 2 2 2 ..................555555555555 0 1
mystring
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 0000000000000000..............
6 000000000000000987654321123456
7 000000000000009999988888777776
8 00000000000000....555555555555
gsub('^[.]{14,14}',paste(rep(0,14),collapse=''),df.1$my.string)
"123456789012345678901234567890" "00000000000000.....34567890098" "112233..............0099887766"
[4] "000000000000002020202020202020" "0000000000000000.............." "000000000000000987654321123456"
[7] "000000000000009999988888777776" "00000000000000....555555555555"
dwin's answer is awesome. here's one that's easy to understand but not nearly as spiffy
# restrict the substitution to only region == 2..
# then replace the 'my.string' column with..
df.1[ df.1$region == 2 , 'my.string' ] <-
# substitute.. (only the first instance!)
# (use gsub for multiple instances)
sub(
# fourteen dots
'..............' ,
# with fourteen zeroes
'00000000000000' ,
# in the same object (also restricted to region == 2
df.1[ df.1$region == 2 , 'my.string' ] ,
# and don't use regex or anything special.
# just exactly 14 dots.
fixed = TRUE
)
A data.table solution:
require(data.table)
dt <- data.table(df.1)
# solution:
dt[, mystring := ifelse(region == 2, sub("^[.]{14}",
paste(rep(0,14), collapse=""), my.string),
my.string), by=1:nrow(dt)]
# city county state region my.string reg1 reg2 mystring
# 1: 1 1 1 1 123456789012345678901234567890 1 0 123456789012345678901234567890
# 2: 1 2 1 1 ...................34567890098 1 0 ...................34567890098
# 3: 1 1 2 1 112233..............0099887766 1 0 112233..............0099887766
# 4: 1 2 2 1 ..............2020202020202020 1 0 ..............2020202020202020
# 5: 1 1 1 2 ..............00.............. 0 1 0000000000000000..............
# 6: 1 2 1 2 ..............0987654321123456 0 1 000000000000000987654321123456
# 7: 1 1 2 2 ..............9999988888777776 0 1 000000000000009999988888777776
# 8: 1 2 2 2 ..................555555555555 0 1 00000000000000....555555555555

Pattern decoding

I need a little help in the following. I have this kind of datafile:
0 0 # <--- Group 1 -- 1 house (0) and 1 room (0)
0 0 # <--- Group 2 -- 2 houses (0;1) and 3,2 rooms (0,1,2;0,1)
0 1
0 2
1 0 # <--- house 2 in Group 2, with the first room (0)
1 1 # <--- house 2 in Group 2, with the second room (1)
0 0 # <--- Group 3
0 1 # <--- house 1 in Group 3, with the second room (1)
0 2
0 0 # <--- Group 4
1 0 # <--- house 2 in Group 4, with one room only (0)
2 0
3 0 # <--- house 4 in Group 4, with one room only (0)
0 0 # <--- Group 5
0 0 # <--- Group 6
There are some cases, which have to be answered:
There are groups in the example; there exists a group, if it is separated by new lines from the other, so in this case we have 6 groups. We have to determine the followings
Get the actual number (ordinal number) of the group (the counter starts for example from 1)
if the 1st column = 0 and the 2nd column = 0 and the next line is empty
So the desired output according to the above example would be
1
5
6
if the first column = 0 and the 2nd column can vary and the next line is empty
So the desired output according to the above example would be
3
... etc. How can this be generalized in a way that we can set at the beginning which case would we like to get ?
There might be many cases according to the values of the columns in a group.
We can imagine this if we think about something like this: the first column means the number of houses in a street, and the second column means the number of rooms in a house. Now I would like to find all possible kind of streets in a city, which means for example
let us pick up those streets, in which there are two houses with different number of rooms, in the first house there are 3 rooms, and in the second house there are 2 rooms. So we have the get output 2, becasue this requirement fulfills this group in the file
0 0
0 1
0 2
1 0
1 1
Important: 0 0 means there is one house with one room
Correction: if there is one house only, then it has just one room all the time! Like in the cases Group 1, Group 5, and Group 6. Remember, that the second column is the number of room, and 0 mean "1 room", 1 means "2 rooms", ...etc. This is just a counter which starts from 0, instead of 1, sorry if it is confusing a little bit...
I don't know what would be your expected output, however I have converted/decoded your number pattern to a meaningful group/house/rooms format. any further "query" could be done on this content.
see below:
kent$ cat file
0 0
0 0
0 1
0 2
1 0
1 1
0 0
0 1
0 2
0 0
1 0
2 0
3 0
0 0
0 0
awk:
kent$ awk 'BEGIN{RS=""}
{ print "\ngroup "++g;
delete a;
for(i=1;i<=NF;i++) if(i%2) a[$i]++;
for(x in a) printf "House#: %s , Room(s): %s \n", x, a[x]; }' file
we get output:
group 1
House#: 0 , Room(s): 1
group 2
House#: 0 , Room(s): 3
House#: 1 , Room(s): 2
group 3
House#: 0 , Room(s): 3
group 4
House#: 0 , Room(s): 1
House#: 1 , Room(s): 1
House#: 2 , Room(s): 1
House#: 3 , Room(s): 1
group 5
House#: 0 , Room(s): 1
group 6
House#: 0 , Room(s): 1
note that the generated format could be changed to fit your "filter" or "query"
UPDATE
OP's comment:
I need to know, the number of the group(s) which have/has for example
1 house with one room. The output would be in the above case: 1, 5 ,6
as I said, based on your query criteria, we could adjust the awk output for next step. now I change the awk abovet to:
awk 'BEGIN{RS=""}
{print ""; gid=++g;
delete a;
for(i=1;i<=NF;i++) if(i%2) a[$i]++;
for(x in a) printf "%s %s %s\n", gid,x, a[x]; }' file
this will output:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 1
the format is groupIdx houseIdx numberOfRooms and there is a blank line between groups. we save the text above to a file named decoded.txt
so your query could be done on this text:
kent$ awk 'BEGIN{RS="\n\n"}{if (NF==3 && $3==1)print $1}' decoded.txt
1
5
6
the last awk line above means, print the group number, if room number ($3) = 1 and there is only one line in the group block.
I would first define a House class and a Group class:
class House:
def __init__(self, rooms):
self.rooms = rooms
class Group:
def __init__(self, index, houses):
self.index = index
# houses.values() is a list with number of rooms for each house.
self.houses = [House(houses[house_nr]) for house_nr in sorted(houses)]
def __str__(self):
return 'Group {}'.format(self.index)
def __repr__(self):
return 'Group {}'.format(self.index)
Then parse the data into this hierarchical structure:
with open('in.txt') as f:
groups = []
# Variable to accumulate current group.
group = collections.defaultdict(int)
i = 1
for line in f:
if not line.strip():
# Empty line found, create a new group.
groups.append(Group(i, group))
# Reset accumulator.
group = collections.defaultdict(int)
i += 1
continue
house_nr, room_nr = line.split()
group[house_nr] += 1
# Create the last group at EOF
groups.append(Group(i, group))
Then you can do stuff like this:
found = filter(
lambda g:
len(g.houses) == 1 and # Group contains one house
g.houses[0].rooms == 1, # First house contains one room
groups)
print(list(found)) # Prints [Group 1, Group 5, Group 6]
found = filter(
lambda g:
len(g.houses) == 2 and # Group contains two houses
g.houses[0].rooms == 3 and # First house contains three rooms
g.houses[1].rooms == 2, # Second house contains two rooms
groups)
print(list(found)) # Prints [Group 2]
Perl solution. It converts the input into this format:
1|0
2|1 2
3|2
4|0 0 0 0
5|0
6|0
The first column is group number, in second column there are number of rooms (minus one) of all its houses, sorted. To search for group with two different houses with 2 and 3 rooms, you can just grep '|1 2$', to search for groups with just one house with one room, grep '|0$'
#!/usr/bin/perl
#-*- cperl -*-
#use Data::Dumper;
use warnings;
use strict;
sub report {
print join ' ', sort {$a <=> $b} #_;
print "\n";
}
my $group = 1;
my #last = (0);
print '1|';
my #houses = ();
while (<>) {
if (/^$/) { # group end
report(#houses, $last[1]);
undef #houses;
print ++$group, '|';
#last = (0);
} else {
my #tuple = split;
if ($tuple[0] != $last[0]) { # new house
push #houses, $last[1];
}
#last = #tuple;
}
}
report(#houses, $last[1]);
It is based on the fact that for each house, only the last line is important.

Resources