How to compute numbers in each line of a file python? - python-3.x

In my file, I have in each line numbers that are between 0 and 256:
20
123
125
109
175
108
210
74
127
86
172
128
187
131
183
230
132
77
30
177
64
60
211
112
79
45
I would like to compute how many each number is repeated in this file:
with open('file', 'w') as f_sb_out:
for i in range(256):
total = sum(str(i))
print('Numre of repetition of '+str(i)+'is', str(total))

This is certainly not the best way of doing it in terms of error handling, but it meets your basic requirements.
# Declare dict for counts
numCount = dict()
with open('file.txt', 'r') as f_sb_out: # Need to change the form to read, not write
line = f_sb_out.readline() # Read the first line
while line: # Loop through each line
line = f_sb_out.readline()
line = line.strip() # Strip any whitespace on either end
if line == '': # Skip any whitespace lines
continue
if line in numCount:
numCount[line] = numCount[line] + 1 # Increment the counter
else:
numCount[line] = 1 # Start a new counter
# Print the counts
for num in numCount:
print "The number " + num + " appeared "+ str(numCount[num]) + " time(s)."
Given your file, it produces:
The number 210 appeared 1 times.
The number 211 appeared 1 times.
The number 60 appeared 1 times.
The number 132 appeared 1 times.
The number 131 appeared 1 times.
The number 64 appeared 1 times.
The number 112 appeared 1 times.
The number 177 appeared 1 times.
The number 175 appeared 1 times.
The number 230 appeared 1 times.
The number 172 appeared 1 times.
The number 79 appeared 1 times.
The number 86 appeared 1 times.
The number 45 appeared 1 times.
The number 183 appeared 1 times.
The number 187 appeared 1 times.
The number 77 appeared 1 times.
The number 108 appeared 1 times.
The number 109 appeared 1 times.
The number 125 appeared 1 times.
The number 127 appeared 1 times.
The number 128 appeared 1 times.
The number 74 appeared 1 times.
The number 30 appeared 1 times.
The number 123 appeared 1 times.

Create a counter dictionary and append the lines to it as you read over them. Once you finished you can access the key/values using counter.items()
from collections import Counter
cntr = Counter()
with open('numbers.txt', 'r') as file_in:
for line in file_in:
cntr[line.rstrip()] += 1
for k, v in cntr.items():
print('Numre of repetitions of %s: is %s' % (k, v))
> Numre of repetitions of 1: is 4
> Numre of repetitions of 3: is 3
> Numre of repetitions of 2: is 2
The input numbers.txt file contains:
1
2
3
2
3
3
1
1
1

Checking the counts of your numbers is super easy if you use the collections module. In it, there is a class named Counter that can count any hashable object you can throw at it. In this case, you can give it strings or numbers without a problem. You will need the latest version of Python to use the f strings demonstrated here:
#! /usr/bin/env python3
import collections
def main():
with open('numbers.txt', 'rt') as file:
counts = collections.Counter(map(int, file))
for key, value in sorted(counts.items()):
print(f'{key!r} repeats {value!s} times.')
if __name__ == '__main__':
main()

Related

VBA string to byte array conversion giving incorrect results for certain values only

I a routine to take a string and make it into an array of numbers. This is in VBA running in Excel as part of Office Professional 2019.
The code below is a demo version to illustrate the problem, which encapsulates the original code.
I need to display the numberical equivalent of each char in the string, so am using Cstr(by) elsewhere in code.
Public Sub TestByteFromString()
'### vars
Dim ss As String, i As Integer
Dim arrBytes() As Byte
Dim by As Byte
'###
ss = ""
For i = 0 To 127 Step 1
ss = ss & Chr(Val(i + 126))
Next i
arrBytes = ss
'###
For i = LBound(arrBytes) To UBound(arrBytes) Step 2
by = arrBytes(i)
Debug.Print "Index " & CStr(i) & " Byte " & CStr(by) & " Original " & CStr((i / 2 + 126)) & " Difference = " & CStr(((i / 2 + 126)) - CInt(by))
Next i
'###
End Sub
`
It seems to work fine except for certain values greater than 126, some of which are shown by the demo above.
I am getting these results and cannot see an explanation or a consistant pattern. Does this make sense to anyone what is wrong?
Index 0 Byte 126 Original 126 Difference = 0
Index 2 Byte 127 Original 127 Difference = 0
Index 4 Byte 172 Original 128 Difference = -44
Index 6 Byte 129 Original 129 Difference = 0
Index 8 Byte 26 Original 130 Difference = 104
Index 10 Byte 146 Original 131 Difference = -15
Index 12 Byte 30 Original 132 Difference = 102
Index 14 Byte 38 Original 133 Difference = 95
Index 16 Byte 32 Original 134 Difference = 102
Index 18 Byte 33 Original 135 Difference = 102
Index 20 Byte 198 Original 136 Difference = -62
Index 22 Byte 48 Original 137 Difference = 89
Index 24 Byte 96 Original 138 Difference = 42
Index 26 Byte 57 Original 139 Difference = 82
Index 28 Byte 82 Original 140 Difference = 58
Index 30 Byte 141 Original 141 Difference = 0
Index 32 Byte 125 Original 142 Difference = 17
Index 34 Byte 143 Original 143 Difference = 0
Index 36 Byte 144 Original 144 Difference = 0
Index 38 Byte 24 Original 145 Difference = 121
Index 40 Byte 25 Original 146 Difference = 121
Index 42 Byte 28 Original 147 Difference = 119
Index 44 Byte 29 Original 148 Difference = 119
Index 46 Byte 34 Original 149 Difference = 115
Index 48 Byte 19 Original 150 Difference = 131
Index 50 Byte 20 Original 151 Difference = 131
Index 52 Byte 220 Original 152 Difference = -68
Index 54 Byte 34 Original 153 Difference = 119
Index 56 Byte 97 Original 154 Difference = 57
Index 58 Byte 58 Original 155 Difference = 97
Index 60 Byte 83 Original 156 Difference = 73
Index 62 Byte 157 Original 157 Difference = 0
Index 64 Byte 126 Original 158 Difference = 32
Index 66 Byte 120 Original 159 Difference = 39
Index 68 Byte 160 Original 160 Difference = 0
It seems fine for everything beyond 160 and below 126.
I don't think it is the Cstr() function. If I multiply the byte value by 2 and use Cstr() I get this kind of result, suggesting the byte numerical value is the problem.
Index 66 Byte 120 Original 159 Difference = 39
Index 66 2*Byte 240
Other causes investigated but cannot see an explanation-
-two byte storage in strings for chars.
-ASCII char set
-bytes being decoded as negative numbers if MSB set, but unlikley as 160 onwards is correct.
There may be much better ways to get the array, and they would be very useful, but if possible I would like to also know what has gone wrong so I, and anyone reading, would not make the same mistake again.
Thanks for any assistance, R.

Use part of a column in one file as search term in other file

I have two files. The output file I am searching has earthquake locations and has the following format:
19090212 1323 30.12 36 19.41 103 28.24 7.29 0.00 4 149 25.8 0.02 5.7 9.8 D - 0
19090216 1828 49.61 36 13.27 101 35.38 10.94 0.00 13 54 38.5 0.07 0.3 0.7 B 0
19090711 2114 54.11 35 1.07 99 56.42 7.00 0.00 7 177 18.7 4.00 63.3 53.2 D # 0
I want to use the last 6 digits of the first column (i.e. '090418' out of '19090418') with the first 3 digits of the second column (i.e. '072' out of '0728') as my search term. The file I am searching has the following format:
SC17 P 090212132329.89
X25A P 090212132330.50
AMTX P 090216182814.12
X29A P 090216182813.70
Y28A P 090216182822.36
MSTX P 090216182826.80
Y27A P 090216182831.43
After I search the second file for the term, I need to figure out how many lines are in that section. So for this example, if I were searching the terms shown for the second file above, I want to know there are 2 lines for 090212132 and 5 lines for 090216182.
This is my first post, so please let me know how I can improve clarity or conciseness in my posts. Thanks for the help!
awk to the rescue!
$ awk 'NR==FNR{a[substr($1,3) substr($2,1,3)]; next}
{k=substr($3,1,9)}
k in a{a[k]++}
END{for(k in a) if(a[k]>0) print k,a[k]}' file1 file2
with your input files, there is no output as expected.
The answer karakfa suggested worked! My output looks like this:
100224194 7
100117172 18
091004005 11
090520220 10
090526143 21
090122033 20
Thanks for the help!
Karafka answer with explanation
awk 'NR==FNR { # For first file
$1 = substr($1, 3); # Get last 6 characters from first col
$2 = substr($2, 1, 3); # Get first 3 characters from second col
a[$1 $2]; # Add to an array
next } # Move to next record in first file
# Start processing second file
{k = substr($3, 1, 9)} # Get first 9 character for third col
k in a {a[k]++} # If key in a, then increment the key
END {
for (k in a) # Iterate array
if (a[k] > 0) # If pattern was matched
print k, a[k] # print the pattern and num occurrence
}'

Problems combining awk scripts

I am trying to use awk to parse a tab delimited table -- there are several duplicate entries in the first column, and I need to remove the duplicate rows that have a smaller total sum of the other 4 columns in the table. I can remove the first or second row easily, and sum the columns, but I'm having trouble combining the two. For my purposes there will never be more than 2 duplicates.
Example file: http://pastebin.com/u2GBnm2D
Desired output in this case would be to remove the rows:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And keep the other two rows with the same gene id in the column. The final parsed file would look like this: http://pastebin.com/WgDkm5ui
Here's what I have tried (this doesn't do anything. But the first part removes the second duplicate, and the second part sums the counts):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried modifying the 2nd part of the script in the best answer of this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But unfortunately I don't really understand what's going on well enough to get it working. Can anyone help me out? I think I need to replace the a[$1] > 1 part with [remove (first duplicate count or 2nd duplicate count depending on which is larger].
EDIT: I'm also using GNU Awk 3.1.7 if that matters.
You can use this awk command:
awk 'NR == 1 {
print;
next
} {
s = $2+$3+$4+$5
} s >= sum[$1] {
sum[$1] = s;
if (!($1 in rows))
a[++n] = $1;
rows[$1] = $0
} END {
for(i=1; i<=n; i++)
print rows[a[i]]
}' file | column -t
Output:
gene SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0
lmo0001 160 323 533 293
lmo0002 135 317 504 306
lmo0003 1 4 5 3
lmo0004 35 59 58 48
lmo0005 113 218 257 187
lmo0006 279 519 653 539
lmo0007 563 1053 1165 1069
lmo0008 34 84 203 107
lmo0009 13 45 90 49
lmo0010 57 210 237 169
lmo0011 65 224 247 179
lmo0012 65 226 250 215
lmo0013 342 500 738 682
lmo0014 662 1032 1283 1311
lmo0015 321 413 631 637
lmo0016 175 253 273 325
lmo0017 3 6 6 6
lmo0018 33 38 46 45
lmo0019 13 1 39 1
lmo0020 3 12 28 15
lmo0021 3 4 14 12
lmo0022 2 3 5 1
lmo0023 2 0 3 2
lmo0024 1 0 2 6
lmo0330 1 1 1 3
lmo0506 151 232 60 204

How can I swap numbers inside data block of repeating format using linux commands?

I have a huge data file, and I hope to swap some numbers of 2nd column only, in the following format file. The file have 25,000,000 dataset, and 8768 lines each.
%% Edited: shorter 10 line example. Sorry for the inconvenience. This is typical one data block.
# Dataset 1
#
# Number of lines 10
#
# header lines
5 11 3 10 120 90 0 0.952 0.881 0.898 2.744 0.034 0.030
10 12 3 5 125 112 0 0.952 0.897 0.905 2.775 0.026 0.030
50 10 3 48 129 120 0 1.061 0.977 0.965 3.063 0.001 0.026
120 2 4 5 50 186 193 0 0.881 0.965 0.899 0.917 3.669 0.000 -0.005
125 3 4 10 43 186 183 0 0.897 0.945 0.910 0.883 3.641 0.000 0.003
186 5 4 120 125 249 280 0 0.899 0.910 0.931 0.961 3.727 0.000 -0.001
193 6 4 120 275 118 268 0 0.917 0.895 0.897 0.937 3.799 0.000 0.023
201 8 4 278 129 131 280 0 0.921 0.837 0.870 0.934 3.572 0.000 0.008
249 9 4 186 355 179 317 0 0.931 0.844 0.907 0.928 3.615 0.000 0.008
280 10 4 186 201 340 359 0 0.961 0.934 0.904 0.898 3.700 0.000 0.033
#
# Dataset 1
#
# Number of lines 10
...
As you can see, there are 7 repeating header lines in the head, and 1 trailing line at the end of the dataset. Those header and trailing lines are all beginning from #. As a result, the data will have 7 header lines, 8768 data lines, and 1 trailing line, total 8776 lines per a data block. That one trailing line only contains sinlge '#'.
I want to swap some numbers in 2nd columns only. First, I want to replace
1, 9, 10, 11 => 666
2, 6, 7, 8 => 333
3, 4, 5 => 222
of the 2nd column, and then,
666 => 6
333 => 3
222 => 2
of the 2nd column. I hope to conduct this replacing for all repeating dataset.
I tried this with python, but the data is too big, so it makes memory error. How can I perform this swapping with linux commands like sed or awk or cat commands?
Thanks
Best,
This might work for you, but you'd have to use GNU awk, as it's using the gensub command and $0 reassignment.
Put the following into an executable awk file ( like script.awk ):
#!/usr/bin/awk -f
BEGIN {
a[1] = a[9] = a[10] = a[11] = 6
a[2] = a[6] = a[7] = a[8] = 3
a[3] = a[4] = a[5] = 2
}
function swap( c2, val ) {
val = a[c2]
return( val=="" ? c2 : val )
}
/^( [0-9]+ )/ { $0 = gensub( /^( [0-9]+)( [0-9]+)/, "\\1 " swap($2), 1 ) }
47 # print the line
Here's the breakdown:
BEGIN - set up an array a with mappings of the new values.
create a user defined function swap to provide values for the 2nd column from the a array or the value itself. The c2 element is passed in, while the val element is a local variable ( becuase no 2nd argument is passed in ).
when a line starts with a space followed by a number and a space (the pattern), then use gensub to replace the first occurrance of the first number pattern with itself concatenated with a space and the return from swap(the action). In this case, I'm using gensub's replacement text to preserve the first column data. The second column is passed to swap using the field data identifier of $2. Using gensub should preserve the formatting of the data lines.
47 - an expression that evaluates to true provides the default action of printing $0, which for data lines might have been modified. Any line that wasn't "data" will be printed out here w/o modifications.
The provided data doesn't show all the cases, so I made up my own test file:
# 2 skip me
9 2 not going to process me
1 1 don't change the for matting
2 2 4 23242.223 data
3 3 data that's formatted
4 4 7 that's formatted
5 5 data that's formatted
6 6 data that's formatted
7 7 data that's formatted
8 8 data that's formatted
9 9 data that's formatted
10 10 data that's formatted
11 11 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
Running the executable awk (like ./script.awk data) gives the following output:
# 2 skip me
9 2 not going to process me
1 6 don't change the for matting
2 3 4 23242.223 data
3 2 data that's formatted
4 2 7 that's formatted
5 2 data that's formatted
6 3 data that's formatted
7 3 data that's formatted
8 3 data that's formatted
9 6 data that's formatted
10 6 data that's formatted
11 6 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
which looks alright to me, but I'm not the one with 25 million datasets.
You'd also most definitely want to try this on a smaller sample of your data first (the first few datasets?) and redirect stdout a temp file perhaps like:
head -n 26328 data | ./script.awk - > tempfile
You can learn more about the elements used in this script here:
awk basics (the man page)
Arrays
User defined functions
String functions - gensub()
And of course, you should spend some quality time reviewing awk related questions and answers on Stack Overflow ;)

R: How to filter through a string of characters in the header of a table

I have a table, here's the start:
TargetID SM_H1462 SM_H1463 SM_K1566 SM_X1567 SM_V1568 SM_K1534 SM_K1570 SM_K1571
ENSG00000000419.8 290 270 314 364 240 386 430 329
ENSG00000000457.8 252 230 242 220 106 234 343 321
ENSG00000000460.11 154 158 162 136 64 152 206 432
ENSG00000000938.7 20106 18664 19764 15640 19024 18508 45590 32113
I want to write a code that will filter through the names of each column (the SM_... ones) and only look at the fourth character in each name. There are 4 different options that can appear at the 4th character: they can be letters H, K, X or V. This can be seen from the table above, e.g. SM_H1462, SM_K1571 etc. Names that have the letter H and K as the 4th character is the Control and names that have letters X or V as the 4th character is the Case.
I want the code to separate the column names based on the 4th letter and group them into two groups: either Case and Control.
Essentially, we can ignore the data for now, I just want to work with the col names first.
You could try checking for the fourth character and ger case and control aas two separate data frames,if that helps you
my.df <- data.frame(matrix(rep(seq(1,8),3), ncol = 8))
colnames(my.df) <- c('SM_H1462','SM_H1463','SM_K1566','SM_X1567', 'SM_V1568', 'SM_K1534', 'SM_K1570','SM_K1571')
my.df
control = my.df[,(substr(colnames(my.df),4,4) == 'H' | substr(colnames(my.df),4,4) == 'K')]
case = my.df[,(substr(colnames(my.df),4,4) == 'X' | substr(colnames(my.df),4,4) == 'V')]

Resources