python3: Counting repeated occurrence in a list - python-3.x

Each line contains a special timestamp, the caller number, the receiver number, the duration of the call in seconds and the rate per minute in cents at which this call was charged, all separated by ";”. The file contains thousands of calls looks like this. I created a list instead of a dictionary to access the elements but I'm not sure how to count the number of calls originating from the phone in question
timestamp;caller;receiver;duration;rate per minute
1419121426;7808907654;7807890123;184;0.34
1419122593;7803214567;7801236789;46;0.37
1419122890;7808907654;7809876543;225;0.31
1419122967;7801234567;7808907654;419;0.34
1419123462;7804922860;7809876543;782;0.29
1419123914;7804321098;7801234567;919;0.34
1419125766;7807890123;7808907654;176;0.41
1419127316;7809876543;7804321098;471;0.31
Phone number || # |Duration | Due |
+--------------+-----------------------
|(780) 123 4567||384|55h07m53s|$ 876.97|
|(780) 123 6789||132|17h53m19s|$ 288.81|
|(780) 321 4567||363|49h52m12s|$ 827.48|
|(780) 432 1098||112|16h05m09s|$ 259.66|
|(780) 492 2860||502|69h27m48s|$1160.52|
|(780) 789 0123||259|35h56m10s|$ 596.94|
|(780) 876 5432||129|17h22m32s|$ 288.56|
|(780) 890 7654||245|33h48m46s|$ 539.41|
|(780) 987 6543||374|52h50m11s|$ 883.72|
list =[i.strip().split(";") for i in open("calls.txt", "r")]
print(list)

I have very simple solution for your issue:
First of all use with when opening file - it's a handy shortcut and it provides sames functionality as wrap this funtion into try...except. Consider this:
lines = []
with open("test.txt", "r") as f:
for line in f.readlines():
lines.append(line.strip().split(";"))
print(lines)
counters = {}
# you browse through lists and later through numbers inside lists
for line in lines:
for number in line:
# very basic way to count occurences
if number not in counters:
counters[number] = 1
else:
counters[number] += 1
# in this condition you can tell what number of digits you accept
counters = {elem: counters[elem] for elem in counters.keys() if len(elem) > 5}
print(counters)

This should get you started
import csv
import collections
Call = collections.namedtuple("Call", "duration rate time")
calls = {}
with open('path/to/file') as infile:
for time, nofrom, noto, dur, rate in csv.reader(infile):
calls.get(nofrom, {}).get(noto,[]).append(Call(dur, rate, time))
for nofrom, logs in calls.items():
for noto, callist in logs.items():
print(nofrom, "called", noto, len(callist), "times")

Related

Parsing a http response that is in bytes format

requests.get() receives a response which is of the type bytes. It looks like:
b'{"Close":8506.25,"DownTicks":164,"DownVolume":207,"High":8508.25,"Low":8495.00,"Open":8496.75,"Status":13,"TimeStamp":"\\/Date(1583530800000)\\/","TotalTicks":371,"TotalVolume":469,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":207,"UpVolume":262,"OpenInterest":0}\r\n{"Close":8503.00,"DownTicks":152,"DownVolume":203,"High":8509.50,"Low":8502.00,"Open":8506.00,"Status":13,"TimeStamp":"\\/Date(1583531100000)\\/","TotalTicks":282,"TotalVolume":345,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":130,"UpVolume":142,"OpenInterest":0}\r\n{"Close":8494.00,"DownTicks":160,"DownVolume":206,"High":8505.75,"Low":8492.75,"Open":8503.25,"Status":13,"TimeStamp":"\\/Date(1583531400000)\\/","TotalTicks":275,"TotalVolume":346,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":115,"UpVolume":140,"OpenInterest":0}\r\n{"Close":8499.00,"DownTicks":136,"DownVolume":192,"High":8500.25,"Low":8492.25,"Open":8493.75,"Status":13,"TimeStamp":"\\/Date(1583531700000)\\/","TotalTicks":299,"TotalVolume":402,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":163,"UpVolume":210,"OpenInterest":0}\r\n{"Close":8501.75,"DownTicks":176,"DownVolume":314,"High":8508.25,"Low":8495.75,"Open":8498.50,"Status":536870941,"TimeStamp":"\\/Date(1583532000000)\\/","TotalTicks":340,"TotalVolume":510,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":164,"UpVolume":196,"OpenInterest":0}\r\nEND'
Please note that while the actual string is much longer, it is always a long string of shorter strings separated by '\r\n', ignoring the final word "END". You can see how similarly-structured these short strings are:
for i in response.text.split('\r\n')[:-1]: print(i, '\n\n')
{"Close":8506.25,"DownTicks":164,"DownVolume":207,"High":8508.25,"Low":8495.00,"Open":8496.75,"Status":13,"TimeStamp":"\/Date(1583530800000)\/","TotalTicks":371,"TotalVolume":469,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":207,"UpVolume":262,"OpenInterest":0}
{"Close":8503.00,"DownTicks":152,"DownVolume":203,"High":8509.50,"Low":8502.00,"Open":8506.00,"Status":13,"TimeStamp":"\/Date(1583531100000)\/","TotalTicks":282,"TotalVolume":345,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":130,"UpVolume":142,"OpenInterest":0}
{"Close":8494.00,"DownTicks":160,"DownVolume":206,"High":8505.75,"Low":8492.75,"Open":8503.25,"Status":13,"TimeStamp":"\/Date(1583531400000)\/","TotalTicks":275,"TotalVolume":346,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":115,"UpVolume":140,"OpenInterest":0}
{"Close":8499.00,"DownTicks":136,"DownVolume":192,"High":8500.25,"Low":8492.25,"Open":8493.75,"Status":13,"TimeStamp":"\/Date(1583531700000)\/","TotalTicks":299,"TotalVolume":402,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":163,"UpVolume":210,"OpenInterest":0}
{"Close":8501.75,"DownTicks":176,"DownVolume":314,"High":8508.25,"Low":8495.75,"Open":8498.50,"Status":536870941,"TimeStamp":"\/Date(1583532000000)\/","TotalTicks":340,"TotalVolume":510,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":164,"UpVolume":196,"OpenInterest":0}
Goal parsing a few of the fields and saving them in a dataframe with the field "Timestamp" as the dataframe's index.
What I have done:
response_text = response.text
import ast
df = pd.DataFrame(columns = [ 'o', 'h', 'l', 'c', 'v'])
for i in response_text.split('\r\n')[:-1]:
i_dict = ast.literal_eval(i)
epoch_in_milliseconds = int(i_dict['TimeStamp'].split('(')[1].split(')')[0])
time_stamp = datetime.datetime.fromtimestamp(float(epoch_in_milliseconds)/1000.)
o = i_dict['Open']
h = i_dict['High']
l = i_dict['Low']
c = i_dict['Close']
v = i_dict['TotalVolume']
temp_df = pd.DataFrame({'o':o, 'h':h, 'l':l, 'c':c, 'v':v}, index = [time_stamp])
df = df.append(temp_df)
which gets me:
In [546]df
Out[546]:
o h l c v
2020-03-06 16:40:00 8496.75000 8508.25000 8495.00000 8506.25000 469
2020-03-06 16:45:00 8506.00000 8509.50000 8502.00000 8503.00000 345
2020-03-06 16:50:00 8503.25000 8505.75000 8492.75000 8494.00000 346
2020-03-06 16:55:00 8493.75000 8500.25000 8492.25000 8499.00000 402
2020-03-06 17:00:00 8498.50000 8508.25000 8495.75000 8501.75000 510
which is exactly what I need.
Issue this method feels clumsy to me, like a patch-work, and prone to breaking due to possible slight differences in the response text.
Is there any more robust and faster way of extracting this information from the original bytes? (When the server response is in JSON format, I have none of this headache)
This is somewhat cleaner, I believe:
ts = """
b'{"Close":8506.25,"DownTicks":164,"DownVolume":207,"High":8508.25,"Low":8495.00,"Open":8496.75,"Status":13,"TimeStamp":"\\/Date(1583530800000)\\/","TotalTicks":371,"TotalVolume":469,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":207,"UpVolume":262,"OpenInterest":0}\r\n{"Close":8503.00,"DownTicks":152,"DownVolume":203,"High":8509.50,"Low":8502.00,"Open":8506.00,"Status":13,"TimeStamp":"\\/Date(1583531100000)\\/","TotalTicks":282,"TotalVolume":345,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":130,"UpVolume":142,"OpenInterest":0}\r\n{"Close":8494.00,"DownTicks":160,"DownVolume":206,"High":8505.75,"Low":8492.75,"Open":8503.25,"Status":13,"TimeStamp":"\\/Date(1583531400000)\\/","TotalTicks":275,"TotalVolume":346,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":115,"UpVolume":140,"OpenInterest":0}\r\n{"Close":8499.00,"DownTicks":136,"DownVolume":192,"High":8500.25,"Low":8492.25,"Open":8493.75,"Status":13,"TimeStamp":"\\/Date(1583531700000)\\/","TotalTicks":299,"TotalVolume":402,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":163,"UpVolume":210,"OpenInterest":0}\r\n{"Close":8501.75,"DownTicks":176,"DownVolume":314,"High":8508.25,"Low":8495.75,"Open":8498.50,"Status":536870941,"TimeStamp":"\\/Date(1583532000000)\\/","TotalTicks":340,"TotalVolume":510,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":164,"UpVolume":196,"OpenInterest":0}\r\nEND'
"""
import pandas as pd
from datetime import datetime
import json
data = []
tss = ts.replace("b'","").replace("\r\nEND'","")
tss2 = tss.strip().split("\r\n")
for t in tss2:
item = json.loads(t)
epo = int(item['TimeStamp'].split('(')[1].split(')')[0])
eims = datetime.fromtimestamp(epo/1000)
item.update(TimeStamp=eims)
data.append(item)
pd.DataFrame(data)
Output:
Close DownTicks DownVolume High Low Open Status TimeStamp TotalTicks TotalVolume UnchangedTicks UnchangedVolume UpTicks UpVolume OpenInterest
0 8506.25 164 207 8508.25 8495.00 8496.75 13 2020-03-06 16:40:00 371 469 0 0 207 262 0
1 8503.00 152 203 8509.50 8502.00 8506.00 13 2020-03-06 16:45:00 282 345 0 0 130 142 0
etc. You can drop unwanted columns, change column names and so on.
That's almost JSON format. Or more precisely, it's a series of lines, each of which contains a JSON-formatted object. (Except for the last one.) So that suggests that the optimal solution in some way uses the json module.
json.load doesn't handle files consisting of a series of lines, nor does it directly handle individual strings (much less bytes). However, you're not limited to json.load. You can construct a JSONDecoder object, which does include methods to parse from strings (but not from bytes), and you can use the decode method of the bytes object to construct a string from the input. (You need to know the encoding to do that, but I strongly suspect that in this case all the characters are ascii, so either 'ascii' or the default UTF-8 encoding will work.)
Unless your input is gigabytes, you can just use the strategy in your question: split the input into lines, discard the END line, and pass the rest into a JSONDecoder:
import json
decoder = JSONDecoder()
# Using splitlines seemed more robust than counting on a specific line-end
for line in response_text.decode().splitlines()
# Alternative: use a try/catch around the call to decoder.decode
if line == 'END': break
line_dict = decoder.decode(line)
# Handle the Timestamp member and create the dataframe item

Printing list in different columns

I am quite new to Python and I am now struggling with printing my list in columns. It prints my lists in one columns only but I want it printed under 4 different titles. I know am missing something but can't seem to figure it out. Any advice would be really appreciated!
def createMyList():
myAgegroup = ['20 - 39','40 - 59','60 - 79']
mygroupTitle = ['Age','Underweight','Healthy','Overweight',]
myStatistics = [['Less than 21%','21 - 33','Greater than 33%',],['Less than 23%','23 - 35','Greater than 35%',],['Less than 25%','25 - 38','Greater than 38%',]]
printmyLists(myAgegroup,mygroupTitle,myStatistics)
return
def printmyLists(myAgegroup,mygroupTitle,myStatistics):
print(': Age : Underweight : Healthy : Overweight :')
for count in range(0, len(myAgegroup)):
print(myAgegroup[count])
for count in range(0, len(mygroupTitle)):
print(mygroupTitle[count])
for count in range(0, len(myStatistics)):
print(myStatistics[0][count])
return
createMyList()
To print data in nice columns is nice to know Format Specification Mini-Languag (doc). Also, to group data together, look at zip() builtin function (doc).
Example:
def createMyList():
myAgegroup = ['20 - 39','40 - 59','60 - 79']
mygroupTitle = ['Age', 'Underweight','Healthy','Overweight',]
myStatistics = [['Less than 21%','21 - 33','Greater than 33%',],['Less than 23%','23 - 35','Greater than 35%',],['Less than 25%','25 - 38','Greater than 38%',]]
printmyLists(myAgegroup,mygroupTitle,myStatistics)
def printmyLists(myAgegroup,mygroupTitle,myStatistics):
# print the header:
for title in mygroupTitle:
print('{:^20}'.format(title), end='')
print()
# print the columns:
for age, stats in zip(myAgegroup, myStatistics):
print('{:^20}'.format(age), end='')
for stat in stats:
print('{:^20}'.format(stat), end='')
print()
createMyList()
Prints:
Age Underweight Healthy Overweight
20 - 39 Less than 21% 21 - 33 Greater than 33%
40 - 59 Less than 23% 23 - 35 Greater than 35%
60 - 79 Less than 25% 25 - 38 Greater than 38%

After a seperator there is a key in loop. How to keep it?

---------------------------
CompanyID: 000000000000
Pizza: 2 3.15 6.30
spaghetti: 1 7 7
ribye: 2 40 80
---------------------------
CompanyID: 000000000001
burger: 1 3.15 6.30
spaghetti: 1 7 7
ribye: 2 40 80
--------------------------
I'm doing a for loop over a list of lines. Every line is an item of a list. I need to keep the companyID while looking for a user input.
While this is printing the variable x=True. I cant take company ID to print it.
a='-'
for line in lines:
if a in line:
companyID= next(line)
if product in line:
x=True
TypeError: 'str' object is not an iterator
You can use your line seperator to identify when new data starts. Once you see the line with "----" then you can start collecing info in a new dictionary. for each line take its key and value by splitting on ":" and create the entry in the dictionary.
When you see the next "----" line you know thats the end of the data for this company so then do your check to see if they have the product and if so print the company id from the dictionary.
line_seperator_char = '-'
company_data = {}
product = 'burger'
with open('data.dat') as lines:
for line in lines:
line = line.rstrip()
if line.startswith(line_seperator_char):
if product in company_data:
print(f'{company_data["CompanyID"]} contains the product {product}')
company_data = {}
else:
key, value = line.split(':')
company_data[key] = value
OUTPUT
000000000001 contains the product burger
No it doesnt run. Could you explain what does "[1] means near split()[1]?
Another try that doesnt run is
y=[]
y=lines[1].split(' ')
for line in lines:
y=line.split(' ')
if len(y[1])==10:
companyID=y[1]
if product in line:
x=True
Thanks for the answers.Something that finally worked in my case was that:
y=[]
y=line[1].split(' ')
a='-'
for line in lines:
if line.startswith("CompanyID:"):
y=line.split(' ')
companyID=y[1]
if product in line:
x=True

Python3: How to increment a string value within a "for" loop

I have a tabular.text file (Named "xfile"). An example of its contents is attached below.
Scaffold2_1 WP_017805071.1 26.71 161 97
Scaffold2_1 WP_006995572.1 26.36 129 83
Scaffold2_1 WP_005723576.1 26.92 130 81
Scaffold3_1 WP_009894856.1 25.77 245 43
Scaffold8_1 WP_017805071.1 38.31 248 145
Scaffold8_1 WP_006995572.1 38.55 249 140
Scaffold8_1 WP_005723576.1 34.88 258 139
Scaffold9_1 WP_005645255.1 42.54 446 144
Note how each line begins with Scaffold(y)_1, with y being a number. I have written the following code to print each line beginning with the following terms, Scaffold2 and Scaffold8.
with open("xfile", 'r') as data:
for line in data.readlines():
if "Scaffold2" in line:
a = line
print(a)
elif "Scaffold8" in line:
b = line
print(b)
I was wondering, is there a way you would recommend incrementing the (y) portion of Scaffold() in the if and elif statements?
The idea would be to allow the script to search for each line containing "Scaffold(y)" and storing each line with a specific number (y) in its own variable to be then printed. This would obviously be much faster than entering in each number manually.
You can try this, it is quite easier than using Regex. If this isn't what you expect, let me know, I'll change the code.
for line in data.readlines():
if line[0:8] == "Scaffold" and line[8].isdigit():
print(line)
I'm just checking the 9th Position in your line, i.e. (8th index). If it's a digit, I'm printing the line. Like you said, I'm printing if your "y" is a digit. I'm not incrementing it. The work of incrementation is already been done by your for loop.
Ok it seems like you want to get something in format like:
entries = {y1: ['Scaffold(y1)_...', 'Scaffold(y1)_...'], y2: ['Scaffold(y2)_...', 'Scaffold(y2)_...'], ...}
Then you can do something like that (I assume all of your lines start in the same manner as you have shown, so the y value is always the 8th position in the string):
entries = dict()
for line in data.readlines():
if not line[8] in entries.keys():
entries.update({line[8]: [line]})
else:
entries[line[8]].append(line)
print(entries)
This way you will have a dictionary in the format I have shown you above - output:
{'2': ['Scaffold2_1 WP_017805071.1 26.71 161 97', 'Scaffold2_1 WP_006995572.1 26.36 129 83', 'Scaffold2_1 WP_005723576.1 26.92 130 81'], '3': ['Scaffold3_1 WP_009894856.1 25.77 245 43'], '8': ['Scaffold8_1 WP_017805071.1 38.31 248 145', 'Scaffold8_1 WP_006995572.1 38.55 249 140', 'Scaffold8_1 WP_005723576.1 34.88 258 139'], '9': ['Scaffold9_1 WP_005645255.1 42.54 446 144']}
EDIT: tbh I still don't fully understand why would you need that tho.

svm train output file has less lines than that of the input file

I am currently building a binary classification model and have created an input file for svm-train (svm_input.txt). This input file has 453 lines, 4 No. features and 2 No. classes [0,1].
i.e
0 1:15.0 2:40.0 3:30.0 4:15.0
1 1:22.73 2:40.91 3:36.36 4:0.0
1 1:31.82 2:27.27 3:22.73 4:18.18
0 1:22.73 2:13.64 3:36.36 4:27.27
1 1:30.43 2:39.13 3:13.04 4:17.39 ......................
My problem is that when I count the number of lines in the output model generated by svm-train (svm_train_model.txt), this has 12 fewer lines than that of the input file. The line count here shows 450, although there are obviously also 9 lines at the beginning showing the various parameters generated
i.e.
svm_type c_svc
kernel_type rbf
gamma 1
nr_class 2
total_sv 441
rho -0.156449
label 0 1
nr_sv 228 213
SV
Therefore 12 lines in total from the original input of 453 have gone. I am new to svm and was hoping that someone could shed some light on why this might have happened?
Thanks in advance
Updated.........
I now believe that in generating the model, it has removed lines whereby the labels and all the parameters are exactly the same.
To explain............... My input is a set of miRNAs which have been classified as 1 and 0 depending on their involvement in a particular process or not (i.e 1=Yes & 0=No). The input file looks something like.......
0 1:22 2:30 3:14 4:16
1 1:26 2:15 3:17 4:25
0 1:22 2:30 3:14 4:16
Whereby, lines one and three are exactly the same and as a result will be removed from the output model. My question is then both why the output model would do this and how I can get around this (whilst using the same features)?
Whilst both SOME OF the labels and their corresponding feature values are identical within the input file, these are still different miRNAs.
NOTE: The Input file does not have a feature for miRNA name (and this would clearly show the differences in each line) however, in terms of the features used (i.e Nucleotide Percentage Content), some of the miRNAs do have exactly the same percentage content of A,U,G & C and as a result are viewed as duplicates and then removed from the output model as it obviously views them as duplicates even though they are not (hence there are less lines in the output model).
the format of the input file is:
Where:
Column 0 - label (i.e 1 or 0): 1=Yes & 0=No
Column 1 - Feature 1 = Percentage Content "A"
Column 2 - Feature 2 = Percentage Content "U"
Column 3 - Feature 3 = Percentage Content "G"
Column 4 - Feature 4 = Percentage Content "C"
The input file actually looks something like (See the very first two lines below), as they appear identical, however each line represents a different miRNA):
1 1:23 2:36 3:23 4:18
1 1:23 2:36 3:23 4:18
0 1:36 2:32 3:5 4:27
1 1:14 2:41 3:36 4:9
1 1:18 2:50 3:18 4:14
0 1:36 2:23 3:23 4:18
0 1:15 2:40 3:30 4:15
In terms of software, I am using libsvm-3.22 and python 2.7.5
Align your input file properly, is my first observation. The code for libsvm doesnt look for exactly 4 features. I identifies by the string literals you have provided separating the features from the labels. I suggest manually converting your input file to create the desired input argument.
Try the following code in python to run
Requirements - h5py, if your input is from matlab. (.mat file)
pip install h5py
import h5py
f = h5py.File('traininglabel.mat', 'r')# give label.mat file for training
variables = f.items()
labels = []
c = []
import numpy as np
for var in variables:
data = var[1]
lables = (data.value[0])
trainlabels= []
for i in lables:
trainlabels.append(str(i))
finaltrain = []
trainlabels = np.array(trainlabels)
for i in range(0,len(trainlabels)):
if trainlabels[i] == '0.0':
trainlabels[i] = '0'
if trainlabels[i] == '1.0':
trainlabels[i] = '1'
print trainlabels[i]
f = h5py.File('training_features.mat', 'r') #give features here
variables = f.items()
lables = []
file = open('traindata.txt', 'w+')
for var in variables:
data = var[1]
lables = data.value
for i in range(0,1000): #no of training samples in file features.mat
file.write(str(trainlabels[i]))
file.write(' ')
for j in range(0,49):
file.write(str(lables[j][i]))
file.write(' ')
file.write('\n')

Resources