Binning values in Python3 - python-3.x

Say I have the following txt file
Distances Counts
1 5
2 5
3 9
4 10
9 10
10 10
11 5
14 5
20 1
21 1
23 2
I would like a way to bin according to the first column and sum the second column.
The correct output if you use a bin of 5 would be
0-5 29
5-10 20
10-15 10
15-20 20
20-25 3
or just
5 29
10 20
15 10
20 20
25 3
i tried
binfile = open('distances.txt', 'r')
binsize = 5
summar = 0
binsize2 = binsize
for line in binfile:
line = line.strip().split('\t')
distance = int(line[0])
counts = int(line[1])
if distance <= binsize2:
summar += counts
else:
print(str(binsize2)+'\t'+str(summar))
binsize2 = binsize2 + binsize
summar = counts
but it doesn't print the last bin. Any suggestions?

Related

How to list the number of words in a row with the most words?

I try to write the number of words from the longest line. I was able to write the number of words in each line, but I can't print the maximum number. The max () function do not works. Can anyone help me?
import os
import sys
import numpy as np
with open('demofile.txt') as f:
lines = f.readlines()
for index, value in enumerate(lines):
number_of_words = len(value.split())
print(number_of_words)
demofile.txt
<=4 1 2 3 4 5 6 7 8 9 10 11
<=4 1 2 3 4 5 6 7 8 9
<=4 1 2 3 4 5 6 7 8 9 10 11 sdad adada affg
<=4 1 2 3 4 5 6 7 8 9 10 11
Output:
12
10
15
12
0
0
0
0
0
0
0
0
0
0
0
I also don't understand why it lists the number of words in the next lines where there are no words
If I understood correctly max() function doesn't work because you are searching max of strings so you need to convert them to ints(floats).
lines = [int(x) for x in lines.split(" ")] // converts to ints
maximum = max(lines)// should work now
UPD:
Edited with comment below.
Before:
int(x) for x in lines
Now:
int(x) for x in lines.split(" ")

Replacing the first column values according to the second column pattern

How to use regex to replace values in Data Frames, here, 5th column according to pattern of the 1st column? The column 5 consist only in ones for now. However, I would like to start changing this column when in the 1st column pattern 34444 appears. Then program suppose to replace ones with 11111, 22222, 33333 etc. until the end of the file when the pattern appears.
Sample of the file:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 1 3 138.998480 12.596951 0.223780
22 12 1 4 138.333252 11.884713 -0.281429
23 13 1 4 139.498084 13.356891 -0.480091
24 14 1 4 139.710930 11.981460 0.697098
25 15 1 4 138.452807 13.136061 0.990663
Expected result:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 2 3 138.998480 12.596951 0.223780
22 12 2 4 138.333252 11.884713 -0.281429
23 13 2 4 139.498084 13.356891 -0.480091
24 14 2 4 139.710930 11.981460 0.697098
25 15 2 4 138.452807 13.136061 0.990663
Yeah, if you really want re, there is a way. But I doubt it would be really more efficient than a for-loop.
1. re.finditer
import pandas as pd
import numpy as np
import re
# present col1 as number-strings
arr1 = df['1'].values
str1 = "".join([str(i) for i in arr1])
ans = np.ones(len(str1), dtype=int)
# when a pattern is found, increase latter elements by 1
for match in re.finditer('34444', str1):
e = match.end()
ans[e:] += 1
# replace column 5
df['5'] = ans
# Output
df[['0', '5', '1']]
Out[50]:
0 5 1
11 1 1 1
12 2 1 1
13 3 1 1
14 4 1 1
15 5 1 1
16 6 1 3
17 7 1 4
18 8 1 4
19 9 1 4
20 10 1 4
21 11 2 3
22 12 2 4
23 13 2 4
24 14 2 4
25 15 2 4
2. naïve for-loop
Checks the array directly element-by-element. By comparison with re.finditer, no typecasting is involved, but an explicit for-loop is written. The same output is obtained. Please benchmark by yourself if efficiency became relevant, say, if there were tens of millions of rows involved.
arr1 = df['1'].values
ans = np.ones(len(str1), dtype=int)
n = len(arr1)
for i, el in enumerate(arr1):
# termination
if i > n - 5:
break
# ignore non-3 elements
if el != 3:
continue
# if found, increase latter elements by 1
if np.all(arr1[i+1:i+5] == 4):
ans[i+5:] += 1
df['5'] = ans

Pandas - Fill N rows for a specific column with a integer value and increment the integer there after

I have a dataframe to which I added say a column named col_1. I want to add integer values to that column starting from the first row that increment after every 4th row. So the new resulting column should have values of as such.
col_1
1
1
1
1
2
2
2
2
The current approach I have is a very brute force one:
for x in range(len(df)):
if x <= 3:
df['col_1'][x] = 1
if x >3 and x <= 7:
df['col_1'][x] = 2
This might work for something small but when moving to something larger it will chew up a lot of time.
If there si default RangeIndex you can use integer division with add 1:
df['col_1'] = df.index // 4 + 1
Or for general solution use helper array by lenght of DataFrame:
df['col_1'] = np.arange(len(df)) // 4 + 1
For repeat 1 and 2 pattern use also modulo by 2 like:
df = pd.DataFrame({'a':range(20, 40)})
df['col_1'] = (np.arange(len(df)) // 4) % 2 + 1
print (df)
a col_1
0 20 1
1 21 1
2 22 1
3 23 1
4 24 2
5 25 2
6 26 2
7 27 2
8 28 1
9 29 1
10 30 1
11 31 1
12 32 2
13 33 2
14 34 2
15 35 2
16 36 1
17 37 1
18 38 1
19 39 1

How to convert multi-indexed datetime index into integer?

I have a multi indexed dataframe(groupby object) as the result of groupby (by 'id' and 'date').
x y
id date
abc 3/1/1994 100 7
9/1/1994 90 8
3/1/1995 80 9
bka 5/1/1993 50 8
7/1/1993 40 9
I'd like to convert those dates into an integer-like, such as
x y
id date
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
I thought it would be simple but I couldn't get there easily. Is there a simple way to work on this?
Try this:
s = 'day ' + df.groupby(level=0).cumcount().astype(str)
df1 = df.set_index([s], append=True).droplevel(1)
x y
id
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
You can calculate the new level and create a new index:
lvl1 = 'day ' + df.groupby('id').cumcount().astype('str')
df.index = pd.MultiIndex.from_tuples((x,y) for x,y in zip(df.index.get_level_values('id'), lvl1) )
output:
x y
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9

Finding sum of values in RDD

I have a sample file where I have numbers in it separated by spaces.I need to find sum of those numbers.
here is the file it looks like :
10 20 30 40 50 60 70
1 2 3 4 5 6 7 8 9 10
10 20 30 40 50 60 70
I tried loading the file using textfile which has numbers and then I applied flatmap to split the numbers by spaces and then I am unable to find sum function to do sum of elements.
here is the code:
val rdd=sc.textFile(“/tmp/numbers.txt”)
val numRdd=rdd.flatMap(lines=>lines.split(“ “))
You could try to map the flatmap to a map function to convert it to Int and then use sum() on the RDD.
val data = Seq("10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70")
val rdd:RDD[String] = sc.parallelize(data)
val dataSplit = rdd.flatMap(x => x.split(" ")).map(x => x.toInt)
val sumData = dataSplit.sum()
println("Total sum "+sumData)

Resources