Print sorted data under a heading in python 3.7 - python-3.x

I want to sort data in a txt file that has the following format:
name number1 number2 number3 number4
nick 3 2 66 40
Anna 3 1 33 19
kathrine 4 4 100 258
based on fourth column (number3) but it doesn't seem to work when number3 is three digits number.
The program simply asks three times for a name. Each name enters numbers other than 0
and prints the name, how many numbers given (number1), how many of them are greater than 10 (number2), percentage of them (number 3) and the sum of all numbers given(number4).
I would like also to print data aligned under each heading.The headings doesn't need to be stored
in the file.
The code is the following
def how_to_sort_a_file():
from operator import itemgetter
with open('myfile2.txt') as f:
lines = [line.split(' ') for line in f]
output = open('myfile2(sorted).txt', 'w')
for line in sorted(lines,key=itemgetter(3), reverse=True):
output.write(' '.join(line))
output.close()
print('')
with open('myfile2(sorted).txt') as f:
##prints an empty line between lines
for line in f:
print(line)
##end function
##################################################################
##################
## main program ##
##################
file=open('myfile2.txt', 'w')
file.close()
for i in range(3):
name=input('insert your name: ')
number=int(input('insert number, 0 to terminate: '))
given_numbers=0
numbers_greater_10=0
Sum=0
while number!=0:
Sum=Sum+number
given_numbers=given_numbers+1
if number>10:
numbers_greater_10=numbers_greater_10+1
number=int(input('insert number, 0 to terminate: '))
percentage=int((numbers_greater_10)*100/given_numbers)
with open('myfile2.txt', 'a') as saveFile:
saveFile.write(name +' '+str(given_numbers)+' '+str(numbers_greater_10)+' '+ str(percentage)+' '+ str(Sum)+"\n")
how_to_sort_a_file()
I'm totally inexperienced with Python and i would appreciate any help.
Thanks a lot.

You can try doing this using Python's library pandas:
Read the text file into a dataframe like below:
In [1397]: df = pd.read_fwf('myfile2.txt')
In [1398]: df
Out[1398]:
name number1 number2 number3 number4
0 nick 3 2 66 40
1 Anna 3 1 33 19
2 kathrine 4 4 100 258
Now, you can simple sort on the column number3 in an ascending order:
In [1401]: df = df.sort_values('number3')
In [1402]: df
Out[1402]:
name number1 number2 number3 number4
1 Anna 3 1 33 19
0 nick 3 2 66 40
2 kathrine 4 4 100 258
You can see above that the rows are sorted by number3. Now, simply write this to a text file:
In [1403]: df.to_csv('my_output.txt', index=False)
mayankp#mayank:~/Desktop$ cat my_output.txt
name,number1,number2,number3,number4
Anna,3,1,33,19
nick,3,2,66,40
kathrine,4,4,100,258
The good part about this is that you don't have to write complex code for parsing the file.

Related

How to print out the cell value from excel using pandas?

Below is the code im using to diff two dataframes, but not sure how i can get the mismatched values cell location.
file=[random1.csv,random2.csv]
li=[]
for filename in file:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
df1=li[0]
df2=li[1]
print("{} comparing with {}".format(file[0],file[1]))
df3 = df2[df1.ne(df2).any(axis=1)]
print(df3)
print("{} comparing with {}".format(file[1],file[0]))
df4 = df1[df2.ne(df1).any(axis=1)]
print(df4)
output
random1.csv comparing with random2.csv
name age address
1 2 22 2
4 5 6 3
9 10 89 10
random2.csv comparing with random1.csv
name age address
1 2 22 1
4 5 6 2
9 10 89 11
kindly help on this!!
p.s : Im a newbie :)

limit number of words in a column in a DataFrame

My dataframe looks like
Abc XYZ
0 Hello How are you doing today
1 Good This is a
2 Bye See you
3 Books Read chapter 1 to 5 only
max_size = 3,
I want to truncate the column(XYZ) to a maximum size of 3 words(max_size). There are rows with length less than max_size, and it should be left as it is.
Desired output:
Abc XYZ
0 Hello How are you
1 Good This is a
2 Bye See you
3 Books Read chapter 1
Use split with limit, remove last value and then join lists together:
max_size = 3
df['XYZ'] = df['XYZ'].str.split(n=max_size).str[:max_size].str.join(' ')
print (df)
Abc XYZ
0 Hello How are you
1 Good This is a
2 Bye See you
3 Books Read chapter 1
Another solution with lambda function:
df['XYZ'] = df['XYZ'].apply(lambda x: ' '.join(x.split(maxsplit=max_size)[:max_size]))

How to generate pyramid of numbers (using only 1-3) using Python?

I'm wondering how to create a pyramid using only element (1,2,3) regardless of how many rows.
For eg. Rows = 7 ,
1
22
333
1111
22222
333333
1111111
I've have tried creating a normal pyramid with numbers according to rows.
eg.
1
22
333
4444
55555
666666
Code that I tried to make a Normal Pyramid
n = int(input("Enter the number of rows:"))
for rows in range (1, n+1):
for times in range (rows):
print(rows, end=" ")
print("\n")
You need to adjust your ranges and use the modulo operator % - it gives you the remainer of any number diveded by some other number.Modulo 3 returns 0,1 or 2. Add 1 to get your desired range of values:
1 % 3 = 1
2 % 3 = 2 # 2 "remain" as 2 // 3 = 0 - so remainder is: 2 - (2//3)*3 = 2 - 0 = 2
3 % 3 = 0 # no remainder, as 3 // 3 = 1 - so remainder is: 3 - (3//3)*3 = 3 - 1*3 = 0
Full code:
n = int(input("Enter the number of rows: "))
print()
for rows in range (0, n): # start at 0
for times in range (rows+1): # start at 0
print( rows % 3 + 1, end=" ") # print 0 % 3 +1 , 1 % 3 +1, ..., etc.
print("")
Output:
Enter the number of rows: 6
1
2 2
3 3 3
1 1 1 1
2 2 2 2 2
3 3 3 3 3 3
See:
Modulo operator in Python
What is the result of % in Python?
binary-arithmetic-operations
A one-liner (just for the record):
>>> n = 7
>>> s = "\n".join(["".join([str(1+i%3)]*(1+i)) for i in range(n)])
>>> s
'1\n22\n333\n1111\n22222\n333333\n1111111'
>>> print(s)
1
22
333
1111
22222
333333
1111111
Nothing special: you have to use the modulo operator to cycle the values.
"".join([str(1+i%3)]*(1+i)) builds the (i+1)-th line: i+1 times 1+i%3 (thats is 1 if i=0, 2 if i=1, 3 if i=2, 1 if i=4, ...).
Repeat for i=0..n-1 and join with a end of line char.
Using cycle from itertools, i.e. a generator.
from itertools import cycle
n = int(input("Enter the number of rows:"))
a = cycle((1,2,3))
for x,y in zip(range(1,n),a):
print(str(x)*y)
(update) Rewritten as two-liner
from itertools import cycle
n = int(input("Enter the number of rows:"))
print(*[str(y)*x for x,y in zip(range(1,n),cycle((1,2,3)))],sep="\n")

Assigning integers to dataframe fields ` OverflowError: Python int too large to convert to C unsigned long`

I have a dataframe df that looks like this:
var val
0 clump_thickness 5
1 unif_cell_size 1
2 unif_cell_shape 1
3 marg_adhesion 1
4 single_epith_cell_size 2
5 bare_nuclei 1
6 bland_chrom 3
7 norm_nucleoli 1
8 mitoses 1
9 class 2
11 unif_cell_size 4
12 unif_cell_shape 4
13 marg_adhesion 5
14 single_epith_cell_size 7
15 bare_nuclei 10
17 norm_nucleoli 2
20 clump_thickness 3
25 bare_nuclei 2
30 clump_thickness 6
31 unif_cell_size 8
32 unif_cell_shape 8
34 single_epith_cell_size 3
35 bare_nuclei 4
37 norm_nucleoli 7
40 clump_thickness 4
43 marg_adhesion 3
50 clump_thickness 8
51 unif_cell_size 10
52 unif_cell_shape 10
53 marg_adhesion 8
... ... ...
204 single_epith_cell_size 5
211 unif_cell_size 5
215 bare_nuclei 7
216 bland_chrom 7
217 norm_nucleoli 10
235 bare_nuclei -99999
257 norm_nucleoli 6
324 single_epith_cell_size 8
I want to create a new column that holds the values of the var and val columns, converted to a number. I wrote the following code:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
When I run this code I get the following error:
df['id'] = df.apply(lambda row: int.from_bytes('{}{}'.format(row.var, row.val).encode(), 'little'), axis = 1)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4384, in _apply_standard
result = Series(results)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 205, in __init__
default=np.nan)
File "pandas/_libs/src/inference.pyx", line 1701, in pandas._libs.lib.fast_multiget (pandas/_libs/lib.c:68371)
File "pandas/_libs/src/inference.pyx", line 1165, in pandas._libs.lib.maybe_convert_objects (pandas/_libs/lib.c:58498)
OverflowError: Python int too large to convert to C unsigned long
I don't understand why. If I run
for column in df['var'].unique():
for value in df['val'].unique():
if int.from_bytes('{}{}'.format(column, value).encode(), 'little') > maximum:
maximum = int.from_bytes('{}{}'.format(column, value).encode(), 'little')
print(int.from_bytes('{}{}'.format(column, value), 'little'))
print()
print(maximum)
I get the following result:
65731626445514392434127804442952952931
67060854441299308307031611503233297507
65731626445514392434127804442952952931
68390082437084224179935418563513642083
69719310432869140052839225623793986659
73706994420223887671550646804635020387
16399285238650560638676108961167827102819
67060854441299308307031611503233297507
72377766424438971798646839744354675811
75036222416008803544454453864915364963
69719310432869140052839225623793986659
16399285238650560638676108961167827102819
76365450411793719417358260925195709539
68390082437084224179935418563513642083
76365450411793719417358260925195709539
73706994420223887671550646804635020387
83632281929131549175300318205721294812263623257187
71048538428654055925743032684074331235
75036222416008803544454453864915364963
72377766424438971798646839744354675811
277249955343544548646026928445812341
256480767909405238131904943128931957
266865361626474893388965935787372149
287634549060614203903087921104252533
64059424565585367137514643836585471605
261673064767940065760435439458152053
282442252202079376274557424775032437
.....
60968996531299
69179002195346541894528099
58769973275747
62068508159075
59869484903523
6026341019714892551838472781928948268513458935618931750446847388019
Based on these results I would say that the conversion to integers works fine. Furthermore, the largest created integer is not so big that it should cause problems when being inserted into the dataframe right?
Question: How can I successfully create a new column with the newly created integers? What am I doing wrong here?
Edit: Although bws's solution
str(int.from_bytes('{}{}'.format(column, value).encode(), 'little'))
solves the error, I now have a new problem: the ids are all unique.. I don't understand why this happens but I suddenly have 3000 unique ids, while there are only 92 unique var/val combinations.
I dont know the why. Maybe lamda use by default int in front of int64?
I have a workaround that maybe is useful for you.
Convert the result to string (object):df['id'] = df.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
This is interesting to know: https://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html
uint64 Unsigned integer (0 to 18446744073709551615)
edit:
After read the last link I asume that when you use a loop, you are using the int python type, not the int that use pandas (come from numpy). So, when you work with a Dataframe you are using the types that numpy provide...
Int type from numpy come from Object so I think that the correct way to work with large integer is use object.
Its my conclusion but maybe I am wrong.
Edit second question:
Simple example works:
d2 = {'val': [2, 1, 1, 2],
'var': ['clump_thickness', 'unif_cell_size', 'unif_cell_size', 'clump_thickness']
}
df2 = pd.DataFrame(data=d2)
df2['id'] = df2.apply(lambda row: str(int.from_bytes('{}{}'.format(row["var"], row["val"]).encode(), 'little')), axis = 1)
Result of df2:
print (df2)
val var id
0 2 clump_thickness 67060854441299308307031611503233297507
1 1 unif_cell_size 256480767909405238131904943128931957
2 1 unif_cell_size 256480767909405238131904943128931957
3 2 clump_thickness 67060854441299308307031611503233297507

Linux join column files of different lengths

I've seen a lot of similar questions to this but I haven't found an answer. I have several text files, each with two columns, but the columns in each file are different lengths, e.g.
file1:
type val
1 2
2 4
3 2
file2:
type val
1 9
2 8
3 9
4 7
I want:
type val type val
1 2 1 9
2 4 2 8
3 2 3 9
4 7
'join' gives something like this:
type val type val
1 2 1 9
2 4 2 8
3 2 3 9
4 7
I could write a script but I'm wondering if there is a simple command.
Thanks,
Ok, couldn't wait for an answer, so wrote a python script. Here it is in case its useful to anyone.
import sys
import os
#joins all the tab delimited column files in a folder into one file with multiple columns
#usage joincolfiles.py /folder_with_files outputfile
folder = sys.argv[1] #working folder, put all the files to be joined in here
outfile=sys.argv[2] #output file
cols=int(sys.argv[3]) #number of columns, only works if each file has same number
g=open(outfile,'w')
a=[]
b=[]
c=0
for files in os.listdir(folder):
f=open(folder+"/"+files,'r')
b=[]
c=c+1
t=0
for line in f:
t=t+1
if t==1:
b.append(str(files)+line.rstrip('\n'))
else:
b.append(line.rstrip('\n')) #list of lines
a.append(b) #list of list of lines
f.close()
print "num files", len(a)
x=[]
for i in a:
x.append(len(i))
maxl = max(x) #max length of files
print 'max len',maxl
for k in range(0,maxl): #row number
for j in a:
if k<len(j):
g.write(j[k]+"\t")
else:
g.write("\t"*cols)
g.write("\n")
g.close()

Resources