Linux join column files of different lengths - text

I've seen a lot of similar questions to this but I haven't found an answer. I have several text files, each with two columns, but the columns in each file are different lengths, e.g.
file1:
type val
1 2
2 4
3 2
file2:
type val
1 9
2 8
3 9
4 7
I want:
type val type val
1 2 1 9
2 4 2 8
3 2 3 9
4 7
'join' gives something like this:
type val type val
1 2 1 9
2 4 2 8
3 2 3 9
4 7
I could write a script but I'm wondering if there is a simple command.
Thanks,

Ok, couldn't wait for an answer, so wrote a python script. Here it is in case its useful to anyone.
import sys
import os
#joins all the tab delimited column files in a folder into one file with multiple columns
#usage joincolfiles.py /folder_with_files outputfile
folder = sys.argv[1] #working folder, put all the files to be joined in here
outfile=sys.argv[2] #output file
cols=int(sys.argv[3]) #number of columns, only works if each file has same number
g=open(outfile,'w')
a=[]
b=[]
c=0
for files in os.listdir(folder):
f=open(folder+"/"+files,'r')
b=[]
c=c+1
t=0
for line in f:
t=t+1
if t==1:
b.append(str(files)+line.rstrip('\n'))
else:
b.append(line.rstrip('\n')) #list of lines
a.append(b) #list of list of lines
f.close()
print "num files", len(a)
x=[]
for i in a:
x.append(len(i))
maxl = max(x) #max length of files
print 'max len',maxl
for k in range(0,maxl): #row number
for j in a:
if k<len(j):
g.write(j[k]+"\t")
else:
g.write("\t"*cols)
g.write("\n")
g.close()

Related

How can I delete useless strings by index from a Pandas DataFrame defining a function?

I have a DataFrame, namely 'traj', as follow:
x y z
0 5 3 4
1 4 2 8
2 1 1 7
3 Some string here
4 This is spam
5 5 7 8
6 9 9 7
... #continues repeatedly a lot with the same strings here in index 3 and 4
79 4 3 3
80 Some string here
I'm defining a function in order to delete useless strings positioned in certain index from the DataFrame. Here is what I'm trying:
def spam(names,df): #names is a list composed, for instance, by "Some" and "This" in 'traj'
return df.drop(index = ([traj[(traj.iloc[:,0] == n)].index for n in names]))
But when I call it it returns the error:
traj_clean = spam(my_list_of_names, traj)
...
KeyError: '[(3,4,...80)] not found in axis'
If I try alone:
traj.drop(index = ([traj[(traj.iloc[:,0] == 'Some')].index for n in names]))
it works.
I solved it in a different way:
df = traj[~traj[:].isin(names)].dropna()
Where names is a list of the terms you wish to delete.
df will contain only rows without these terms

Printing Pattern in Python

1. The Problem
Given a positive integer n. Print the pattern as shown in sample outputs.
A code has already been provided. You have to understand the logic of the code on your own and try and make changes to the code so that it gives correct output.
1.1 The Specifics
Input: A positive integer n, 1<= n <=9
Output: Pattern as shown in examples below
Sample input:
4
Sample output:
4444444
4333334
4322234
4321234
4322234
4333334
4444444
Sample input:
5
Sample output:
555555555
544444445
543333345
543222345
543212345
543222345
543333345
544444445
555555555
2. My Answer
2.1 My Code
n=int(input())
answer=[[1]]
for i in range(2, n+1):
t=[i]*((2*i)-3)
answer.insert(0, t)
answer.append(t)
for a in answer:
a.insert(0,i)
a.append(i)
print(answer)
outlst = [' '.join([str(c) for c in lst]) for lst in answer]
for a in outlst:
print(a)
2.2 My Output
Input: 4
4 4 4 4 4 4 4 4 4
4 4 3 3 3 3 3 3 3 4 4
4 4 3 3 2 2 2 2 2 3 3 4 4
4 3 2 1 2 3 4
4 4 3 3 2 2 2 2 2 3 3 4 4
4 4 3 3 3 3 3 3 3 4 4
4 4 4 4 4 4 4 4 4
2.3 Desired Output
4444444
4333334
4322234
4321234
4322234
4333334
4444444
Your answer isn't as expected because you add the same object t to the answer list twice:
answer.insert(0, t)
answer.append(t)
More specifically, when you assign t = [i]*(2*i - 3), a new data structure is created, [i, ..., i], and t just points to that data structure. Then you put the pointer t in the answer list twice.
In the for a in answer loop, when you use a.insert(0, i) and a.append(i), you update the data structure a is pointing to. Since you call insert(0, i) and append(i) on both pointers that point to the same data structure, you effectively insert and append i to that data structure twice. That's why you end up with more digits than you need.
Instead, you could run the loop for a in answer for only the top half of the rows in the answer list (and the middle row that has was created without a pair). E.g. for a in answer[:(len(answer)+1)/2].
Other things you could do:
using literals as the arguments instead of reusing the reference, e.g. append([i]*(2*i-3)). The literal expression will create a new data structure every time.
using a copy in one of the calls, e.g. append(t.copy()). The copy method creates a new list object with a "shallow" copy of the data structure.
Also, your output digits are space-separated, because you used a non-empty string in ' '.join(...). You should use the empty string: ''.join(...).
n=5
answer=[[1]]
for i in range(2, n+1):
t=[i]*((2*i)-3)
answer.insert(0, t)
answer.append(t.copy())
for a in answer:
a.insert(0,i)
a.append(i)
answerfinal=[]
for a in answer:
answerfinal.append(str(a).replace(' ','').replace(',','').replace(']','').replace('[',''))
for a in answerfinal:
print(a)
n = int(input())
for i in range(1,n*2):
for j in range(1,n*2):
if i <= j<=n*2-i: print(n-i+1,end='')
elif i>n and i>=j >= n*2 -i : print(i-n+1,end='')
elif j<=n: print(n-j+1,end="")
else: print(j-n+1,end='')
print()
n = int(input())
k = 2*n - 1
for i in range(k):
for j in range(k):
a = i if i<j else j
a = a if a<k-i else k-i-1
a = a if a<k-j else k-j-1
print(n-a, end = '')
print()

delete columns based on index name string operation [duplicate]

This question already has answers here:
Drop columns whose name contains a specific string from pandas DataFrame
(11 answers)
Closed 3 years ago.
I have a large dataframe with a lot of columns and want to delete some based on string operations on the column names.
Consider the following example:
df_tmp = pd.DataFrame(data=[(1,2,3, "foo"), ("bar", 4,5,6), (7,"baz", 8,9)],
columns=["test", "anothertest", "egg", "spam"])
Now, I would like to delete all columns where the column name contains test; I have tried to adapt answers given here (string operations on column content) and here (on addressing the name) to no avail.
df_tmp = df_tmp[~df_tmp.index.str.contains("test")]
# AttributeError: Can only use .str accessor with string values!
df_tmp[~df_tmp.name.str.contains("test")]
# AttributeError: 'DataFrame' object has no attribute 'name'
Can someone point me in the right direction?
Thanks a ton in advance. :)
Better would be with df.filter()...
>>> df_tmp
test anothertest egg spam
0 1 2 3 foo
1 bar 4 5 6
2 7 baz 8 9
Result:
1-
>>> df_tmp.loc[:,~df_tmp.columns.str.contains("test")]
egg spam
0 3 foo
1 5 6
2 8 9
2-
>>> df_tmp.drop(df_tmp.filter(like='test').columns, axis=1)
egg spam
0 3 foo
1 5 6
2 8 9
3-
>>> df_tmp.drop(df_tmp.filter(regex='test').columns, axis=1)
egg spam
0 3 foo
1 5 6
2 8 9
4-
>>> df_tmp.filter(regex='^((?!test).)*$')
egg spam
0 3 foo
1 5 6
2 8 9
Regex explanation
'^((?!test).)*$'
^ #Start matching from the beginning of the string.
(?!test) #This position must not be followed by the string "test".
. #Matches any character except line breaks (it will include those in single-line mode).
$ #Match all the way until the end of the string.
Nice explanation about regex negative lookahead

Print sorted data under a heading in python 3.7

I want to sort data in a txt file that has the following format:
name number1 number2 number3 number4
nick 3 2 66 40
Anna 3 1 33 19
kathrine 4 4 100 258
based on fourth column (number3) but it doesn't seem to work when number3 is three digits number.
The program simply asks three times for a name. Each name enters numbers other than 0
and prints the name, how many numbers given (number1), how many of them are greater than 10 (number2), percentage of them (number 3) and the sum of all numbers given(number4).
I would like also to print data aligned under each heading.The headings doesn't need to be stored
in the file.
The code is the following
def how_to_sort_a_file():
from operator import itemgetter
with open('myfile2.txt') as f:
lines = [line.split(' ') for line in f]
output = open('myfile2(sorted).txt', 'w')
for line in sorted(lines,key=itemgetter(3), reverse=True):
output.write(' '.join(line))
output.close()
print('')
with open('myfile2(sorted).txt') as f:
##prints an empty line between lines
for line in f:
print(line)
##end function
##################################################################
##################
## main program ##
##################
file=open('myfile2.txt', 'w')
file.close()
for i in range(3):
name=input('insert your name: ')
number=int(input('insert number, 0 to terminate: '))
given_numbers=0
numbers_greater_10=0
Sum=0
while number!=0:
Sum=Sum+number
given_numbers=given_numbers+1
if number>10:
numbers_greater_10=numbers_greater_10+1
number=int(input('insert number, 0 to terminate: '))
percentage=int((numbers_greater_10)*100/given_numbers)
with open('myfile2.txt', 'a') as saveFile:
saveFile.write(name +' '+str(given_numbers)+' '+str(numbers_greater_10)+' '+ str(percentage)+' '+ str(Sum)+"\n")
how_to_sort_a_file()
I'm totally inexperienced with Python and i would appreciate any help.
Thanks a lot.
You can try doing this using Python's library pandas:
Read the text file into a dataframe like below:
In [1397]: df = pd.read_fwf('myfile2.txt')
In [1398]: df
Out[1398]:
name number1 number2 number3 number4
0 nick 3 2 66 40
1 Anna 3 1 33 19
2 kathrine 4 4 100 258
Now, you can simple sort on the column number3 in an ascending order:
In [1401]: df = df.sort_values('number3')
In [1402]: df
Out[1402]:
name number1 number2 number3 number4
1 Anna 3 1 33 19
0 nick 3 2 66 40
2 kathrine 4 4 100 258
You can see above that the rows are sorted by number3. Now, simply write this to a text file:
In [1403]: df.to_csv('my_output.txt', index=False)
mayankp#mayank:~/Desktop$ cat my_output.txt
name,number1,number2,number3,number4
Anna,3,1,33,19
nick,3,2,66,40
kathrine,4,4,100,258
The good part about this is that you don't have to write complex code for parsing the file.

printing only even columns

I am trying to create a program in python that, once it opens a .csv file, would print (or better, create a new file) with only the even columns.
For example, if my file contains:
A B C D E
1 2 3 4 5
6 7 8 9 0
The new file would have only:
B D
2 4
7 9
So far I have this:
import csv
ifile=open('Example.csv', 'r')
reader=csv.reader(ifile)
ofile=open('Example2.csv', 'w')
writer=csv.writer(ofile, delimiter=',')
for row in reader:
writer.writerow(row[1:2]+row[3:4])
print(row[1:2]+row[3:4])
ifile.close()
ofile.close()
But if I have a file containing hundreds of columns, I need a neat way to solve the problem.
Considering your data looks like this(see no new line between rows):
A B C D E
1 2 3 4 5
6 7 8 9 0
You can modify your program as:
import csv
ifile=open('Example.csv', 'r')
reader=csv.reader(ifile, delimiter=' ')
ofile=open('Example2.csv', 'w')
writer=csv.writer(ofile, delimiter=',')
for row in reader:
# Here you check for even
tmp_row = [col for idx, col in enumerate(row) if (idx + 1) % 2 == 0]
writer.writerow(tmp_row)
ifile.close()
ofile.close()
You loop over each row to get column index and then check for even(odd actually because index starts from 0) columns. Also, you should specify reader=csv.reader(ifile, delimiter=' ') delimiter.

Resources