printing only even columns - python-3.x

I am trying to create a program in python that, once it opens a .csv file, would print (or better, create a new file) with only the even columns.
For example, if my file contains:
A B C D E
1 2 3 4 5
6 7 8 9 0
The new file would have only:
B D
2 4
7 9
So far I have this:
import csv
ifile=open('Example.csv', 'r')
reader=csv.reader(ifile)
ofile=open('Example2.csv', 'w')
writer=csv.writer(ofile, delimiter=',')
for row in reader:
writer.writerow(row[1:2]+row[3:4])
print(row[1:2]+row[3:4])
ifile.close()
ofile.close()
But if I have a file containing hundreds of columns, I need a neat way to solve the problem.

Considering your data looks like this(see no new line between rows):
A B C D E
1 2 3 4 5
6 7 8 9 0
You can modify your program as:
import csv
ifile=open('Example.csv', 'r')
reader=csv.reader(ifile, delimiter=' ')
ofile=open('Example2.csv', 'w')
writer=csv.writer(ofile, delimiter=',')
for row in reader:
# Here you check for even
tmp_row = [col for idx, col in enumerate(row) if (idx + 1) % 2 == 0]
writer.writerow(tmp_row)
ifile.close()
ofile.close()
You loop over each row to get column index and then check for even(odd actually because index starts from 0) columns. Also, you should specify reader=csv.reader(ifile, delimiter=' ') delimiter.

Related

How to join several data frames containing different pieces of one data into one?

I have several - let's say three - data frames that contain different rows (sometimes they can overlap) of another data frame. The columns are the same for all three dfs. I want now to create final data frame that will contain all the rows from three mentioned data frames. Moreover I need to generate a column for the final df that will contain information in which one of the first three dfs this particular row is included.
Example below
Original data frame:
original_df = pd.DataFrame(np.array([[1,1],[2,2],[3,3],[4,4],[5,5],[6,6]]), columns = ['label1','label2'])
Three dfs containing different pieces of the original df:
a = original_df.loc[0:1, columns]
b = original_df.loc[2:2, columns]
c = original_df.loc[3:, columns]
I want to get the following data frame:
final_df = pd.DataFrame(np.array([[1,1,'a'],[2,2,'a'],[3,3,'b'],[4,4,'c'],\
[5,5,'c'],[6,6,'c']]), columns = ['label1','label2', 'from which df this row'])
or simply use integers to mark from which df the row is:
final_df = pd.DataFrame(np.array([[1,1,1],[2,2,1],[3,3,2],[4,4,3],\
[5,5,3],[6,6,3]]), columns = ['label1','label2', 'from which df this row'])
Thank you in advance!
See this related post
IIUC, you can use pd.concat with the keys and names arguments
pd.concat(
[a, b, c], keys=['a', 'b', 'c'],
names=['from which df this row']
).reset_index(0)
from which df this row label1 label2
0 a 1 1
1 a 2 2
2 b 3 3
3 c 4 4
4 c 5 5
5 c 6 6
However, I'd recommend that you store those dataframe pieces in a dictionary.
parts = {
'a': original_df.loc[0:1],
'b': original_df.loc[2:2],
'c': original_df.loc[3:]
}
pd.concat(parts, names=['from which df this row']).reset_index(0)
from which df this row label1 label2
0 a 1 1
1 a 2 2
2 b 3 3
3 c 4 4
4 c 5 5
5 c 6 6
And as long as it is stored as a dictionary, you can also use assign like this
pd.concat(d.assign(**{'from which df this row': k}) for k, d in parts.items())
label1 label2 from which df this row
0 1 1 a
1 2 2 a
2 3 3 b
3 4 4 c
4 5 5 c
5 6 6 c
Keep in mind that I used the double-splat ** because you have a column name with spaces. If you had a column name without spaces, we could do
pd.concat(d.assign(WhichDF=k) for k, d in parts.items())
label1 label2 WhichDF
0 1 1 a
1 2 2 a
2 3 3 b
3 4 4 c
4 5 5 c
5 6 6 c
Just create a list and in the end concatenate:
list_df = []
list_df.append(df1)
list_df.append(df2)
list_df.append(df3)
df = pd.concat(liste_df)
Perhaps this can work / add value for you :)
import pandas as pd
# from your post
a = original_df.loc[0:1, columns]
b = original_df.loc[2:2, columns]
c = original_df.loc[3:, columns]
# create new column to label the datasets
a['label'] = 'a'
b['label'] = 'b'
c['label'] = 'c'
# add each df to a list
combined_l = []
combined_l.append(a)
combined_l.append(b)
combined_l.append(c)
# concat all dfs into 1
df = pd.concat(liste_df)

In python, how to locate the position of the empty rows in the middle of the file and skip some rows at the beginning dynamically

The data in an excel file looks like this
A B C
1 1 1
1 1 1
D E F G H
1 1 1 1 1
1 1 1 1 1
The file is separated into two parts by one empty row in the middle of the file. They have different column names and different number of columns. I only need the second part of the file. I want to read this file as a pandas dataframe. The number of rows in the first part is not fixed, different files will have different number of rows. So if I use skiprows=4 will not work.
I actually already have a solution for that. But I want to know whether there is a better solution.
import pandas as pd
path = r'C:\Users\'
file = 'test-file.xlsx'
# Read the whole file without skipping
df_temp = pd.read_excel(path + '/' + file)
The data looks like this in pandas. Empty row will have null values in all the columns.
A B C Unnamed: 3 Unnamed: 4
0 1 1 1 NaN NaN
1 1 1 1 NaN NaN
2 NaN NaN NaN NaN NaN
3 D E F G H
4 1 1 1 1 1
5 1 1 1 1 1
I try to find all empty rows and return the index of the first empty row
first_empty_row = df_temp[df_temp.isnull().all(axis=1)].index[0]
del df_temp
Read the file again but skip number of rows by using the number provided above
df= pd.read_excel(path + '/' + file, skiprows=first_empty_row+2)
print(df)
The drawback of this solution is I need to read the file twice. If the file has a lot of rows in the first part, it might take a long time to read these useless rows. I can also possibly use readline loop rows until it reach an empty row, but that will be inefficient.
Does anyone have a better solution? Thanks
Find the position if the first empty row:
pos = df_temp[df_temp.isnull().all(axis=1)].index[0]
Then select everything after that position:
df = df_temp.iloc[pos+1:]
df.columns = df.iloc[0]
df.columns.name = ''
df = df.iloc[1:]
Your first line looks across the entire row for all null. Would it be possible to just look for the first null in the first column?
first_empty_row = df_temp[df_temp.isnull().all(axis=1)].index[0]
How does this compare in performance?
import pandas as pd
import numpy as np
data1 = {'A' : [1,1, np.NaN, 'D', 1,1],
'B' : [1,1, np.NaN, 'E', 1,1],
'C' : [1,1, np.NaN, 'F', 1,1],
'Unnamed: 3' : [np.NaN,np.NaN,np.NaN, 'G', 1,1],
'Unnamed: 4' : [np.NaN,np.NaN,np.NaN, 'H', 1,1]}
df1 = pd.DataFrame(data1)
print(df1)
A B C Unnamed: 3 Unnamed: 4
0 1 1 1 NaN NaN
1 1 1 1 NaN NaN
2 NaN NaN NaN NaN NaN
3 D E F G H
4 1 1 1 1 1
5 1 1 1 1 1
# create empty list to append the rows that need to be deleted
list1 = []
# loop through the first column of the dataframe and append the index to a list until the row is null
for index, row in df1.iterrows():
if (pd.isnull(row[0])):
list1.append(index)
break
else:
list1.append(index)
# drop the rows based on list created from for loop
df1 = df1.drop(df1.index[list1])
# reset index so you can replace the old columns names
# with the secondary column names easier
df1 = df1.reset_index(drop = True)
# create empty list to append the new column names to
temp = []
# loop through dataframe and append the new column names
for label in df1.columns:
temp.append(df1[label][0])
# replace column names with the desired names
df1.columns = temp
# drop the old column names which are always going to be at row 0
df1 = df1.drop(df1.index[0])
# reset index so it doesn't start at 1
df1 = df1.reset_index(drop = True)
print(df1)
D E F G H
0 1 1 1 1 1
1 1 1 1 1 1

Renaming Duplicates in CSV column in sequence by python

I have CSV data with a particular column having duplicate entries say
like a,b,c,a,b,c,v,f,c... I want to replace the values to
a,b,c,a_1,b_1,c_1,v,f,c_2...
I have wrote the below code to find duplicates:-
import csv
from collections import Counter
import pandas as pd
duplicate_names=[]
file='2018_Akola_August.csv'
with open(file, 'r', newline='') as csv_file:
occurrences = Counter()
for line in csv.reader(csv_file):
email = line[3]
if email in occurrences:
print(email)
duplicate_names.append(email)
occurrences[email] += 1
else:
occurrences[email] = 1
Also to replace a string in CSV column I wrote code as below but is
not working as desired for two duplicate values.
df = pd.read_csv(file, index_col=False, header=0)
#Finds 'a' and replaces it with 'a_1'
df.loc[df['Circle'] == 'a' , 'Circle']= 'a_1'
print(df)
df.to_csv(file)
What effect does this statement will have is not clear?
df.loc[df['Circle'] == 'a' , 'Circle'][]= 'a_1'
How to go about renaming such duplicates in sequence?
here is a way in 2 steps:
>>> df
Circle
0 a
1 b
2 c
3 a
4 b
5 c
6 v
7 f
8 c
dups = (df.loc[df['Circle'].duplicated(),'Circle'] + '_' +
df.groupby('Circle').cumcount().astype(str))
df.loc[dups.notnull(),'Circle'] = dups
>>> df
Circle
0 a
1 b
2 c
3 a_1
4 b_1
5 c_1
6 v
7 f
8 c_2
In answer to your second question, the line:
df.loc[df['Circle'] == 'a' , 'Circle']= 'a_1'
Will take all values of Circle where it is equal to a and change it to a_1

Python/Pandas return column and row index of found string

I've searched previous answers relating to this but those answers seem to utilize numpy because the array contains numbers. I am trying to search for a keyword in a sentence in a dataframe ('Timeframe') where the full sentence is 'Timeframe for wave in ____' and would like to return the column and row index. For example:
df.iloc[34,0]
returns the string I am looking for but I am avoiding a hard code for dynamic reasons. Is there a way to return the [34,0] when I search the dataframe for the keyword 'Timeframe'
EDIT:
For check index need contains with boolean indexing, but then there are possible 3 values:
df = pd.DataFrame({'A':['Timeframe for wave in ____', 'a', 'c']})
print (df)
A
0 Timeframe for wave in ____
1 a
2 c
def check(val):
a = df.index[df['A'].str.contains(val)]
if a.empty:
return 'not found'
elif len(a) > 1:
return a.tolist()
else:
#only one value - return scalar
return a.item()
print (check('Timeframe'))
0
print (check('a'))
[0, 1]
print (check('rr'))
not found
Old solution:
It seems you need if need numpy.where for check value Timeframe:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,'Timeframe'],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 Timeframe 0 4 b
a = np.where(df.values == 'Timeframe')
print (a)
(array([5], dtype=int64), array([2], dtype=int64))
b = [x[0] for x in a]
print (b)
[5, 2]
In case you have multiple columns where to look into you can use following code example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],["a","b","Timeframe for wave in____","d"],[5,6,7,8]])
mask = np.column_stack([df[col].str.contains("Timeframe", na=False) for col in df])
find_result = np.where(mask==True)
result = [find_result[0][0], find_result[1][0]]
Then output for df and result would be:
>>> df
0 1 2 3
0 1 2 3 4
1 a b Timeframe for wave in____ d
2 5 6 7 8
>>> result
[1, 2]

Linux join column files of different lengths

I've seen a lot of similar questions to this but I haven't found an answer. I have several text files, each with two columns, but the columns in each file are different lengths, e.g.
file1:
type val
1 2
2 4
3 2
file2:
type val
1 9
2 8
3 9
4 7
I want:
type val type val
1 2 1 9
2 4 2 8
3 2 3 9
4 7
'join' gives something like this:
type val type val
1 2 1 9
2 4 2 8
3 2 3 9
4 7
I could write a script but I'm wondering if there is a simple command.
Thanks,
Ok, couldn't wait for an answer, so wrote a python script. Here it is in case its useful to anyone.
import sys
import os
#joins all the tab delimited column files in a folder into one file with multiple columns
#usage joincolfiles.py /folder_with_files outputfile
folder = sys.argv[1] #working folder, put all the files to be joined in here
outfile=sys.argv[2] #output file
cols=int(sys.argv[3]) #number of columns, only works if each file has same number
g=open(outfile,'w')
a=[]
b=[]
c=0
for files in os.listdir(folder):
f=open(folder+"/"+files,'r')
b=[]
c=c+1
t=0
for line in f:
t=t+1
if t==1:
b.append(str(files)+line.rstrip('\n'))
else:
b.append(line.rstrip('\n')) #list of lines
a.append(b) #list of list of lines
f.close()
print "num files", len(a)
x=[]
for i in a:
x.append(len(i))
maxl = max(x) #max length of files
print 'max len',maxl
for k in range(0,maxl): #row number
for j in a:
if k<len(j):
g.write(j[k]+"\t")
else:
g.write("\t"*cols)
g.write("\n")
g.close()

Resources