Delete specific columns from csv file in python3 - python-3.x

Need to delete some specific column and rows(by index) of multiple csv files, without creating new files.
For the code below, it is giving output with new blank rows after each row.
import csv
with open('file.csv') as fd:
reader = csv.reader(fd)
valid_rows = [row for idx, row in enumerate(reader) if idx != 0]
with open('file.csv', 'w') as out:
csv.writer(out).writerows(valid_rows)
What is the simpler way to do this(might be by other python libraries)?

Since you wish not to generate any new csv files and would like the data to perform operations, I would suggest you to make use of Pandas Framework. Make use of drop function in this framework.
Consider the following example:
Sample.csv:
col1,col2,col3,col4
1,2,3,4
5,6,7,8
9,10,11,12
13,14,15,16
17,18,19,20
Code:
import pandas as pd
df = pd.read_csv('./Sample.csv')
To delete columns:
df.drop('col3', axis = 1, inplace = True)
df contents:
col1 col2 col4
0 1 2 4
1 5 6 8
2 9 10 12
3 13 14 16
4 17 18 20
To delete rows:
df.drop(df.index[[1,4]], inplace = True)
df contents:
col1 col2 col4
0 1 2 4
2 9 10 12
3 13 14 16
Finally to save the edited csv file:
df.to_csv('new_sample.csv', index = False)

Related

Pandas cannot rename columns or insert column

I am reading a csv file in python without any prior headers or labels. I would like to add names to my columns and I followed the documentation but there is no change to the document I also try to insert a row, but there is also no change to the csv
I am stuck as to why there is no change. See the code below:
data = pd.read_csv('data.csv', header=None, names=['a', 'b', 'label'])
bias=[1 for i in range(79)]
data.insert(0, 'bias', bias)
Sources Used:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html and Pandas read_csv usecols and names not working properly
I can't reproduce. I create a data frame from mock data with the same length.
>>> data = pd.read_csv(io.StringIO('\n'.join([','.join(['1', '2', 'a'])] * 79)), header=None, names=['a', 'b', 'label'])
a b label
0 1 2 a
1 1 2 a
.. .. .. ...
77 1 2 a
78 1 2 a
[79 rows x 3 columns]
df.insert operates in place, so I don't need to reassign or anything like that.
>>> data.insert(0, 'bias', bias)
>>> data
bias a b label
0 1 1 2 a
1 1 1 2 a
.. ... .. .. ...
77 1 1 2 a
78 1 1 2 a
[79 rows x 4 columns]
But I observe here that the data frame properly reassigns. Could you provide more information?

Combine data from two columns into one without affecting the data values

I have two columns in a data frame. I want to combine those columns into a single column.
df = pd.DataFrame({'a': [500, 200, 13, 47], 'b':['$', '€', .586,.02]})
df
Out:
a b
0 500 $
1 200 €
2 13 .586
3 47 .02
I want to merge that two columns without affecting the data.
Expected output:
df
Out:
a
0 500$
1 200€
2 13.586
3 47.02
Please help me with this...
I tried the below solution, but it does not work for me,
df.b=np.where(df.b,df.b,df.a)
df.loc[df['b'] == '', 'b'] = df['a']
First solution working with convert both columns to strings and then join with +, last convert Series to one column DataFrame - but it working only if numbers less like 1 for column b:
df1 = df.astype(str)
df = (df1.a + df1.b.str.replace(r'^0', '')).to_frame('a')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02
Or if want mixed values numeric for last 2 rows and strings for first 2 rows use:
m = df.b.apply(lambda x: isinstance(x, str))
df.loc[m, 'a'] = df.loc[m, 'a'].astype(str) + df.b
df.loc[~m, 'a'] += df.pop('b')
print (df)
a
0 500$
1 200€
2 13.586
3 47.02

I'm not able to add column for all rows in pandas dataframe

I'm pretty new in python / pandas, so its probably pretty simple question...but I can't handle it:
I have two dataframe loaded from Oracle SQL. One with 300 rows / 2 column and second with one row/one column. I would like to add column from second dataset to the first for each row as new column. But I can only get it for the first row and the others are NaN.
`import cx_Oracle
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.externals import joblib
dsn_tns = cx_Oracle.makedsn('127.0.1.1', '1521', 'orcl')
conn = cx_Oracle.connect(user='MyName', password='MyPass', dsn=dsn_tns)
d_score = pd.read_sql_query(
'''
SELECT
ID
,RESULT
,RATIO_A
,RATIO_B
from ORCL_DATA
''', conn) #return 380 rows
d_score['ID'] = d_score['ID'].astype(int)
d_score['RESULT'] = d_score['RESULT'].astype(int)
d_score['RATIO_A'] = d_score['RATIO_A'].astype(float)
d_score['RATIO_B'] = d_score['RATIO_B'].astype(float)
d_score_features = d_score.iloc [:,2:4]
#d_train_target = d_score.iloc[:,1:2] #target is RESULT
DM_train = xgb.DMatrix(data= d_score_features)
loaded_model = joblib.load("bst.dat")
pred = loaded_model.predict(DM_train)
i = pd.DataFrame({'ID':d_score['ID'],'Probability':pred})
print(i)
s = pd.read_sql_query('''select max(id_process) as MAX_ID_PROCESS from PROCESS''',conn) #return only 1 row
m =pd.DataFrame(data=s, dtype=np.int64,columns = ['MAX_ID_PROCESS'] )
print(m)
i['new'] = m ##Trying to add MAX_ID_PROCESS to all rows
print(i)
i =
ID Probability
0 20101 0.663083
1 20105 0.486774
2 20106 0.441300
3 20278 0.703176
4 20221 0.539185
....
379 20480 0.671976
m =
MAX_ID_PROCESS
0 274
i =
ID_MATCH Probability new
0 20101 0.663083 274.0
1 20105 0.486774 NaN
2 20106 0.441300 NaN
3 20278 0.703176 NaN
4 20221 0.539185 NaN
I need value 'new' for all rows...
Since your second dataframe is only having one value, you can assign it like this:
df1['new'] = df2.MAX_ID_PROCESS[0]
# Or using .loc
df1['new'] = df2.MAX_ID_PROCESS.loc[0]
In your case, it should be:
i['new'] = m.MAX_ID_PROCESS[0]
You should now see:
ID Probability new
0 20101 0.663083 274.0
1 20105 0.486774 274.0
2 20106 0.441300 274.0
3 20278 0.703176 274.0
4 20221 0.539185 274.0
As we know that we can append one column of dataframe1 to dataframe2 as new column using the code: dataframe2["new_column_name"] = dataframe1["column_to_copy"].
We can extend this approach to solve your problem.
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1["ColA"] = [1, 12, 32, 24,12]
df1["ColB"] = [23, 11, 6, 45,25]
df1["ColC"] = [10, 25, 3, 23,15]
print(df1)
Output:
ColA ColB ColC
0 1 23 10
1 12 11 25
2 32 6 3
3 24 45 23
4 12 25 15
Now we create a new dataframe and add a row to it.
df3 = pd.DataFrame()
df3["ColTest"] = [1]
Now we store the value of the first row of the second dataframe as we want to add it to all the rows in dataframe1 as a new column:
val = df3.iloc[0]
print(val)
Output:
ColTest 1
Name: 0, dtype: int64
Now, we will store this value for as many rows as we have in dataframe1.
rows = len(df1)
for row in range(rows):
df3.loc[row]=val
print(df3)
Output:
ColTest
0 1
1 1
2 1
3 1
4 1
Now we will append this column to the first dataframe and solve your problem.
df["ColTest"] = df3["ColTest"]
print(df)
Output:
ColA ColB ColC ColTest
0 1 23 10 1
1 12 11 25 1
2 32 6 3 1
3 24 45 23 1
4 12 25 15 1

Convert a csv file to array uisng pandas

I have a csv file as shown below
Year, Growth
1, 10
2, 12
3, 13
4, 15
I would like to convert content of the csv file to an array say my_array,
so that I can access growth as show below
my_array[1] should return 10
my_array[4] should return 15 etc
How can I convert the csv to such an array using pandas?
>>> import pandas
>>> df = pandas.read_csv('test.csv')
>>> df
Year Growth
0 1 10
1 2 12
2 3 13
3 4 15
>>> arr = df['Growth'].values.tolist()
>>> arr[0]
10
Yes , above piece of code will help you do that for whatever columns you want.

How do I copy to a range, rather than a list, of columns?

I am looking to append several columns to a dataframe.
Let's say I start with this:
import pandas as pd
dfX = pd.DataFrame({'A': [1,2,3,4],'B': [5,6,7,8],'C': [9,10,11,12]})
dfY = pd.DataFrame({'D': [13,14,15,16],'E': [17,18,19,20],'F': [21,22,23,24]})
I am able to append the dfY columns to dfX by defining the new columns in list form:
dfX[[3,4]] = dfY.iloc[:,1:3].copy()
...but I would rather do so this way:
dfX.iloc[:,3:4] = dfY.iloc[:,1:3].copy()
The former works! The latter executes, returns no errors, but does not alter dfX.
Are you looking for
dfX = pd.concat([dfX, dfY], axis = 1)
It returns
A B C D E F
0 1 5 9 13 17 21
1 2 6 10 14 18 22
2 3 7 11 15 19 23
3 4 8 12 16 20 24
And you can append several dataframes in this like pd.concat([dfX, dfY, dfZ], axis = 1)
If you need to append say only column D and E from dfY to dfX, go for
pd.concat([dfX, dfY[['D', 'E']]], axis = 1)

Resources