Choose an specific column of an imported text file - python-3.x

I am trying to import a text file into Python. The first column is date and others are integers. After importing the text file I want to extract each column, name them and plot each variable vs date (the first column). How can I extract columns? And how can I choose the 2nd column onwards? I tried two different methods for importing the file:
btcv = np.genfromtxt('example_Feb.388.btcv.txt', dtype=None);
and
btcv = pd.read_csv('example_Feb.388.btcv.txt', header = None)
The text file looks like:
"2015-06-17 00:00" -6.830000 -5.642747 -5.642747 -4.057440 -3.867922 -4.377454
"2015-06-18 00:00" -6.830000 -5.630413 -5.630413 -4.045107 -3.855588 -4.365120
"2015-06-19 00:00" -5.245973 -5.627623 -5.627623 -3.967911 -3.836147 -4.309624
"2015-06-20 00:00" -4.568952 -5.620628 -5.620628 -3.871517 -3.837915 -4.238232
"2015-06-21 00:00" -4.620864 -5.615302 -5.615302 -3.980928 -4.001598 -4.272657
"2015-06-22 00:00" -4.673435 -5.622433 -5.622433 -4.025599 -4.071035 -4.285809
With 1000 rows and 188 columns.
I tried
btcv.date = btcv[:,0]
and it did not work! and btcv[0] returns the full array.
Thanks.

using pandas you can read it as a csv and set the delimeter to whitespace
pd.read_csv('example.csv', delim_whitespace=True, header=None)
This will read the file into a pandas dataframe. You can then name your columns. For example
df.columns = ['date', 'first', 'second']
then you can access each column by name E.g
date = df.date
make the date the frame index
df.index = df.date
and then plot the data frame with a plotting tool

Related

Merge duplicate rows in a text file using python based on a key column

I have a csv file and I need to merge records of those rows based on a key column name
a.csv
Name|Acc#|ID|Age
Suresh|2345|a-b2|24
Mahesh|234|a-vf|34
Mahesh|4554|a-bg|45
Keren|344|s-bg|45
yankie|999|z-bg|34
yankie|3453|g-bgbbg|45
Expected output: Merging records based on name like values from both the rows for name Mahesh and yankie are merged
Name|Acc#|ID|Age
Suresh|2345|a-b2|24
Mahesh|[234,4555]|[a-vf,a-bg]|[34,45]
Keren|344|s-bg|45
yankie|[999,3453]|[z-bg,g-bgbbg]|[34,45]
can someone help me with this in python?
import pandas as pd
df = pd.read_csv("a.csv", sep="|", dtype=str)
new_df = df.groupby('Name',as_index=False).aggregate(lambda tdf: tdf.unique().tolist() if tdf.shape[0] > 1 else tdf)
new_df.to_csv("data.csv", index=False, sep="|")
Output:
Name|Acc#|ID|Age
Keren|344|s-bg|45
Mahesh|['234', '4554']|['a-vf', 'a-bg']|['34', '45']
Suresh|2345|a-b2|24
yankie|['999', '3453']|['z-bg', 'g-bgbbg']|['34', '45']

Unquoted date in first column pf CSV for Python/Pandas read_csv

Incoming CSV from American Express download looks like below. (I would prefer each field has quotes around it, but it doesn't. It is treating the quoted long number in the second CSV column as the first column in the Pandas data frame, i.e. 320193480240275508 as my "Date" column:
12/13/19,'320193480240275508',Alamo Rent A Car,John
Doe,-12345,178.62,Travel-Vehicle Rental,DEBIT,
colnames = ['Date', 'TransNum', 'Payee', 'NotUsed4', 'NotUsed5', 'Amount', 'AmexCategory', 'DebitCredit']
df = pd.read_csv(filenameIn, names=colnames, header=0, delimiter=",")
delimiter=",")
pd.set_option('display.max_rows', 15)
pd.set_option('display.width', 200)
print (df)
print (df.values)
Start
Date ... DebitCredit 12/13/19 '320193480240275508' ... NaN
I have a routine to reformat the date ( to handle things like 1/3/19, and to add the century). It is called like this:
df['Date'][j] = reformatAmexDate2(df['Date'][j])
That routine shows the date as follows:
def reformatAmexDate2(oldDate):
print ("oldDate=" + oldDate)
oldDate='320193480240275508'
I saw this post which recommended dayfirst=True, and added that, but same result. I never even told Pandas that column 1 is a date, so it should treat it as text I believe.
IIUC, the problem seems to be name=colnames, it sets new names for your columns being read from csv file, as you are trying to read specific columns from csv file, you can use usecol
df = pd.read_csv(filenameIn,usecols=colnames, header=0, delimiter=",")
Looking at the data, I didn't notice the comma after the column value, i.e. the comma after "DEBIT,"
12/13/19,'320193480240275508',Alamo Rent A Car,John Doe,-12345,178.62,Travel-Vehicle Rental,DEBIT,
I just added another column at the end of my columns array:
colnames = ['Date', 'TransNum', 'Payee', 'NotUsed4', 'NotUsed5', 'Amount', 'AmexCategory', 'DebitCredit','NotUsed9']
and life is wonderful.

How read big csv file and read each column separatly?

I have a big CSV file that I read as dataframe.
But I can not figure out how can I read separately by every column.
I have tried to use sep = '\' but it gives me an error.
I read my file with this code:
filename = "D:\\DLR DATA\\status maritime\\nperf-data_2019-01-01_2019-09-10_bC5HIhAfJWqdQckz.csv"
#Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv(filename)
df1 = df.head()
When I print my dataframe head I have this result:
In variable explorer my dataframe consist only 1 column with all data inside:
I try to set sep = '' with space and coma. But it didn't work.
How can I read all data with appropriate different columns?
I would appreciate any help.
You have a tab separated file use \t
df = dd.read_csv(filename, sep='\t')

How to read comma separated data into data frame without column header?

I am getting a comma separated data set as bytes which I need to:
Convert in to string from byte
Create csv (can skip this if there is any way to jump to 3rd output)
format and read as data frame without converting first row as column name.
(Later I will be using this df to compare with oracle db output.)
Input data:
val = '-8335,Q1,2017,2002-07-10 00:00:00.0,-,Mr. A,4342000,AnalystA,0,F\n-8336,Q1,2017,2002-07-11 00:00:00.0,-,Mr. B,4342001,Analyst A,0,F\n-8337,Q1,2017,2002-07-10 00:00:00.0,-,Mr. C,4342002,Analyst A,0,F\n'
type(val)
i managed to do till step 3 but my first row is becoming header. I am fine if we can give any value as column header e.g. a, b, c, ...
#1 Code I tried to convert byte to str
strval = val.decode('ascii').strip()
#2 code to craete csv. Frist i created blank csv and later appended the data
import csv
import pandas as pd
abc = ""
with open('csvfile.csv', 'w') as csvOutput:
testData = csv.writer(csvOutput)
testData.writerow(abc)
with open('csvfile.csv', 'a') as csvAppend:
csvAppend.write(val)
#3 now converting it into dataframe
df = pd.read_csv('csvfile.csv')
# hdf = pd.read_csv('csvfile.csv', column=none) -- this give NameError: name 'none' is not defined
output:
df
according to the read_csv documentation it should be enough to add header=None as a parameter:
df = pd.read_csv('csvfile.csv', header=None)
In this way the header will be interpreted as a row of data. If you want to exclude this line then you need to add the skiprows=1 parameter:
df = pd.read_csv('csvfile.csv', header=None, skiprows=1)
You can do it without saving to csv file like this, you don't need to convert the bytes to string or save that to file
Here val is of type bytes if it is of type string as in your example you can use io.StringIO instead of io.BytesIO
import pandas as pd
import io
val = b'-8335,Q1,2017,2002-07-10 00:00:00.0,-,Mr. A,4342000,AnalystA,0,F\n-8336,Q1,2017,2002-07-11 00:00:00.0,-,Mr. B,4342001,Analyst A,0,F\n-8337,Q1,2017,2002-07-10 00:00:00.0,-,Mr. C,4342002,Analyst A,0,F\n'
buf_bytes = io.BytesIO(val)
pd.read_csv(buf_bytes, header=None)

Pandas - Adding dummy header column in csv

I am trying concat several csv files by customer group using the below code:
files = glob.glob(file_from + "/*.csv") <<-- Path where the csv resides
df_v0 = pd.concat([pd.read_csv(f) for f in files]) <<-- Dataframe that concat all csv files from files mentioned above
The problem is the number of column in the csv varies by customer and they do not have a header file.
I am trying to see if I could add in a dummmy header column with labels such as col_1, col_2 ... depending on the number of columns in that csv.
Could anyone guide as to how could I get this done. Thanks.
Update on trying to search for a specific string in the Dataframe:
Sample Dataframe
col_1,col_2,col_3
fruit,grape,green
fruit,watermelon,red
fruit,orange,orange
fruit,apple,red
Trying to filter out rows having the word red and expect it to return rows 2 and 4.
Tried the below code:
df[~df.apply(lambda x: x.astype(str).str.contains('red')).any(axis=1)]
Use parameters header=None for default range columns 0, 1, 2 and skiprows=1 if necessary remove original columns names:
df_v0 = pd.concat([pd.read_csv(f, header=None, skiprows=1) for f in files])
If want also change columns names add rename:
dfs = [pd.read_csv(f, header=None, skiprows=1).rename(columns = lambda x: f'col_{x + 1}')
for f in files]
df_v0 = pd.concat(dfs)

Resources