Writing multiple columns into csv files using python - python-3.x

I am a python beginner. I am trying to write multiple lists into separate columns in a csv file.
In my csv file, I would like to have
2.9732676520000001 0.0015852047556142669 1854.1560636319559
4.0732676520000002 0.61902245706737125 2540.1258143280334
4.4032676520000003 1.0 2745.9167395368572
Following is the code that I wrote.
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
with open('output/'+file,'w') as f:
for dt,int_norm,CSs in zip(df,int_peak,CS):
f.write('{0:f},{1:f},{2:f}\n'.format(dt,int_norm,CSs))
This isn't running properly. I'm getting non-empty format string passed to object.format this error message. I'm having a hard time to catch what is going wrong. Could anyone spot what's going wrong with my code?

You are better off using pandas
import pandas as pd
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
file_name = "your_file_name.csv"
# pandas can convert a list of lists to a dataframe.
# each list is a row thus after constructing the dataframe
# transpose is applied to get to the user's desired output.
df = pd.DataFrame([df, int_peak, CS])
df = df.transpose()
# write the data to the specified output path: "output"/+file_name
# without adding the index of the dataframe to the output
# and without adding a header to the output.
# => these parameters are added to be fit the desired output.
df.to_csv("output/"+file_name, index=False, header=None)
The output CSV looks like this:
2.973268 0.001585 1854.156064
4.073268 0.619022 2540.125814
4.403268 1.000000 2745.916740
However to fix your code, you need to use another file name variable other than file. I changed that in your code as follows:
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
file_name = "your_file_name.csv"
with open('/tmp/'+file_name,'w') as f:
for dt,int_norm,CSs in zip(df,int_peak,CS):
f.write('{0:f},{1:f},{2:f}\n'.format(dt,int_norm,CSs))
and it works. The output is as follows:
2.973268,0.001585,1854.156064
4.073268,0.619022,2540.125814
4.403268,1.000000,2745.916740

If you need to write only a few selected columns to CSV then you should use following option.
csv_data = df.to_csv(columns=['Name', 'ID'])

Related

Better way to read columns from excel file as variables in Python

I have an .xlsx file with 5 columns(X,Y,Z,Row_Cog,Col_Cog) and will be in the same order each time. I would like to have each column as a variable in python. I am implementing the below method but would like to know if there is a better way to do it.
Also I am writing the range manually(in the for loop) while I would like to have a robust way to know the length of each column in excel(no of rows) and assign it.
#READ THE TEST DATA from Excel file
import xlrd
workbook = xlrd.open_workbook(r"C:\Desktop\SawToothCalib\TestData.xlsx")
worksheet = workbook.sheet_by_index(0)
X_Test=[]
Y_Test=[]
Row_Test=[]
Col_Test=[]
for i in range(1, 29):
x_val= worksheet.cell_value(i,0)
X_Test.append(x_val)
y_val= worksheet.cell_value(i,2)
Y_Test.append(y_val)
row_val= worksheet.cell_value(i,3)
Row_Test.append(row_val)
col_val= worksheet.cell_value(i,4)
Col_Test.append(col_val)
Do you really need this package? You can easily do this kind of operation with pandas.
You can read your file as a DataFrame with:
import pandas as pd
df = pd.read_excel(path + 'file.xlsx', sheet_name=the_sheet_you_want)
and access the list of columns with df.columns. You can acces each column with df['column name']. If there are empty entries, they are stored as NaN. You can know how many you have with df['column_name'].isnull().
If you are uncomfortable with DataFrames, you can then convert the columns to lists or arrays, like
df['my_col'].tolist()
or
df['my_col'].to_numpy()

Dask: Add list to a column value like pandas does

I am bit new to dask. I have large csv file and large list. Length of row of csv are equal to length of the list. I am trying to create a new column in the Dask dataframe from a list. In pandas, it pretty straight forward, however in Dask I am having hard time creating new column for it. I am avoiding to use pandas because my data is 15GB+.
Please see my tries below.
csv Data
name,text,address
john,some text here,MD
tim,some text here too,WA
Code tried
import dask.dataframe as dd
import numpy as np
ls = ['one','two']
ddf = dd.read_csv('../data/test.csv')
ddf.head()
Try #1:
ddf['new'] = ls # TypeError: Column assignment doesn't support type list
Try #2: What should be passed here for condlist?
ddf['new'] = np.select(choicelist=ls) # TypeError: _select_dispatcher() missing 1 required positional argument: 'condlist'
Looking for this output:
name text address new
0 john some text here MD one
1 tim some text here too WA two
Try creating a dask dataframe and then appending it like this -
#ls = dd.from_array(np.array(['one','two']))
#ddf['new'] = ls
# As tested by OP
import dask.array as da
ls = da.array(['one','two','three'])
ddf['new'] = ls

How to manipulate this csv, in py especially, while i insert one Character like psg, and the result is 9

This is the example csv. I already tried another way but still print useless things, i just want to get the values (Count).
You can just read through pandas:
import pandas as pd
df = pd.read_csv(r'FILE.csv')
print(df['Count'])
If you need to print only the Counts:
print(*df['Count'].values.tolist(),sep="\n")

Why do my lists become strings after saving to csv and re-opening? Python

I have a Dataframe in which each row contains a sentence followed by a list of part-of-speech tags, created with spaCy:
df.head()
question POS_tags
0 A title for my ... [DT, NN, IN,...]
1 If one of the ... [IN, CD, IN,...]
When I write the DataFrame to a csv file (encoding='utf-8') and re-open it, it looks like the data format has changed with the POS tags now appearing between quotes ' ' like this:
df.head()
question POS_tags
0 A title for my ... ['DT', 'NN', 'IN',...]
1 If one of the ... ['IN', 'CD', 'IN',...]
When I now try to use the POS tags for some operations, it turns out they are no longer lists but have become strings that even include the quotation marks. They still look like lists but are not. This is clear when doing:
q = df['POS_tags']
q = list(q)
print(q)
Which results in:
["['DT', 'NN', 'IN']"]
What is going on here?
I either want the column 'POS_tags' to contain lists, even after saving to csv and re-opening. Or I want to do an operation on the column 'POS_tags' to have the same lists again that SpaCy originally created. Any advice how to do this?
To preserve the exact structure of the DataFrame, an easy solution is to serialize the DF in pickle format with pd.to_pickle, instead of using csv, which will always throw away all information about data types, and will require manual reconstruction after re-import. One drawback of pickle is that it's not human-readable.
# Save to pickle
df.to_pickle('pickle-file.pkl')
# Save with compression
df.to_pickle('pickle-file.pkl.gz', compression='gzip')
# Load pickle from disk
df = pd.read_pickle('pickle-file.pkl') # or...
df = pd.read_pickle('pickle-file.pkl.gz', compression='gzip')
Fixing lists after importing from CSV
If you've already imported from CSV, this should convert the POS_tags column from strings to python lists:
from ast import literal_eval
df['POS_tags'] = df['POS_tags'].apply(literal_eval)

using split() to split values in an entire column in a python dataframe

I am trying to clean a list of url's that has garbage as shown.
/gradoffice/index.aspx(
/gradoffice/index.aspx-
/gradoffice/index.aspxjavascript$
/gradoffice/index.aspx~
I have a csv file with over 190k records of different url's. I tried to load the csv into a pandas dataframe and took the entire column of url's into a list by using the statement
str = df['csuristem']
it clearly gave me all the values in the column. when i use the following code - It is only printing 40k records and it starts some where in the middle. I don't know where am going wrong. the program runs perfectly but is showing me only partial number of results. any help would be much appreciated.
import pandas
table = pandas.read_csv("SS3.csv", dtype=object)
df = pandas.DataFrame(table)
str = df['csuristem']
for s in str:
s = s.split(".")[0]
print s
I am looking to get an output like this
/gradoffice/index.
/gradoffice/index.
/gradoffice/index.
/gradoffice/index.
Thank you,
Santhosh.
You need to do the following, so call .str.split on the column and then .str[0] to access the first portion of the split string of interest:
In [6]:
df['csuristem'].str.split('.').str[0]
Out[6]:
0 /gradoffice/index
1 /gradoffice/index
2 /gradoffice/index
3 /gradoffice/index
Name: csuristem, dtype: object

Resources