Perform a calculation in a newly added column in csv in python - python-3.x

I am trying to add a new column to a csv file in python 3. The csv file has a header row, and the first two columns i don't need at this point. the other 8 columns contain 4 coordinates of a polygon. I am trying to add a new column that calculates the area from the points in the csv. I have seen several questions similar on stack overflow, and have tried to use the information there in my code however at the moment, only the last line of the csv is displaying and the I don't think the area is calculating correctly either. Any suggestions? (FYI this is my first code with a csv.)
Here is my code:
with open(poly.csv, 'rU')as input:
with open ('polyout.csv', 'w') as output:
writer = csv.writer(output, lineterminator='\n')
reader=csv.reader(input)
coords=[]
row =next(reader)
row =next(reader,None)
coords=row[2:]
prev_de=coords[-2]
prev_dn=coords[-1]
prev_de=float(prev_de)
prev_dn=float(prev_dn)
areasq=float(0)
for de,dn in zip(coords[:-1:2], coords[1::2]):
areasq+= (float(de)*float(prev_dn))-(float(dn)*float(prev_de))
prev_de, prev_dn = de,dn
area =abs(areasq)/2
for row in reader:
row.append(area)
coords.append(row)
writer.writerows(coords)
print(row)

I would recommend you use pandas for this.
import pandas as pd
df = pd.read_csv('./poly.csv')
df['area'] = calculate_area(df) # implement calculate_area
df.write_csv('polyout.csv')
You're probably better off actually just using plain numpy, see the answer to this question Calculate area of polygon given (x,y) coordinates

My data, 1st quadrangular given clockwise, 2nd given anticlockwise
$ cat a.csv
a,b,x1,y1,x2,y2,x3,y3,x4,y4
a,b,3,3,3,9,4,9,4,3
e,f,0,0,5,0,5,5,0,5
$
Imports, I import also stdout to be able to show on screen my
results
from csv import reader, writer
from sys import stdout
use the csv classes
data = reader(open('a.csv'))
out = writer(stdout)
process the headers (assuming one row of headers)
headers = next(data)
headers = headers+['A']
out.writerow(headers)
loop on data, process data, output processed data
for row in data:
# the list comprehension is unpacked in aptly named variables
x1, y1, x2, y2, x3, y3, x4, y4 = [int(v) for v in row[2:]]
# https://en.wikipedia.org/wiki/Shoelace_formula#Examples
a = (x1*y2+x2*y3+x3*y4+x4*y1-y1*x2-y2*x3-y3*x4-y4*x1)/2
row.append(a)
out.writerow(row)
I have saved the above in a file named area.py and finally we have
$ python3 area.py
a,b,x1,y1,x2,y2,x3,y3,x4,y4,A
a,b,3,3,3,9,4,9,4,3,-6.0
e,f,0,0,5,0,5,5,0,5,25.0
$
To use the shoelace formula as is remember that points must be ordered clockwise, if your data is different just write a = -(...

Related

My plot from matplotlib is working too slow, is there are way for more efficiency?

When I open the plot or want to zoom it all is working very slow. Can you say, maybe I can something change there are for more efficiency (productivity)?
Data is about ~1600 numbers.
My numbers just in typical list format, like [1,2,3], I don't use pandas and getting that data from csv file (see picture 2, please).
Picture 1 - The plot
Picture 2 - CSV file of data
with open(r"C:\Users\Idensas\PycharmProjects\pythonProject\f.csv",'r',encoding='utf-8') as f:
x = []
y = []
for i in f.readlines():
y.append(i[0:8])
x.append(float(i[9:15]))

Pandas dropped row showing in plot

I am trying to make a heatmap.
I get my data out of a pipeline that class some rows as noisy, I decided to get a plot including them and a plot without them.
The problem I have: In the plot without the noisy rows I have blank line appearing (the same number of lines than rows removed).
Roughly The code looks like that (I can expand part if required I am trying to keep it shorts).
If needed I can provide a link with similar data publicly available.
data_frame = load_df_fromh5(file) # load a data frame from the hdf5 output
noisy = [..] # a list which indicate which row are vector
# I believe the problem being here:
noisy = [i for (i, v) in enumerate(noisy) if v == 1] # make a vector which indicates which index to remove
# drop the corresponding index
df_cells_noisy = df_cells[~df_cells.index.isin(noisy)].dropna(how="any")
#I tried an alternative method:
not_noisy = [0 if e==1 else 1 for e in noisy)
df = df[np.array(not_noisy, dtype=bool)]
# then I made a clustering using scipy
Z = hierarchy.linkage(df, method="average", metric="canberra", optimal_ordering=True)
df = df.reindex(hierarchy.leaves_list(Z))
# the I plot using the df variable
# quit long function I believe the problem being upstream.
plot(df)
The plot is quite long but I believe it works well because the problem only shows with the no noisy data frame.
IMO I believe somehow pandas keep information about the deleted rows and that they are plotted as a blank line. Any help is welcome.
Context:
Those are single-cell data of copy number anomaly (abnormalities of the number of copy of genomic segment)
Rows represent individuals (here individuals cells) columns represents for the genomic interval the number of copy (2 for vanilla (except sexual chromosome)).

Am I Scrambling my data when writing to netCDF?

I have created a 2D array that when plotted looks like this:
Basically it is an array of size [101,365] of numbers with range from 0.0 to 1.2 and contains NaNs.
I am writing it to a netCDF4 file in this manner:
nc_out = Dataset(nc_out_file, 'w', format='NETCDF4')
#Create Dimemsions
y = nc_out.createDimension('y',101)
x = nc_out.createDimension('x',365)
#Create Variables
latitudes = nc_out.createVariable('latitude', np.float32, ('y'))
days = nc_out.createVariable('days', np.float32,('x'))
on2_climo = nc_out.createVariable('on2_climo', np.float32, ('x', 'y'))
#Fill Variables
latitudes[:] = lat
days[:] = day
on2_climo[:] = data
nc_out.close()
However, when I plot the data I've saved in the file it looks nothing like the original data:
What is going on here? The faint diagonal lines make me think there is something weird going on here...
Is there a better way to code a netCDF4 file? I'd share a copy of the original data with you... but I can't seem to get a faithful version of it saved...

Pandas - iterating to fill values of a dataframe

I'm trying to build a data-frame of time series data. I have to retrieve the data from an API and every (i,j) entry in the data-frame (where "i" is the row and "j" is the column) has to be iterated through and filled individually.
Here's an idea of the type of thing i'm trying to do (note the API i'm using doesn't have historical data for what i'm trying to analyze):
import pandas as pd
import numpy as np
import time
def retrievedata(string):
take string
do some stuff with api
return float
label_list = ['label1','label1','label1', etc...]
discrete_points = 720
df = pd.DataFrame(index=np.arange(0, discrete_points), columns=(i for i in label_list))
So at this point I've pre-allocated a data frame. What comes next is the issue.
Now, I want to iterate over it and assign values to every (i,j) entry in the data frame based on a function i wrote to pull data. Note that the function I wrote has to be specific to a certain column (as it is taking as input the column label). And on top of that, each row will have different values b/c it is time-series data.
EDIT: Yuck, I found a gross way to make it work:
for row in range(discrete_points):
for label in label_list:
df.at[row, label] = retrievedata(label)
This is obviously a non-pythonic, non-numpy, non-pandas way of doing things. So i'd like to find a nicer and more efficient/less computing power intensive way of doing this.
I'm assuming it's gonna have to be some combination of: iter.rows(); iter.tuples(); df.loc(); df.at()
I'm stumped though.
Any ideas?

Matplotlib: Import and plot multiple time series with legends direct from .csv

I have several spreadsheets containing data saved as comma delimited (.csv) files in the following format: The first row contains column labels as strings ('Time', 'Parameter_1'...). The first column of data is Time and each subsequent column contains the corresponding parameter data, as a float or integer.
I want to plot each parameter against Time on the same plot, with parameter legends which are derived directly from the first row of the .csv file.
My spreadsheets have different numbers of (columns of) parameters to be plotted against Time; so I'd like to find a generic solution which will also derive the number of columns directly from the .csv file.
The attached minimal working example shows what I'm trying to achieve using np.loadtxt (minus the legend); but I can't find a way to import the column labels from the .csv file to make the legends using this approach.
np.genfromtext offers more functionality, but I'm not familiar with this and am struggling to find a way of using it to do the above.
Plotting data in this style from .csv files must be a common problem, but I've been unable to find a solution on the web. I'd be very grateful for your help & suggestions.
Many thanks
"""
Example data: Data.csv:
Time,Parameter_1,Parameter_2,Parameter_3
0,10,0,10
1,20,30,10
2,40,20,20
3,20,10,30
"""
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('Data.csv', skiprows=1, delimiter=',') # skip the column labels
cols = data.shape[1] # get the number of columns in the array
for n in range (1,cols):
plt.plot(data[:,0],data[:,n]) # plot each parameter against time
plt.xlabel('Time',fontsize=14)
plt.ylabel('Parameter values',fontsize=14)
plt.show()
Here's my minimal working example for the above using genfromtxt rather than loadtxt, in case it is helpful for anyone else.
I'm sure there are more concise and elegant ways of doing this (I'm always happy to get constructive criticism on how to improve my coding), but it makes sense and works OK:
import numpy as np
import matplotlib.pyplot as plt
arr = np.genfromtxt('Data.csv', delimiter=',', dtype=None) # dtype=None automatically defines appropriate format (e.g. string, int, etc.) based on cell contents
names = (arr[0]) # select the first row of data = column names
for n in range (1,len(names)): # plot each column in turn against column 0 (= time)
plt.plot (arr[1:,0],arr[1:,n],label=names[n]) # omitting the first row ( = column names)
plt.legend()
plt.show()
The function numpy.genfromtxt is more for broken tables with missing values rather than what you're trying to do. What you can do is simply open the file before handing it to numpy.loadtxt and read the first line. Then you don't even need to skip it. Here is an edited version of what you have here above that reads the labels and makes the legend:
"""
Example data: Data.csv:
Time,Parameter_1,Parameter_2,Parameter_3
0,10,0,10
1,20,30,10
2,40,20,20
3,20,10,30
"""
import numpy as np
import matplotlib.pyplot as plt
#open the file
with open('Data.csv') as f:
#read the names of the colums first
names = f.readline().strip().split(',')
#np.loadtxt can also handle already open files
data = np.loadtxt(f, delimiter=',') # no skip needed anymore
cols = data.shape[1]
for n in range (1,cols):
#labels go in here
plt.plot(data[:,0],data[:,n],label=names[n])
plt.xlabel('Time',fontsize=14)
plt.ylabel('Parameter values',fontsize=14)
#And finally the legend is made
plt.legend()
plt.show()

Resources