I'm trying to only import part of a CSV file to a list.
In short, the CSV I recveive contains two columns [depth and speed]. Depth always starts at zero, gets larger and then back to zero again.
I would like to add the first part of the CSV to the list (depth 0-13+). I then want to add the second part of the CSV (13-0) to another list.
I assume a for loop would be the way to go, but I don't know how to check each row for ascending/descending numbers.
pullData = open("svp3.csv","r").read()
dataArray = pullData.split('\n')
depthArrayY = []
speedArrayX = []
depthArrayLength = len(depthArrayY)
for eachLine in dataArray:
if len(eachLine)>1:
x,y = eachLine.split(',')
speedArrayX.append(round(float(x), 2))
depthArrayY.append(round(float(y), 2))
I'd suggest using Pandas, I think it will allow you for much more when you need to deal with imported data.
import pandas as pd
df = pd.read_csv('svp3.csv')
tmp = df[df.depth <= df.depth.shift(-1)].values
depth_increase = tmp[:,0]
speed_while_depth_increase = tmp[:,1]
tmp = df[df.depth > df.depth.shift(-1)].values
depth_decrease = tmp[:,0]
speed_while_depth_decrease = tmp[:,1]
I assumed that your CSV has the first the depth column then the speed column.
Depth column had values from 0 to a certain max value say 14, then from 13 to 0 depth column->[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,13,12,11,10,9,8,7,6,5,4,3,2,1]
and I populated speed column with some random values.
The following code makes use of pandas library and splits the column of depth into 2 lists of ascending and descending values using a simple logic of storing the current max value to determine when the ascending part of the column ends.
import pandas as pd
data = pd.read_csv('svp3.csv')
max_val = -10000
depthArrayAscendingY = []
speedArrayX = []
depthArrayDescendingY = []
for a in data.values:
if a[0]>max_val:
depthArrayAscendingY.append(a[0])
speedArrayX.append(a[1])
max_val = a[0]
else:
depthArrayDescendingY.append(a[0])
speedArrayX.append(a[1])
The answer to this question by Baleato is more efficient and cleaner than this answer, you should definitely check their answer.
Related
I am manipulating .csv files. I have to loop through each column of numeric data in the file and enter them into different lists. The code I have is the following:
import csv
salto_linea = "\n"
csv_file = "02_CSV_data1.csv"
with open(csv_file, 'r') as csv_doc:
doc_reader = csv.reader(csv_doc, delimiter = ",")
mpg = []
cylinders = []
displacement = []
horsepower = []
weight = []
acceleration = []
year = []
origin = []
lt = [mpg, cylinders, displacement, horsepower,
weight, acceleration, year, origin]
for i,ln in zip(range (0,9),lt):
print(f"{i} -> {ln}")
for row in doc_reader:
y = row[i]
ln.append(y)
In the loop, try to have range() serve me as an index so that in the nested for loop, it loops through the first column (the first element of each row in the csv) and feeds it into the first list of 'lt'. The problem I have is that I go through the data column and enter it, but range() continues to advance in the first loop, ending the nesting, thinking that it would iterate i = 1, and that new value of 'i' would enter again. the nested loop traversing the next column and vice versa. I also tried it with some other while loop to iterate a counter that adds to each iteration and serves as an index but it didn't work either.
How I can fill the sublists in 'lt' with the data which is inside the csv file??
without seing the ontents of the CSV file itself, the best way of reading the data into a table is with the pandas module, which can be done in one line of code.
import pandas as pd
df = pd.read_csv('02_CSV_data1.csv')
this would have read all the data into a dataframe and you can work with this.
Alternatively, ammend the for loop like this:
for row in doc_reader:
for i, ln in enumerate(lt):
ln.append(row[i])
for bigger data, i would prefer pandas which has vectorised methods.
Sample DataFrame:
id date price
93 6021501535 2014-07-25 430000
93 6021501535 2014-12-23 700000
313 4139480200 2014-06-18 1384000
313 4139480200 2014-12-09 1400000
first_list = []
second_list = []
I need to add the first price that corresponds to a specific ID to the first list and the second price for that same ID to the second list.
Example:
first_list = [430,000, 1,384,000]
second_list = [700,000, 1,400,000]
After which, I'm going to plot the values from both lists on a lineplot to compare the difference in price between the first and second list.
I've tried doing this with groupby and loc and I kept running into errors. I then tried iterating over each row using a simple for loop but ran into more problems...
I would appreciate some help.
Based on your question I think it's not necessary to save them into a list because you could also store them somewhere else (e.g. another DataFrame) and plot them. The functions below should help with filling wherever you want to store your data.
def date(your_id):
first_date = df.loc[(df['id']==your_id)].iloc[0,1]
second_date = df.loc[(df['id']==your_id)].iloc[1,1]
return first_date, second_date
def price(your_id):
first_date, second_date = date(your_id)
price_first_date = df.loc[(df['id']==6021501535) & (df['date']==first_date)].iloc[0,2]
price_second_date = df.loc[(df['id']==6021501535) & (df['date']==second_date)].iloc[0,2]
return price_first_date, price_second_date
price_first_date, price_second_date = price(6021501535)
If now for example you want to store your data in a new df you could do something like:
selected_ids = [6021501535, 4139480200]
new_df = pd.DataFrame(index=np.arange(1,len(selected_ids)+1), columns=['price_first_date', 'price_second_date'])
for i in range(len(selected_ids)):
your_id = selected_ids[i]
new_df.iloc[i, 0], new_df.iloc[i, 1] = price(your_id)
new_df then contains all 'first date prices' in the first column and all 'second date prices' in the second column. Plotting should work out.
This is my sample code
dataset_current=dataset_seq['Motor_Current_Average']
dataset_consistency=dataset_seq['Consistency_Average']
#technique with non-overlapping the values(for current)
dataset_slide=dataset_current.tolist()
from window_slider import Slider
import numpy
list = numpy.array(dataset_slide)
bucket_size = 336
overlap_count = 0
slider = Slider(bucket_size,overlap_count)
slider.fit(list)
empty_dictionary = {}
count = 0
while True:
count += 1
window_data = slider.slide()
empty_dictionary['df_current%s'%count] = window_data
empty_dictionary['df_current%s'%count] =pd.DataFrame(empty_dictionary['df_current%s'%count])
empty_dictionary['df_current%s'%count]= empty_dictionary['df_current%s'%count].rename(columns={0: 'Motor_Current_Average'})
if slider.reached_end_of_list(): break
locals().update(empty_dictionary)
#technique with non-overlapping the values(for consistency)
dataset_slide_consistency=dataset_consistency.tolist()
list = numpy.array(dataset_slide_consistency)
slider_consistency = Slider(bucket_size,overlap_count)
slider_consistency.fit(list)
empty_dictionary_consistency = {}
count_consistency = 0
while True:
count_consistency += 1
window_data_consistency = slider_consistency.slide()
empty_dictionary_consistency['df_consistency%s'%count_consistency] = window_data_consistency
empty_dictionary_consistency['df_consistency%s'%count_consistency] =pd.DataFrame(empty_dictionary_consistency['df_consistency%s'%count_consistency])
empty_dictionary_consistency['df_consistency%s'%count_consistency]= empty_dictionary_consistency['df_consistency%s'%count_consistency].rename(columns={0: 'Consistency_Average'})
if slider_consistency.reached_end_of_list(): break
locals().update(empty_dictionary_consistency)
import pandas as pd
output_current ={}
increment = 0
while True:
increment +=1
output_current['dataframe%s'%increment] = pd.concat([empty_dictionary_consistency['df_consistency%s'%count_consistency],empty_dictionary['df_current%s'%count]],axis=1)
My question is i have two dictionaries that contains 79 data frames in each one of them namely "empty_dictionary_consistency" and "empty_dictionary" . I want to create a new data frame for each one of them so that it concatenates df1 from empty_dictionary_consistency with df1 from empty_dictionary .So , it will start from concatenating df1 from empty_dictionary_consistency with df1 from empty_dictionary till df79 from empty_dictionary_consistency with df79 from empty_dictionary . I tried using while loop to increment it but does not shows any output.
output_current ={}
increment = 0
while True:
increment +=1
output_current['dataframe%s'%increment] = pd.concat([empty_dictionary_consistency['df_consistency%s'%count_consistency],empty_dictionary['df_current%s'%count]],axis=1)
Can anyone help me regarding this? How can i do this.
I am not near my computer now, so I can not test the code, but it seems that the problem is in indices. In the last loop, on every iteration you increment a variable called 'increment', but you still use indices from previous loops for dictionaries that you want to concatenate. Try to change variables that you use for indexing all dictionaries to 'increment'.
And one more thing - I can't see when this loop is going to finish?
UPD
I mean this:
length = len(empty_dictionary_consistency)
increment = 0
while increment < length:
increment +=1
output_current['dataframe%s'%increment] = pd.concat([empty_dictionary_consistency['df_consistency%s'%increment],empty_dictionary['df_current%s'%increment]],axis=1)
While iterating over your dictionaries you should use a variable that you increment as an index in all three dictionaries. And as soon as you do not use a Slider object in the loop, you have to stop it when the first dictionary is over.
I am trying to use python to conduct a calculation which will sum the values in a column only for the time period that a certain condition is met.
However, the summation should begin when the conditions are met (runstat == 0 and oil >1). The summation should then stop at the point when oil == 0.
I am new to python so I am not sure how to do this.
I connected the code to a spreadsheet for testing purposes but the intent is to connect to live data. I figured a while loop in combination with an if function might work but I am not winning.
Basically I want to have the code start when runstat is zero and oil is higher than 0. It should stop summing the values of oil when the oil row becomes zero and then it should write the data to a SQL database (this I will figure out later - for now I just want to see if it can work).
This is what code I have tried so far.
import numpy as np
import pandas as pd
data = pd.read_excel('TagValues.xlsx')
df = pd.DataFrame(data)
df['oiltag'] = df['oiltag'].astype(float)
df['runstattag'] = df['runstattag'].astype(float)
oil = df['oiltag']
runstat = df['runstattag']
def startup(oil,runstat):
while oil.all() > 0:
if oil > 0 and runstat == 0:
totaloil = sum(oil.all())
print(totaloil)
else:
return None
return
print(startup(oil.all(), runstat.all()))
It should sum the values in the column but it is returning: None
OK, so I think that what you want to do is get the subset of rows between the two conditions, then get a sum of those.
Method: Slice the dataframe to get the relevant rows and then sum.
import numpy as np
import pandas as pd
data = pd.read_excel('TagValues.xlsx')
df = pd.DataFrame(data)
df['oiltag'] = df['oiltag'].astype(float)
df['runstattag'] = df['runstattag'].astype(float)
def startup(dframe):
start_row = dframe[(dframe.oiltag > 0) & (dframe.runstattag == 0)].index[0]
end_row = dframe[(dframe.oiltag == 0) & (dframe.index > start_row)].index[0]
subset = dframe[start_row:end_row+1] # +1 because the end slice is non-inclusive
totaloil = subset.oiltag.sum()
return totaloil
print(startup(df))
This code will raise an error if it can't find a subset of rows which match your criteria. If you need to handle that case, then we could add some exception handling.
EDIT: Please note this assumes that your criteria is only expected to occur once per excel. If you have multiple “chunks” that you will want to sum then this will need tweaking.
Suppose I have 3 dataframe variables: res_df_union is the main dataframe and df_res and df_vacant are subdataframes created from res_df_union. They all share 2 columns called uniqueid and vacant_has_res. My goal is to compare the uniqueid column values in df_res and df_vacant, and if they match, to assign vacant_has_res in res_df_union with the value of 1.
*Note: I am using geoPandas (gpd Dataframe) instead of just pandas because I am working with spatial data but the concept is the same.
res_df_union = gpd.read_file(union, layer=union_feat)
df_parc_res = res_df_union[res_df_union.Parc_Res == 1]
unq_id_res = df_parc_res.uniqueid.unique()
df_parc_vacant = res_df_union[res_df_union.Parc_Vacant == 1]
unq_id_vac = df_parc_vacant.uniqueid.unique()
vacant_res_ids = []
for id_a in unq_id_vac:
for id_b in unq_id_res:
if id_a == id_b:
vacant_res_ids.append(id_a)
The code up to this point works. I have a list of uniqueid's that match. Now I just want to look for those unique id's in res_df_union and then assign res_df_union['vacant_has_res'] = 1. When I run the following, it either causes my IDE to crash, or never finishes running (after several hours). What am I doing wrong and is there a more efficient way to do this?
def u_row(row, id_val):
if row['uniqueid'] == id_val:
return 1
for item in res_df_union['uniqueid']:
if item in vacant_res_ids:
res_df_union['Has_Res_Association'] = res_df_union.apply(lambda row: u_row(row, item), axis = 1)