Convert a csv file to array uisng pandas - python-3.x

I have a csv file as shown below
Year, Growth
1, 10
2, 12
3, 13
4, 15
I would like to convert content of the csv file to an array say my_array,
so that I can access growth as show below
my_array[1] should return 10
my_array[4] should return 15 etc
How can I convert the csv to such an array using pandas?

>>> import pandas
>>> df = pandas.read_csv('test.csv')
>>> df
Year Growth
0 1 10
1 2 12
2 3 13
3 4 15
>>> arr = df['Growth'].values.tolist()
>>> arr[0]
10
Yes , above piece of code will help you do that for whatever columns you want.

Related

Compare value in a dataframe to multiple columns of another dataframe to get a list of lists where entries match in an efficient way

I have two pandas dataframes and i want to find all entries of the second dataframe where a specific value occurs.
As an example:
df1:
NID
0 1
1 2
2 3
3 4
4 5
df2:
EID N1 N2 N3 N4
0 1 1 2 13 12
1 2 2 3 14 13
2 3 3 4 15 14
3 4 4 5 16 15
4 5 5 6 17 16
5 6 6 7 18 17
6 7 7 8 19 18
7 8 8 9 20 19
8 9 9 10 21 20
9 10 10 11 22 21
Now, what i basically want, is a list of lists with the values EID (from df2) where the values NID (from df1) occur in any of the columns N1,N2,N3,N4:
Solution would be:
sol = [[1], [1, 2], [2, 3], [3, 4], [4, 5]]
The desired solution explained:
The solution has 5 entries (len(sol = 5)) since I have 5 entries in df1.
The first entry in sol is 1 because the value NID = 1 only appears in the columns N1,N2,N3,N4 for EID=1 in df2.
The second entry in sol refers to the value NID=2 (of df1) and has the length 2 because NID=2 can be found in column N1 (for EID=2) and in column N2 (for EID=1). Therefore, the second entry in the solution is [1,2] and so on.
What I tried so far is looping for each element in df1 and then looping for each element in df2 to see if NID is in any of the columns N1,N2,N3,N4. This solution works but for huge dataframes (each df can have up to some thousand entries) this solution becomes extremely time-consuming.
Therefore I was looking for a much more efficient solution.
My code as implemented:
Input data:
import pandas as pd
df1 = pd.DataFrame({'NID':[1,2,3,4,5]})
df2 = pd.DataFrame({'EID':[1,2,3,4,5,6,7,8,9,10],
'N1':[1,2,3,4,5,6,7,8,9,10],
'N2':[2,3,4,5,6,7,8,9,10,11],
'N3':[13,14,15,16,17,18,19,20,21,22],
'N4':[12,13,14,15,16,17,18,19,20,21]})
solution acquired using looping:
sol= []
for idx,node in df1.iterrows():
x = []
for idx2,elem in df2.iterrows():
if node['NID'] == elem['N1']:
x.append(elem['EID'])
if node['NID'] == elem['N2']:
x.append(elem['EID'])
if node['NID'] == elem['N3']:
x.append(elem['EID'])
if node['NID'] == elem['N4']:
x.append(elem['EID'])
sol.append(x)
print(sol)
If anyone has a solution where I do not have to loop, I would be very happy. Maybe using a numpy function or something like cKDTrees but unfortunately I have no idea on how to get this problem solved in a faster way.
Thank you in advance!
You can reshape with melt, filter with loc, and groupby.agg as list. Then reindex and convert tolist:
out = (df2
.melt('EID') # reshape to long form
# filter the values that are in df1['NID']
.loc[lambda d: d['value'].isin(df1['NID'])]
# aggregate as list
.groupby('value')['EID'].agg(list)
# ensure all original NID are present in order
# and convert to list
.reindex(df1['NID']).tolist()
)
Alternative with stack:
df3 = df2.set_index('EID')
out = (df3
.where(df3.isin(df1['NID'].tolist())).stack()
.reset_index(name='group')
.groupby('group')['EID'].agg(list)
.reindex(df1['NID']).tolist()
)
Output:
[[1], [2, 1], [3, 2], [4, 3], [5, 4]]

pandas computation on rolling 1 calendar month

I have a pandas DataFrame with date as the index and a column, 'spendings'. I intend to get the rolling max() of the 'spendings' column for the trailing 1 calendar month (not 30 days or 4 weeks).
I tried to capture a snippet with custom data for addressing the problem, below (borrowed from Pandas monthly rolling operation):
import pandas as pd
from io import StringIO
data = StringIO(
"""\
date spendings
20210325 15
20210405 20
20210415 10
20210425 40
20210505 3
20210515 2
20210525 2
20210527 1
"""
)
df = pd.read_csv(data,sep="\s+", parse_dates=True)
df.index = pd.to_datetime(df.date, format='%Y%m%d')
del(df['date'])
Now, to create a column 'max' to hold rolling last 1 calendar month's max() val, I use:
df['max'] = df.loc[(df.index - pd.tseries.offsets.DateOffset(months=1)):df.index, 'spendings'].max()
This raises an exception like:
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [DatetimeIndex(['2021-02-25', '2021-03-05', '2021-03-15', '2021-03-25',
'2021-04-05', '2021-04-15', '2021-04-25'],
dtype='datetime64[ns]', name='date', freq=None)] of type DatetimeIndex
However, if I manually access a random month window like below, it works without exception:
>>> df['2021-04-16':'2021-05-15']
spendings
date
2021-04-25 40
2021-05-05 3
2021-05-15 2
(I could have followed the method using list comprehension here: https://stackoverflow.com/a/47199274/235415, but I would like to use panda's vectorized method. I have many DataFrames and each is very large - using list comprehension is very slow here).
Q: How to get the vectorized method of performing rolling 1 calendar month's max()?
The expected o/p, ie primarily the 'max' column (holding the max value of 'spendings' for last 1 calendar month) will be something like this:
>>> df
spendings max
date
2021-03-25 15 15
2021-04-05 20 20
2021-04-15 10 20
2021-04-25 40 40
2021-05-05 3 40
2021-05-15 2 40
2021-05-25 2 40
2021-05-27 1 3
The answer will be
[df.loc[x- pd.tseries.offsets.DateOffset(months=1):x, 'spendings'].max() for x in df.index]
Out[53]: [15, 20, 20, 40, 40, 40, 40, 3]

How to access list of list values in columns in dataset

In my DataFrame.I am having a list of list values in a column. For example, I am having columns as A, B, C, and my output column. In column A I'm having a value of 12 and in column B I am having values of 30 and in column C I am having a list of values like [0.01,1.234,2.31].When I try to find mean for all the list of list values.It shows list object as no attribute mean.How to convert all list of list values to mean in the dataframe?
You can transform the column which contains the lists to another DataFrame and calculate the mean.
import pandas as pd
df = ... # Original df
pd.DataFrame(df['column_with_lists'].values.tolist()).mean(1)
This would result in a pandas DataFrame which looks like the following:
0 mean_of_list_row_0
1 mean_of_list_row_1
. .
. .
. .
n mean_of_list_row_n
You can use apply(np.mean) on the column with the lists in it to get the mean. For example:
Build a dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame([[2,4],[4,6]])
df[3] = [[5,7],[8,9,10]]
print(df)
0 1 3
0 2 4 [5, 7]
1 4 6 [8, 9, 10]
Use apply(np.mean)
print(df[3].apply(np.mean))
0 6.0
1 9.0
If you want to convert that column into the mean of the lists:
df[3] = df[3].apply(np.mean)
print(df)
Name: 3, dtype: float64
0 1 3
0 2 4 6.0
1 4 6 9.0

How to do element wise operation of Pandas Series to get a new DataFrame

I have a Pandas Series of numbers, lets say [1,2,3,4,5]. I want an efficient way to create a dataframe with each combination of series element passed through a function and the result being the corresponding dataframe element
Lets say the function is
def f1(a,b):
return a*b + 5
then I want the dataframe to be like below, with each cell being the result of the function call with combination of series element.
col_1___2___3___4___5__
row|
1|6, 7, 8, 9, 10
2|7, 9, 11, 13, 15
3|8, 11, 14, 17, 20
4|9, 13, 17, 21, 25
5|10, 15, 20, 25, 30
The series can sometimes be as much as 500 elements.
Thanks in advance!
You can do:
def f1(a,b):
return a*b + 5
s=[1,2,3,4,5]
df=pd.DataFrame(columns=s, index=s)
df=df.apply(lambda x: f1(x.name, x.index))
Output:
1 2 3 4 5
1 6 7 8 9 10
2 7 9 11 13 15
3 8 11 14 17 20
4 9 13 17 21 25
5 10 15 20 25 30
Using pandas
import pandas as pd
#create pandas dataframe with one column "col_" with data.
df = pd.DataFrame({'col_':list(range(1, 6))})
print(df['col_']) #this is data you provided above.
#Create a new column based on function.
b=1 #constant
df['1_'] = df['col_'].apply(lambda x: x*b+5)
#and another column
df['2_'] = df['col_'].apply(lambda x: x*(b+b)+5)

Delete specific columns from csv file in python3

Need to delete some specific column and rows(by index) of multiple csv files, without creating new files.
For the code below, it is giving output with new blank rows after each row.
import csv
with open('file.csv') as fd:
reader = csv.reader(fd)
valid_rows = [row for idx, row in enumerate(reader) if idx != 0]
with open('file.csv', 'w') as out:
csv.writer(out).writerows(valid_rows)
What is the simpler way to do this(might be by other python libraries)?
Since you wish not to generate any new csv files and would like the data to perform operations, I would suggest you to make use of Pandas Framework. Make use of drop function in this framework.
Consider the following example:
Sample.csv:
col1,col2,col3,col4
1,2,3,4
5,6,7,8
9,10,11,12
13,14,15,16
17,18,19,20
Code:
import pandas as pd
df = pd.read_csv('./Sample.csv')
To delete columns:
df.drop('col3', axis = 1, inplace = True)
df contents:
col1 col2 col4
0 1 2 4
1 5 6 8
2 9 10 12
3 13 14 16
4 17 18 20
To delete rows:
df.drop(df.index[[1,4]], inplace = True)
df contents:
col1 col2 col4
0 1 2 4
2 9 10 12
3 13 14 16
Finally to save the edited csv file:
df.to_csv('new_sample.csv', index = False)

Resources