df2=df.drop(df[df['issue']=="prob"].index)
df2.head()
The code immediately below works fine.
But why is there a need to type df[df[ rather than the below?
df2=df.drop(df['issue']=="prob"].index)
df2.head()
I know that the immediately above won't work while the former does. I would like to understand why or know what exactly I should google.
Also ~ any advice on a more relevant title would be appreciated.
Thanks!
Option 1: df[df['issue']=="prob"] produces a DataFrame with a subset of values.
Option 2: df['issue']=="prob" produces a pandas.Series with a Boolean for every row.
.drop works for Option 1, because it knows to just drop the selected indices, vs. all of the indices returned from Option 2.
I would use the following methods to remove rows.
Use ~ (not) to select the opposite of the Boolean selection.
df = df[~(df.treatment == 'Yes')]
Select rows with only the desired value
df = df[(df.treatment == 'No')]
import pandas as pd
import numpy as np
import random
# sample dataframe
np.random.seed(365)
random.seed(365)
rows = 25
data = {'a': np.random.randint(10, size=(rows)),
'groups': [random.choice(['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']) for _ in range(rows)],
'treatment': [random.choice(['Yes', 'No']) for _ in range(rows)],
'date': pd.bdate_range(datetime.today(), freq='d', periods=rows).tolist()}
df = pd.DataFrame(data)
df[df.treatment == 'Yes'].index
Produces just the indices where treatment is 'Yes', therefore df.drop(df[df.treatment == 'Yes'].index) only drops the indices in the list.
df[df.treatment == 'Yes'].index
[out]:
Int64Index([0, 1, 2, 4, 6, 7, 8, 11, 12, 13, 14, 15, 19, 21], dtype='int64')
df.drop(df[df.treatment == 'Yes'].index)
[out]:
a groups treatment date
3 5 6-25 No 2020-08-15
5 2 500-1000 No 2020-08-17
9 0 500-1000 No 2020-08-21
10 3 100-500 No 2020-08-22
16 8 1-5 No 2020-08-28
17 4 1-5 No 2020-08-29
18 3 1-5 No 2020-08-30
20 6 500-1000 No 2020-09-01
22 6 6-25 No 2020-09-03
23 8 100-500 No 2020-09-04
24 9 26-100 No 2020-09-05
(df.treatment == 'Yes').index
Produces all of the indices, therefore df.drop((df.treatment == 'Yes').index) drops all of the indices, leaving an empty dataframe.
(df.treatment == 'Yes').index
[out]:
RangeIndex(start=0, stop=25, step=1)
df.drop((df.treatment == 'Yes').index)
[out]:
Empty DataFrame
Columns: [a, groups, treatment, date]
Index: []
Related
list1 = [[[1,2,3],[4,5,6],[7,8,9],[10,11,12]],[[13,14,15],[16,17,18],[19,20,21],[22,23,24]]]
for i in range(0,2):
print(list1[:][i][1])
output =
[4, 5, 6]
[16, 17, 18]
how do I get the above code to work such that:
desired output =
2 14
5 17
What is the correct indexing notation for the list? I'm having particular trouble with the list[:] format as it seems to be ignored by the above code. Thx
Something like?
>>> print(*map(' '.join, zip(*[map(str, list(zip(*x[:2]))[1]) for x in list1])), sep='\n')
2 14
5 17
>>>
Or:
>>> print(*map(' '.join, zip(*[(str(x[0][1]), str(x[1][1])) for x in list1])), sep='\n')
2 14
5 17
>>>
To make it work on your code you would have to do:
for i in range(0,2):
print(list1[0][i][1], list1[1][i][1])
Out:
2 14
5 17
The reason it your code didn't work is because it is getting the sublist from i, not the sublist's sublist.
I have this dataframe :
df=pd.DataFrame({'a': [2, 6, 8, 9],
'date': ['2021-07-21 04:34:02',
'test_2022-17-21 04:54:22',
'test_2020-06-21 04:34:02',
'2023-12-01 11:54:52']})
df["date"].replace("test_", "")
df
I would like to delete 'test_' from the column date.
Maybe, you can help
Use str.strip(<unnecessary string>) to remove the unnecessary string:
df.date = df.date.str.strip('test_')
OUTPUT:
a date
0 2 2021-07-21 04:34:02
1 6 2022-17-21 04:54:22
2 8 2020-06-21 04:34:02
3 9 2023-12-01 11:54:52
The same question was answered here. check the link. For your specific inquiry, this one line is all you want.
df['date'] = df['date'].map(lambda x: x.lstrip('test_'))
I was handling a large csv file, and came across this problem. I am reading in the csv file in chunks and want to extract sub-dataframes based on values for a particular column.
To explain the problem, here is a minimal version:
The CSV (save it as test1.csv, for example)
1,10
1,11
1,12
2,13
2,14
2,15
2,16
3,17
3,18
3,19
3,20
4,21
4,22
4,23
4,24
Now, as you can see, if I read the csv in chunks of 5 rows, the first column's values will be distributed across the chunks. What I want to be able to do is load in memory only the rows for a particular value.
I achieved it using the following:
import pandas as pd
list_of_ids = dict() # this will contain all "id"s and the start and end row index for each id
# read the csv in chunks of 5 rows
for df_chunk in pd.read_csv('test1.csv', chunksize=5, names=['id','val'], iterator=True):
#print(df_chunk)
# In each chunk, get the unique id values and add to the list
for i in df_chunk['id'].unique().tolist():
if i not in list_of_ids:
list_of_ids[i] = [] # initially new values do not have the start and end row index
for i in list_of_ids.keys(): # ---------MARKER 1-----------
idx = df_chunk[df_chunk['id'] == i].index # get row index for particular value of id
if len(idx) != 0: # if id is in this chunk
if len(list_of_ids[i]) == 0: # if the id is new in the final dictionary
list_of_ids[i].append(idx.tolist()[0]) # start
list_of_ids[i].append(idx.tolist()[-1]) # end
else: # if the id was there in previous chunk
list_of_ids[i] = [list_of_ids[i][0], idx.tolist()[-1]] # keep old start, add new end
#print(df_chunk.iloc[idx, :])
#print(df_chunk.iloc[list_of_ids[i][0]:list_of_ids[i][-1], :])
print(list_of_ids)
skip = None
rows = None
# Now from the file, I will read only particular id group using following
# I can again use chunksize argument to read the particular group in pieces
for id, se in list_of_ids.items():
print('Data for id: {}'.format(id))
skip, rows = se[0], (se[-1] - se[0]+1)
for df_chunk in pd.read_csv('test1.csv', chunksize=2, nrows=rows, skiprows=skip, names=['id','val'], iterator=True):
print(df_chunk)
Truncated output from my code:
{1: [0, 2], 2: [3, 6], 3: [7, 10], 4: [11, 14]}
Data for id: 1
id val
0 1 10
1 1 11
id val
2 1 12
Data for id: 2
id val
0 2 13
1 2 14
id val
2 2 15
3 2 16
Data for id: 3
id val
0 3 17
1 3 18
What I want to ask is, do we have a better way of doing this? If you consider MARKER 1 in the code, it is bound to be inefficient as the size grows. I did save memory usage, but, time still remains a problem. Do we have some existing method for this?
(I am looking for complete code in answer)
I suggest you use itertools for this, as follows:
import pandas as pd
import csv
import io
from itertools import groupby, islice
from operator import itemgetter
def chunker(n, iterable):
"""
From answer: https://stackoverflow.com/a/31185097/4001592
>>> list(chunker(3, 'ABCDEFG'))
[['A', 'B', 'C'], ['D', 'E', 'F'], ['G']]
"""
iterable = iter(iterable)
return iter(lambda: list(islice(iterable, n)), [])
chunk_size = 5
with open('test1.csv') as csv_file:
reader = csv.reader(csv_file)
for _, group in groupby(reader, itemgetter(0)):
for chunk in chunker(chunk_size, group):
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
Output (partial)
0 1
0 1 10
1 1 11
2 1 12
---
0 1
0 2 13
1 2 14
2 2 15
3 2 16
---
0 1
0 3 17
1 3 18
2 3 19
3 3 20
---
...
This approach will read first in groups by column 1:
for _, group in groupby(reader, itemgetter(0)):
and each group will be read in chunks of 5 rows (this can be change using chunk_size):
for chunk in chunker(chunk_size, group):
The last part:
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
creates a suitable string to be pass to pandas.
I am trying to get the 6th element value. as true but get NaN instead. I have an example based out of Excel. When i try rolling window of 6, i get nan for 6th record but i should get False, instead. However, when i try rolling window of 5, all seems to work. I want to understand what is actually happened and what is the best way to say sum product of 6 elements means rolling window of 6 instead of 5.
Objective : Six points in a row, all increasing or all decreasing
Code I am trying
def condition(x):
if x.tolist()[-1] != 0:
if ( sum(x.tolist()) >= 5 or sum(x.tolist()) <= -5):
return 1
else:
return 0
else:
return 0
df_in['I GET'] = df_in[['lead_one']].rolling(
window=6).apply(condition , raw=False)
Tag column is what is expected.
When you use a rolling window of 6, it takes the current value + the previous 5 values. Then you try to sum those 6 values. I say try, because if there's any nan value in there, ordinary python summing will also give you an na value.
That's also why .rolling(window=5) works: it gets the current value + 4 previous values and since they don't contain any nan values, you actually get a summed value one row earlier
You could use a different kind of summing: np.nansum()
Or use pandas summing where you specify to skip the na's, something like: df['column'].sum(skipna=True)
However looking at your code, I think it could be improved, so you don't get the na's in the first place. Here's an example using np.where():
import numpy as np
import pandas as pd
# create example dataframe
df = pd.DataFrame(
data=[10, 10, 12, 13, 14, 15, 16, 17, 17, 10, 9],
columns=['value']
)
# create an if/then using np.select
df['n > n+1'] = np.select(
[df['value'] > df['value'].shift(1),
df['value'] == df['value'].shift(1),
df['value'] < df['value'].shift(1)],
[1, 0, -1]
)
# take an absolute value of the last 6 values and check if >= 5
df['I GET'] = np.where(
np.abs(df['n > n+1'].rolling(window=6).sum()) >= 5, 1, 0)
I'm trying to loop through a list(y) and output by appending a row for each item to a dataframe.
y=[datetime.datetime(2017, 3, 29), datetime.datetime(2017, 3, 30), datetime.datetime(2017, 3, 31)]
Desired Output:
Index Mean Last
2017-03-29 1.5 .76
2017-03-30 2.3 .4
2017-03-31 1.2 1
Here is the first and last part of the code I currently have:
import pandas as pd
import datetime
df5=pd.DataFrame(columns=['Mean','Last'],index=index)
for item0 in y:
.........
.........
df=df.rename(columns = {0:'Mean'})
df4=pd.concat([df, df3], axis=1)
print (df4)
df5.append(df4)
print (df5)
My code only puts one row into the dataframe like as opposed to a row for each item in y:
Index Mean Last
2017-03-29 1.5 .76
Try:
y = [datetime(2017, 3, 29), datetime(2017, 3, 30),datetime(2017, 3, 31)]
m = [1.5,2.3,1.2]
l = [0.76, .4, 1]
df = pd.DataFrame([],columns=['time','mean','last'])
for y0, m0, l0 in zip(y,m,l):
data = {'time':y0,'mean':m0,'last':l0}
df = df.append(data, ignore_index=True)
and if you want y to be the index:
df.index = df.time
There are a few ways to skin this, and it's hard to know which approach makes the most sense with the limited info given. But one way is to start with a dataframe that has only the index, iterate through the dataframe by row and populate the values from some other process. Here's an example of that approach:
import datetime
import numpy as np
import pandas as pd
y=[datetime.datetime(2017, 3, 29), datetime.datetime(2017, 3, 30), datetime.datetime(2017, 3, 31)]
main_df = pd.DataFrame(y, columns=['Index'])
#pop in the additional columns you want, but leave them blank
main_df['Mean'] = None
main_df['Last'] = None
#set the index
main_df.set_index(['Index'], inplace=True)
that gives us the following:
Mean Last
Index
2017-03-29 None None
2017-03-30 None None
2017-03-31 None None
Now let's loop and plug in some made up random values:
## loop through main_df and add values
for (index, row) in main_df.iterrows():
main_df.ix[index].Mean = np.random.rand()
main_df.ix[index].Last = np.random.rand()
this results in the following dataframe which has the None values filled:
Mean Last
Index
2017-03-29 0.174714 0.718738
2017-03-30 0.983188 0.648549
2017-03-31 0.07809 0.47031