In the flowing data frame in Pandas, I want to extract columns corresponding dates between '03/01' and '06/01'. I don't want to use the index at all, as my input would be a start and end dates. How could I do so ?
A B
0 01/01 56
1 02/01 54
2 03/01 66
3 04/01 77
4 05/01 66
5 06/01 72
6 07/01 132
7 08/01 127
First create a list of the dates you need using daterange. I'm adding the year 2000 since you need to supply a year for this to work, im then cutting it off to get the desired strings. In real life you might want to pay attention to the actual year due to things like leap-days.
date_start = '03/01'
date_end = '06/01'
dates = [x.strftime('%m/%d') for x in pd.date_range('2000/{}'.format(date_start),
'2000/{}'.format(date_end), freq='D')]
dates is now equal to:
['03/01',
'03/02',
'03/03',
'03/04',
.....
'05/29',
'05/30',
'05/31',
'06/01']
Then simply use the isin argument and you are done
df = df.loc[df.A.isin(dates)]
df
If your columns is a datetime column I guess you can skip the strftime part in th list comprehension to get the right result.
You are welcome to use boolean masking, i.e.:
df[(df.A >= start_date) && (df.A <= end_date)]
Inside the bracket is a boolean array of True and False. Only rows that fulfill your given condition (evaluates to True) will be returned. This is a great tool to have and it works well with pandas and numpy.
Related
I want to extract the subsequences indicated by the first and last locations in data frame 'B'.
The algorithm that I came up with is:
Identify the rows of B that fall in the locations of A
Find the relative position of the locations (i.e. shift the locations to make them start from 0)
Start a for loop using the relative position as a range to extract the subsequences.
The issue with the above algorithm is runtime. I require an alternative approach to compile the code faster than the existing one.
Desired output:
first last sequences
3 5 ACA
8 12 CGGAG
105 111 ACCCCAA
115 117 TGT
Used data frames:
import pandas as pd
A = pd.DataFrame({'first.sequence': ['AAACACCCGGAG','ACCACACCCCAAATGTGT'
],'first':[1,100], 'last':[12,117]})
B = pd.DataFrame({'first': [3,8,105,115], 'last':[5,12,111,117]})
One solution could be as follows:
out = pd.merge_asof(B, A, on=['last'], direction='forward',
suffixes=('','_y'))
out.loc[:,['first','last']] = \
out.loc[:,['first','last']].sub(out.first_y, axis=0)
out = out.assign(sequences=out.apply(lambda row:
row['first.sequence'][row['first']:row['last']+1],
axis=1)).drop(['first.sequence','first_y'], axis=1)
out.update(B)
print(out)
first last sequences
0 3 5 ACA
1 8 12 CGGAG
2 105 111 ACCCCAA
3 115 117 TGT
Explanation
First, use df.merge_asof to match first values from B with first values from A. I.e. 3, 8 will match with 1, and 105, 115 will match with 100. Now we know which string (sequence) needs splitting and we also know where the string starts, e.g. at index 1 or 100 instead of a normal 0.
We use this last bit of information to find out where the string slice should start and end. So, we do out.loc[:,['first','last']].sub(out.first_y, axis=0). E.g. we "reset" 3 to 2 (minus 1) and 105 to 5 (minus 100).
Now, we can use df.apply to get the string slices for each sequence, essentially looping over each row. (if your slices would have started and ended at the same indices, we could have used Series.str.slice instead.
Finally, we assign the result to out (as col sequences), drop the cols we no longer need, and we use df.update to "reset" the columns first and last.
I have a data frame as the image below. I want to extract the rows of data frame which are having year and month as '1395/01'. I used the code below, but I know it is not correct because we can use string slice on a series of strings. Can anyone show me a way without using nested for loops?
df[df['Date'][:7] == '1395/01']
I might use str.match here:
df[df['Date'].str.match(r'^1395/01')]
But in general it is usually preferable to store dates as datetime and not text. Also, the year 1395 seems dubious.
You can use loc and startswith to filter your dataframe.
Sample:
df = pd.DataFrame({'Date': ['1395/01/01', '1395/02/01', '1395/01/01', '1395/05/01']})
print(df)
Date
0 1395/01/01
1 1395/02/01
2 1395/01/01
3 1395/05/01
Solution:
print(df.loc[df['Date'].str.startswith('1395/01'), :])
Date
0 1395/01/01
2 1395/01/01
If you would like to extract year and month for all rows, you can use str.slice:
df['Extracted Date'] = df['Date'].str.slice(0, 7)
print(df)
Date Extracted Date
0 1395/01/01 1395/01
1 1395/02/01 1395/02
2 1395/01/01 1395/01
3 1395/05/01 1395/05
I have checked the other posts about IndexError: single positional indexer is out-of-bounds but could not find solutions that explain my problem.
I have a DataFrame that looks like:
Date Balance
0 2020-01-07 168.51
1 2020-02-07 179.46
2 2020-03-07 212.15
3 2020-04-07 221.68
4 2020-05-07 292.23
5 2020-06-07 321.61
6 2020-07-07 332.27
7 2020-08-07 351.63
8 2020-09-07 372.26
My problem is I want to run a script that takes in a DataFrame like the one above and returns the balance of the each row using something like df.iloc[2][1]. However, the DataFrame can be anywhere from 1 to 12 rows in length. So if I call df.iloc[8][1] and the DataFrame is less than 9 rows in length then I get the IndexError.
If I want to return the balance for every row using df.iloc[]... how can I handle the index errors without using 12 different try and except statements?
Also the problem is simplified here and the DataFrame can get rather large so I want to try and stay away from looping if possible
Thanks!!
My Solution was to use a loop over the length of the list and append the balance into a list. I then padded the list to the length of 12 with 'NaN' values.
num_months = len(df)
N=12
list_balance_months = []
for month in range(num_months):
list_balance_months .append(df_cd.iloc[month][0])
list_balance_months += [np.nan] * (N - len(list_balance_months ))
balance_month_1, balance_month_2, balance_month_3, balance_month_4, balance_month_5, balance_month_6, balance_month_7, balance_month_8, balance_month_9, balance_month_10, balance_month_11, balance_month_12 = list_credit_months
with this solution, if balance_month_11 is called and the DataFrame only has 4 months of data, instead of index error it will give np.nan (nan).
Please let me know if you can think of a simpler solution!
I am trying to create a new column in Pandas dataframe. If the other two date columns in my dataframe share the same month, then this new column should have 1 as a value, otherwise 0. Also, I need to check that ids match my other list of ids that I have saved previously in another place and mark those only with 1. I have some code but it is useless since I am dealing with almost a billion of rows.
my_list_of_ids = df[df.bool_column == 1].id.values
def my_func(date1, date2):
for id_ in df.id:
if id_ in my_list_of_ids:
if date1.month == date2.month:
my_var = 1
else:
my_var = 0
else:
my_var = 0
return my_var
df["new_column"] = df.progress_apply(lambda x: my_func(x['date1'], x['date2']), axis=1)
Been waiting for 30 minutes and still 0%. Any help is appreciated.
UPDATE (adding an example):
id | date1 | date2 | bool_column | new_column |
id1 2019-02-13 2019-04-11 1 0
id1 2019-03-15 2019-04-11 0 0
id1 2019-04-23 2019-04-11 0 1
id2 2019-08-22 2019-08-11 1 1
id2 ....
id3 2019-09-01 2019-09-30 1 1
.
.
.
What I need to do is save the ids that are 1 in my bool_column, then I am looping through all of the ids in my dataframe and checking if they are in the previously created list (= 1). Then I want to compare month and the year of date1 and date2 columns and if they are the same, create a new_column with a value 1 where they mach, otherwise, 0.
The pandas way to do this is
mask = ((df['date1'].month == df['date2'].month) & (df['id'].isin(my_list_of_ids)))
df['new_column'] = mask.replace({False: 0, True: 1})
Since you have a large data-set, this will take time, but should be faster than using apply
The best way to deal with the month match is to use vectorization in pandas and do this:
new_column = (df.date1.dt.month == df.date2.dt.month).astype(int)
That is, avoid using apply() over the DataFrame (which will probably be iterative) and take advantage of the underlying numpy vectorization. The gateway to such functionality is almost always in families of Series functions and properties, like the dt family for dates, str family for strings, and so forth.
Luckily, you have pre-computed the id_list membership in your bool_column, so to add membership as a criterion, just do this:
new_column = ((df.date1.dt.month == df.date2.dt.month) & df.bool_column).astype(int)
Once again, the & of two Series takes advantage of vectorization. You stay inside boolean space till the end, then cast to int with astype(int). Reviewing your code, it occurs to me that the iterative checking of your id_list may be the real performance hit here, even more so than the DataFrame.apply(). Whatever you do, avoid at all costs iterating your id_list at each row, since you already have a vector denoting membership in your bool_column.
By the way I believe there's a tiny error in your example data, the new_column value for your third row should be 0, since your bool_column value there is 0.
Please forgive any errors or shortcomings in this question, it's my first on stackoverflow.
I have two sets of data in Excel of differing lengths and frequency, and would like to be able to place a value of 0 for where they don't synchronise, and match the rest.
For example, dataset 1 could be:
Date Set1
01-01-2010 10
01-03-2010 4
01-04-2010 8
01-05-2010 5
01-06-2010 10
01-09-2010 12
01-10-2010 9
01-11-2010 4
And dataset 2 could be:
Date Set2
01-03-2010 102
01-06-2010 104
01-10-2010 102
I'm looking for an output table that displays the values alongside each other for dates matching, 0 otherwise, like so:
Date Set1 Set2
01-01-2010 10 0
01-03-2010 4 102
01-04-2010 8 0
01-05-2010 5 0
01-06-2010 10 104
01-09-2010 12 0
01-10-2010 9 102
01-11-2010 4 0
I can't seem to be able to crack this with my limited knowledge and the lack of synchronisation in the data. Any help would be much appreciated, thanks.
You can do this using a VLOOKUP nested in an IFERROR statement.
The two equations used (and dragged down to last unique date row) are:
H3 = IFERROR(VLOOKUP(G3,A:B,2,0),0)) & I3 = IFERROR(VLOOKUP(G3,D:E,2,0),0))
This will not work if you have duplicate dates in the same data set with varying values since VLOOKUP will always return the first matched value (reading top down).
Place Set1 in A1:B9 (header in row 1). Add a column of zeros next to it in column C, so A2:A9 is dates, B2:B9 is values and C2:C9 is zeros.
Place Set2 (without the header) in A10:B12; move the Set2 data to column C and put zeros in column B, so A10:A12 is dates, B10:B12 is zeros, C10:C12 is values.
Sort the range A2:C12 by Date (column A).
Easier to show with a screenshot but newbies are not allowed to post images.