Search column name with multiple conditions pandas - python-3.x

I have a query to retrieve all columns with the name date in them as below
date_raw_cols = [col for col in df_raw.columns if 'date' in col]
That is also picking up columns with updated which I want to exclude. I've also tried a regex filter as below with same problem of returning updated
df_dates = df_raw.filter(regex='date', axis='columns')
How do I combine conditions to filter column names. i.e.
Where column name is date but not update, but could be date1, _date, date_

Instead of searching for 'date' in a column name, you can be more explicit:
# Assume example df_raw
>>> df_raw
date date1 prev_date update
0 1 2 3 200
1 4 2 5 300
2 5 5 3 100
>>> date_raw_cols = [col for col in df_raw.columns if col == 'date']
>>> print(date_raw_cols)
['date']
EDIT: If your question fully covers your data at hand, you can add an extra condition in the list comprehension with len() < 6, which will only grab column names with number of characters less than 6. This way you don't have to explicitly deal with underscores or digits.
>>> df_raw
date date1 prev_date update _date date_
0 1 2 3 200 a g
1 4 2 5 300 s h
2 5 5 3 100 v a
>>> date_raw_cols = [col for col in df_raw.columns if 'date' in col and len(col) < 6]
>>> print(date_raw_cols)
['date', 'date1', '_date', 'date_']

Try following regex:
\b(\w*(?=[^a-z]date)|(?=date[^a-z]))\w*\b
It will find all words that contain "date" which are bounded with numbers or punctuation marks:
re.findall(r'\b\w*(?=[^a-z]date)\w*\b|\b(?=date[^a-z])\w*\b',
'date1 date date_1 update new_date 234_date 33date datetime ')
['date1', '', 'date', '', 'date_1', 'new_date', '234_date', '33date', '']

Related

pandas column name search and append notes column value to previous row value python [duplicate]

I want to merge several strings in a dataframe based on a groupedby in Pandas.
This is my code so far:
import pandas as pd
from io import StringIO
data = StringIO("""
"name1","hej","2014-11-01"
"name1","du","2014-11-02"
"name1","aj","2014-12-01"
"name1","oj","2014-12-02"
"name2","fin","2014-11-01"
"name2","katt","2014-11-02"
"name2","mycket","2014-12-01"
"name2","lite","2014-12-01"
""")
# load string as stream into dataframe
df = pd.read_csv(data,header=0, names=["name","text","date"],parse_dates=[2])
# add column with month
df["month"] = df["date"].apply(lambda x: x.month)
I want the end result to look like this:
I don't get how I can use groupby and apply some sort of concatenation of the strings in the column "text". Any help appreciated!
You can groupby the 'name' and 'month' columns, then call transform which will return data aligned to the original df and apply a lambda where we join the text entries:
In [119]:
df['text'] = df[['name','text','month']].groupby(['name','month'])['text'].transform(lambda x: ','.join(x))
df[['name','text','month']].drop_duplicates()
Out[119]:
name text month
0 name1 hej,du 11
2 name1 aj,oj 12
4 name2 fin,katt 11
6 name2 mycket,lite 12
I sub the original df by passing a list of the columns of interest df[['name','text','month']] here and then call drop_duplicates
EDIT actually I can just call apply and then reset_index:
In [124]:
df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()
Out[124]:
name month text
0 name1 11 hej,du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
update
the lambda is unnecessary here:
In[38]:
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Out[38]:
name month text
0 name1 11 du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
We can groupby the 'name' and 'month' columns, then call agg() functions of Panda’s DataFrame objects.
The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.
df.groupby(['name', 'month'], as_index = False).agg({'text': ' '.join})
The answer by EdChum provides you with a lot of flexibility but if you just want to concateate strings into a column of list objects you can also:
output_series = df.groupby(['name','month'])['text'].apply(list)
If you want to concatenate your "text" in a list:
df.groupby(['name', 'month'], as_index = False).agg({'text': list})
For me the above solutions were close but added some unwanted /n's and dtype:object, so here's a modified version:
df.groupby(['name', 'month'])['text'].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', '').reset_index()
Please try this line of code : -
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Although, this is an old question. But just in case. I used the below code and it seems to work like a charm.
text = ''.join(df[df['date'].dt.month==8]['text'])
Thanks to all the other answers, the following is probably the most concise and feels more natural. Using df.groupby("X")["A"].agg() aggregates over one or many selected columns.
df = pandas.DataFrame({'A' : ['a', 'a', 'b', 'c', 'c'],
'B' : ['i', 'j', 'k', 'i', 'j'],
'X' : [1, 2, 2, 1, 3]})
A B X
a i 1
a j 2
b k 2
c i 1
c j 3
df.groupby("X", as_index=False)["A"].agg(' '.join)
X A
1 a c
2 a b
3 c
df.groupby("X", as_index=False)[["A", "B"]].agg(' '.join)
X A B
1 a c i i
2 a b j k
3 c j

I want to merge 4 rows to form 1 row with 4 sub-rows in pandas Dataframe

This is my dataframe
I have tried this but it didn't work:
df1['quarter'].str.contains('/^[-+](20)$/', re.IGNORECASE).groupby(df1['quarter'])
Thanks in advance
Hi and welcome to the forum! If I understood your question correctly, you want to form groups per year?
Of course, you can simply do a group by per year as you already have the column.
Assuming you didn't have the year column, you can simply group by the whole string except the last 2 characters of the quarter column. Like this (I created a toy dataset for the answer):
import pandas as pd
d = {'quarter' : pd.Series(['1947q1', '1947q2', '1947q3', '1947q4','1948q1']),
'some_value' : pd.Series([1,3,2,4,5])}
df = pd.DataFrame(d)
df
This is our toy dataframe:
quarter some_value
0 1947q1 1
1 1947q2 3
2 1947q3 2
3 1947q4 4
4 1948q1 5
Now we simply group by the year, but we substract the last 2 characters:
grouped = df.groupby(df.quarter.str[:-2])
for name, group in grouped:
print(name)
print(group, '\n')
Output:
1947
quarter some_value
0 1947q1 1
1 1947q2 3
2 1947q3 2
3 1947q4 4
1948
quarter some_value
4 1948q1 5
Additional comment: I used an operation that you can always apply to strings. Check this, for example:
s = 'Hi there, Dhruv!'
#Prints the first 2 characters of the string
print(s[:2])
#Output: "Hi"
#Prints everything after the third character
print(s[3:])
#Output: "there, Dhruv!"
#Prints the text between the 10th and the 15th character
print(s[10:15])
#Output: "Dhruv"

How to return index of a row 60 seconds before current row

I have a large (>32 M rows) Pandas dataframe.
In column 'Time_Stamp' I have a Unix timestamp in seconds. These values are not linear, there are gaps, and some timestamps can be duplicated (ex: 1, 2, 4, 6, 6, 9,...).
I would like to set column 'Result' of current row to the index of the row that is 60 seconds before current row (closest match if there are no rows exactly 60 seconds before current row, and if more than one match, take maximum of all matches).
I've tried this to first get the list of indexes, but it always return an empty list:
df.index[df['Time_Stamp'] <= df.Time_Stamp-60].tolist()
I cannot use a for loop due to the large number of rows.
Edit 20.01.2020:
Based on comment below, I'm adding a sample dataset, and instead of returning the index I want to return the column Value:
In [2]: df
Out[2]:
Time_Stamp Value
0 1 2.4
1 2 3.1
2 4 6.3
3 6 7.2
4 6 6.1
5 9 6.0
So with the precious help of ALollz, I managed to achieve what i wanted to do in the end, here's my code:
#make copy of dataframe
df2 = df[['Time_Stamp','Value']].copy()
#add Time_gap to Time_Stamp in df2
df2['Time_Stamp'] = df2.Time_Stamp +Time_gap
#sort df2 on Time_Stamp
df2.sort_values(by = 'Time_Stamp', ascending=True,inplace = True)
df2 = df2.reset_index(drop=True)
df3 = pd.merge_asof(df, df2, on='Time_Stamp', direction='forward')

i want to count the -ve in a col and put them in another colm using groupby in an col

count the number of neg values in delay column using groupby
merged_inner['delayed payments']=merged_inner.groupby('Customer Name')['delay'].apply(lambda x: x [x < 0].count())
the delayed payments col is showing null
I believe the problem here is that you are trying to put the results back to the same dataframe as you did .groupby with, which won't have Customer Name as index.
Consider following minified example:
df = pd.DataFrame({
'Customer Name':['a', 'b','c','a', 'c','a','b','a'],
'Delay':[1, 2, -3, 0, -1,-2, -3,2]
})
You can even try:
df.loc[df['col']<0].groupby('Customer Name')['delay'].size()
Output:
Customer Name
a 1
b 1
c 2
Name: Delay, dtype: int64
You can have dataframe using:
df.loc[df['Delay']<0].groupby('Customer Name')['Delay'].size().reset_index(name='delayed_payment')
Output:
Customer Name delayed_payment
0 a 1
1 b 1
2 c 2

Filter columns based on a value (Pandas): TypeError: Could not compare ['a'] with block values

I'm trying filter a DataFrame columns based on a value.
In[41]: df = pd.DataFrame({'A':['a',2,3,4,5], 'B':[6,7,8,9,10]})
In[42]: df
Out[42]:
A B
0 a 6
1 2 7
2 3 8
3 4 9
4 5 10
Filtering columns:
In[43]: df.loc[:, (df != 6).iloc[0]]
Out[43]:
A
0 a
1 2
2 3
3 4
4 5
It works! But, When I used strings,
In[44]: df.loc[:, (df != 'a').iloc[0]]
I'm getting this error: TypeError: Could not compare ['a'] with block values
You are trying to compare string 'a' with numeric values in column B.
If you want your code to work, first promote dtype of column B as numpy.object, It will work.
df.B = df.B.astype(np.object)
Always check data types of the columns before performing the operations using
df.info()
You could do this with masks instead, for example:
df[df.A!='a'].A
and to filter from any column:
df[df.apply(lambda x: sum([x_=='a' for x_ in x])==0, axis=1)]
The problem is due to the fact that there are numeric and string objects in the dataframe.
You can loop through each column and check each column as a series for a specific value using
(Series=='a').any()

Resources