Use a split function in every row of one column of a data frame - python-3.x

I have a rather big pandas data frame (more than 1 million rows) with columns containing either strings or numbers. Now I would like to split the strings in one column before the expression "is applied".
An example to explain what I mean:
What I have:
a b description
2 4 method A is applied
10 5 titration is applied
3 1 computation is applied
What I am looking for:
a b description
2 4 method A
10 5 titration
3 1 computation
I tried the following,
df.description = df.description.str.split('is applied')[0]
But this didn't bring the desired result.
Any ideas how to do it? :-)

You are close, need str[0]:
df.description = df.description.str.split(' is applied').str[0]
Alternative solution:
df.description = df.description.str.extract('(.*)\s+is applied')
print (df)
a b description
0 2 4 method A
1 10 5 titration
2 3 1 computation
But for better performance use list comprehension:
df.description = [x.split(' is applied')[0] for x in df.description]

you can use replace
df.description = df.description.str.replace(' is applied','')
df
a b description
0 2 4 method A
1 10 5 titration
2 3 1 computation

Related

How can I find the highest value between rows every time that they met a certain condition?

I have been struggling with a problem with my data frame build in pandas that is current like this
MyDataFrame:
Index Status Value
0 A 10
1 A 8
2 A 5
3 B 9
4 B 5
5 A 1
6 B 2
7 A 3
8 A 5
9 A 1
The desired output would be:
Index Status Value
0 A 10
1 B 9
2 A 1
3 B 2
4 A 5
So far I tried to use range and while conditions to filter, however, if I put a conditional like :
for i in range:
if Status[i] == "A":
print(Value[i])
if Status == "B":
break
** The code above is more an example of what I have been trying to reach my goal, I tried to use .iloc and range with while, but maybe in the wrong way idk.*
The desired output isn't printed.
One thing that complicates this filtering process is that MyDataFrame changes every time that I run the script since it uses another base of data to create this DataFrame.
I believe that I'm missing something simple, but it has been almost a week and I can't figure out.
Thanks in advance for all your answers and support.
Let us try using shift with cumsum create the groupby key , then it is groupby + agg
out = df.groupby(df.Status.ne(df.Status.shift()).cumsum()).agg({'Status':'first','Value':'max'})
Out[14]:
Status Value
Status
1 A 10
2 B 9
3 A 1
4 B 2
5 A 5
Very close to #BEN_YO:
grp = (df['Status'] != df['Status'].shift()).cumsum()
df.loc[df.groupby(grp)['Value'].idxmax()]
Output:
Status Value
Index
0 A 10
3 B 9
5 A 1
6 B 2
8 A 5
Create groups using shift and inequality with cumsum, then groupby and find the index of the max value of 'Value', idxmax, and filter the dataframe using loc

Renaming columns in dataframe w.r.t another specific column

BACKGROUND: Large excel mapping file with about 100 columns and 200 rows converted to .csv. Then stored as dataframe. General format of df as below.
Starts with a named column (e.g. Sales) and following two columns need to be renamed. This pattern needs to be repeated for all columns in excel file.
Essentially: Link the subsequent 2 columns to the "parent" one preceding them.
Sales Unnamed: 2 Unnamed: 3 Validation Unnamed: 5 Unnamed: 6
0 Commented No comment Commented No comment
1 x x
2 x x
3 x x
APPROACH FOR SOLUTION: I assume it would be possible to begin with an index (e.g. index of Sales column 1 = x) and then rename the following two columns as (x+1) and (x+2).
Then take in the text for the next named column (e.g. Validation) and so on.
I know the rename() function for dataframes.
BUT, not sure how to apply the iteratively for changing column titles.
EXPECTED OUTPUT: Unnamed 2 & 3 changed to Sales_Commented and Sales_No_Comment, respectively.
Similarly Unnamed 5 & 6 change to Validation_Commented and Validation_No_Comment.
Again, repeated for all 100 columns of file.
EDIT: Due to the large number of cols in the file, creating a manual list to store column names is not a viable solution. I have already seen this elsewhere on SO. Also, the amount of columns and departments (Sales, Validation) changes in different excel files with the mapping. So a dynamic solution is required.
Sales Sales_Commented Sales_No_Comment Validation Validation_Commented Validation_No_Comment
0 Commented No comment Commented No comment
1 x x
2 x
3 x x x
As a python novice, I considered a possible approach for the solution using the limited knowledge I have, but not sure what this would look like as a workable code.
I would appreciate all help and guidance.
1.You need is to make a list with the column names that you would want.
2.Make it a dict with the old column names as the keys and new column name as the values.
3. Use df.rename(columns = your_dictionary).
import numpy as np
import pandas as pd
df = pd.read_excel("name of the excel file",sheet_name = "name of sheet")
print(df.head())
Output>>>
Sales Unnamed : 2 Unnamed : 3 Validation Unnamed : 5 Unnamed : 6 Unnamed :7
0 NaN Commented No comment NaN Comment No comment Extra
1 1.0 2 1 1.0 1 1 1
2 3.0 1 1 1.0 1 1 1
3 4.0 3 4 5.0 5 6 6
4 5.0 1 1 1.0 21 3 6
# get new names based on the values of a previous named column
new_column_names = []
counter = 0
for col_name in df.columns:
if (col_name[:7].strip()=="Unnamed"):
new_column_names.append(base_name+"_"+df.iloc[0,counter].replace(" ", "_"))
else:
base_name = col_name
new_column_names.append(base_name)
counter +=1
# convert to dict key pair
dictionary = dict(zip(df.columns.tolist(),new_column_names))
# rename columns
df = df.rename(columns=dictionary)
# drop first column
df = df.iloc[1:].reset_index(drop=True)
print(df.head())
Output>>
Sales Sales_Commented Sales_No_comment Validation Validation_Comment Validation_No_comment Validation_Extra
0 1.0 2 1 1.0 1 1 1
1 3.0 1 1 1.0 1 1 1
2 4.0 3 4 5.0 5 6 6
3 5.0 1 1 1.0 21 3 6

How can the AVERAGEIFS function be translated into MATLAB?

I am working at moving my data over from Excel to Matlab. I have some data that I want to average based on multiple criteria. I can accomplish this by looping, but want to do it with matrix operations only, if possible.
Thus far, I have managed to do so with a single criterion, using accumarray as follows:
data=[
1 3
1 3
1 3
2 3
2 6
2 9];
accumarray(data(:,1),data(:,2))./accumarray(data(:,1),1);
Which returns:
3
6
Corresponding to the averages of items 1 and 2, respectively. I have at least three other columns that I need to include in this averaging but don't know how I can add that in. Any help is much appreciated.
For your single column, you don't need to call accumarray twice, you can provide a function handle to mean as the fourth input
mu = accumarray(data(:,1), data(:,2), [], #mean);
For multiple columns, you can use the row indices as the second input to accumarray and then use those from within the anonymous function to access the rows of your data to operate on.
data = [1 3 5
1 3 10
1 3 8
2 3 7
2 6 9
2 9 12];
tmp = accumarray(data(:,1), 1:size(data, 1), [], #(rows){mean(data(rows,2:end), 1)});
means = cat(1, tmp{:});
% 3.0000 7.6667
% 6.0000 9.3333

Using pandas style to give colors to some rows with a specific condition

This is the output of pandas in excel format:
Id comments number
1 so bad 1
1 so far 2
2 always 3
2 very good 4
3 very bad 5
3 very nice 6
3 so far 7
4 very far 8
4 very close 9
4 busy 10
I want to use pandas to give a color (for example: gray color) to rows that their value for Id column is even. For example rows 3 and 4 have even Id numbers, but rows 5, 6 and 7 have odd Id numbers. Is there any possible way to use pandas to do it?
As explained in the documentation http://pandas.pydata.org/pandas-docs/stable/style.html what you basically want to do is write a style function and apply it to the style object.
def _color_if_even(s):
return ['background-color: grey' if val % 2 == 0 else '' for val in s]
and call it on my Styler object, i.e.,
df.style.apply(_color_if_even, subset=['id'])

Efficiently concatanate a large number of columns

I tried to concatenate a large number of columns containing integers in one string.
Basically, starting from:
df = pd.DataFrame({'id':[1,2,3,4],'a':[0,1,2,3], 'b':[4,5,6,7], 'c':[8,9,0,1]})
To obtain:
id join
0 1 481
1 2 592
2 3 603
3 4 714
I found several methods to do this (here and here):
Method 1:
conc['glued']=''
i=1
while i < len(df.columns):
conc['glued'] = conc['glued'] + df[df.columns[i]].values.astype(str)
i=i+1
This method work, but is a bit long (45min on my "test" case of 18,000 rows x 40,000 columns). I am concerned by the loop on the columns as this program should be applied at the end on tables of 600.000 columns and I am afraid it will be too long.
Method 2a
conc['join']=[''.join(row) for row in df[df.columns[1:]].values.astype(str)]
Method 2b
conc['apply'] = df[df.columns[1:]].apply(lambda x: ''.join(x.astype(str)), axis=1)
Both of these methods are 10 times more efficient than the previous one, iterate on rows which is good and work perfectly on my "debug" table df. But, when I apply it to my "test" table of 18k x 40k, it leads to a MemoryError: (I have 60% of my 32GB of RAM occupied after reading the corresponding csv file).
I can copy my DataFrame without overpass the memory, but curiously, applying this method make the code crash.
Do you see how I can fix and improve this code to use an efficient row based iteration? Thank you !
Appendix:
Here is the code I use on my test case:
geno_reader = pd.read_csv(genotype_file,header=0,compression='gzip', usecols=geno_columns_names)
fimpute_geno = pd.DataFrame({'SampID': geno_reader['SampID']})
I should use the chunksize option to read this file but I haven't yet really understand how to use it after reading.
Method 1:
fimpute_geno['Calls'] = ''
for i in range(1,len(geno_reader.columns)):
fimpute_geno['Calls'] = fimpute_geno['Calls']\
+ geno_reader[geno_reader.columns[i]].values.astype(int).astype(str)
This work in 45min.
There is some quite disgusting piece of code like the .astype(int).astype(str). I don't know why Python don't recognize my integers and consider them as float.
Method 2:
fimpute_geno['Calls'] = geno_reader[geno_reader.columns[1:]]\
.apply(lambda x: ''.join(x.astype(int).astype(str)), axis=1)
This leads to an MemoryError:
Here' something to try. It would require that you convert your columns to strings though. your sample frame
b c id
0 4 8 1
1 5 9 2
2 6 0 3
3 7 1 4
then
#you could also do this conc[['b','c','id']] for the next two lines
conc.ix[:,'b':'id'] = conc.ix[:,'b':'id'].astype('str')
conc['join'] = np.sum(conc.ix[:,'b':'id'],axis=1)
Would give
a b c id join
0 0 4 8 1 481
1 1 5 9 2 592
2 2 6 0 3 603
3 3 7 1 4 714

Resources