Problematic
Let us say that I have a DataFrame df of n rows like this:
| Precipitation | Discharge |
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
...
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
Rows are automatically indexed from 1 to n. When we extract one column for example:
series = df.loc[5:,['Precipitation']]
or
series = df.Precipitation[5:]
The extracted series tends to be like:
Precipitation
5 16
6 17
7 18
...
n 15
So the question is how can we modify the generic indexing from 5 to n to 0 to n-5.
Note that I have tried series.reindex() and series.reset_index() but neither of them works...
Currently I do series.tolist() to solve the problem but is there any way more elegant and smarter?
Many thanks in advance !
If I understand your question correctly, you want to select the first n-5 rows of your column.
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7], 'col2': [3,4,3,2,5,9,7]})
df.loc[:len(df)-5,['col1']]
You may want to read up on slicing with pandas: https://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges
EDIT:
I think I misunderstood, and what you instead want is to 'reset' your index. You can do that with the method reset_index()
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7], 'col2': [3,4,3,2,5,9,7]})
df = df.loc[5:,['col1']]
df.reset_index(drop=True)
Related
I have a problem with python pandas dataframe problem. I have two dataframes with different contents. I want to output words that are not in dataframe 2 and store them on a new dataframe. Can someone help me in solving this problem using python pandas dataframe? Thankyouu...
Where dataframe 1 contains:
Tweet
Bismillah for tomorrow Amin
shared location
Replying to shahrilPng
It's time to finish what's been pending
up and parallel
When you run after your dream
And dataframe 2 contains:
Words
tomorrow
shared
location
time
finish
pending
parallel
run
after
dream
The output that i want
Results
Bismillah
for
Amin
Replying
to
shahrilPng
etc
Split and explode your tweets dataframe and check if each words is present in your words dataframe:
# check function
not_in_list = lambda x: ~x.str.casefold().isin(df2['Words'].str.casefold())
out = df1['Tweet'].str.split().explode().loc[not_in_list] \
.drop_duplicates().reset_index(drop=True).to_frame('Results')
print(out)
# Output
Results
0 Bismillah
1 for
2 Amin
3 Replying
4 to
5 shahrilPng
6 It's
7 what's
8 been
9 up
10 and
11 When
12 you
13 your
one way would be to turn the dataframes to a flatten set, find the differences and put them into a dtaframe
import pandas as pd
import numpy as np
df1_set = set(np.ravel(df1.values))
df2_set = set(np.ravel(df2.values))
pd.DataFrame(df1_set - df2_set).dropna()
My excel spreadsheet is for the form as below.
A
B
Part 1- 20210910
55
Part 2- 20210829
45
Part 3- 20210912
2
I would like to take the strings from Column A "Part A- 20210910" but read it using Pandas as "2021/09/10", a date format. How could I implement this?
IIUC:
df['A'] = df['A'].str.extract(r'(\d{8})').astype('datetime64')
print(df)
# Output:
A B
0 2021-09-10 55
1 2021-08-29 45
2 2021-09-12 2
My beginner way of doing it:
import pandas as pd
df = pd.read_excel('file_name.xlsx')
df['A'] = df['A'].apply(lambda x: x.split('-')).apply(lambda x: x[1]).apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
Output
Background:
I deal with a dataframe and want to divide the two columns of this dataframe to get a new column. The code is shown below:
import pandas as pd
df = {'drive_mile': [15.1, 2.1, 7.12], 'price': [40, 9, 31]}
df = pd.DataFrame(df)
df['price/km'] = df[['drive_mile', 'price']].apply(lambda x: x[1]/x[0])
print(df)
And I get the below result:
drive_mile price price/km
0 15.10 40 NaN
1 2.10 9 NaN
2 7.12 31 NaN
Why would this happen? And how can I fix it?
As pointed out in the comments, you missed the axis=1 parameter to perform the division on the right dimension using apply. This is because you end up with different indices when joining back in the DataFrame.
However, more importantly, do not use apply to perform a division!. Apply is often much less efficient compared to vectorial operations.
Use div:
df['price/km'] = df['drive_mile'].div(df['price'])
Or /:
df['price/km'] = df['drive_mile']/df['price']
I am sorry for asking a naive question but it's driving me crazy at the moment. I have a dataframe df1, and creating new dataframe df2 by using it, as following:
import pandas as pd
def NewDF(df):
df['sum']=df['a']+df['b']
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(df1)
df2 =NewDF(df1)
print(df1)
which gives
a b
0 1 4
1 2 5
2 3 6
a b sum
0 1 4 5
1 2 5 7
2 3 6 9
Why I am loosing df1 shape and getting third column? How can I avoid this?
DataFrames are mutable so you should either explicitly pass a copy to your function, or have the first step in your function copy the input. Otherwise, just like with lists, any modifications your functions make also apply to the original.
Your options are:
def NewDF(df):
df = df.copy()
df['sum']=df['a']+df['b']
return df
df2 = NewDF(df1)
or
df2 = NewDF(df1.copy())
Here we can see that everything in your original implementation refers to the same object
import pandas as pd
def NewDF(df):
print(id(df))
df['sum']=df['a']+df['b']
print(id(df))
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(id(df1))
#2242099787480
df2 = NewDF(df1)
#2242099787480
#2242099787480
print(id(df2))
#2242099787480
The third column that you are getting is the Index column, each pandas DataFrame will always maintain an Index, however you can choose if you don't want it in your output.
import pandas as pd
def NewDF(df):
df['sum']=df['a']+df['b']
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(df1.to_string(index=False))
df2 =NewDF(df1)
print(df1.to_string(index = False))
Gives the output as
a b
1 4
2 5
3 6
a b sum
1 4 5
2 5 7
3 6 9
Now you might have the question why does index exist, Index is actually a backed hash table which increases speed and is a highly desirable feature in multiple contexts, If this was just a one of question, this should be enough, but if you are looking to learn more about pandas and I would advice you to look into indexing, you can begin by looking here https://stackoverflow.com/a/27238758/10953776
I got the following simple code to calculate normality over an array:
import pandas as pd
df = pd.read_excel("directory\file.xlsx")
import numpy as np
x=df.iloc[:,1:].values.flatten()
import scipy.stats as stats
from scipy.stats import normaltest
stats.normaltest(x,axis=None)
This gives me nicely a p-value and a statistic.
The only thing I want right now is to:
Add 2 columns in the file with this p value and statistic and if i have multiple rows, do it for all the rows (calculate p value & statistic for each row and add 2 columns with these values in it).
Can someone help?
If you want to calculate row-wise normaltest, you should not flatten your data in x and use axis=1 such as
df = pd.DataFrame(np.random.random(105).reshape(5,21)) # to generate data
# calculate normaltest row-wise without the first column like you
df['stat'] ,df['p'] = stats.normaltest(df.iloc[:,1:],axis=1)
Then df contains two columns 'stat' and 'p' with the values your are looking for IIUC.
Note: to be able to perform normaltest, you need at least 8 values (according to what I experienced) so you need at least 8 columns in df.iloc[:,1:] otherwise it will raise an error. And even, it would be better to have more than 20 values in each row.