Contructing new dataframe and keeping old one? - python-3.x

I am sorry for asking a naive question but it's driving me crazy at the moment. I have a dataframe df1, and creating new dataframe df2 by using it, as following:
import pandas as pd
def NewDF(df):
df['sum']=df['a']+df['b']
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(df1)
df2 =NewDF(df1)
print(df1)
which gives
a b
0 1 4
1 2 5
2 3 6
a b sum
0 1 4 5
1 2 5 7
2 3 6 9
Why I am loosing df1 shape and getting third column? How can I avoid this?

DataFrames are mutable so you should either explicitly pass a copy to your function, or have the first step in your function copy the input. Otherwise, just like with lists, any modifications your functions make also apply to the original.
Your options are:
def NewDF(df):
df = df.copy()
df['sum']=df['a']+df['b']
return df
df2 = NewDF(df1)
or
df2 = NewDF(df1.copy())
Here we can see that everything in your original implementation refers to the same object
import pandas as pd
def NewDF(df):
print(id(df))
df['sum']=df['a']+df['b']
print(id(df))
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(id(df1))
#2242099787480
df2 = NewDF(df1)
#2242099787480
#2242099787480
print(id(df2))
#2242099787480

The third column that you are getting is the Index column, each pandas DataFrame will always maintain an Index, however you can choose if you don't want it in your output.
import pandas as pd
def NewDF(df):
df['sum']=df['a']+df['b']
return df
df1 =pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
print(df1.to_string(index=False))
df2 =NewDF(df1)
print(df1.to_string(index = False))
Gives the output as
a b
1 4
2 5
3 6
a b sum
1 4 5
2 5 7
3 6 9
Now you might have the question why does index exist, Index is actually a backed hash table which increases speed and is a highly desirable feature in multiple contexts, If this was just a one of question, this should be enough, but if you are looking to learn more about pandas and I would advice you to look into indexing, you can begin by looking here https://stackoverflow.com/a/27238758/10953776

Related

Dask apply with custom function

I am experimenting with Dask, but I encountered a problem while using apply after grouping.
I have a Dask DataFrame with a large number of rows. Let's consider for example the following
N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)
I want to bin the values of col_1 and I follow the solution from here
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)
where
def test_f(df,col,bins,labels):
return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))
and this works as I expect it to.
Now I want to take the median value in each bin (taken from here)
median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()
Having 10 bins, I expect median to have 10 rows, but it actually has 80. The dataframe has 8 partitions so I guess that somehow the apply is working on each one individually.
However, If I want the mean and use mean
median = ddf2.groupby('bin_num')['col_1'].mean().compute()
it works and the output has 10 rows.
The question is then: what am I doing wrong that is preventing apply from operating as mean?
Maybe this warning is the key (Dask doc: SeriesGroupBy.apply) :
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
You are right! I was able to reproduce your problem on Dask 2.11.0. The good news is that there's a solution! It appears that the Dask groupby problem is specifically with the category type (pandas.core.dtypes.dtypes.CategoricalDtype). By casting the category column to another column type (float, int, str), then the groupby will work correctly.
Here's your code that I copied:
import dask.dataframe as dd
import pandas as pd
import numpy as np
def test_f(df, col, bins, labels):
return df.assign(bin_num=pd.cut(df[col], bins, labels=labels))
N = 10000
df = pd.DataFrame({'col_1': np.random.random(N), 'col_2': np.random.random(N)})
ddf = dd.from_pandas(df, npartitions=8)
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1', bins, labels)
print(ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which prints out the problem you mentioned
bin_num
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
5 0.550844
6 0.651036
7 0.751220
8 NaN
9 NaN
Name: col_1, Length: 80, dtype: float64
Here's my solution:
ddf3 = ddf2.copy()
ddf3["bin_num"] = ddf3["bin_num"].astype("int")
print(ddf3.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which printed:
bin_num
9 0.951369
2 0.249150
1 0.149563
0 0.049897
3 0.347906
8 0.847819
4 0.449029
5 0.550608
6 0.652778
7 0.749922
Name: col_1, dtype: float64
#MRocklin or #TomAugspurger
Would you be able to create a fix for this in a new release? I think there is sufficient reproducible code here. Thanks for all your hard work. I love Dask and use it every day ;)

Mapping a dictionary values to a column using pandas [duplicate]

This question already has answers here:
Remap values in pandas column with a dict, preserve NaNs
(11 answers)
Closed 4 years ago.
I'm trying do something that should be really simple in pandas, but it seems anything but. I'm trying to add a column to an existing pandas dataframe that is a mapped value based on another (existing) column. Here is a small test case:
import pandas as pd
equiv = {7001:1, 8001:2, 9001:3}
df = pd.DataFrame( {"A": [7001, 8001, 9001]} )
df["B"] = equiv(df["A"])
print(df)
I was hoping the following would result:
A B
0 7001 1
1 8001 2
2 9001 3
Instead, I get an error telling me that equiv is not a callable function. Fair enough, it's a dictionary, but even if I wrap it in a function I still get frustration. So I tried to use a map function that seems to work with other operations, but it also is defeated by use of a dictionary:
df["B"] = df["A"].map(lambda x:equiv[x])
In this case I just get KeyError: 8001. I've read through documentation and previous posts, but have yet to come across anything that suggests how to mix dictionaries with pandas dataframes. Any suggestions would be greatly appreciated.
The right way of doing it will be df["B"] = df["A"].map(equiv).
In [55]:
import pandas as pd
equiv = {7001:1, 8001:2, 9001:3}
df = pd.DataFrame( {"A": [7001, 8001, 9001]} )
df["B"] = df["A"].map(equiv)
print(df)
A B
0 7001 1
1 8001 2
2 9001 3
[3 rows x 2 columns]
And it will handle the situation when the key does not exist very nicely, considering the following example:
In [56]:
import pandas as pd
equiv = {7001:1, 8001:2, 9001:3}
df = pd.DataFrame( {"A": [7001, 8001, 9001, 10000]} )
df["B"] = df["A"].map(equiv)
print(df)
A B
0 7001 1
1 8001 2
2 9001 3
3 10000 NaN
[4 rows x 2 columns]

Modifying multiple columns of data using iteration, but changing increment value for each column

I'm trying to modify multiple column values in pandas.Dataframes with different increments in each column so that the values in each column do not overlap with each other when graphed on a line graph.
Here's the end goal of what I want to do: link
Let's say I have this kind of Dataframe:
Col1 Col2 Col3
0 0.3 0.2
1 1.1 1.2
2 2.2 2.4
3 3 3.1
but with hundreds of columns and thousands of values.
When graphing this on a line-graph on excel or matplotlib, the values overlap with each other, so I would like to separate each column by adding the same values for each column like so:
Col1(+0) Col2(+10) Col3(+20)
0 10.3 20.2
1 11.1 21.2
2 12.2 22.4
3 13 23.1
By adding the same value to one column and increasing by an increment of 10 over each column, I am able to see each line without it overlapping in one graph.
I thought of using loops and iterations to automate this value-adding process, but I couldn't find any previous solutions on Stackoverflow that addresses how I could change the increment value (e.g. from adding 0 in Col1 in one loop, then adding 10 to Col2 in the next loop) between different columns, but not within the values in a column. To make things worse, I'm a beginner with no clue about programming or data manipulation.
Since the data is in a CSV format, I first used Pandas to read it and store in a Dataframe, and selected the columns that I wanted to edit:
import pandas as pd
#import CSV file
df = pd.read_csv ('data.csv')
#store csv data into dataframe
df1 = pd.DataFrame (data = df)
# Locate columns that I want to edit with df.loc
columns = df1.loc[:, ' C000':]
here is where I'm stuck:
# use iteration with increments to add numbers
n = 0
for values in columns:
values = n + 0
print (values)
But this for-loop only adds one increment value (in this case 0), and adds it to all columns, not just the first column. Not only that, but I don't know how to add the next increment value for the next column.
Any possible solutions would be greatly appreciated.
IIUC ,just use df.add() over axis=1 with a list made from the length of df.columns:
df1 = df.add(list(range(0,len(df.columns)*10))[::10],axis=1)
Or as #jezrael suggested, better:
df1=df.add(range(0,len(df.columns)*10, 10),axis=1)
print(df1)
Col1 Col2 Col3
0 0 10.3 20.2
1 1 11.1 21.2
2 2 12.2 22.4
3 3 13.0 23.1
Details :
list(range(0,len(df.columns)*10))[::10]
#[0, 10, 20]
I would recommend you to avoid looping over the data frame as it is inefficient but rather think of adding to matrixes.
e.g.
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# Multiply each column with an incremented value * 10
x = x * 10*np.arange(1,df.shape[1]+1)
# Add the matrix to the data
df + x
Edit: In case you do not want to increment with 10, 20 ,30 but 0,10,20 use this instead
import numpy as np
import pandas as pd
# Create your example df
df = pd.DataFrame(data=np.random.randn(10,3))
# Create a Matrix of ones
x = np.ones(df.shape)
# THIS LINE CHANGED
# Obmit the 1 so there is only an end value -> default start is 0
# Adjust the length of the vector
x = x * 10*np.arange(df.shape[1])
# Add the matrix to the data
df + x

Appending values to a column in a loop

I have various files containing data. I want to extract one specific column from each file and create a new dataframe with one column containing all the extracted data.
So for example I have 3 files:
A B C
1 2 3
4 5 6
A B C
7 8 9
8 7 6
A B C
5 4 3
2 1 0
The new dataframe should only contain the values from column C:
C
3
6
9
6
3
0
So the column of the first file should be copied to the new dataframe, the column from the second file should be appendend to the new dataframe.
My code looks like this so far:
import pandas as pd
import glob
for filename in glob.glob('*.dat'):
df= pd.read_csv(filename, delimiter="\t", header=6)
df1= df["Bias"]
print(df)
Now df1 is overwritten in each loop step. Would it be a good idea to create a temporary dataframe in each loop step and then copy the data to the new dataframe?
Any input is appreciated!
Use list comprehension or for loop with append for list of DataFrames and if need only some columns add parameter usecols, last concat all together for big DataFrame:
dfs = [pd.read_csv(f, delimiter="\t", header=6, usecols=['C']) for f in glob.glob('*.dat')]
Or:
dfs = []
for filename in glob.glob('*.dat'):
df = pd.read_csv(filename, delimiter="\t", header=6, usecols=['C'])
#if need all columns
#df = pd.read_csv(filename, delimiter="\t", header=6)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)

Re-indexing a pandas Series extracted from DataFrame

Problematic
Let us say that I have a DataFrame df of n rows like this:
| Precipitation | Discharge |
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
...
|:------12-------:|:-----16-----:|
|:------10-------:|:-----15-----:|
Rows are automatically indexed from 1 to n. When we extract one column for example:
series = df.loc[5:,['Precipitation']]
or
series = df.Precipitation[5:]
The extracted series tends to be like:
Precipitation
5 16
6 17
7 18
...
n 15
So the question is how can we modify the generic indexing from 5 to n to 0 to n-5.
Note that I have tried series.reindex() and series.reset_index() but neither of them works...
Currently I do series.tolist() to solve the problem but is there any way more elegant and smarter?
Many thanks in advance !
If I understand your question correctly, you want to select the first n-5 rows of your column.
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7], 'col2': [3,4,3,2,5,9,7]})
df.loc[:len(df)-5,['col1']]
You may want to read up on slicing with pandas: https://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges
EDIT:
I think I misunderstood, and what you instead want is to 'reset' your index. You can do that with the method reset_index()
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7], 'col2': [3,4,3,2,5,9,7]})
df = df.loc[5:,['col1']]
df.reset_index(drop=True)

Resources