Combine multiindex columns with duplicate names - python-3.x

I have this multiindex dataframe:
I am trying to merge both columns named MODIS.NDVI under one same multi-index where the index MODIS.NDVI would return its corresponding p10, p25, p50, mean and std. Calling df.xs('MODIS.NDVI', axis=1) returns the expected output, so my questions are:
How to reformat the df to remove unnecessary duplicate column names at level=0?
Is that even necessary beyond a simple aesthetics concern since .xs permits that? Are there any risk of index "confusion" if duplicate names exist?

Related

Iterating through each row of a single column in a Python geopandas (or pandas) DataFrame

I am trying to iterate through this very large DataFrame that I have in python. The thing is, I only want to pull out data from one specific column that contains the names of a bunch of counties.
I have tried to use iteritems(), itertupel(), and iterrows() to no avail.
Any suggestions on how to do this?
My end goal is to have a nested dictionary with each internal dictionary's key being a name from the DataFrame column.
Also tried to use this method below to select a single column but that will only print the name of the column, not its contents.
for county in map_datafile[['NAME']]:
print(county)
If you delete one pair of square brackets, you get a Series that is iterable:
for county in map_datafile['NAME']:
print(county)
See the difference:
print(type(map_datafile[['NAME']]))
# pandas.core.frame.DataFrame
print(type(map_datafile['NAME']))
# pandas.core.series.Series

Is there a better way to replace rows from one dataframe to another based on a single columns' data?

I'm attempting to replace all matching rows in one dataframe with rows from another dataframe when a specific condition is satisfied. The number of rows in both dataframes will differ, but the number of columns are the same.
Do to the sensitive nature of the data that im dealing with i can give an example but not the real data itself.
An example of the dataframe could be something like this:
Location Tickets Destination Product Tare Net Scale Value1 Value2 ...
500 012 ID 01A 20 40 60 0.01 0.00
500 013 PA 02L 10 300 310 12.01 5
The other dataframe would have different values in each respective column, except for the 'Tickets' column. I would like to replace the data from all rows in df1 with any data from df2 that has a matching ticket number.
df1.loc[df1.Tickets.isin(df2.Tickets), :] =
df2.loc[df2.Tickets.isin(df1.Tickets), :].values
When i had run the code i get this error off and on:
ValueError: Must have equal len keys and value when setting with an ndarray
Not sure why or what is causing it because the code works on certain dataframes and not on others. The dataframes it is working on have different numbers of rows but the same columns, which is exactly the same for the dataframes it's not working on. I know their is something to this but i can't seem to find it.
Any help is greatly appreciated!
Update:
interestingly enough when i isolate the tickets to be replaced from df1 in a separate dataframe, i get no errors when i replace them with df2. This leads me to believe i have duplicate tickets in df1 that are causing my error but on further inspection i can find no duplicate tickets in df1.

regarding avoiding using column as index by default in Pandas [duplicate]

Can someone point me to a link or provide an explanation of the benefits of indexing in pandas? I routinely deal with tables and join them based on columns, and this joining/merging process seems to re-index things anyway, so it's a bit cumbersome to apply index criteria considering I don't think I need to.
Any thoughts on best-practices around indexing?
Like a dict, a DataFrame's index is backed by a hash table. Looking up rows
based on index values is like looking up dict values based on a key.
In contrast, the values in a column are like values in a list.
Looking up rows based on index values is faster than looking up rows based on column values.
For example, consider
df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])
Here is how you could look up any row where the df['index'] column equals 999.
Pandas has to loop through every value in the column to find the ones equal to 999.
df[df['index'] == 999]
# foo index
# 999 0.375489 999
Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:
df_with_index.loc[999]
# foo 0.375489
# index 999.000000
# Name: 999, dtype: float64
Looking up rows by index is much faster than looking up rows by column value:
In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 µs per loop
In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 µs per loop
Note however, it takes time to build the index:
In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 µs per loop
So having the index is only advantageous when you have many lookups of this type
to perform.
Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index, stack, unstack, pivot, pivot_table, melt,
lreshape, and crosstab, all use or manipulate the index.
Sometimes we want the DataFrame in a different shape for presentation purposes, or for join, merge or groupby operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join, merge and groupby take advantage of fast index lookups when possible.
Time series have resample, asfreq and interpolate methods whose underlying implementations take advantage of fast index lookups too.
So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash
lookups.

reindexing dataframes replaces all my data with NaNs, why?

So I was investigating how some commands from Pandas work, and I ran into this issue; when I use the reindex command, my data is replaced by NaN values. Below is my code:
>>>import pandas as pd
>>>import numpy as np
>>>frame1=pd.DataFrame(np.arange(365))
then, I give it an index of dates:
>>>frame1.index=pd.date_range(pd.datetime(2017, 4, 6), pd.datetime(2018, 4, 5))
then I reindex:
>>>broken_frame=frame1.reindex(np.arange(365))
aaaand all my values are erased. This example isn't particularly useful, but it happens any and every time I use the reindex command, seemingly regardless of context. Similarly, when I try to join two dataframes:
>>>big_frame=frame1.join(pd.DataFrame(np.arange(365)), lsuffix='_frame1')
all of the values in the frame being attached (np.arange(365)) are replaced with NaNs before the frames are joined. If I had to guess, I would say this is because the second frame is reindexed as part of the joining process, and reindexing erases my values.
What's going on here?
From the Docs
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
Emphasis my own.
You want either set_index
frame1.set_index(np.arange(365))
Or do what you did in the first place
frame1.index = np.arange(365)
I did not find the answer helpful in relation to what I think the question is getting at so I am adding to this.
The key is that the initial dataframe must have the same index that you are reindexing on for this to work. Even the names must be the same! So if you're new MultiIndex has no names, your initial dataframe must also have no names.
m = pd.MultiIndex.from_product([df['col'].unique(),
pd.date_range(df.date.min(),
df.date.max() +
pd.offsets.MonthEnd(1),
freq='M')])
df = df.set_index(['col','date']).rename_axis([None,None])
df.reindex(m)
Then you will preserve your initial data values and reindex the dataframe.

Issue with dropping columns

I'm trying to read in a data set and dropping the first two columns of the data set, but it seems like it is dropping the wrong column of information. I was looking at this thread, but their suggestion is not giving the expected answer. My data set starts with 6 columns, and I need to remove the first two. Elsewhere in threads it has the option of dropping columns with labels, but I would prefer not to name columns only to drop them if I can do it in one step.
df= pd.read_excel('Data.xls', header=17,footer=246)
df.drop(df.columns[[0,1]], axis=1, inplace=True)
But it is dropping columns 4 and 5 instead of the first two. Is there something with the drop function that I'm just completely missing?
If I understand your question correctly, you have a multilevel index, so drop columns [0, 1] will start counting on non-index columns.
If you know the position of the columns, why not try selecting it directly, such as:
df = df.iloc[:, 3:]

Resources