Issue with dropping columns - python-3.x

I'm trying to read in a data set and dropping the first two columns of the data set, but it seems like it is dropping the wrong column of information. I was looking at this thread, but their suggestion is not giving the expected answer. My data set starts with 6 columns, and I need to remove the first two. Elsewhere in threads it has the option of dropping columns with labels, but I would prefer not to name columns only to drop them if I can do it in one step.
df= pd.read_excel('Data.xls', header=17,footer=246)
df.drop(df.columns[[0,1]], axis=1, inplace=True)
But it is dropping columns 4 and 5 instead of the first two. Is there something with the drop function that I'm just completely missing?

If I understand your question correctly, you have a multilevel index, so drop columns [0, 1] will start counting on non-index columns.
If you know the position of the columns, why not try selecting it directly, such as:
df = df.iloc[:, 3:]

Related

Combine multiindex columns with duplicate names

I have this multiindex dataframe:
I am trying to merge both columns named MODIS.NDVI under one same multi-index where the index MODIS.NDVI would return its corresponding p10, p25, p50, mean and std. Calling df.xs('MODIS.NDVI', axis=1) returns the expected output, so my questions are:
How to reformat the df to remove unnecessary duplicate column names at level=0?
Is that even necessary beyond a simple aesthetics concern since .xs permits that? Are there any risk of index "confusion" if duplicate names exist?

Spark DataFrame how to change permutation of one column without join [duplicate]

This question already has answers here:
Updating a dataframe column in spark
(5 answers)
Closed 3 years ago.
I am trying to use Pyspark to permute a column in a dataframe, aka shuffle all values for a single column across rows.
I am trying to avoid the solution where the column gets split and assigned an index column before being joined back to the original dataframe which also has an added index column. Primarily because of my understanding (which could be very wrong) that joins are bad in terms of runtime for a large dataset (millions of rows).
# for some dataframe spark_df
new_df = spark_df.select(colname).sort(colname)
new_df.show() # column values sorted nicely
spark_df.withColumn("ha", new_df[colname]).show()
# column "ha" no longer sorted and has same permutation as spark_df.colname
Thanks for any guidance in helping me understand this, I am a complete beginner with this :)
Edit: Sorry if I was being unclear in the question, I just wanted to replace a column with the sorted version of it without doing join. Thank you for pointing out that dfs are not mutable, but even doing spark_df.withColumn("ha", spark_df.select(colname).sort(colname)[colname]).show() shows column 'ha' as having the same permutation as 'colname' when doing sort on the column itself shows a different permutation. The question is mainly about why the permutation stays the same in the new column 'ha', not about how to replace a column. Thanks again! (Also changed the title to better reflect the question)
Spark dataframes and RDDs are immutable. Every time you make a transformation, a new one is created. Therefore, when you do new_df = spark_df.select(colname).sort(colname), spark_df remains unchanged. Only new_df is sorted. This is why spark_df.withColumn("ha", new_df[colname]) returns an unsorted dataframe.
Try new_df.withColumn("ha", new_df[colname]) instead.

Is there a better way to replace rows from one dataframe to another based on a single columns' data?

I'm attempting to replace all matching rows in one dataframe with rows from another dataframe when a specific condition is satisfied. The number of rows in both dataframes will differ, but the number of columns are the same.
Do to the sensitive nature of the data that im dealing with i can give an example but not the real data itself.
An example of the dataframe could be something like this:
Location Tickets Destination Product Tare Net Scale Value1 Value2 ...
500 012 ID 01A 20 40 60 0.01 0.00
500 013 PA 02L 10 300 310 12.01 5
The other dataframe would have different values in each respective column, except for the 'Tickets' column. I would like to replace the data from all rows in df1 with any data from df2 that has a matching ticket number.
df1.loc[df1.Tickets.isin(df2.Tickets), :] =
df2.loc[df2.Tickets.isin(df1.Tickets), :].values
When i had run the code i get this error off and on:
ValueError: Must have equal len keys and value when setting with an ndarray
Not sure why or what is causing it because the code works on certain dataframes and not on others. The dataframes it is working on have different numbers of rows but the same columns, which is exactly the same for the dataframes it's not working on. I know their is something to this but i can't seem to find it.
Any help is greatly appreciated!
Update:
interestingly enough when i isolate the tickets to be replaced from df1 in a separate dataframe, i get no errors when i replace them with df2. This leads me to believe i have duplicate tickets in df1 that are causing my error but on further inspection i can find no duplicate tickets in df1.

Pandas Pivot - Swap 'columns' and 'values'

I have a working pivot based upon the following code:
pd.pivot_table(df,
index=["row_a","row_b"],
columns=["col_a"],
values=["metric_a", "metric_b"],
aggfunc={"metric_a":np.max, "metric_b":np.sum})
Based upon that code, I correctly receive the below output.
However, I would like to essentially swap the column with the metric to receive the below output. Is this possible?
I think all you need is a call to pandas.DataFrame.swaplevel after the initial pivot, followed by sorting the columns to group the top level (level=0):
# Assuming df holds the result of the pivot
df.swaplevel(0, 1, axis=1).sort_index(axis=1)

reindexing dataframes replaces all my data with NaNs, why?

So I was investigating how some commands from Pandas work, and I ran into this issue; when I use the reindex command, my data is replaced by NaN values. Below is my code:
>>>import pandas as pd
>>>import numpy as np
>>>frame1=pd.DataFrame(np.arange(365))
then, I give it an index of dates:
>>>frame1.index=pd.date_range(pd.datetime(2017, 4, 6), pd.datetime(2018, 4, 5))
then I reindex:
>>>broken_frame=frame1.reindex(np.arange(365))
aaaand all my values are erased. This example isn't particularly useful, but it happens any and every time I use the reindex command, seemingly regardless of context. Similarly, when I try to join two dataframes:
>>>big_frame=frame1.join(pd.DataFrame(np.arange(365)), lsuffix='_frame1')
all of the values in the frame being attached (np.arange(365)) are replaced with NaNs before the frames are joined. If I had to guess, I would say this is because the second frame is reindexed as part of the joining process, and reindexing erases my values.
What's going on here?
From the Docs
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
Emphasis my own.
You want either set_index
frame1.set_index(np.arange(365))
Or do what you did in the first place
frame1.index = np.arange(365)
I did not find the answer helpful in relation to what I think the question is getting at so I am adding to this.
The key is that the initial dataframe must have the same index that you are reindexing on for this to work. Even the names must be the same! So if you're new MultiIndex has no names, your initial dataframe must also have no names.
m = pd.MultiIndex.from_product([df['col'].unique(),
pd.date_range(df.date.min(),
df.date.max() +
pd.offsets.MonthEnd(1),
freq='M')])
df = df.set_index(['col','date']).rename_axis([None,None])
df.reindex(m)
Then you will preserve your initial data values and reindex the dataframe.

Resources