reindexing dataframes replaces all my data with NaNs, why? - python-3.x

So I was investigating how some commands from Pandas work, and I ran into this issue; when I use the reindex command, my data is replaced by NaN values. Below is my code:
>>>import pandas as pd
>>>import numpy as np
>>>frame1=pd.DataFrame(np.arange(365))
then, I give it an index of dates:
>>>frame1.index=pd.date_range(pd.datetime(2017, 4, 6), pd.datetime(2018, 4, 5))
then I reindex:
>>>broken_frame=frame1.reindex(np.arange(365))
aaaand all my values are erased. This example isn't particularly useful, but it happens any and every time I use the reindex command, seemingly regardless of context. Similarly, when I try to join two dataframes:
>>>big_frame=frame1.join(pd.DataFrame(np.arange(365)), lsuffix='_frame1')
all of the values in the frame being attached (np.arange(365)) are replaced with NaNs before the frames are joined. If I had to guess, I would say this is because the second frame is reindexed as part of the joining process, and reindexing erases my values.
What's going on here?

From the Docs
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False
Emphasis my own.
You want either set_index
frame1.set_index(np.arange(365))
Or do what you did in the first place
frame1.index = np.arange(365)

I did not find the answer helpful in relation to what I think the question is getting at so I am adding to this.
The key is that the initial dataframe must have the same index that you are reindexing on for this to work. Even the names must be the same! So if you're new MultiIndex has no names, your initial dataframe must also have no names.
m = pd.MultiIndex.from_product([df['col'].unique(),
pd.date_range(df.date.min(),
df.date.max() +
pd.offsets.MonthEnd(1),
freq='M')])
df = df.set_index(['col','date']).rename_axis([None,None])
df.reindex(m)
Then you will preserve your initial data values and reindex the dataframe.

Related

Pandas Dataframe : inplace column substitution vs creating new dataframe with transformed column

Whenever I want to transform an existing column of a dataframe, I tend to use apply/transform which gives me altogether a new series and it does not modify the existing column in the dataframe.
Suppose the following code performs an operation on a column and returns me a series.
new_col1 = df.col1.apply(...)
After this I have two ways of substituting the new series in the dataframe
modifying the existing col1:
df.col1 = new_col1
Or creating a new dataframe with the transformed column:
df.drop(columns=[col1]).join(new_col1)
I ask this because whenever I use mutable data structures in python like lists I always try to create new lists using list comprehension and not by in-place substitution.
Is there any benefit of following this style in case of pandas dataframes ? What's more pythonic and which of the above two approaches do you recommend ?
Since you are modifying an existing column, the first approach would be faster. Remember that both drop and join returns a copy of new data, so the second approach can be expensive if you have a big data frame with many columns.
Whenever you want to make changes to the original data frame itself, consider using inplace=True attribute in functions like drop/join which by default returns a new copy.
NOTE: Please keep in mind
cons of inplace,
inplace, contrary to what the name implies, often does not prevent copies from - being created, and (almost) never offers any performance benefits
inplace does not work with method chaining
inplace is a common pitfall for beginners, so removing this option will simplify the API
SOURCE: In pandas, is inplace = True considered harmful, or not?

Iterating through each row of a single column in a Python geopandas (or pandas) DataFrame

I am trying to iterate through this very large DataFrame that I have in python. The thing is, I only want to pull out data from one specific column that contains the names of a bunch of counties.
I have tried to use iteritems(), itertupel(), and iterrows() to no avail.
Any suggestions on how to do this?
My end goal is to have a nested dictionary with each internal dictionary's key being a name from the DataFrame column.
Also tried to use this method below to select a single column but that will only print the name of the column, not its contents.
for county in map_datafile[['NAME']]:
print(county)
If you delete one pair of square brackets, you get a Series that is iterable:
for county in map_datafile['NAME']:
print(county)
See the difference:
print(type(map_datafile[['NAME']]))
# pandas.core.frame.DataFrame
print(type(map_datafile['NAME']))
# pandas.core.series.Series

Spark DataFrame how to change permutation of one column without join [duplicate]

This question already has answers here:
Updating a dataframe column in spark
(5 answers)
Closed 3 years ago.
I am trying to use Pyspark to permute a column in a dataframe, aka shuffle all values for a single column across rows.
I am trying to avoid the solution where the column gets split and assigned an index column before being joined back to the original dataframe which also has an added index column. Primarily because of my understanding (which could be very wrong) that joins are bad in terms of runtime for a large dataset (millions of rows).
# for some dataframe spark_df
new_df = spark_df.select(colname).sort(colname)
new_df.show() # column values sorted nicely
spark_df.withColumn("ha", new_df[colname]).show()
# column "ha" no longer sorted and has same permutation as spark_df.colname
Thanks for any guidance in helping me understand this, I am a complete beginner with this :)
Edit: Sorry if I was being unclear in the question, I just wanted to replace a column with the sorted version of it without doing join. Thank you for pointing out that dfs are not mutable, but even doing spark_df.withColumn("ha", spark_df.select(colname).sort(colname)[colname]).show() shows column 'ha' as having the same permutation as 'colname' when doing sort on the column itself shows a different permutation. The question is mainly about why the permutation stays the same in the new column 'ha', not about how to replace a column. Thanks again! (Also changed the title to better reflect the question)
Spark dataframes and RDDs are immutable. Every time you make a transformation, a new one is created. Therefore, when you do new_df = spark_df.select(colname).sort(colname), spark_df remains unchanged. Only new_df is sorted. This is why spark_df.withColumn("ha", new_df[colname]) returns an unsorted dataframe.
Try new_df.withColumn("ha", new_df[colname]) instead.

Dask Dataframe View Entire Row

I want to see the entire row for a dask dataframe without the fields being cutoff, in pandas the command is pd.set_option('display.max_colwidth', -1), is there an equivalent for dask? I was not able to find anything.
You can import pandas and use pd.set_option() and Dask will respect pandas' settings.
import pandas as pd
# Don't truncate text fields in the display
pd.set_option("display.max_colwidth", -1)
dd.head()
And you should see the long columns. It 'just works.'
Dask does not normally display the data in a dataframe at all, because it represents lazily-evaluated values. You may want to get a specific row by index, using the .loc accessor (same as in Pandas, but only efficient if the index is known to be sorted).
If you meant to get the whole list of columns only, you can get this by the .columns attribute.

Issue with dropping columns

I'm trying to read in a data set and dropping the first two columns of the data set, but it seems like it is dropping the wrong column of information. I was looking at this thread, but their suggestion is not giving the expected answer. My data set starts with 6 columns, and I need to remove the first two. Elsewhere in threads it has the option of dropping columns with labels, but I would prefer not to name columns only to drop them if I can do it in one step.
df= pd.read_excel('Data.xls', header=17,footer=246)
df.drop(df.columns[[0,1]], axis=1, inplace=True)
But it is dropping columns 4 and 5 instead of the first two. Is there something with the drop function that I'm just completely missing?
If I understand your question correctly, you have a multilevel index, so drop columns [0, 1] will start counting on non-index columns.
If you know the position of the columns, why not try selecting it directly, such as:
df = df.iloc[:, 3:]

Resources