Outer Join Two Pandas Dataframes [duplicate]

Outer Join Two Pandas Dataframes [duplicate] - python-3.x

This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 1 year ago.
I'm not sure where I am astray but I cannot seem to reset the index on a dataframe.
When I run test.head(), I get the output below:
As you can see, the dataframe is a slice, so the index is out of bounds.
What I'd like to do is to reset the index for this dataframe. So I run test.reset_index(drop=True). This outputs the following:
That looks like a new index, but it's not. Running test.head again, the index is still the same. Attempting to use lambda.apply or iterrows() creates problems with the dataframe.
How can I really reset the index?

reset_index by default does not modify the DataFrame; it returns a new DataFrame with the reset index. If you want to modify the original, use the inplace argument: df.reset_index(drop=True, inplace=True). Alternatively, assign the result of reset_index by doing df = df.reset_index(drop=True).

BrenBarn's answer works.
The following also worked via this thread, which isn't a troubleshooting so much as an articulation of how to reset the index:
test = test.reset_index(drop=True)

As an extension of in code veritas's answer... instead of doing del at the end:
test = test.reset_index()
del test['index']
You can set drop to True.
test = test.reset_index(drop=True)

I would add to in code veritas's answer:
If you already have an index column specified, then you can save the del, of course. In my hypothetical example:
df_total_sales_customers = pd.DataFrame({'Sales': total_sales_customers['Sales'],
'Customers': total_sales_customers['Customers']}, index = total_sales_customers.index)
df_total_sales_customers = df_total_sales_customers.reset_index()

Related

Pandas Series - Trouble Dropping First Value

I'm having trouble dropping the first NaN value in my Series.
I took the difference of the normal Series to the shifted Series of 1 period. This is how I calculated it:
x[c] = x[c] - x[c].shift(periods=1)
When i try to drop the first value using these methods:
x[c].drop(labels=[0])
x[c].dropna()
x[c].iloc[1:]
It doesn't work for me in the reassignment
# these are not used all together, but separately
x[c] = x[c].dropna()
x[c] = x[c].drop(labels=['1981-09-29'])
x[c] = x[c][1:]
print(x[c])
Date
1981-09-29 NaN
1981-09-30 -0.006682
1981-10-01 -0.014575
1981-10-02 -0.004963
1981-10-05 -0.004963
However, when I call the drop or dropna function in a print statement, it works!
print(x[c].dropna())
Date
1981-09-30 -0.006682
1981-10-01 -0.014575
1981-10-02 -0.004963
1981-10-05 -0.004963
1981-10-06 -0.005514
It doesn't matter what method, I just want to get rid of the first element in my Series.
Pls help.

The dataframe has multiple series where if I tried reassigning just one series, it would still give me an NaN. This is because Series need to be of the same length in a dataframe. Therefore I need to call dropna on the dataframe after the calculation is performed

Try this?
x[c] = x[c].dropna(inplace=True)
print(x[c])

If you know it is the first element you don't want, the simplest way is to use series.iloc[1:, :] or dataframe.iloc[1:, :].

Slow loop aggregating rows and columns

I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).

I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column

How do I parse one character from Python Pandas String? [duplicate]

This question already has answers here:
Pandas: get second character of the string, from every row
(2 answers)
Closed 4 years ago.
I have a data frame and want to parse the 9th character into a second column. I'm missing the syntax somewhere though.
#develop the data
df = pd.DataFrame(columns = ["vin"], data = ['LHJLC79U58B001633','SZC84294845693987','LFGTCKPA665700387','L8YTCKPV49Y010001',
'LJ4TCBPV27Y010217','LFGTCKPM481006270','LFGTCKPM581004253','LTBPN8J00DC003107',
'1A9LPEER3FC596536','1A9LREAR5FC596814','1A9LKEER2GC596611','1A9L0EAH9C596099',
'22A000018'])
df['manufacturer'] = ['A','A','A','A','B','B','B','B','B','C','C','D','D']
def check_digit(df):
df['check_digit'] = df['vin'][8]
print(df['checkdigit'])]
For some reason, this puts the 8th row VIN in every line.

In your code doing this:
df['check_digit'] = df['vin'][8]
Is only selecting the 8th element in the column 'vin'. Try this instead:
for i in range(len(df['vin'])):
df['check_digit'] = df['vin'][i][8]
As a rule of thumb, whenever you are stuck, simply check the type of the variable returned. It solves a lot of small problems.
EDIT: As pointed out by #Georgy in the comment, using a loop wouldn't be pythonic and a more efficient way of solving this would be :
df['check_digit'] = df['vin'].str[8]
The .str does the trick. For future reference on that, I think you would find this helpful.

The correct way is:
def check_digit(df):
df['check_digit'] = df['vin'].str[8]
print(df)

How to replace old string values in Series/column of dataframe with values from dict?

This question is somewhat similar to: Remap values in pandas column with a dict, however, the answers are quite dated and do not cover the "SettingWithCopyWarning".
I am simply trying to replace the original strings in a column, "col", of my dataframe, "df", using a dictionary, "dict1". Here is my code which successfully replaces the values:
temp_series = df.loc[:,col].copy()
for name in temp_series:
for old, new in q_names_dict.items():
if (old.lower() == name.lower()):
temp_series.replace(name, new, inplace=True)
-However, when I attempt to update my original dataframe with this copy, "temp_series", I get a "SettingWithCopyWarning". Here is the code which throws that warning:
df.loc[:,col] = temp_series
# The bottom three don't work either.
#df[col] = temp_series
#df.loc[:,col].update(temp_series)
#df[col].update(temp_series)

By changing the deep copy to a shallow copy, I was able to have changes to the original dataframe, "df". It is stated in the docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html
Here is the code that enables shallow copy:
temp_series = df.loc[:,col].copy(deep=False)
Therefore, I don't have to explicitly update my dataframe after using the above code. This also prevents the SettingWithCopyWarning.

Update Pandas DF during while loop (Python3, Pandas)

Some background: My code takes user input and applies it to my DF to remove certain rows. This process repeats as many times as the user would like. Unfortunately, I am not sure how to update my DF within the while loop I have created so that it keeps the changes being made:
data = ({'hello':['the man','is a','good guy']})
df = pd.DataFrame(data)
def func():
while True:
n = input('Words: ')
if n == "Done":
break
elif n != "Done":
pattern = '^'+''.join('(?=.*{})'.format(word) for word in n.split())
df[df['hello'].str.contains(pattern)==False]
How do I update the DF at the end of each loop so the changes being made stay put?

Ok, I reevaluated your problem and my old answer was totally wrong of course.
What you want is the DataFrame.drop method. This can be done inplace.
mask = df['hello'].str.contains(pattern)
df.drop(mask, inplace=True)
This will update your DataFrame.

Looks to me like you've already done all the hard work, but there are two problems.
Your last line doesn't store the result anywhere. Most Pandas operations are not "in-place", which means you have to store the result somewhere to be able to use it later.
df is a global variable, and setting its value inside a function doesn't work, unless you explicitly have a line stating global df. See the good answers to this question for more detail.
So I think you just need to do:
df = df[df['hello'].str.contains(pattern)==False]
to fix problem one.
For problem two, at the end of func, do return df then when you call func call it like:
df = func(df)
OR, start func with the line
global df

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Outer Join Two Pandas Dataframes [duplicate] - python-3.x

reset_index by default does not modify the DataFrame; it returns a new DataFrame with the reset index. If you want to modify the original, use the inplace argument: df.reset_index(drop=True, inplace=True). Alternatively, assign the result of reset_index by doing df = df.reset_index(drop=True).

BrenBarn's answer works. The following also worked via this thread, which isn't a troubleshooting so much as an articulation of how to reset the index: test = test.reset_index(drop=True)

As an extension of in code veritas's answer... instead of doing del at the end: test = test.reset_index() del test['index'] You can set drop to True. test = test.reset_index(drop=True)

Related

Pandas Series - Trouble Dropping First Value

Slow loop aggregating rows and columns

How do I parse one character from Python Pandas String? [duplicate]

How to replace old string values in Series/column of dataframe with values from dict?

Update Pandas DF during while loop (Python3, Pandas)

Categories

Resources