How to combine dataframes - python-3.x

I have 2 data frames(final_combine_df & acs_df) that have a shared column ('CBG'). Dataframe acs_df has 2 additional columns that I want to add to the combined dataframe (acs_total_persons & cs_total_building_units) . For the 'CBG' column values in acs_df that match those in final_combine_df, I want to add the acs_total_persons & acs_total_housing_units values to that row.
acs_df.head()
CBG acs_total_persons acs_total_housing_units
10010211001 1925.0 1013.0 1
10030114011 2668.0 1303.0 2
10070100043 930.0 532.0 3
10139534001 1570.0 763.0 4
10150021023 1059.0 379.0
I tried combine_acs_merge = pd.concat([final_combine,acs_df], sort=True) but it did not seem to match them up. I also tried combine_acs_merge = final_combine.merge(acs_df, on='CBG') and got
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
What do I need to do here?
Note: Column acs_df['CBG'] is type numpy.float64, not a string but it should still return. Oddly, when I run the following: print(acs_df.loc[acs_df['CBG'] == '01030114011']) it returns an empty dataframe. I created the acs_df from a csv file (see below). Is that creating a problem?
acs_df = pd.read_csv(acs_data)

Related

Python Pandas, Try to update cell value

I've 2 dataframe, both with a column date:
I need to set in first dataframe the value of specific column found in the second dataframe,
So in first of all I find the correct row of first dataframe with:
id_row = int(dataset.loc[dataset["time"] == str(searchs.index[x])].index[0]) #example: 910
and then I want to update the value of column ['search_volume'] at this row: 910
I will do this with:
dataset['search_volume'][id_row] = searchs[kw_list[0]][x]
but I get back this error:
/root/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
my full code is, but not working and nothing is updated.
for x in range(len(searchs)):
id_row = int(dataset.loc[dataset["time"] == str(searchs.index[x])].index[0])
dataset['search_volume'][id_row] = searchs[kw_list[0]][x]
It work fine if I test manually the update with:
dataset['search_volume'][910] = searchs[kw_list[0]][47]
What's append?!
Use .loc:
dataset.loc[910, 'search_volume'] = searchs.loc[47, kw_list[0]]
For more info about the error message, see this
Also, there are way more efficient methods for doing this. As a rule of thumb, if you are looping over a dataframe, you are generally doing something wrong. Some potential solutions: pd.DataFrame.join, pd.merge, masking, pd.DataFrame.where, etc.

change the column value based on the multiple columns

I am developing code for searching a keyword in the given data.
for example, I have a data in column A & I want to find if
the substring is present in the row if yes give me that keyword against
the data, if that keyword is not present then give me 'blank'.
import pandas as pd
data = pd.read_excel("C:/Users/606736.CTS/Desktop/Keyword.xlsx")
# dropping null value columns to avoid errors
data.dropna(inplace = True)
# Converting the column to uppercase
data["Uppercase"]= data["Skill"].str.upper()
# Below is the keywords I want to search in the data
sub =['MEMORY','PASSWORD','DISK','LOGIN','RESET']
# I have used the below code, which is creating multiple columns &
giving me the boolean output
for keyword in sub:
data[keyword] = data.astype(str).sum(axis=1).str.contains(keyword)
# what I want is, search the keyword if it exits give me the keyword
name else blank
Try this:
data['Keyword'] = np.nan
for i in sub:
data.loc[(data['Uppercase'].apply(lambda x: i in x.split(' ')) & (data['Keyword'].isna()), 'Keyword'] = i

Append each value in a DataFrame to a np vector, grouping by column

I am trying to create a list, which will be fed as input to the neural network of a Deep Reinforcement Learning model.
What I would like to achieve:
This list should have the properties of this code's output
vec = []
lines = open("data/" + "GSPC" + ".csv", "r").read().splitlines()
for line in lines[1:]:
vec.append(float(line.split(",")[4]))
i.e. just a list of values like this [enter image description here][1]
The original dataframe looks like:
Out[0]:
Close sma15
0 1.26420 1.263037
1 1.26465 1.263193
2 1.26430 1.263350
3 1.26450 1.263533
but by using df.transpose() i obtained the following:
0 1 2 3
Close 1.264200 1.264650 1.26430 1.26450
sma15 1.263037 1.263193 1.26335 1.263533
from here I would like to obtain a list grouped by column, of the type:
[1.264200, 1.263037, 1.264650, 1.263193, 1.26430, 1.26335, 1.26450, 1.263533]
I tried
x = np.array(df.values.tolist(), dtype = np.float32).reshape(1,-1)
but this gives me a float with 1 row and 6 columns, how could I achieve a result that has the properties I am looking for?
From what I can understand, you just want a flattened version of the DataFrame's values. That can be done simply with the ndarray.flatten() method rather than reshaping it.
# Creating your DataFrame object
a = [[1.26420, 1.263037],
[1.26465, 1.263193],
[1.26430, 1.263350],
[1.26450, 1.263533]]
df = pd.DataFrame(a, columns=['Close', 'sma15'])
df.values.flatten()
This gives array([1.2642, 1.263037, 1.26465, 1.263193, 1.2643, 1.26335, 1.2645, 1.263533]) as is (presumably) desired.
PS: I am not sure why you have not included the last row of the DataFrame as the output of your transpose operation. Is that an error?

Slow loop aggregating rows and columns

I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column

Pandas: Merge on one column using EXACT match when column values overlap

I have two dataframes that I want to merge based on the values in one Column (SKU). The 'SKU' values are varied. For example, SKU values range from "00047" to "TPA20839". However, they are always alphanumeric.
When the dataframes are read from the csv files, I convert the 'SKU' column to strings so they merge on the same data type. The data merges correctly, EXCEPT when there are overlapping string values.
For example, there is a df_master['SKU'] value = "6748". There are two similar values in df_inv['SKU'] -> "6748" AND "9006748" (two different items, 6748 == 6748 9006784 != 6748).
This causes this row to NOT appear in the new dataframe.
I want it to to EXACT match, similar to the =MATCH('','',0) function in excel. Can you help me achieve this?
df_master['SKU'] = df_master['SKU'].astype(str)
df_inv['SKU'] = df_inv['SKU'].astype(str)
df_new = pd.merge(df_inv, df_master, on='SKU')
df_new.to_csv('new-master.csv', sep=',', encoding='utf-8')
I think the trick may be to format the data type differently, but I'm not sure.
Try this:
vals_matched = []
haystacks = df_inv['SKU'].astype(str).tolist()
needles = df_master['SKU'].astype(str).tolist()
for needle in needles:
for haystack in haystacks:
if needle in haystack:
vals_matched.append(needle)
break
df_master = df_master[df_master.SKU.astype(str).isin(needles)]
The break statement continues to the next needle, that is, the next value you're trying to match. The reason for that is that a single match is sufficient between two lists.

Resources