How to add missing data from one dataframe to another? - python-3.x

I am working on a project that requires to fill out missing data from one Excel sheet to another. For example:
table A:
card name address zipcode
123 steve chicago 60601
321 Joy New York 10083
222 Andy San Francisco 43211
table B:
card name address zipcode
321 steve nan nan
123 Joy nan nan
123 nan nan nan
For this project, I need fill out table B according to table A. I do have idea about using Excel VLOOKUP function to fill out all of columns, but I guess if the number of data file getting huge in future, then I may use python to do this. (eg, same data format but from different branches)
In Python, the merge function can do this but it takes too much time. Is there any useful function in pandas, numpy, or any other third-party library that can help me do this? Thanks all!
Here is what I have tried:
df.merge(table A, table B, on = 'card', how = 'right')
it does work but I have to rename columns to match each features. And I also know we can do this on SQL very fast and effiency, just wanna do this on python :)

Of course pandas library can do this and more. I am currently writing a business intelligence program. And I do a lot of operations like this with pandas
There are many ways to do this, but since I don't see your code, you can do it in the simplest and most understandable way. Turn at the point where you are stuck.thank you
searchdata = Atabledata[['name','adress','zipcode']]
for i in search['name']:
Btabledata.loc[Btabledata['name']== i, Btabledata['adress']] = Atabledata['adress']
Btabledata.loc[Btabledata['name'] == i, Btabledata['zipcode']] = Atabledata['zipcode']
print(Btabledata)

Related

Multilevel index of rows of a dataframe using pandas [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Combine multiple rows based on Id and other column using pandas python [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

How to understand python data frames better

Being a beginner in Python, I often face this problem - Let's say I am working with a data frame and want to execute an operation on one of the column. It can be just removing the decimal point from the value or maybe I want to take out the month from the date column. But often the solutions I would find online - it is generally shown with a single value or a data point like this:
a = 11.0
int(a)
11
Now, the same solution can't be applied if I have a data frame or a column. Again If I want to add time with date
d = date.today()
d
datetime.date(2018, 3, 30)
datetime.combine(d, datetime.min.time())
datetime.datetime(2018, 3, 30, 0, 0)
In the same manner, this solution can not be used for a data frame. That will throw an error. Obviously I have a lacking in knowledge here, I am not being able to make it work in terms of data frames. Can you please point me towards the topic which might help me understand these problems in terms of data frames ? or maybe show an example how its done ?
You should have a look at pandas library to manipulate dataframes : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
This is an exemple to apply a function for each value of a given column:
import pandas as pd
def myFunction(a_string):
return a_string.upper()
data = pd.read_csv('data.csv')
print(data)
data['City'] = data['City'].apply(myFunction)
print(data)
Data at beginning :
Name City Age
Robert Paris 32
Max Dallas 24
Raj Delhi 27
Data after:
Name City Age
Robert PARIS 32
Max DALLAS 24
Raj DELHI 27
Here myFunction uppercase the string but could be used the same way for other kind of operations.
Hope that helps.

Excel Vlookup Multiple Values

I am looking for a vlookup formula that returns multiple matches using two lookup values. I am currently trying to use the concatenate method, but I haven't quite figured it out. The table needs to return all of the multiple matches not just one. Currently, its only returning the last match.
For example, lets say I have a list of multiple city and states. The cities differ but the states remain the same obviously. I want to return the number of people in the each city.
City State #OfPeople
Albany NY 10
Orlando FL 5
Tampa FL 3
Seattle WA 1
Queens NY 8
So I concatenated the city and state column.
Join City State #OfPeople
Albany-NY Albany NY 10
Orlando-FL Orlando FL 5
Tampa-FL Tampa FL 3
Seattle-WA Seattle WA 1
Queens-NY Queens NY 8
The purpose of this is to create an updated log of people in each city has time progresses. I want to have a grand total amount of people in each column. (I know this requires another formula. I'm just focused on returning multiple matches for now). However, I don't want to overwrite the existing data. Hopefully, I explained this well. This is just an example of a larger project I'm working on. I need to be able to build on this list. That's why its important that I be able to return matches multiple times.
Join City State #OfPeople Total
Albany-NY Albany NY 10 10
Orlando-FL Orlando FL 5 15
Tampa-FL Tampa FL 3 18
Seattle-WA Seattle WA 1 19
Queens-NY Queens NY 8 27
Any help would be greatly appreciated!
Considering you're trying to get some grand totals based on multiple criteria, I would suggest using SUMIFS() / COUNTIFS() functions, rather than focusing on searching matching row itself.
However, if you need multiple criteria look up, for some reason, I believe INDEX() + MATCH() combination can perfectly do the job.
The table needs to return all of the multiple matches not just one.
Currently, its only returning the last match
You'll need to use SUMIFS() if there are multiple records for the same city/state combo in your people lookup.
=SUMIFS (sum_range, range1, criteria1, [range2], [criteria2], ...)
Let's assume that you have a cities tab and a people tab. Let's assume you have ten cities that you want to return the total amount of people from.
Cities Tab definition
City range: 'Cities'!A$1:A$10
State range: 'Cities'!B$1:B$10
People Tab definition
City range: 'People'!A$1:A$100
State range: 'People'!B$1:B$100
#OfPeople range: 'People'!C$1:C$100
Drop this formula in the first row of your cities tab, drag down the entire range of cities.
=SUMIFS('People'!C$1:C$100, 'Cities'!A$1, People'!A$1:A$100, 'Cities'!B$1, 'People'!B$1:B$100)

How to fill missing data in excel time series

I need a hand on this problem: In an Excel workbook I reported 10 time series (with monthly frequency) of 10 titles that should cover the past 15 years. Unfortunately, not all titles can cover the 15-year time series. For example, a title only goes up to 2003; So in the column of that title, I have the first 5 years with a "Not Available" instead of a value. Once I’have imported the data into Matlab, obviously, in the column of the title with the shorter series appears NaN where there are no values.
>> Prices = xlsread('PrezziTitoli.xls');
>> whos
Name Size Bytes Class Attributes
Prices 182x10 6360 double
My goal is to estimate the variance-covariance matrix, however, because of the lack of data, the calculation is not possible for me. I thought to an interpolation, before the calculation of the variance-covariance matrix, to cover the values that in Matlab return NaN, for example with a "fillts", but have difficulties in its use.
There is some code that can be useful to me? Can you help me?
Thanks!
Do you have the statistics toolbox installed? In that case, the solution is simple:
>> x = randn(10,4); // x is a 10x4 matrix of random numbers
>> x(randi(40,10,1)) = NaN; // set some random entries to NaN
>> disp(x)
-1.1480 NaN -2.1384 2.9080
0.1049 -0.8880 NaN 0.8252
0.7223 0.1001 1.3546 1.3790
2.5855 -0.5445 NaN -1.0582
-0.6669 NaN NaN NaN
NaN -0.6003 0.1240 -0.2725
-0.0825 0.4900 1.4367 1.0984
-1.9330 0.7394 -1.9609 -0.2779
-0.4390 1.7119 -0.1977 0.7015
-1.7947 -0.1941 -1.2078 -2.0518
>> nancov(x) // Compute covariances after removing all NaN rows
1.2977 0.0520 1.6248 1.3540
0.0520 0.5359 -0.0967 0.3966
1.6248 -0.0967 2.2940 1.6071
1.3540 0.3966 1.6071 1.9358
>> nancov(x, 'pairwise') // Compute covariances pairwise, ignoring NaNs
1.9195 -0.5221 1.4491 -0.0424
-0.5221 0.7325 -0.1240 0.2917
1.4491 -0.1240 2.1454 0.2279
-0.0424 0.2917 0.2279 2.1305
If you don't have the statistics toolbox, we need to think harder - let me know!

Resources