Using a for loop in pandas - excel

I have 2 different tabular files, in excel formats. I want to know if an id number from one of the columns in the first excel file (from the "ID" column) exists in the proteome file in a specific column (take "IHD" for example) and if so, to display the value associated with it. Is there a way to do this, specifically in pandas and possible using a for loop?

After loading the excel files with read_excel(), you should merge() the dataframes on ID and protein. This is the recommended approach with pandas rather than looping.
import pandas as pd
clusters = pd.read_excel('clusters.xlsx')
proteins = pd.read_excel('proteins.xlsx')
clusters.merge(proteins, left_on='ID', right_on='protein')

Related

How to fix NaN values in concatenated multiple tables in Jupyter Notebook

I am using Jupyter Notedbook. I have concatenated multiple tables. When I run the head() command I am not able to see the values in age and gender columns in a table rather than it's showing me NaN values against each user_id.
The following image_1 shows us the output when I concatenated the different two tables
How can I sort it out this issue or suggest me another way to conatenate tables where I can see all of the table values?
or do I need to access tables separeately and apply operations on different tables?
I am expecting to get the values in age and gender table rather than NaN values.
When I use these tables separately. It shows correct results but I have a big data problem so I need to concatenate the tables to access each of the feature column. In the end, I can apply operations on concatenated table features.
I have been testing your problem with the two csv's of your Github.
First of all I loaded the two tables as 'df1' and 'df2', importing the pandas library.
import pandas as pd
df1 = pd.read_csv('https://raw.githubusercontent.com/HaseebTariq7/Data_analysis_dataset/main/data_1.csv')
df2 = pd.read_csv('https://raw.githubusercontent.com/HaseebTariq7/Data_analysis_dataset/main/data_2.csv')
Then, using the pandas library, you can merge both dataframes choosing the connection column, in this case, 'user_id'.
final_df= pd.merge(df1, df2,on='user_id')
Finally we have the 'final_df' with all the information of both tables and without NaN's.

Read excel using pandas and join two pandas dataframes without losing formatting styles

Initially I have two excel files. Input file1 contains some colors present in excel columns.
Another excel file looks likes this.
I have to join this two excel file using openpyxl or xlsxwriter(python library) or by any other methods. And in the output file I don't want to loose colors. output file will look like the below image.
please use the code below to create the pandas dataframe for the two input files.
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4],
'name':['rahul','raju','mohan','ram'],
'salary':[20000,34000,10000,998765]
})
print(df)
df1 = pd.DataFrame({'id':[1,2,3,4],
'state':['gujrat','bhopal','mumbai','kolkata']
})
print(df1)

Create a Dataframe from an excel file

I want create a Dataframe from excel file. I am using pandas read_excel function. My requirement is to create a Dataframe for all elements if the column matches some value.
For eg:- Below is my excel file and I want to create the Dataframe with all elements that has Module equal to 'DC-Prod'
Exccel File Image
Welcome, Saagar Sheth!
to make a Dataframe, just import "pandas" it like so...
import pandas as pd
then create a variable for the file to access, like this;
file_var_pandas = 'customer_data.xlsx'
and then, create its dataframe using the read_excel;
customers = pd.read_excel(file_var_pandas,
sheetname=0,
header=0,
index_col=False,
keep_default_na=True
)
finally, use the head() command like so;
customers.head()
if you want to know more just go to this website!
Packet Pandas Dataframe
and have fun!

Dask Dataframe View Entire Row

I want to see the entire row for a dask dataframe without the fields being cutoff, in pandas the command is pd.set_option('display.max_colwidth', -1), is there an equivalent for dask? I was not able to find anything.
You can import pandas and use pd.set_option() and Dask will respect pandas' settings.
import pandas as pd
# Don't truncate text fields in the display
pd.set_option("display.max_colwidth", -1)
dd.head()
And you should see the long columns. It 'just works.'
Dask does not normally display the data in a dataframe at all, because it represents lazily-evaluated values. You may want to get a specific row by index, using the .loc accessor (same as in Pandas, but only efficient if the index is known to be sorted).
If you meant to get the whole list of columns only, you can get this by the .columns attribute.

Compile a dataframe from multiple CSVs using list of dfs

I am trying to create a single dataframe from 50 csv files. I need to use only two columns of the csv files namely 'Date' and 'Close'. I tried using the df.join function inside the for loop, but it eats up a lot of memory and i am getting error "Killed:9" after processing of almost 22-23 csv files.
So, now I am trying to create a list of Dataframes with only 2 columns using the for loop and then I am trying to concat the dfs outside the loop function.
I have following issues to be resolved:-
(i) Though the start date of most of the csv files have start date of 2000-01-01, but there are few csvs which have later start dates. So, I want that the main dataframe should have all the dates, with NaN or empty fields for csv with later start date.
(ii) I want to concat them across the Date as Index.
My code is :-
def compileData(symbol):
with open("nifty50.pickle","rb") as f:
symbols=pickle.load(f)
dfList=[]
main_df=pd.DataFrame()
for symbol in symbols:
df=pd.read_csv('/Users/uditvashisht/Documents/udi_py/stocks/stock_dfs/{}.csv'.format(symbol),infer_datetime_format=True,usecols=['Date','Close'],index_col=None,header=0)
df.rename(columns={'Close':symbol}, inplace=True)
dfList.append(df)
main_df=pd.concat(dfList,axis=1,ignore_index=True,join='outer')
print(main_df.head())
You can use index_col=0 in the read_csv or dflist.append(df.set_index('Date')) to put your Date column in the index of each dataframe. Then using pd.concat with axis=1, Pandas will using intrinsic data alignment to align all dataframes based on the index.

Resources