how to access the first row of the dataframe after using python - python-3.x

I am new to python and pandas and need help with the following.
I have some data in a dataframe on which I am using groupby function. After using the groupby function I want to access the first row of each group and copy the same to a different dataframe.
For eg.
grouped = com.groupby(['Month'])
Using this will give me groups created month-wise. I just want to extract the first number from all the 12 groups and copy the same row details to another dataframe.

Related

pyspark : Operate on column sets in parallel

If I was interested in working on groups of rows in Spark, I could, for example add a column that denoted group membership and use a pandas groupby to operate on these rows in parallel (independently). Is there a way to do this for columns instead? I have a df with many feature columns and 1 target column and I would like to run calculations on each set{column_i, target} for i=1...p.

Using a for loop in pandas

I have 2 different tabular files, in excel formats. I want to know if an id number from one of the columns in the first excel file (from the "ID" column) exists in the proteome file in a specific column (take "IHD" for example) and if so, to display the value associated with it. Is there a way to do this, specifically in pandas and possible using a for loop?
After loading the excel files with read_excel(), you should merge() the dataframes on ID and protein. This is the recommended approach with pandas rather than looping.
import pandas as pd
clusters = pd.read_excel('clusters.xlsx')
proteins = pd.read_excel('proteins.xlsx')
clusters.merge(proteins, left_on='ID', right_on='protein')

Iterating through each row of a single column in a Python geopandas (or pandas) DataFrame

I am trying to iterate through this very large DataFrame that I have in python. The thing is, I only want to pull out data from one specific column that contains the names of a bunch of counties.
I have tried to use iteritems(), itertupel(), and iterrows() to no avail.
Any suggestions on how to do this?
My end goal is to have a nested dictionary with each internal dictionary's key being a name from the DataFrame column.
Also tried to use this method below to select a single column but that will only print the name of the column, not its contents.
for county in map_datafile[['NAME']]:
print(county)
If you delete one pair of square brackets, you get a Series that is iterable:
for county in map_datafile['NAME']:
print(county)
See the difference:
print(type(map_datafile[['NAME']]))
# pandas.core.frame.DataFrame
print(type(map_datafile['NAME']))
# pandas.core.series.Series

How to properly splice entries out of a pandas dataframe?

I have a pandas csv, assuming my dataframe is mydataframe.
My data is registration data where I have a csv for:
Name, RegistrationID, DateSignedUp, Course
I want to 'clean' the data up in my data frame by removing any row of any 'Name' who had less than 5 registrations.
I am able to get the count of registrations per name using the following:
mydataframe.groupby('Name')['RegistrationID'].count()
How do I create a new dataframe with all rows where 'Name' has more than 5 registrations?
You can try with transform
n=5
mydataframe=mydataframe[mydataframe.groupby('Name')['RegistrationID'].transform('count')>n].copy()

Compile a dataframe from multiple CSVs using list of dfs

I am trying to create a single dataframe from 50 csv files. I need to use only two columns of the csv files namely 'Date' and 'Close'. I tried using the df.join function inside the for loop, but it eats up a lot of memory and i am getting error "Killed:9" after processing of almost 22-23 csv files.
So, now I am trying to create a list of Dataframes with only 2 columns using the for loop and then I am trying to concat the dfs outside the loop function.
I have following issues to be resolved:-
(i) Though the start date of most of the csv files have start date of 2000-01-01, but there are few csvs which have later start dates. So, I want that the main dataframe should have all the dates, with NaN or empty fields for csv with later start date.
(ii) I want to concat them across the Date as Index.
My code is :-
def compileData(symbol):
with open("nifty50.pickle","rb") as f:
symbols=pickle.load(f)
dfList=[]
main_df=pd.DataFrame()
for symbol in symbols:
df=pd.read_csv('/Users/uditvashisht/Documents/udi_py/stocks/stock_dfs/{}.csv'.format(symbol),infer_datetime_format=True,usecols=['Date','Close'],index_col=None,header=0)
df.rename(columns={'Close':symbol}, inplace=True)
dfList.append(df)
main_df=pd.concat(dfList,axis=1,ignore_index=True,join='outer')
print(main_df.head())
You can use index_col=0 in the read_csv or dflist.append(df.set_index('Date')) to put your Date column in the index of each dataframe. Then using pd.concat with axis=1, Pandas will using intrinsic data alignment to align all dataframes based on the index.

Resources