How to properly splice entries out of a pandas dataframe? - python-3.x

I have a pandas csv, assuming my dataframe is mydataframe.
My data is registration data where I have a csv for:
Name, RegistrationID, DateSignedUp, Course
I want to 'clean' the data up in my data frame by removing any row of any 'Name' who had less than 5 registrations.
I am able to get the count of registrations per name using the following:
mydataframe.groupby('Name')['RegistrationID'].count()
How do I create a new dataframe with all rows where 'Name' has more than 5 registrations?

You can try with transform
n=5
mydataframe=mydataframe[mydataframe.groupby('Name')['RegistrationID'].transform('count')>n].copy()

Related

Delimited File with Varying Number of Rows Azure Data Factory

I have a delimited file separated by hashes that looks somewhat like this,
value#value#value#value#value#value##value
value#value#value#value##value#####value#####value
value#value#value#value###value#value####value##value
As you can see, when separated by hashes, there are more columns in the 2nd and 3rd rows than there is in the first. I want to be able to ingest this into a database using a ADF Data Flow after some transformations. However, whenever I try to do any kind of mapping, I always only see 7 columns (the number of columns in the first row).
Is there any way to get all of the values? As many columns as there are in the row with most number of items? I do not mind the nulls.
Note: I do not have a header row for this.
Azure Data Factory directly will not be able to Import schema -row with the maximum number of column. Hence, it is important to make sure you have same number of columns in your file.
You can use Azure functions to validate your file and update it to get equal number of columns in all rows.
You could give it a try to have a local file with row with the maximum number of column and import the schema from the file, else you have to go for Azure Functions where you have to convert the file and then trigger the pipeline.

how to access the first row of the dataframe after using python

I am new to python and pandas and need help with the following.
I have some data in a dataframe on which I am using groupby function. After using the groupby function I want to access the first row of each group and copy the same to a different dataframe.
For eg.
grouped = com.groupby(['Month'])
Using this will give me groups created month-wise. I just want to extract the first number from all the 12 groups and copy the same row details to another dataframe.

How to automatically index DataFrame created from groupby in Pandas

Using the Kiva Loan_Data from Kaggle I aggregated the Loan Amounts by country. Pandas allows them to be easily turned into a DataFrame, but indexes on the country data. The reset_index can be used to create a numerical/sequential index, but I'm guessing I am adding an unnecessary step. Is there a way to create an automatic default index when creating a DataFrame like this?
Use as_index=False
groupby
split-apply-combine
df.groupby('country', as_index=False)['loan_amount'].sum()

Feature extraction from data stored in PostgreSQL database

I have some data stored in PostgreSQL database, which contains fields like cost, start date, end date, country, etc. Please take a look at the data here.
Now what I want to do is extract some of the important features/fields from this data and store them in a separate CSV file or pandas data frame so I can use the extracted data for analysis.
Is there any python script to do this task? Please let me know. Thanks.
Firstly you should import your postgresql table data into dataframe which can be done by ,
import psycopg2 as pg
import pandas.io.sql as psql
# get connected to the database
connection = pg.connect("dbname=mydatabase user=postgres")
dataframe = psql.frame_query("SELECT * FROM <tablename>", connection)
explained here https://gist.github.com/00krishna/9026574 .
After that we can select specific columns in pandas dataframe . these can be done by ,
df1 = dataframe[['projectfinancialtype','regionname']]
# here you can select n number of feature columns which is available in your dataframe i had only took 2 fields of your json
Now for putting these feature column into csv we can use code like these,
df1.to_csv("pathofoutput.csv", cols=['projectfinancialtype','regionname'])
#it will create csv with your feature columns
May these helps

Compile a dataframe from multiple CSVs using list of dfs

I am trying to create a single dataframe from 50 csv files. I need to use only two columns of the csv files namely 'Date' and 'Close'. I tried using the df.join function inside the for loop, but it eats up a lot of memory and i am getting error "Killed:9" after processing of almost 22-23 csv files.
So, now I am trying to create a list of Dataframes with only 2 columns using the for loop and then I am trying to concat the dfs outside the loop function.
I have following issues to be resolved:-
(i) Though the start date of most of the csv files have start date of 2000-01-01, but there are few csvs which have later start dates. So, I want that the main dataframe should have all the dates, with NaN or empty fields for csv with later start date.
(ii) I want to concat them across the Date as Index.
My code is :-
def compileData(symbol):
with open("nifty50.pickle","rb") as f:
symbols=pickle.load(f)
dfList=[]
main_df=pd.DataFrame()
for symbol in symbols:
df=pd.read_csv('/Users/uditvashisht/Documents/udi_py/stocks/stock_dfs/{}.csv'.format(symbol),infer_datetime_format=True,usecols=['Date','Close'],index_col=None,header=0)
df.rename(columns={'Close':symbol}, inplace=True)
dfList.append(df)
main_df=pd.concat(dfList,axis=1,ignore_index=True,join='outer')
print(main_df.head())
You can use index_col=0 in the read_csv or dflist.append(df.set_index('Date')) to put your Date column in the index of each dataframe. Then using pd.concat with axis=1, Pandas will using intrinsic data alignment to align all dataframes based on the index.

Resources