Splitting a dataframe (csv) - python-3.x

How do i split a dataframe (csv) in the ratio of 4:1 randomly and store them in two different variables
ex- if there are ten rows from 1 to 10 in the dataframe, i want any 8 rows from it in variable 'a' and the remaining 2 rows in variable 'b'.

I've never done this on a random basis but the basic approach would be:
import pandas 2)
read in your csv
drop empty/null columns(avoid issues with these)
create a new dataframe to put the split values into
assign names to your new columns
split values and combine the values (using apply/combine/lambda)
Code sample:
# importing pandas module
import pandas as pd
# read in csv file
data = pd.read_csv("https://mydata.csv")
# drop null values
data.dropna(inplace = True)
# create new data frame
new = data["ColumnName"].str.split(" ", n = 1, expand = True) #this 'split' code applies to splitting one column into two
# assign new name to first column
data["A"]= new[0] #8 concatenated values will go here
# making seperate last name column from new data frame
data["B"]= new[1] #last two [combined] values in go here
## other/different code required for concatenation of column values - look at this linked SO question##
# df display
data
Hope this helps

Related

Pandas - concatanation multiple excel files with different names but the same data type

I have about 50 excel sheet with .'xlsb' extension. I'd like to concatanate a specific worksheet into pandas DataFrame (all worksheets names are the same). The problem I have is that the names of columns are not exactly the same in each worksheet. I wrote a code using pandas but the way it works is that it concats all values into the same column in pandas data frame but based on the name of column. So for example: sometimes I have column called: FgsNr and sometimes FgNr - the datatype and the meaning in both columns are exactly the same and I would like to have them in the same column in Data Frame but pandas creates to separate columns in data frame and stack together only those values that are listed in column with the same name.
files = glob(r'C:\Users\Folder\*xlsb')
for file in files:
Datafile = pd.concat(pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0) for file in files)
How could I correct the code so it copied and concatanted all values based on columns from excel at the same time ignoring the names of columns ?
When concatenating multiple dataframes with the same format, you can use the below snippet for speed and efficiency.
The basic logic is that you put them into a list, and then concatenate at the final stage.
files = glob(r'C:\Users\Folder\*xlsb')
dfs = []
for file in files:
df = pd.read_excel(file, engine='pyxlsb', sheet_name='Sheet1', usecols='A:F', header=0)
dfs.append(df)
large_df = pd.concat(dfs, ignore_index=True)
Also refer to the below :
Creating an empty Pandas DataFrame, then filling it?

How to read and store names of all the columns from multiple sheets in excel using Python?

I have 25 sheets in the excel file and I want the list of column names(top row/header) from each of the sheets.
Can you specify how you want the answers collected? Do you want all the column names from each sheet in the same list or dataframe?
Assuming you want the results in one DataFrame: I will assume you want to collect the results into one DataFrame where each row represents one sheet and each column represents one column name. The general idea is to loop through The pd.read_excel() method specifying a different sheet name each time.
import pandas as pd
import numpy as np
n_sheets = 25
int_sheet_names = np.arange(0,n_sheets,1)
df = pd.DataFrame()
for i in int_sheet_names:
sheet_i_col_names = pd.read_excel('file.xlsx', sheet_name = i, header=None, nrows=1)
df = df.append(sheet_i_col_names)
The resulting DataFrame can be further manipulated based on your specific requirements.
Output from my example excel sheet, which only had 4 sheets
Alternatively, you can pass a list to the sheet_names argument. In this case, you are given a dictionary, which I find to be less useful. In this case, int_sheet_names must be a list and not a numpy array.
n_sheets = 25
int_sheet_names = list(range(0,n_sheets))
dict = pd.read_excel('file.xlsx', sheet_name = int_sheet_names, head=None, nrows=1)
Output as a dictionary when passing a list to sheet_name kwarg

Merge multiple dataframes using multiindex in python

I have 3 series which is generated out of the code shown below. I have shown a the code for one series below
I would like to merge 3 such series/dataframes using columns (subject_id,hadm_id,icustay_id) but unfortunately these headings don't appear as column names. How do I convert them as columns and use them for merging with another series/dataframe of similar datatype
I am generating series from another dataframe (df) based on the condition given below. Though I already tried converting this series to dataframe, still it doesn't display the indices, instead it displays the column name as index. I have shown the output below. I would like to see the values 'Subject_id','hadm_id','icustay_id' as column names in dataframe along with other column 'val_bw_80_110' so that I can join with other dataframes using these 3 ids ('Subject_id','hadm_id','icustay_id')
s1 =
df.groupby(['subject_id','hadm_id','icustay_id'['val_bw_80_110'].mean()
I expect an output where the ids (subject_id,hadm_id,icustay_id) are converted to column names and can be used for joining/merging with other dataframes.
You can add parameter as_index=False to DataFrame.groupby or use Series.reset_index:
df = df.groupby(['subject_id','hadm_id','icustay_id'], as_index=False)['val_bw_80_110'].mean()
Or:
df = df.groupby(['subject_id','hadm_id','icustay_id'])['val_bw_80_110'].mean().reset_index()

Assign values to a datetime column in Pandas / Rename a datetime column to a date column

Dataframe image
I have created the following dataframe 'user_char' in Pandas with:
## Create a new workbook User Char with empty datetime columns to import data from the ledger
user_char = all_users[['createdAt', 'uuid','gasType','role']]
## filter on consumers in the user_char table
user_char = user_char[user_char.role == 'CONSUMER']
user_char.set_index('uuid', inplace = True)
## creates datetime columns that need to be added to the existing df
user_char_rng = pd.date_range('3/1/2016', periods = 25, dtype = 'period[M]', freq = 'MS')
## converts date time index to a list
user_char_rng = list(user_char_rng)
## adds empty cols
user_char = user_char.reindex(columns = user_char.columns.tolist() + user_char_rng)
user_char
and I am trying to assign a value to the highlighted column using the following command:
user_char['2016-03-01 00:00:00'] = 1
but this keeps creating a new column rather than editing the existing one. How do I assign the value 1 to all the indices without adding a new column?
Also how do I rename the datetime column that excludes the timestamp and only leaves the date field in there?
Try
user_char.loc[:, '2016-03-01'] = 1
Because your column index is a DatetimeIndex, pandas is smart enough to translate the string '2016-03-01' into datetime format. Using loc[c] seems to hint to pandas to first look for c in the index, rather than create a new column named c.
Side note: the DatetimeIndex of time-series data is conventionally used as the (row) index of a DataFrame, not in the columns. (There's no technical reason why you can't use time in the columns, of course!) In my experience, most of the PyData stack is built to expect "tidy data", where each variable (like time) forms a column, and each observation (timestamp value) forms a row. The way you're doing it, you'll need to transpose your DataFrame before calling plot() on it, for example.

To average a subset of one column using pandas

I have a .csv file that I'm reading from. I read only select columns from it and I need to further process this data before I save it into an excel sheet. The idea is to repeat this process for all the files in the folder and save the sheets with the same names as the original .csv.
As of now, I'm able to read the specific columns from .csv and write the whole file into excel. I am yet to figure out how to further process these columns before I save to excel. Further processing involves
Averaging rows 18000-20000 for each column separately.
Calculating (Column value - Average)/Average
Saving these values in separate columns with different column names.
My code is as follows. Need some help with this.
import pandas as pd
import os
from pathlib import Path
for f in os.listdir():
file_name, file_ext = os.path.splitext(f) #splitting into file name and extension
if file_ext == '.atf':
#open the data file and get data only from specific columns.
df = pd.read_csv(f, header = 9, index_col = 0, usecols = [0,55,59,63,67,71,75,79,83,87,91,95,99,103], encoding = "ISO-8859-1", sep = '\t', dtype = {'YFP.13':str,'YFP.14':str,'YFP.15':str,'YFP.16':str,'YFP.17':str,'YFP.18':str,'YFP.19':str,'YFP.20':str,'YFP.21':str,'YFP.22':str,'YFP.23':str,'YFP.24':str,'YFP.25':str,'Signals=':str})
df.to_excel(file_name+'.xlsx',sheet_name=file_name, engine = 'xlsxwriter') #writing into an excel file
Let's say your dataframe has a shape of 4 columns and 100 rows:
data = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
Averaging rows 18000-20000 for each column separately.
To perform the averaging on a subset, you define a logical mask based on inequalities over the index and apply the averaging function on the selected dataframe. The results are saved in a new dataframe as you want to use them later:
means = data[(60<data.index) & (80>data.index)].mean()
Calculating (Column value - Average)/Average
Saving these values in separate columns with different column names
For the two last steps, the code below speaks by itself:
cols = data.columns
for col in cols:
data[col +"_calc"] = (data[col]-means[col])/means[col]
In the end, you may export the dataframe "data" to an excel format as you did earlier.

Resources