How to unmerge cells and create a standard dataframe when reading excel file?

How to unmerge cells and create a standard dataframe when reading excel file? - python-3.x

I would like to convert this dataframe
into this dataframe
So far reading excel the standard way gives me the following result.
df = pd.read_excel(folder + 'abcd.xlsx', sheet_name="Sheet1")
Unnamed: 0 Unnamed: 1 T12006 T22006 T32006 \
0 Casablanca Global 100 97.27252 93.464538
1 NaN Résidentiel 100 95.883979 92.414063
2 NaN Appartement 100 95.425152 91.674379
3 NaN Maison 100 101.463607 104.039383
4 NaN Villa 100 102.45132 101.996932
Thank you

You can try method .fillna() with parameter method='ffill'. According to the pandas documentation for the ffill method: ffill: propagate last valid observation forward to next valid backfill.
So, your code would be like:
df.fillna(method='ffill', inplace=True)
And change name of 0 and 1 columns with this lines:
df.columns.values[0] = "City"
df.columns.values[1] = "Type"

Related

How to reformat time series to fill in missing entries with NaNs?

I have a problem that involves converting time series from one
representation to another. Each item in the time series has
attributes "time", "id", and "value" (think of it as a measurement
at "time" for sensor "id"). I'm storing all the items in a
Pandas dataframe with columns named by the attributes.
The set of "time"s is a small set of integers (say, 32),
but some of the "id"s are missing "time"s/"value"s. What I want to
construct is an output dataframe with the form:
id time0 time1 ... timeN
val0 val1 ... valN
where the missing "value"s are represented by NaNs.
For example, suppose the input looks like the following:
time id value
0 0 13
2 0 15
3 0 20
2 1 10
3 1 12
Then, assuming the set of possible times is 0, 2, and 3, the
desired output is:
id time0 time1 time2 time3
0 13 NaN 15 20
1 NaN NaN 10 12
I'm looking for a Pythonic way to do this since there are several
million rows in the input and around 1/4 million groups.

You can transform your table with a pivot. If you need to handle duplicate values for index/column pairs, you can use the more general pivot_table.
For your example, the simple pivot is sufficient:
>>> df = df.pivot(index="id", columns="time", values="value")
time 0 2 3
id
0 13.0 15.0 20.0
1 NaN 10.0 12.0
To get the exact result from your question, you could reindex the columns to fill in the empty values, and rename the column index like this:
# add missing time columns, fill with NaNs
df = df.reindex(range(df.columns.max() + 1), axis=1)
# name them "time#"
df.columns = "time" + df.columns.astype(str)
# remove the column index name "time"
df = df.rename_axis(None, axis=1)
Final df:
time0 time1 time2 time3
id
0 13.0 NaN 15.0 20.0
1 NaN NaN 10.0 12.0

Doubts pandas filtering data row by row

How can I solve this issue related on pandas? I've a dataframe of the following approach:
datetime64ns
type(int)
datetime64ns(analysis)
2019-02-02T10:02:05
4
2019-02-02T10:02:01
3
2019-02-02T10:02:02
4
2019-02-02T10:02:02
2019-02-02T10:02:04
3
2019-02-02T10:02:04
The goal is to do the following issue:
# psuedocode
for all the rows:
if datetime(analysis) exists and type=4:
insert in the a new row column type4=1
elseif datetime(analysis) exists and type=2:
insert in the a new row column type2=1
the idea to develop it is in order to make a group by count value. I'm sure that is possible because I manage to develop it in the past but I lost my .py file. Thanks for the attention

Need this?
df = pd.concat([df, pd.get_dummies(df['type(int)'].mask(
df['datetime64ns(analysis)'].isna()).astype('Int64')).add_prefix('type')], 1)
OUTPUT:
datetime64ns type(int) datetime64ns(analysis) type3 type4
0 2019-02-02T10:02:05 4 NaN 0 0
1 2019-02-02T10:02:01 3 NaN 0 0
2 2019-02-02T10:02:02 4 2019-02-02T10:02:02 0 1
3 2019-02-02T10:02:04 3 2019-02-02T10:02:04 1 0

Keeping columns of pandas dataframe whose substring is in the list

I have a dataframe with many columns. I only want to retain those columns whose substring is in the list. For example the lst and dataframe is:
lst = ['col93','col71']
sample_id. col9381.3 col8371.8 col71937.9 col19993.1
1
2
3
4
Based on the substrings, the resulting dataframe will look like:
sample_id. col9381.3 col71937.9
1
2
3
4
I have a code that go through the list and filter out the columns for whom I have a substring in a list but I don't know how to create a dataframe for it. The code so far:
for i in lst:
df2 = df1.filter(regex=i)
if df2.shape[1] > 0:
print(df2)
The above code is able to filter out the columns but I don't know how combine all of these into one dataframe. Insights will be appreciated.

Try with startswith which accepts a tuple of options:
df.loc[:, df.columns.str.startswith(('sample_id.',)+tuple(lst))]
Or filter which accepts a regex as you were trying:
df.filter(regex='|'.join(['sample_id']+lst))
Output:
sample_id. col9381.3 col71937.9
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN

How in Python can I get dataframe coordinates that are not items in a column, into columns within the same dataframe?

I am using xarray to read in two different netCDF files, using combine='by_coords'. The data read in is then converted to a dataframe, and the printed output is shown below.
tag p
lat lon time
23.025642 -110.925552 2010-01-01 0 NaN
2010-01-02 0 NaN
2010-01-03 0 NaN
2010-01-04 0 NaN
2010-01-05 0 NaN
... ... ...
29.974609 -90.084259 2010-12-20 0 9.711414
2010-12-21 0 8.313345
2010-12-22 0 6.525973
2010-12-23 0 1.124200
2010-12-24 0 0.000000
[64110060 rows x 2 columns]
The variables are put as columns, however the coordinate variables are not columns. I have tried pulling the lat and long separately and appending them to the dataframe, however that is not working (size differences).
How might I be able to get the lat and lon as columns, so I can then use pandas groupby function with these?

Python - Calculating standard deviation (row level) of dataframe columns

I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far
# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()
salary 8.194421e-01
num_months 3.690081e+05
no_of_hours 2.518869e+02
# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)
# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()
salary 8.194421e-01
# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)
0 4.374107e+12
1 4.377543e+12
2 4.374026e+12
3 4.374046e+12
4 4.374112e+12
5 4.373926e+12
When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?
# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN

It is expected, because if checking DataFrame.std:
Normalized by N-1 by default. This can be changed using the ddof argument
If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.
Sample:
inp_df = pd.DataFrame({'salary':[10,20,30],
'num_months':[1,2,3],
'no_of_hours':[2,5,6]})
print (inp_df)
salary num_months no_of_hours
0 10 1 2
1 20 2 5
2 30 3 6
Select one column by one [] for Series:
print (inp_df['salary'])
0 10
1 20
2 30
Name: salary, dtype: int64
Get std of Series - get a scalar:
print (inp_df['salary'].std())
10.0
Select one column by double [] for one column DataFrame:
print (inp_df[['salary']])
salary
0 10
1 20
2 30
Get std of DataFrame per index (default value) - get one element Series:
print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary 10.0
dtype: float64
Get std of DataFrame per columns (axis=1) - get all NaNs:
print (inp_df[['salary']].std(axis = 1))
0 NaN
1 NaN
2 NaN
dtype: float64
If changed default ddof=1 to ddof=0:
print (inp_df[['salary']].std(axis = 1, ddof=0))
0 0.0
1 0.0
2 0.0
dtype: float64
If you want std by two or more columns:
#select 2 columns
print (inp_df[['salary', 'num_months']])
salary num_months
0 10 1
1 20 2
2 30 3
#std by index
print (inp_df[['salary','num_months']].std())
salary 10.0
num_months 1.0
dtype: float64
#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0 5.656854
1 10.606602
2 16.970563
dtype: float64

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to unmerge cells and create a standard dataframe when reading excel file? - python-3.x

Related

How to reformat time series to fill in missing entries with NaNs?

Doubts pandas filtering data row by row

Keeping columns of pandas dataframe whose substring is in the list

How in Python can I get dataframe coordinates that are not items in a column, into columns within the same dataframe?

Python - Calculating standard deviation (row level) of dataframe columns

Categories

Resources