Advice on populating a dataframe based on an existing one - python-3.x

I'm seeking advice on populating a dataframe in pandas. I've now created a dataframe that looks like A:
However, eventually, it should looks like something in B, hopefully:
Could anyone suggest how to create a dataframe like B on top of A, if I have relevant data values.
Any comments or suggestions are highly appreciated.

I assume you have two dataframes that look like A. One for feature F1 and one for F2.
Then you create B like this:
a1 = ... # assuming A1, A2 already have correct index A, B, C as depicted.
a2 = ...
a1['Features'] = "F1"
a2['Features'] = "F2"
b = (pd.concat([a1, a2], axis=0)
.set_index("Features", append=True)
# Swing the new index level - Features - around to become a column level instead.
.unstack("Features"))
You've named the column level "Features" but I'd suggest using "Feature" instead, if you can.
There is also an alternate way to do the same thing, also seen in this question: How to make dataframe behave such as pandas_datareader
(pd.concat([a1, b2], axis='columns', keys=pd.Index(["F1", "F2"], name="Features"))
# swap hierarchy order of column levels
.swaplevel(-2, -1, axis=1)
# restore sorting to that of a1 columns - assuming a1, a2 have the same cols
.reindex(columns=a1.columns, level=0)
)

Related

Reading an Excel file with united cells in Python

I have an excel table of the following type (the problem described below is driven by the presence of the united cells).
I am using read_excel from pandas to read it.
What I want: I would like to use the values in the first column as an index, and to have the values in the third column combined in one cell, e.g. like here.
What I get from directly applying read_excel can be seen here.
If needed: please see the code used to read the file below (I am reading it from google drive in google colab):
path = '/content/drive/MyDrive/ExampleFile.xlsx'
pd.read_excel(path, header = 0, index_col = 0)
Could you please help?
Please let me know if anything in the question is unclear.
here is one way to accomplish it. I created the xls similar to yours, the first column had a heading of sno
# fill the null values with values from previous rows
df=df.ffill()
# combine the rows where class is the same and create a new column
df=df.assign(comb=df.groupby(['class'])['type'].transform(lambda x: ','.join(x)))
# drop the duplicated rows
df2=df.drop_duplicates(subset=['class','comb'])[['class','comb']]
class comb
0 fruit apple,orange
2 toys car,truck,train

split a column based on a delimiter and then unpivot the result with preserving other columns

I need to split a column to multiple rows and then unpivot them by preseving a/multiple columns, how can I achive this in Python3
See below example
import numpy as np
data=np.array(['a0','a1,a2','a2,a3'])
pk=np.array([1,2,3])
df=pd.DataFrame({'data':data,'PK':pk})
df
df['data'].apply(lambda x : pd.Series(str(x).split(","))).stack()
What I need is:
data pk
a0 1
a1 2
a2 2
a2 3
a3 3
Is there any way to achieve this without merge and resetting indexes as mentioned here?
Convert column data into list and explode the data frame
Data
data=np.array(['a0','a1,a2','a2,a3'])
pk=np.array([1,2,3])
df=pd.DataFrame({'data':data,'PK':pk})
df=spark.createDataFrame(df)
Solution
df.withColumn('data', F.explode(F.split(col('data'),','))).show()
Using the Explode is the keyword (thx to wwnde for pointing it out) for searching this and can be done easily in Python with using existing libraries
First step is converting the column with a delimiter to a list
df=df.assign(Data=df.data.str.split(","))
and then explode
df.explode('Data')
if you are reading from Excel and Pandas detect a list of number as int and if you need to do the explode multiple times then this is the code and results

Can I use pandas.DataFrame.apply() to make more than one column at once?

I have been given a function that takes the values in a row in a dataframe and returns a bunch of additional values.
It looks a bit like this:
my_func(row) -> (value1, value2, value3... valueN)
I'd like each of these values to become assigned to new columns in my dataframe. Can I use DataFrame.apply() do add multiple columns in one go, or do I have to add columns one at a time?
It's obvious how I can use apply to generate one column at a time:
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"] = df.apply(axis=1, func=lambda row:(row.A + row.B))
df["Y"] = df.apply(axis=1, func=lambda row:(row.A - row.B))
But what if the two columns I am adding are something that are more easily calculated together? In this case, I already have a function that gives me everything I need in one shot. I'd rather not have to call it multiple times or add a load of caching.
Is there a syntax I can use that would allow me to use apply to generate 2 columns at the same time? Something like this, but less broken:
# Broken Pseudo-code
from eunrg_utils.testing_functions.random_dataframe import random_dataframe
df = random_dataframe(columns=2,rows=4)
df["X"], df["Y"] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
What is the correct way to do something like this?
You can assign list of columns names like:
df = pd.DataFrame(np.random.randint(10, size=(2,2)),columns=list('AB'))
df[["X", "Y"]] = df.apply(axis=1, func=lambda row:(row.A + row.B, row.B-row.A))
print (df)
A B X Y
0 2 8 10 7
1 4 3 6 -1

Matching two columns to get the desired value

A C D
12:58:09 12:58:09 400.9
12:58:16 12:58:10 468.0
12:58:20 12:58:11 425.9
12:58:34 12:58:12 432.4
12:58:38 12:58:13 439.3
12:58:49 12:58:14 442.5
12:58:53 12:58:15 445.2
12:58:56 12:58:16 447.2
12:59:00 12:58:17 449.7
12:59:04 12:58:18 450.4
12:59:07 12:58:19 453.9
12:59:11 12:58:20 454.3
I have a data set like this. I want to make a new helper column B that matches column A and C and gives the value D. So my Bshould look like 400.9, 447.2, 454.3, and so on. Can anyone suggest me what approach should I use for this problem? Thanks!
Put this in column B and drag it down:
=VLOOKUP(A1,$C$1:$D$100,2,FALSE)

Creation of vector of unknown size in Excel

I am attempting to translate my existing Matlab code into Numbers (basically Excel). In Matlab, I have the following code:
clear all; clc;
n = 30
x = 1:(n-1)
T = 295;
D = T./(n-x)
E = T/n
for i=1:(n-2)
C(i) = D(i+1) - D(i)
end
hold on
plot(x(1:end-1), C, 'rx')
plot(x, D, 'bx')
I believe everything has been solved by your formulas, there are parts of them that I don't understand otherwise I would try to figure the rest out myself. Attached is the result (Also you might like to know that the formulas you gave work and are recognized in Numbers). Im trying to figure out why (x) begins at 2 as I feel as though it should start at 1?
Also it is clear that the realistic results for the formulas only exist under certain conditions i.e. column E > 0. That being the case, what would be the easiest way to chart data with a filter so that only certain data is charted?
(Using Excel...)
Suppose you put your input values T & n in A1 & B1 respectively.
You could generate x, D & C In columns C,D & E with:
C1: =IF(ROW()<$A$1,ROW(),"")
D1: =IF(LEN(C1)>0,$A$2/($A$1-C1),"")
E1: =IF(LEN(D2)>0,D2-D1,"")
You could then pull all 3 columns down as far as you need to generate the full length of your vectors. If you then want to chart these, simply use these columns as inputs.

Resources