how to convert grouped dataframe multilevel index to datadict - python-3.x

sample dataframe:
avg
Key1 Key2
a1 b1 v1
b2 v2
b3 v3
a2 b4 v4
a3 b5 v5
b6 v6
a4 b7 v7
How to convert this to a datadict
{a1:v1, a1:v2, a1:v3, a2:v4, a3:v5, a3:v6, a4:v7}
I tried this with no luck
dict(zip(df['ColA'], df['avg']))
Appreciate any help !!

Since it is multiple index using get_level_values
dict(zip(df.index.get_level_values(1), df['avg']))

Related

Pandas rename one value as other value in a column and add corresponding values in the other column

So, I have a pandas data frame:
df =
a b c
a1 b1 c1
a2 b2 c1
a2 b3 c2
a2 b4 c2
I want to rename a2 into a1 and then group by a and c and add the corresponding values of b
df =
a b c
a1 b1+b2 c1
a1 b3+b4 c2
So, something like this
df =
a value c
a1 10 c1
a2 20 c1
a2 50 c2
a2 60 c2
df =
a value c
a1 30 c1
a1 110 c2
How to do this?
What about
>>> res = df.replace({"a": {"a2": "a1"}}).groupby(["a", "c"], as_index=False).sum()
>>> res
a c value
0 a1 c1 30
1 a1 c2 110
which first replaces "a2"s with "a1" in only a column and then groups by and sums.
To get the original column order back, we can reindex:
>>> res.reindex(df.columns, axis=1)
a value c
0 a1 30 c1
1 a1 110 c2
Try this:
df.groupby([df['a'].replace({'a2':'a1'}),'c']).sum().reset_index()

How to split every row in dataframe into two with some features? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have this dataframe:
A C1 C2
a1 c1 c3
a2 c2 c4
And columns C1 and C2 has the same type.
And I want get this:
A C
a1 c1
a1 c3
a2 c2
a2 c4
How I can do this?
UPD:
In answers I get this info:
df_final = df.set_index('A').stack().droplevel(1).rename('C').reset_index()
Out[604]:
A C
0 a1 c1
1 a1 c3
2 a2 c2
3 a2 c4
But what I should if I want split in this way?
A B C1 C2 C3 C4
a1 b1 c1 c2 c3 c4
a2 b2 c5 c6 c7 c8
and get this:
A B C1 C2
a1 b1 c1 c2
a1 b1 c3 c4
a2 b2 c5 c6
a2 b2 c7 c8
Edit 2: If you have even number of columns Cx, you may use numpy to make it simple
import numpy as np
cols = ['C1','C2','C3','C4']
df1 = df.loc[df.index.repeat(len(cols) / 2), ['A','B']].reset_index(drop=True)
df_final = df1.join(pd.DataFrame(df[cols].to_numpy().reshape(-1,2), columns=['C1','C2']))
Out[698]:
A B C1 C2
0 a1 b1 c1 c2
1 a1 b1 c3 c4
2 a2 b2 c5 c6
3 a2 b2 c7 c8
Edit for updated sample:
On multiple columns Cx splitting by 2, you need wide_to_long. However, beforing doing it, you need pre-processing columns names to appropriate format to use with wide_to_long
df1 = df.set_index(['A','B'])
stub_cols = (np.arange(df1.columns.size) % 2).astype(str)
suff_cols = (np.arange(df1.columns.size) // 2).astype(str)
d = dict(zip(stub_cols, ['C1', 'C2']))
df1.columns = pd.Series(stub_cols) + '_' + suff_cols
df_final = pd.wide_to_long(df1.reset_index(),
i=['A','B'],
j='num',
stubnames=['0','1'],
sep='_').droplevel(-1).rename(d, axis=1).reset_index()
Out[680]:
A B C1 C2
0 a1 b1 c1 c2
1 a1 b1 c3 c4
2 a2 b2 c5 c6
3 a2 b2 c7 c8
Give this a try
df_final = df.set_index('A').stack().droplevel(1).rename('C').reset_index()
Out[604]:
A C
0 a1 c1
1 a1 c3
2 a2 c2
3 a2 c4
print(
pd.concat([df.A, df[['C1', 'C2']].apply(list, axis=1)], axis=1).explode(0).rename(columns={0:'C'})
)
Prints:
A C
0 a1 c1
0 a1 c3
1 a2 c2
1 a2 c4

pyspark - Read files with custom delimiter to RDD?

I am newbie in pyspark, and I'm trying to read and merge RDD rows into one row.
Assuming that I have the following text file:
A1 B1 C1
A2 B2 C2 D3
A3 X1 YY1
DELIMITER_ROW
Z1 B1 C1 Z4
X2 V2 XC2 D3
DELIMITER_ROW
T1 R1
M2 MB2 NC2
S3 BB1
AQ3 Q1 P1"
Now, I want to combine all rows appears in each section (between DELIMITER_ROW) into one row, and return a list of these merged rows.
I want to create this kind of list:
[[A1 B1 C1 A2 B2 C2 D3 A3 X1 YY1]
[Z1 B1 C1 Z4 X2 V2 XC2 D3]
[T1 R1 M2 MB2 NC2 S3 BB1 AQ3 Q1 P1]]
How can It be done in pyspark using RDD?
For now I know how to read the file and filter out the delimiter rows:
sc.textFile(pathToFile).filter(lambda line: DELIMITER_ROW not in line).collect()
but I don't know how to reduce/merge/combine/group the rows in each section into one row.
Thanks.
Rather than reading and splitting, You can use hadoopConfiguration.set to set the delimiter which separates the row and then split the row.
spark.sparkContext.hadoopConfiguration.set("textinputformat.record.delimiter", "DELIMITER_ROW")
Hope this helps!

finding Minimum value based on a range of values in another column in excel vba

10.00 b1
11.00 b2
22.00 b3
2.00 b1
323.00 b2
1.00 b3
423.00 b1
32.00 b2
42.00 b3
43.00 b1
522.00 b2
53.00 b3
22.00 b1
344.00 b2
33.00 b3
23445.00 b1
323.00 b2
4.00 b3
How can I find the minimum value of column1 where value of column2 = b2?
Here, I got one for you. It is an excel formula.
=MIN(IF(B1:B100="b2",A1:A100))

Sum fields in a column if there is an entry in a corresponding row in another column

Assume the following data:
| A B C
--+------------------------
1 | 2 3 5
2 | 2 3
3 | 4 4
4 | 2 3
5 | 5 6
In cell A6, I want Excel to add cells C1, C2, C3 on the basis that A1, A2 and A3 have data in. Similarly, I want B6 to add together C1, C4 and C5 because B1, B4 and B5 have data.
Can someone help?
In A6 enter:
=SUMPRODUCT(($C1:$C5)*(A1:A5<>""))
and then copy to B6:
A simple SUMIF formula will work
=SUMIF(A$1:A$5,"<>",$C$1:$C$5)
Place that formula is cell A6 and then copy it to B6.
You can create another column, e.g. AValue, with the formula =IF(ISBLANK(A1),0,A1) in it. This will return 0 if the cell in A in the corresponding line is empty, or the value from the cell in A otherwise.
Then you can just sum up the values of the new column.

Resources