Converting list of items in pandas dataframe column to tuples - python-3.x

I have a dataframe that contains list of items like below.
| B | A |
+----------+------------+
|0.07 | [a,b,c] |
|0.009 | [d,e,f] |
|0.008 | [g,h,j] |
+----------+------------+
The goal is to convert the list of items present in col A to tuples like below
| B | A |
+----------+------------+
|0.07 | (a,b,c) |
|0.009 | (d,e,f) |
|0.008 | (g,h,j) |
+----------+------------+
So how can this be achieved using pandas in python ?

¡Good day!
As #Shubham Sharma mentions in his comment you can solve your problem with the following line of code:
import pandas as pd
columnas=["B" ,"A" ]
data=[
[0.07,["a","b","c"]],
[0.009,["d","e","f"]],
[0.008,["g","h","j"]]
]
df = pd.DataFrame(data, columns=columnas)
df["A"]=df['A'].map(tuple)
print(df)
Result:
B A
0 0.070 (a, b, c)
1 0.009 (d, e, f)
2 0.008 (g, h, j)

Related

Possible corner case: pandas.read_csv

Why are all dots stripped from strings that consist of numbers and dots, only when engine='python', and in the face of dtype being defined?
The unexpected behaviour is experienced when processing a csv file that:
has strings that solely consist of numbers and single dots spread throughout the string
the read_csv parameters are set: engine='python' and thousands='.'
Sample of testcode:
import pandas as pd # version 1.5.2
import io
data = """a;b;c\n0000.7995;16.000;0\n3.03.001.00514;0;4.000\n4923.600.041;23.000;131"""
df1 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='c')
df2 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')
df1 out: col a as desired and expected
| | a | b | c |
|---:|:---------------|------:|-----:|
| 0 | 0000.7995 | 16000 | 0 |
| 1 | 3.03.001.00514 | 0 | 4000 |
| 2 | 4923.600.041 | 23000 | 131 |
df2 out: col a not expected
| | a | b | c |
|---:|------------:|------:|-----:|
| 0 | 00007995 | 16000 | 0 |
| 1 | 30300100514 | 0 | 4000 |
| 2 | 4923600041 | 23000 | 131 |
Even though dtype={'a': str}, it seems that engine='python' handles it differently from engine='c'. dtype={'a': object} yields the same result.
I have spent quite some time getting to know all settings from the pandas read_csv and I can't see any other option I can set to alter this behaviour.
Is there anything I missed or is this behaviour 'normal'?
Looks like a bug (was't reported - so I filed it). Was only able to create a clumsy workaround:
df = pd.read_csv(io.StringIO(data), sep=';', dtype=str, engine='python')
int_columns = ['b', 'c']
df[int_columns] = df[int_columns].apply(lambda x: x.str.replace('.', '')).astype(int)
a
b
c
0000.7995
16000
0
3.03.001.00514
0
4000
4923.600.041
23000
131

Splitting tuples in a column of a dataframe [duplicate]

This question already has answers here:
How can I split a column of tuples in a Pandas dataframe?
(6 answers)
Closed 8 months ago.
I have a dataframe df containing tuples like below in Column A
| ID | A |
+----------+------------+
|0 |(1, [a]) |
|1 |(2, [a,b]) |
|2 |(3, [c,a,b])|
+----------+------------+
I want to split the tuples in the above df and want to see my new dataframe df like below.
| ID | A | B |
+----------+------------+----------+
|0 | 1 | [a] |
|1 | 2 | [a,b] |
|2 | 3 | [c,a,b] |
+----------+------------+----------+
So, how can I split the tuple in the above dataframe df?
Use the str accessor, make sure to start with creating B to avoid losing the data in A:
df['B'] = df['A'].str[1]
df['A'] = df['A'].str[0]
alternative:
df[['A', 'B']] = pd.DataFrame(df['A'].to_list(), columns=['A', 'B'])
output:
ID A B
0 0 1 [a]
1 1 2 [a, b]
2 2 3 [c, a, b]

Create multiple fields as arrays in Pyspark?

I have a dataframe with multiple columns as such:
| ID | Grouping | Field_1 | Field_2 | Field_3 | Field_4 |
|----|----------|---------|---------|---------|---------|
| 1 | AA | A | B | C | M |
| 2 | AA | D | E | F | N |
I want to create 2 new columns and store an list of of existing columns in new fields with the use of a group by on an existing field. Such that my new dataframe would look like this:
| ID | Grouping | Group_by_list1 | Group_by_list2 |
|----|----------|----------------|----------------|
| 1 | AA | [A,B,C,M] | [D,E,F,N] |
Does Pyspark have a way of handling this kind of wrangling with a dataframe to create this kind of an expected result?
Added inline comments, Check below code.
df \
.select(F.col("id"),F.col("Grouping"),F.array(F.col("Field_1"),F.col("Field_2"),F.col("Field_3"),F.col("Field_4")).as("grouping_list"))\ # Creating array of required columns.
.groupBy(F.col("Grouping"))\ # Grouping based on Grouping column.
.agg(F.first(F.col("id")).alias("id"),F.first(F.col("grouping_list")).alias("Group_by_list1"),F.last(F.col("grouping_list")).alias("Group_by_list2"))\ # first value from id, first value from grouping_list list, last value from grouping_list
.select("id","Grouping","Group_by_list1","Group_by_list2")\ # selecting all columns.
.show(false)
+---+--------+--------------+--------------+
|id |Grouping|Group_by_list1|Group_by_list2|
+---+--------+--------------+--------------+
|1 |AA |[A, B, C, M] |[D, E, F, N] |
+---+--------+--------------+--------------+
Note: This solution will give correct result only if DataFrame has two rows.

Fill dataframe cells entry using dataframe column names and index

I try to fill a datafame using following approach:
I generate a mxn size dataframe
Column names for the dataframe areA to N and are read from a list passed to the method.
define the index for the dataframe.
fill the dataframe entries with Column name + _ + index
import numpy as np
import pandas as pd
from tabulate import tabulate
def generate_data(N_rows, N_cols,names_df =[]):
if N_rows == 4:
d16 = ['RU19-24', 'RU13-18', 'RU7-12', 'RU1-6']
df = pd.DataFrame(np.zeros((N_rows, N_cols)), index=d16 ,columns=names_df)
else:
print("The Elevation for each domain is defined by 4, you defined elevation: ", N_rows)
df = None
# df.loc[[],'Z'] = 3
return tabulate(df, headers='keys', tablefmt='psql')
a = generate_data(4,2, ['A', 'B'])
print(a)
Out:
+---------+-----+-----+
| | A | B |
|---------+-----+-----|
| RU19-24 | 0 | 0 |
| RU13-18 | 0 | 0 |
| RU7-12 | 0 | 0 |
| RU1-6 | 0 | 0 |
+---------+-----+-----+
Is it possible to take the index and concatenate with the column names to get the following output ?
+---------+-------------+-------------+
| | A | B |
|---------+-------------+-------------|
| RU19-24 | A_RU19-24 | B_RU19-24 |
| RU13-18 | A_RU13-18 | B_RU13-18 |
| RU7-12 | A_RU7-12 | B_RU7-12 |
| RU1-6 | A_RU1-6 | B_RU1-6 |
+---------+-------------+-------------+
IIUC, you can use, apply which take each column of the dataframe as a pd.Series, with an index (the dataframe index) and a series name(the dataframe column header):
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df.apply(lambda x: x.name+'_'+x.index)
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6
or use np.add.outer
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df_out = pd.DataFrame(np.add.outer(df.columns+'_',df.index).T, index=df.index, columns=df.columns)
df_out
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6

Most efficient way to transform pandas dataframe with relational data to probability linkage

Given is a typical pandas dataframe with "relational data"
|--------------|------------|------------|
| Column1 | Column2 | Column3 |
|-------- -----|------------|------------|
| A | 1 | C |
|--------------|------------|------------|
| B | 2 | C |
|--------------|------------|------------|
| A | 2 | C |
|--------------|------------|------------|
| A | 1 | C |
|--------------|------------|------------|
| ... | ... | ... |
|--------------|------------|------------|
I am trying to calculate the probabilities between all column values with length 2, meaning the tuple (A,1) --> 0.66, (A,2) --> 0.33, (B,2) --> 1, (2,B) --> 0.5 and so on.
I am expecting the result back in a list similar to:
[
[A,1,0.66],
[A,2,0.33],
[B,2,1],
[2,b,0.5],
...
]
Currently, my approach is really inefficient (even while using multiprocessing). Simplified i am iterating over all possibilities without any Cython.
# iterating through all columns
for colname in colnames:
# evaluating all other columns except the one under assessment
for x in [x for x in colnames if not x==colname]:
# through groupby we get their counts
groups = df.groupby([colname,x]).size().reset_index(name='counts')
# for each group we
for index,row in groups.iterrows():
# calculate their probability over the entire population
# of the column and push it in the result list
result.append([row[colname],row[x],(row["counts"]/df[x].count())])
What is the most efficient way to complete this transformation?

Resources