Converting list of items in pandas dataframe column to tuples

Converting list of items in pandas dataframe column to tuples - python-3.x

I have a dataframe that contains list of items like below.
| B | A |
+----------+------------+
|0.07 | [a,b,c] |
|0.009 | [d,e,f] |
|0.008 | [g,h,j] |
+----------+------------+
The goal is to convert the list of items present in col A to tuples like below
| B | A |
+----------+------------+
|0.07 | (a,b,c) |
|0.009 | (d,e,f) |
|0.008 | (g,h,j) |
+----------+------------+
So how can this be achieved using pandas in python ?

¡Good day!
As #Shubham Sharma mentions in his comment you can solve your problem with the following line of code:
import pandas as pd
columnas=["B" ,"A" ]
data=[
[0.07,["a","b","c"]],
[0.009,["d","e","f"]],
[0.008,["g","h","j"]]
]
df = pd.DataFrame(data, columns=columnas)
df["A"]=df['A'].map(tuple)
print(df)
Result:
B A
0 0.070 (a, b, c)
1 0.009 (d, e, f)
2 0.008 (g, h, j)

Related

Possible corner case: pandas.read_csv

Why are all dots stripped from strings that consist of numbers and dots, only when engine='python', and in the face of dtype being defined?
The unexpected behaviour is experienced when processing a csv file that:
has strings that solely consist of numbers and single dots spread throughout the string
the read_csv parameters are set: engine='python' and thousands='.'
Sample of testcode:
import pandas as pd # version 1.5.2
import io
data = """a;b;c\n0000.7995;16.000;0\n3.03.001.00514;0;4.000\n4923.600.041;23.000;131"""
df1 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='c')
df2 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')
df1 out: col a as desired and expected
| | a | b | c |
|---:|:---------------|------:|-----:|
| 0 | 0000.7995 | 16000 | 0 |
| 1 | 3.03.001.00514 | 0 | 4000 |
| 2 | 4923.600.041 | 23000 | 131 |
df2 out: col a not expected
| | a | b | c |
|---:|------------:|------:|-----:|
| 0 | 00007995 | 16000 | 0 |
| 1 | 30300100514 | 0 | 4000 |
| 2 | 4923600041 | 23000 | 131 |
Even though dtype={'a': str}, it seems that engine='python' handles it differently from engine='c'. dtype={'a': object} yields the same result.
I have spent quite some time getting to know all settings from the pandas read_csv and I can't see any other option I can set to alter this behaviour.
Is there anything I missed or is this behaviour 'normal'?

Looks like a bug (was't reported - so I filed it). Was only able to create a clumsy workaround:
df = pd.read_csv(io.StringIO(data), sep=';', dtype=str, engine='python')
int_columns = ['b', 'c']
df[int_columns] = df[int_columns].apply(lambda x: x.str.replace('.', '')).astype(int)
a
b
c
0000.7995
16000
0
3.03.001.00514
0
4000
4923.600.041
23000
131

Splitting tuples in a column of a dataframe [duplicate]

This question already has answers here:
How can I split a column of tuples in a Pandas dataframe?
(6 answers)
Closed 8 months ago.
I have a dataframe df containing tuples like below in Column A
| ID | A |
+----------+------------+
|0 |(1, [a]) |
|1 |(2, [a,b]) |
|2 |(3, [c,a,b])|
+----------+------------+
I want to split the tuples in the above df and want to see my new dataframe df like below.
| ID | A | B |
+----------+------------+----------+
|0 | 1 | [a] |
|1 | 2 | [a,b] |
|2 | 3 | [c,a,b] |
+----------+------------+----------+
So, how can I split the tuple in the above dataframe df?

Use the str accessor, make sure to start with creating B to avoid losing the data in A:
df['B'] = df['A'].str[1]
df['A'] = df['A'].str[0]
alternative:
df[['A', 'B']] = pd.DataFrame(df['A'].to_list(), columns=['A', 'B'])
output:
ID A B
0 0 1 [a]
1 1 2 [a, b]
2 2 3 [c, a, b]

Create multiple fields as arrays in Pyspark?

I have a dataframe with multiple columns as such:
| ID | Grouping | Field_1 | Field_2 | Field_3 | Field_4 |
|----|----------|---------|---------|---------|---------|
| 1 | AA | A | B | C | M |
| 2 | AA | D | E | F | N |
I want to create 2 new columns and store an list of of existing columns in new fields with the use of a group by on an existing field. Such that my new dataframe would look like this:
| ID | Grouping | Group_by_list1 | Group_by_list2 |
|----|----------|----------------|----------------|
| 1 | AA | [A,B,C,M] | [D,E,F,N] |
Does Pyspark have a way of handling this kind of wrangling with a dataframe to create this kind of an expected result?

Added inline comments, Check below code.
df \
.select(F.col("id"),F.col("Grouping"),F.array(F.col("Field_1"),F.col("Field_2"),F.col("Field_3"),F.col("Field_4")).as("grouping_list"))\ # Creating array of required columns.
.groupBy(F.col("Grouping"))\ # Grouping based on Grouping column.
.agg(F.first(F.col("id")).alias("id"),F.first(F.col("grouping_list")).alias("Group_by_list1"),F.last(F.col("grouping_list")).alias("Group_by_list2"))\ # first value from id, first value from grouping_list list, last value from grouping_list
.select("id","Grouping","Group_by_list1","Group_by_list2")\ # selecting all columns.
.show(false)
+---+--------+--------------+--------------+
|id |Grouping|Group_by_list1|Group_by_list2|
+---+--------+--------------+--------------+
|1 |AA |[A, B, C, M] |[D, E, F, N] |
+---+--------+--------------+--------------+
Note: This solution will give correct result only if DataFrame has two rows.

Fill dataframe cells entry using dataframe column names and index

I try to fill a datafame using following approach:
I generate a mxn size dataframe
Column names for the dataframe areA to N and are read from a list passed to the method.
define the index for the dataframe.
fill the dataframe entries with Column name + _ + index
import numpy as np
import pandas as pd
from tabulate import tabulate
def generate_data(N_rows, N_cols,names_df =[]):
if N_rows == 4:
d16 = ['RU19-24', 'RU13-18', 'RU7-12', 'RU1-6']
df = pd.DataFrame(np.zeros((N_rows, N_cols)), index=d16 ,columns=names_df)
else:
print("The Elevation for each domain is defined by 4, you defined elevation: ", N_rows)
df = None
# df.loc[[],'Z'] = 3
return tabulate(df, headers='keys', tablefmt='psql')
a = generate_data(4,2, ['A', 'B'])
print(a)
Out:
+---------+-----+-----+
| | A | B |
|---------+-----+-----|
| RU19-24 | 0 | 0 |
| RU13-18 | 0 | 0 |
| RU7-12 | 0 | 0 |
| RU1-6 | 0 | 0 |
+---------+-----+-----+
Is it possible to take the index and concatenate with the column names to get the following output ?
+---------+-------------+-------------+
| | A | B |
|---------+-------------+-------------|
| RU19-24 | A_RU19-24 | B_RU19-24 |
| RU13-18 | A_RU13-18 | B_RU13-18 |
| RU7-12 | A_RU7-12 | B_RU7-12 |
| RU1-6 | A_RU1-6 | B_RU1-6 |
+---------+-------------+-------------+

IIUC, you can use, apply which take each column of the dataframe as a pd.Series, with an index (the dataframe index) and a series name(the dataframe column header):
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df.apply(lambda x: x.name+'_'+x.index)
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6
or use np.add.outer
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df_out = pd.DataFrame(np.add.outer(df.columns+'_',df.index).T, index=df.index, columns=df.columns)
df_out
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6

Most efficient way to transform pandas dataframe with relational data to probability linkage

Given is a typical pandas dataframe with "relational data"
|--------------|------------|------------|
| Column1 | Column2 | Column3 |
|-------- -----|------------|------------|
| A | 1 | C |
|--------------|------------|------------|
| B | 2 | C |
|--------------|------------|------------|
| A | 2 | C |
|--------------|------------|------------|
| A | 1 | C |
|--------------|------------|------------|
| ... | ... | ... |
|--------------|------------|------------|
I am trying to calculate the probabilities between all column values with length 2, meaning the tuple (A,1) --> 0.66, (A,2) --> 0.33, (B,2) --> 1, (2,B) --> 0.5 and so on.
I am expecting the result back in a list similar to:
[
[A,1,0.66],
[A,2,0.33],
[B,2,1],
[2,b,0.5],
...
]
Currently, my approach is really inefficient (even while using multiprocessing). Simplified i am iterating over all possibilities without any Cython.
# iterating through all columns
for colname in colnames:
# evaluating all other columns except the one under assessment
for x in [x for x in colnames if not x==colname]:
# through groupby we get their counts
groups = df.groupby([colname,x]).size().reset_index(name='counts')
# for each group we
for index,row in groups.iterrows():
# calculate their probability over the entire population
# of the column and push it in the result list
result.append([row[colname],row[x],(row["counts"]/df[x].count())])
What is the most efficient way to complete this transformation?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Converting list of items in pandas dataframe column to tuples - python-3.x

Related

Possible corner case: pandas.read_csv

Splitting tuples in a column of a dataframe [duplicate]

Create multiple fields as arrays in Pyspark?

Fill dataframe cells entry using dataframe column names and index

Most efficient way to transform pandas dataframe with relational data to probability linkage

Categories

Resources