This question already has answers here:
How can I split a column of tuples in a Pandas dataframe?
(6 answers)
Closed 8 months ago.
I have a dataframe df containing tuples like below in Column A
| ID | A |
+----------+------------+
|0 |(1, [a]) |
|1 |(2, [a,b]) |
|2 |(3, [c,a,b])|
+----------+------------+
I want to split the tuples in the above df and want to see my new dataframe df like below.
| ID | A | B |
+----------+------------+----------+
|0 | 1 | [a] |
|1 | 2 | [a,b] |
|2 | 3 | [c,a,b] |
+----------+------------+----------+
So, how can I split the tuple in the above dataframe df?
Use the str accessor, make sure to start with creating B to avoid losing the data in A:
df['B'] = df['A'].str[1]
df['A'] = df['A'].str[0]
alternative:
df[['A', 'B']] = pd.DataFrame(df['A'].to_list(), columns=['A', 'B'])
output:
ID A B
0 0 1 [a]
1 1 2 [a, b]
2 2 3 [c, a, b]
Related
I have a dataframe that contains list of items like below.
| B | A |
+----------+------------+
|0.07 | [a,b,c] |
|0.009 | [d,e,f] |
|0.008 | [g,h,j] |
+----------+------------+
The goal is to convert the list of items present in col A to tuples like below
| B | A |
+----------+------------+
|0.07 | (a,b,c) |
|0.009 | (d,e,f) |
|0.008 | (g,h,j) |
+----------+------------+
So how can this be achieved using pandas in python ?
¡Good day!
As #Shubham Sharma mentions in his comment you can solve your problem with the following line of code:
import pandas as pd
columnas=["B" ,"A" ]
data=[
[0.07,["a","b","c"]],
[0.009,["d","e","f"]],
[0.008,["g","h","j"]]
]
df = pd.DataFrame(data, columns=columnas)
df["A"]=df['A'].map(tuple)
print(df)
Result:
B A
0 0.070 (a, b, c)
1 0.009 (d, e, f)
2 0.008 (g, h, j)
I have a dataframe with multiple columns as such:
| ID | Grouping | Field_1 | Field_2 | Field_3 | Field_4 |
|----|----------|---------|---------|---------|---------|
| 1 | AA | A | B | C | M |
| 2 | AA | D | E | F | N |
I want to create 2 new columns and store an list of of existing columns in new fields with the use of a group by on an existing field. Such that my new dataframe would look like this:
| ID | Grouping | Group_by_list1 | Group_by_list2 |
|----|----------|----------------|----------------|
| 1 | AA | [A,B,C,M] | [D,E,F,N] |
Does Pyspark have a way of handling this kind of wrangling with a dataframe to create this kind of an expected result?
Added inline comments, Check below code.
df \
.select(F.col("id"),F.col("Grouping"),F.array(F.col("Field_1"),F.col("Field_2"),F.col("Field_3"),F.col("Field_4")).as("grouping_list"))\ # Creating array of required columns.
.groupBy(F.col("Grouping"))\ # Grouping based on Grouping column.
.agg(F.first(F.col("id")).alias("id"),F.first(F.col("grouping_list")).alias("Group_by_list1"),F.last(F.col("grouping_list")).alias("Group_by_list2"))\ # first value from id, first value from grouping_list list, last value from grouping_list
.select("id","Grouping","Group_by_list1","Group_by_list2")\ # selecting all columns.
.show(false)
+---+--------+--------------+--------------+
|id |Grouping|Group_by_list1|Group_by_list2|
+---+--------+--------------+--------------+
|1 |AA |[A, B, C, M] |[D, E, F, N] |
+---+--------+--------------+--------------+
Note: This solution will give correct result only if DataFrame has two rows.
I have a simple use case
I have two dataframes df1 and df2, and I am looking for an efficient way to join them?
df1: Contains my main dataframe (billions of records)
+--------+-----------+--------------+
|doc_id |doc_name |doc_type_id |
+--------+-----------+--------------+
| 1 |doc_name_1 |[1,4] |
| 2 |doc_name_2 |[3,2,6] |
+--------+-----------+--------------+
df2: Contains labels of doc types(40000 records), as it's a small one I am broadcasting it.
+------------+----------------+
|doc_type_id |doc_type_name |
+------------+----------------+
| 1 |doc_type_1 |
| 2 |doc_type_2 |
| 3 |doc_type_3 |
| 4 |doc_type_4 |
| 5 |doc_type_5 |
| 6 |doc_type_5 |
+------------+----------------+
I would like to join these two dataframes to result in somthing like this:
+--------+------------+--------------+----------------------------------------+
|doc_id |doc_name |doc_type_id |doc_type_name |
+--------+------------+--------------+----------------------------------------+
| 1 |doc_name_1 |[1,4] |["doc_type_1","doc_type_4"] |
| 2 |doc_name_2 |[3,2,6] |["doc_type_3","doc_type_2","doc_type_6"]|
+--------+------------+--------------+----------------------------------------+
Thanks
We can use array_contains + groupBy + collect_list functions for this case.
Example:
val df1=Seq(("1","doc_name_1",Seq(1,4)),("2","doc_name_2",Seq(3,2,6))).toDF("doc_id","doc_name","doc_type_id")
val df2=Seq(("1","doc_type_1"),("2","doc_type_2"),("3","doc_type_3"),("4","doc_type_4"),("5","doc_type_5"),("6","doc_type_6")).toDF("doc_type_id","doc_type_name")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df1.createOrReplaceTempView("tbl")
df2.createOrReplaceTempView("tbl2")
spark.sql("select a.doc_id,a.doc_name,a.doc_type_id,collect_list(b.doc_type_name) doc_type_name from tbl a join tbl2 b on array_contains(a.doc_type_id,int(b.doc_type_id)) = TRUE group by a.doc_id,a.doc_name,a.doc_type_id").show(false)
//+------+----------+-----------+------------------------------------+
//|doc_id|doc_name |doc_type_id|doc_type_name |
//+------+----------+-----------+------------------------------------+
//|2 |doc_name_2|[3, 2, 6] |[doc_type_2, doc_type_3, doc_type_6]|
//|1 |doc_name_1|[1, 4] |[doc_type_1, doc_type_4] |
//+------+----------+-----------+------------------------------------+
Other way to achieve is by using explode + join + collect_list:
val df3=df1.withColumn("arr",explode(col("doc_type_id")))
df3.join(df2,df2.col("doc_type_id") === df3.col("arr"),"inner").
groupBy(df3.col("doc_id"),df3.col("doc_type_id"),df3.col("doc_name")).
agg(collect_list(df2.col("doc_type_name")).alias("doc_type_name")).
show(false)
//+------+-----------+----------+------------------------------------+
//|doc_id|doc_type_id|doc_name |doc_type_name |
//+------+-----------+----------+------------------------------------+
//|1 |[1, 4] |doc_name_1|[doc_type_1, doc_type_4] |
//|2 |[3, 2, 6] |doc_name_2|[doc_type_2, doc_type_3, doc_type_6]|
//+------+-----------+----------+------------------------------------+
Given is a typical pandas dataframe with "relational data"
|--------------|------------|------------|
| Column1 | Column2 | Column3 |
|-------- -----|------------|------------|
| A | 1 | C |
|--------------|------------|------------|
| B | 2 | C |
|--------------|------------|------------|
| A | 2 | C |
|--------------|------------|------------|
| A | 1 | C |
|--------------|------------|------------|
| ... | ... | ... |
|--------------|------------|------------|
I am trying to calculate the probabilities between all column values with length 2, meaning the tuple (A,1) --> 0.66, (A,2) --> 0.33, (B,2) --> 1, (2,B) --> 0.5 and so on.
I am expecting the result back in a list similar to:
[
[A,1,0.66],
[A,2,0.33],
[B,2,1],
[2,b,0.5],
...
]
Currently, my approach is really inefficient (even while using multiprocessing). Simplified i am iterating over all possibilities without any Cython.
# iterating through all columns
for colname in colnames:
# evaluating all other columns except the one under assessment
for x in [x for x in colnames if not x==colname]:
# through groupby we get their counts
groups = df.groupby([colname,x]).size().reset_index(name='counts')
# for each group we
for index,row in groups.iterrows():
# calculate their probability over the entire population
# of the column and push it in the result list
result.append([row[colname],row[x],(row["counts"]/df[x].count())])
What is the most efficient way to complete this transformation?
I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)
df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1
Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )