Pivot Tables in Python have different formatting comparing to Excel - excel

Let's say we have a dataframe that looks like this:
I want to use pandas pivot table to format it in this way:
.
However, df.pivot_table(index='Groupings', columns='columns', values=['value1','value2']) gives me an output that looks like this:
Does anyone know how to solve this problem?
You can use the following code to create the df
import pandas as pd
data = {"Groupings": ["item1", "item1", "item1", "item2", "item2", "item3", "item3"],
"columns": ["A", "B", "C", "A", "B", "C", "D"],
"value1": [1,2,3,4,5,6,7],
"value2": [11,22,33,44,55,66,77],}
df = pd.DataFrame(data)
df

Related

concat_ws and coalesce in pyspark

In Pyspark, I want to combine concat_ws and coalesce whilst using the list method. For example I know this works:
from pyspark.sql.functions import concat_ws, col
df = spark.createDataFrame([["A", "B"], ["C", None], [None, "D"]]).toDF("Type", "Segment")
#display(df)
df = df.withColumn("concat_ws2", concat_ws(':', coalesce('Type', lit("")), coalesce('Segment', lit(""))))
display(df)
But I want to be able to utilise the *[list] method so I don't have to list out all the columns within that bit of code, i.e. something like this instead:
from pyspark.sql.functions import concat_ws, col
df = spark.createDataFrame([["A", "B"], ["C", None], [None, "D"]]).toDF("Type", "Segment")
list = ["Type", "Segment"]
df = df.withColumn("almost_desired_output", concat_ws(':', *list))
display(df)
However as you can see, I want to be able to coalesce NULL with a blank, but not sure if that's possible using the *[list] method or do I really have to list out all the columns?
This would work:
Iterate over list of columns names
df=df.withColumn("almost_desired_output", concat_ws(':', *[coalesce(name, lit('')).alias(name) for name in df.schema.names]))
Output:
Or, Use fill - it'll fill all the null values across all columns of Dataframe (but this changes in the actual column, which may can break some use-cases)
df.na.fill("").withColumn("almost_desired_output", concat_ws(':', *list)
Or, Use selectExpr (again this changes in the actual column, which may can break some use-cases)
list = ["Type", "Segment"] # or just use df.schema.names
list2 = ["coalesce(type,' ') as Type", "coalesce(Segment,' ') as Segment"]
df=df.selectExpr(list2).withColumn("almost_desired_output", concat_ws(':', *list))

Convert pandas MultiIndex columns to uppercase

I would like to replace pandas multi index columns with uppercase names. With a normal (1D/level) index, I would do something like
df.coulumns = [c.upper() for c in df.columns]
When this is done on a DataFrame with a pd.MultiIndex, I get the following error:
AttributeError: 'tuple' object has no attribute 'upper'
How would I apply the same logic to a pandas multi index? Example code is below.
import pandas as pd
import numpy as np
arrays = [
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)
arrays_upper = [
["BAR", "BAR", "BAZ", "BAZ", "FOO", "FOO", "QUX", "QUX"],
["ONE", "TWO", "ONE", "TWO", "ONE", "TWO", "ONE", "TWO"],
]
tuples_upper = list(zip(*arrays_upper))
index_upper = pd.MultiIndex.from_tuples(tuples_upper, names=['first', 'second'])
df_upper = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index_upper)
print(f'Have: {df.columns}')
print(f'Want: {df_upper.columns}')
You can convert the multiindex to dataframe and uppercase the value in dataframe then convert it back to multiindex
df.columns = pd.MultiIndex.from_frame(df.columns.to_frame().applymap(str.upper))
print(df)
first BAR BAZ FOO QUX
second ONE TWO ONE TWO ONE TWO ONE TWO
A -0.374874 0.049597 -1.930723 -0.279234 0.235430 0.351351 -0.263074 -0.068096
B 0.040872 0.969948 -0.048848 -0.610735 -0.949685 0.336952 -0.012458 -0.258237
C 0.932494 -1.655863 0.900461 0.403524 -0.123720 0.207627 -0.372031 -0.049706
Or follow your loop idea
df.columns = pd.MultiIndex.from_tuples([tuple(map(str.upper, c)) for c in df.columns])
Use set_levels:
df.columns = df.columns.set_levels([level.str.upper() for level in df.columns.levels])

Print table in plotly dash with multiple lines in one cell

Currently I have a pandas dataframe :
df = pd.DataFrame({
"date": ["20210613", "20210614", "20210615"],
"user": ["A\nB", "C", "D"],
"machine" : [1, 0, 3]
})
I wonder if there is any way to print this table to my dash app like this:
no matter using pure text print into dcc.Textarea or dash_table.DataTable are fine.
Currently I still can not figure out a good way to achieve this, many thanks.
You can do it in a DataTable by using the style_cell property on the DataTable like this:
import dash
import dash_table
import pandas as pd
df = pd.DataFrame(
{
"date": ["20210613", "20210614", "20210615"],
"user": ["A\nB", "C", "D"],
"machine": [1, 0, 3],
}
)
app = dash.Dash(__name__)
app.layout = dash_table.DataTable(
id="table",
columns=[{"name": i, "id": i} for i in df.columns],
data=df.to_dict("records"),
style_cell={"whiteSpace": "pre-line"},
)
if __name__ == "__main__":
app.run_server(debug=True)
You can make the datatable cells break when they encounter the \n character by setting the white-space CSS attribute for the cells in the table
I found this answer on this thread

Pandas:Find newly added data

I have two dumps of data , the old dump and the new dump.
Old dump goes like:-
New Dump is the changed data or the newly added data.
My goal is to find the newly added data.
Sample new dump goes like:-
For the actual implementation i am querying old dump from the database , and new dump is the weekly files received for any new adds or updates
I am using the below code to find the newly adds , but something is not right as when i go to find the duplicate id's it gives blank dataframe even though the duplicates are present.
Below id my code:-
cols_to_show=["id","fname","lname","title","yearjoined"]
prev_batch=pd.read_csv("sample1.csv",encoding='cp1252',sep='\t',error_bad_lines=False, warn_bad_lines=True) #olddump
current_batch=pd.read_csv("sample2",encoding='cp1252',sep='\t',error_bad_lines=False, warn_bad_lines=True) #newdump
prev_batch['version']='v0.0'
current_batch['version']='v1.0'
full_set = pd.concat([prev_batch,current_batch],ignore_index=True)
changes = full_set.drop_duplicates(subset=cols_to_show, keep='last')
dupe=changes[changes.duplicated('id',keep=False)] #gives empty dataframe
change_new = dupe_lawyers[(dupe["version"] == 'v1.0')]
change_old = dupe_lawyers[(dupe["version"] == 'v0.0')]
changes['duplicate']=changes["id"].isin(dupe["id"])
new_added = full_set.drop_duplicates(subset=cols_to_show, keep='first')
new_added['duplicate']=new_added["id"].isin(dupe["id"])
added_new = new_added[(new_lawyers["duplicate"] == False)&(new_added["version"] == "v1.0")]
I am not able to understand why dupe is giving me empty dataframe
I am not sure, if you are looking for an answer only to the code that you have provided, but another approach could be to use the pandas compare API to get the difference between the dataframes. I am showing an example below taken from their website.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
df2
diff_df = df.compare(df2)
After this you can get the difference using the xs api and get the difference of the dataframe.
diff_df.xs('other', level=1, axis=1)
For this case will be the following:

Pandas groupby specific range

I am trying to use groupby to group a dataset by a specific range. In the following dataframe, I'd like to groupby the max_speed which is above 150 and also count the number of items that are above 150.
An example dataset is as follows:
df = pd.DataFrame(
[
("bird", "Falconiformes", 250.0),
("bird", "Psittaciformes", 250.0),
("mammal", "Carnivora", 180.2),
("mammal", "Primates", 159.0),
("mammal", "Carnivora", 58),
],
index=["falcon", "parrot", "lion", "monkey", "leopard"],
columns=("class", "order", "max_speed"),
)
What I've tried is:
df[['max_speed'] > 150].groupby('max_speed').size()
Expected output:
max_speed count
180.2 1
250.0 2
159.0 1
How can I do this?
Thank you

Resources