convert nested dict values in pyspark/pandas dataframe to column and rows - python-3.x

I have a pyspark dataframe with only one record. it contains an id field and a "value" field. the value field contains nested dicts like the example record shown in the inputdf below. I would like to create a new dataframe like the outputdf below, where the type column is the keys from the nested dict in the value field in inputdf, and the value and active columns contain corresponding values from the nested dicts. if it's easier, the dataframe could be converted to a pandas dataframe using .toPandas(). does anyone have a slick way to do this?
inputdf:
id value
1 {"soda":{"value":2,"active":1},"jet":{"value":0,"active":1}}
outputdf:
type value active
soda 2 1
jet 0 1

Let us try , notice here I also include the id column
yourdf = pd.DataFrame(df.value.tolist(),index=df.id).stack().apply(pd.Series).reset_index()
Out[13]:
id level_1 value active
0 1 soda 2 1
1 1 jet 0 1

Related

Find a value in column of lists in pandas dataframe

I have a dataframe with two columns A & B, B is column of lists and A is a string, I want to search a value in the column B and get the corresponding value in column A. For ex :
category zones
0 category_1 [zn_1, zn_2]
1 category_2 [zn_3]
2 category_3 [zn_4]
3 category_4 [zn_5, zn_6]
If the input = 'zn_1', how can i get a response back as 'category_1'?
Use str.contains and filter category values
inputvalue='zn_1'
df[df.zones.str.contains(inputvalue)]['category']
#If didnt want an array
inputvalue='zn_1'
list(df[df.zones.str.contains(inputvalue)]['category'].values)[0]

Pandas DataFrame won't pivot. Says duplicate indices

So basically I have 3 columns in my dataframe as follows:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 158143 entries, 0 to 203270
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 users 158143 non-null int64
1 dates 158143 non-null datetime64[ns]
2 medium_of_ans 158143 non-null object
And I want it to be reshaped such that each entry in medium_of_ans value has a separate column and dates as row indices with users of a particular answer medium on a particular date resides in the junction of that row and column. In pandas similar functionality can be achieved by pivoting the dataframe although I am not able to achieve that as following attempt:
df.pivot(columns= 'medium_of_ans', index = 'dates', values = 'users')
throws this error:
ValueError: Index contains duplicate entries, cannot reshape
And I'm not sure why as a dataframe to be pivoted will obviously have duplicates in indices. That's why it is being pivoted. Resetting dataframe index as follows:
df.reset_index().pivot(columns= 'medium_of_ans', index = 'dates', values = 'users')
does not help either and error persists.
You have duplicates not just by the index, dates, but by the combination of index and column together, the combined dates and medium_of_ans.
You can find these duplicates with something like this:
counts = df.groupby(['dates', 'medium_of_ans']).size().reset_index(name='n')
duplicates = counts[counts['n'] > 1]
If you want to combine the duplicates, for example by taking the mean of users for the cell, then you can use pivot_table.
df.pivot_table(columns='medium_of_ans', index='dates', values='users', aggfunc='mean')
Taking the mean is the default, but I have added the explicit parameter for clarity.

Add prefix to all values in a column in the most efficient way

I have a dataframe with over 5M rows. I am concerned with just one column in this dataframe. Let's assume dataframe name to be df and the column in consideration to be df['id']. An example of the dataframe is shown below:
df['id'] :
id
0 432000000
1 432000010
2 432000020
The column df['id] is stored as a string.
I want to add a prefix to all the rows of a particular column in this dataframe. Below is the code I use:
for i in tqdm(range(0,len(df['id']))):
df['id'][i]='ABC-1234-'+df['id'][i]
While the above code works, it shows 15 hours to complete. Is there a more efficient way to perform this task ?

Splitting dataframe based on multiple column values

I have a dataframe with 1M+ rows. A sample of the dataframe is shown below:
df
ID Type File
0 123 Phone 1
1 122 Computer 2
2 126 Computer 1
I want to split this dataframe based on Type and File. If the total count of Type is 2 (Phone and Computer), total number of files is 2 (1,2), then the total number of splits will be 4.
In short, total splits is as given below:
total_splits=len(set(df['Type']))*len(set(df['File']))
In this example, total_splits=4. Now, I want to split the dataframe df in 4 based on Type and File.
So the new dataframes should be:
df1 (having data of type=Phone and File=1)
df2 (having data of type=Computer and File=1)
df3 (having data of type=Phone and File=2)
df4 (having data of type=Computer and File=2)
The splitting should be done inside a loop.
I know we can split a dataframe based on one condition (shown below), but how do you split it based on two ?
My Code:
data = {'ID' : ['123', '122', '126'],'Type' :['Phone','Computer','Computer'],'File' : [1,2,1]}
df=pd.DataFrame(data)
types=list(set(df['Type']))
total_splits=len(set(df['Type']))*len(set(df['File']))
cnt=1
for i in range(0,total_splits):
for j in types:
locals()["df"+str(cnt)] = df[df['Type'] == j]
cnt += 1
The result of the above code gives 2 dataframes, df1 and df2. df1 will have data of Type='Phone' and df2 will have data of Type='Computer'.
But this is just half of what I want to do. Is there a way we can make 4 dataframes here based on 2 conditions ?
Note: I know I can first split on 'Type' and then split the resulting dataframe based on 'File' to get the output. However, I want to know of a more efficient way of performing the split instead of having to create multiple dataframes to get the job done.
EDIT
This is not a duplicate question as I want to split the dataframe based on multiple column values, not just one!
You can make do with groupby:
dfs = {}
for k, d in df.groupby(['Type','File']):
type, file = k
# do want ever you want here
# d is the dataframe corresponding with type, file
dfs[k] = d
You can also create a mask:
df['mask'] = df['File'].eq(1) * 2 + df['Type'].eq('Phone')
Then, for example:
df[df['mask'].eq(0)]
gives you the first dataframe you want, i.e. Type==Phone and File==1, and so on.

PySpark: Update column values for a given number of rows of a DataFrame

I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even
I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')

Resources