From dict to pandas dataframe as rows - python-3.x

I am sure i must be missing something basic here. But as far as I know you can create a dataframe from a dict with pd.DataFrame.from_dict(). But I am not sure how it can be set that key-values pairs in a dict can be put it as rows in a dataframe.
For instance, given this example
d = {'a':1,'b':2}
the desired output would be:
col1 col2
0 a 1
1 b 2
I know that the index might be a problem but that can be handle it with a simple index = [0]

Duplicate of Convert Python dict into a dataframe.
Simple answer for python 3.
import pandas as pd
d = {'a':1,'b':2, 'c':3}
df = pd.DataFrame(list(d.items()), columns = ['cola','colb'])

This code should help you.
d = {k: [l] for k, l in d.items()}
pd.DataFrame(d).T.reset_index().rename(columns={'index': 'col1', 0: 'col2'})

Related

Finding common rows between two dataframes based on a column using pandas

I have two dataframes. I need to extract rows based on common values in column 'a'. However, instead of creating one data frame at the end I want to retain the two data frames.
For example:
###Consider the following input
df1 = pd.DataFrame({'a':[0,1,1,2,3,4], 'b':['q','r','s','t','u','v'],'c':['a','b','c','d','e','f']})
df2 = pd.DataFrame({'a':[1,4,5,6], 'b':['qq','rr','ss','tt'],'c':[1,2,3,4]})
The expected output is:
###df1:
a. b. c
0. 1. r. a
1. 1. s. c
2. 4. v. f
###df2:
a. b. c
0. 1. qq 1
1. 4. rr 2
How can I achieve the following result? Insights will be appreciated.
You can generalize it with numpy's intersect1d
import numpy as np
intersection_arr = np.intersect1d(df1['a'], df2['a'])
df1 = df1.loc[df1['a'].isin(intersection_arr),:]
df2 = df2.loc[df2['a'].isin(intersection_arr),:]
More than two dataframes:
import numpy as np
from functools import reduce
intersection_arr = reduce(np.intersect1d, (df1['a'], df2['a'], df3['a']))
df1 = df1.loc[df1['a'].isin(intersection_arr),:]
df2 = df2.loc[df2['a'].isin(intersection_arr),:]
df3 = df3.loc[df3['a'].isin(intersection_arr),:]
df1 = df1[df1['a'].isin(df2['a'])].reset_index(drop=True)
df2 = df2[df2['a'].isin(df1['a'])].reset_index(drop=True)

Drop the columns in pandas dataframe

col_exclusions = ['numerator','Numerator' 'Denominator', "denominator"]
dataframe
id prim_numerator sec_Numerator tern_Numerator tern_Denominator final_denominator Result
1 12 23 45 54 56 Fail
Final output is id and Result
using regex
import re
pat = re.compile('|'.join(col_exclusions),flags=re.IGNORECASE)
final_cols = [c for c in df.columns if not re.search(pat,c)]
#out:
['id', 'Result']
print(df[final_cols])
id Result
0 1 Fail
if you want to drop
df = df.drop([c for c in df.columns if re.search(pat,c)],axis=1)
or the pure pandas approach thanks to #Anky_91
df.loc[:,~df.columns.str.contains('|'.join(col_exclusions),case=False)]
You can be explicit and use del for columns that contain the suffixes in your input list:
for column in df.columns:
if any([column.endswith(suffix) for suffix in col_exclusions]):
del df[column]
You can also use the following approach where the column names are splitted then matched with col_exclusions
df.drop(columns=[i for i in df.columns if i.split("_")[-1] in col_exclusions], inplace=True)
print(df.head())

Using List Comprehension with Pandas Series and Dataframes

I have written the below code that accepts a pandas series (dataframe column) of strings and a dictionary of terms to replace in the strings.
def phrase_replace(repl_dict, str_series):
for k,v in repl_dict.items():
str_series = str_series.str.replace(k,v)
return str_series
It works correctly, but it seems like I should be able to use some kind of list comprehension instead of the for loop.
I don't want to use str_series = [] or {} because I don't want a list or a dictionary returned, but a pandas.core.series.Series
Likewise, if I want to use the function on every column in a dataframe:
for column in df.columns:
df[column] = phrase_replace(repl_dict, df[column])
There must be a list comprehension method to do this?
It is possible, but then need concat for DataFrame because get list of Series:
df = pd.concat([phrase_replace(repl_dict, df[column]) for column in df.columns], axis=1)
But maybe need replace by dictionary:
df = df.replace(repl_dict)
df = pd.DataFrame({'words':['apple','banana','orange']})
repl_dict = {'an':'foo', 'pp':'zz'}
df.replace({'words':repl_dict}, inplace=True, regex=True)
df
Out[263]:
words
0 azzle
1 bfoofooa
2 orfooge
If you want to apply to all columns:
df2 = pd.DataFrame({'key1':['apple', 'banana', 'orange'], 'key2':['banana', 'apple', 'pineapple']})
df2
Out[13]:
key1 key2
0 apple banana
1 banana apple
2 orange pineapple
df2.replace(repl_dict,inplace=True, regex=True)
df2
Out[15]:
key1 key2
0 azzle bfoofooa
1 bfoofooa azzle
2 orfooge pineazzle
The whole point of pandas is to not use for loops... it's optimized to use the built in methods for dataframes and series...

Pyspark dataframe split json column values into top-level multiple columns

I have a json column which can contain any no of key:value pairs. I want to create new top level columns for these key:value pairs.
For Eg if I have this data
A B
"{\"C\":\"c\" , \"D\":\"d\"...}" b
This is the output that i want
B C D ...
b c d
There are few questions similar to splitting the coulmns into multiple columns but none are working in this case. Can Anyone please help. Thanks in Advance!
You are looking for org.apache.spark.sql.functions.from_json: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$#from_json(e:org.apache.spark.sql.Column,schema:String,options:java.util.Map[String,String]):org.apache.spark.sql.Column
Here's the python code commit related to SPARK-17699: https://github.com/apache/spark/commit/fe33121a53384811a8e094ab6c05dc85b7c7ca87
Sample Usage from commit:
>>> from pyspark.sql.types import *
>>> data = [(1, '''{"a": 1}''')]
>>> schema = StructType([StructField("a", IntegerType())])
>>> df = spark.createDataFrame(data, ("key", "value"))
>>> df.select(from_json(df.value, schema).alias("json")).collect()
[Row(json=Row(a=1))]

How to modify a column value in a row of a spark dataframe?

I am working with data frame with following structure
Here I need to modify each record so that if a column is listed in post_event_list I need to populate that column with corresponding post_column value. So in the above example for both records I need to populate col4 and col5 with post_col4 and post_col5 values. Can someone please help me to do this in pyspark.
Maybe this is what you want in pyspark2
suppose df is the DataFrame
row = df.rdd.first()
d = row.asDict()
d['col4'] = d['post_col4']
new_row = pyspark.sql.types.Row(**d)
now we has a new Row object;
put these codes in a map function can help to change all df.
You can use when/otherwise in pyspark.sql.functions. Something likes:
import pyspark.sql.functions as sf
from pyspark.sql.types import BooleanType
contains_col4_udf = udf(lambda x: 'col4' in x, BooleanType())
df.select(sf.when(contains_col4_udf('post_event_list'), sf.col('post_col4')).otherwise(sf.col('col_4')).alias('col_4'))
Here is the doc: https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.otherwise

Resources