I have a Pandas dataframe created from CSV with the following headers:
podcast_name,user_name,description,image,ratings,category,itunes_link,rss,email,latest_date,listener_1,listener_2,listener_3,listener_4,listener_5,listener_6,listener_7,listener_8,listener_9,listener_10,listener_11,listener_12,listener_13,listener_14,listener_15,listener_16,listener_17,listener_18
This dataframe was loaded from several files and cleared of duplicates:
all_files = glob.glob(os.path.join("data/*.csv"))
df = pandas.concat((pandas.read_csv(f) for f in all_files))
df.drop_duplicates(keep=False, inplace=True)
Now i want to check and replace some values from category. For example i have keywords dict:
categories = {
"Comedy": ["Comedy Interviews", "Improv", "Stand-Up"],
"Fiction": ["Comedy Fiction", "Drama", "Science Fiction"]
}
So i want to check if value in category is equal to one of values from the list. For example i have line with Improv in caterogy column and i want to replace Improv with Comedy.
Honestly, I have no idea how to do this.
Create helper dictionary and replace:
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in categories.items() for k in oldv}
print (d)
{'Comedy Interviews': 'Comedy', 'Improv': 'Comedy',
'Stand-Up': 'Comedy', 'Comedy Fiction': 'Fiction',
'Drama': 'Fiction', 'Science Fiction': 'Fiction'}
df['category'] = df['category'].replace(d)
Related
I currently have a list and a Spark dataframe:
['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
I am having a tough time figuring out a way to create a new column in the dataframe that takes the first matching word from the tags column for each row and puts it in the newly created column for that row.
For example, lets say the first row in the tags column has only "murder" in it, I would want that to show in the new column. Then, if the next row had "boring", "silly" and "cult" in it, I would want it to show cult in the new column since it matches the list. If the last row in tags column had "revenge", "cult" in it, I would want it to only show revenge, since its the first word that matches the list.
from pyspark.sql import functions as F
df = spark.createDataFrame([('murder',), ('boring silly cult',), ('revenge cult',)], ['tags'])
mylist = ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
pattern = '|'.join([f'({x})' for x in mylist])
df = df.withColumn('first_from_list', F.regexp_extract('tags', pattern, 0))
df.show()
# +-----------------+---------------+
# | tags|first_from_list|
# +-----------------+---------------+
# | murder| murder|
# |boring silly cult| cult|
# | revenge cult| revenge|
# +-----------------+---------------+
You could use a PySpark UDF (User Defined Function).
First, let's write a python function to find a first match between a list (in this case the list you provided) and a string, that is, the value of the tags column:
def find_first_match(tags):
first_match = ''
genres= ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
for tag in tags.split():
for genre in genres:
if tag==genre:
first_match=tag
return first_match
Then, we need to convert this function into a PySpark udf so that we can use it in combination with the .withColumn() operation:
find_first_matchUDF = udf(lambda z:find_first_match(z))
Now we can apply the udf function to generate a new column. Assuming df is the name of your DataFrame:
from pyspark.sql.functions import col
new_df = df.withColumn("first_match", find_first_matchUDF(col("tags")))
This approach only works if all tag in your tags column are separated by white spaces.
P.S
You can avoid the second step using annotation:
from pyspark.sql.functions import col
#udf(returnType=StringType())
def find_first_match(tags):
first_match = ''
genres= ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
for tag in tags.split():
for genre in genres:
if tag==genre:
first_match=tag
return first_match
new_df = df.withColumn("first_match", find_first_match(col("tags")))
I quite often write a function to return different dataframes based on the parameters I enter. Here's an example dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',
freq='M'), 10000)})
I then created a function to perform sub-totals for me like this:
def some_fun(DF1, agg_column, myList=[], *args):
y = pd.concat([
DF1.assign(**{x:'[Total]' for x in myList[i:]})\
.groupby(myList).agg(sumz = (agg_column,'sum')) for i in range(1,len(myList)+1)]).sort_index().unstack(0)
return y
I then write out lists that I'll pass as arguments to the function:
list_one = [pd.Grouper(key='Date',freq='A'),'Category','Product']
list_two = [pd.Grouper(key='Date',freq='A'),'Category','Sub-Category','Sub-Category-2']
list_three = [pd.Grouper(key='Date',freq='A'),'Sub-Category','Product']
I then have to run each list through my function creating new dataframes:
df1 = some_fun(df,'Units_Sold',list_one)
df2 = some_fun(df,'Dollars_Sold',list_two)
df3 = some_fun(df,'Units_Sold',list_three)
I then use a function to write each of these dataframes to an Excel worksheet. This is just an example - I perform this same exercise 10+ times.
My question - is there a better way to perform this task than to write out df1, df2, df3 with the function information applied? Should I be looking at using a dictionary or some other data type to do this my pythonically with a function?
A dictionary would be my first choice:
variations = ([('Units Sold', list_one), ('Dollars_Sold',list_two),
..., ('Title', some_list)])
df_variations = {}
for i, v in enumerate(variations):
name = v[0]
data = v[1]
df_variations[i] = some_fun(df, name, data)
You might further consider setting the keys to unique / helpful titles for the variations, that goes beyond something like 'Units Sold', which isn't unique in your case.
IIUC,
as Thomas has suggested we can use a dictionary to parse through your data, but with some minor modifications to your function, we can use the dictionary to hold all the required data then pass that through to your function.
the idea is to pass two types of keys, the list of columns and the arguments to your pd.Grouper call.
data_dict = {
"Units_Sold": {"key": "Date", "freq": "A"},
"Dollars_Sold": {"key": "Date", "freq": "A"},
"col_list_1": ["Category", "Product"],
"col_list_2": ["Category", "Sub-Category", "Sub-Category-2"],
"col_list_3": ["Sub-Category", "Product"],
}
def some_fun(dataframe, agg_col, dictionary,column_list, *args):
key = dictionary[agg_col]["key"]
frequency = dictionary[agg_col]["freq"]
myList = [pd.Grouper(key=key, freq=frequency), *dictionary[column_list]]
y = (
pd.concat(
[
dataframe.assign(**{x: "[Total]" for x in myList[i:]})
.groupby(myList)
.agg(sumz=(agg_col, "sum"))
for i in range(1, len(myList) + 1)
]
)
.sort_index()
.unstack(0)
)
return y
Test.
df1 = some_fun(df,'Units_Sold',data_dict,'col_list_3')
print(df1)
sumz
Date 2016-12-31 2017-12-31 2018-12-31
Sub-Category Product
X Product 1 18308 17839 18776
Product 2 18067 19309 18077
Product 3 17943 19121 17675
[Total] 54318 56269 54528
Y Product 1 20699 18593 18103
Product 2 18642 19712 17122
Product 3 17701 19263 20123
[Total] 57042 57568 55348
Z Product 1 19077 17401 19138
Product 2 17207 21434 18817
Product 3 18405 17300 17462
[Total] 54689 56135 55417
[Total] [Total] 166049 169972 165293
as you want to automate the writing of the 10x worksheets, we can again do that with a dictionary call over your function:
matches = {'Units_Sold': ['col_list_1','col_list_3'],
'Dollars_Sold' : ['col_list_2']}
then a simple for loop to write all the files to a single excel sheet, change this to match your required behavior.
writer = pd.ExcelWriter('finished_excel_file.xlsx')
for key,value in matches.items():
for items in value:
dataframe = some_fun(df,k,data_dict,items)
dataframe.to_excel(writer,f'{key}_{items}')
writer.save()
I'm working on PySpark. I have a data frame which I need to dump as a JSON file but the the JSON file should have the following format for example -
{"Column 1": [9202, 9202, 9202, ....], "Column 2": ["FEMALE", "No matching concept", "MALE", ....]}
So there should be 1 key for each column and the corresponding value should have a list of all the values in that column
I tried to convert this to a Pandas data frame and then convert to a dict before dumping it as a JSON and was successful in doing that but as the data volume is very I want to do it directly on the PySpark data frame
One way is to collect each column values as array before you write to JSON. Try this:
column_arrays = [collect_list(c).alias(c) for c in df.columns]
df2 = df.groupBy().agg(*column_arrays)
df2.coalesce(1).write.mode("overwrite").json("/path")
L = []
for j in range(0, len(df.columns)):
arr = []
for i in range(0, df.count()):
arr.append(df.collect()[i][j])
L.append(arr)
columns = df.columns
data_dict = dict(zip(columns, L))
print(data_dict)
I have a dataframe with column which contains two different column values and their name as follows:
How Do I transform it into separate columns?
So far, I tried Following:
use df[col].apply(pd.Series) - It didn't work since data in column is not in dictionary format.
Tried separating columns by a semi-colon (";") sign but It is not a good idea since the given dataframe might have n number of column based on response.
EDIT:
Data in plain text format:
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
How about:
df2 = (df["ClusterName"]
.str.replace("Date:", "")
.str.replace("Bucket:", "")
.str.split(";", expand=True))
df2.columns = ["Date", "Bucket"]
EDIT:
Without hardcoding the variable names, here's a quick hack. You can clean it up (and make less silly variable names):
df_temp = df.ClusterName.str.split(";", expand=True)
cols = []
for col in df_temp:
df_temptemp = df_temp[col].str.split(":", expand=True)
df_temp[col] = df_temptemp[1]
cols.append(df_temptemp.iloc[0, 0])
df_temp.columns = cols
So .. maybe like this ...
Setup the data frame
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
df = pd.DataFrame(data=d)
df
Parse over the dataframe breaking apart by colon and semi-colon
ls = []
for index, row in df.iterrows():
splits = row['ClusterName'].split(';')
print(splits[0].split(':')[1],splits[1].split(':')[1])
ls.append([splits[0].split(':')[1],splits[1].split(':')[1]])
df = pd.DataFrame(ls, columns =['Date', 'Bucket'])
We handle data from volunteers, that data is entered in to a form using ODK. When the data is downloaded the header (column names) row contains a lot of 'stuff' we don' t need. The pattern is as follows:
'Group1/most_common/G27'
I want to replace the column names (there can be up to 200) or create a copy of the DataFrame with column names that just contain the G-code (Gxxx). I think I got it.
What is the faster or better way to do this?
IS the output reliable in terms of sort order? As of now it appears that the results list is in the same order as the original list.
y = ['Group1/most common/G95', 'Group1/most common/G24', 'Group3/plastics/G132']
import re
r = []
for x in y:
m = re.findall(r'G\d+', x)
r.append(m)
# the comprehension below is to flatten it
# append.m gives me a list of lists (each list has one item)
results = [q for t in r for q in t]
print(results)
['G95', 'G24', 'G132']
The idea would be to iterate through the column names in the DataFrame (or a copy), delete what I don't need and replace (inplace=True).
Thanks for your input.
You can use str.extract:
df = pd.DataFrame(columns=['Group1/most common/G95',
'Group1/most common/G24',
'Group3/plastics/G132'])
print (df)
Empty DataFrame
Columns: [Group1/most common/G95, Group1/most common/G24, Group3/plastics/G132]
Index: []
df.columns = df.columns.str.extract('(G\d+)', expand=False)
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []
Another solution with rsplit and select last values with [-1]:
df.columns = df.columns.str.rsplit('/').str[-1]
print (df)
Empty DataFrame
Columns: [G95, G24, G132]
Index: []