One Hot Encoding a composite field - apache-spark

I want to transform multiple columns with same categorical values using a OneHotEncoder. I created a composite field and tried to use OneHotEncoder on it as below: (Items 1-3 are from the same list of items)
import pyspark.sql.functions as F
df = df.withColumn("basket", myConcat("item1", "item2", "item3"))
indexer = StringIndexer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)
encoder = OneHotEncoder(setInputCol="basketIndex", setOutputCol="basketVec")
encoded = encoder.transform(indexed)
def myConcat(*cols):
return F.concat(*[F.coalesce(c, F.lit("*")) for c in cols])
I am getting an out of memory error.
Does this approach work? How do I one hot encode a composite field or multiple columns with categorical values from same list?

If you have categorical values array why you didn't try CountVectorizer:
import pyspark.sql.functions as F
from pyspark.ml.feature import CountVectorizer
df = df.withColumn("basket", myConcat("item1", "item2", "item3"))
indexer = CountVectorizer(inputCol="basket", outputCol="basketIndex")
indexed = indexer.fit(df).transform(df)

Note: I can't comment yet (due to the fact that I'm a new user).
What is the cardinality of your "item1", "item2" and "item3"
More specifically, what are the values that the following prints is giving ?
k1 = df.item1.nunique()
k2 = df.item2.nunique()
k3 = df.item3.nunique()
k = k1 * k2 * k3
print (k1, k2, k3)
One hot encoding is basically creating a very sparse matrix of same number of rows as your original dataframe with k number of additional columns, where k = products of the three numbers printed above.
Therefore, if your 3 numbers are large, you get out of memory error.
The only solutions are to:
(1) increase your memory or
(2) introduce a hierarchy among the categories and use the higher level categories to limit k.

Related

Trying to use pandas to group all data in Column B [duplicate]

I have a large (about 12M rows) DataFrame df:
df.columns = ['word','documents','frequency']
The following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
However, this is taking an unexpectedly long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?
df.word.describe()
ran pretty well, so I really did not expect this Occurrences_of_Words DataFrame to take very long to build.
I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()
-Source.
Just an addition to the previous answers. Let's not forget that when dealing with real data there might be null values, so it's useful to also include those in the counting by using the option dropna=False (default is True)
An example:
>>> df['Embarked'].value_counts(dropna=False)
S 644
C 168
Q 77
NaN 2
Other possible approaches to count occurrences could be to use (i) Counter from collections module, (ii) unique from numpy library and (iii) groupby + size in pandas.
To use collections.Counter:
from collections import Counter
out = pd.Series(Counter(df['word']))
To use numpy.unique:
import numpy as np
i, c = np.unique(df['word'], return_counts = True)
out = pd.Series(c, index = i)
To use groupby + size:
out = pd.Series(df.index, index=df['word']).groupby(level=0).size()
One very nice feature of value_counts that's missing in the above methods is that it sorts the counts. If having the counts sorted is absolutely necessary, then value_counts is the best method given its simplicity and performance (even though it still gets marginally outperformed by other methods especially for very large Series).
Benchmarks
(if having the counts sorted is not important):
If we look at runtimes, it depends on the data stored in the DataFrame columns/Series.
If the Series is dtype object, then the fastest method for very large Series is collections.Counter, but in general value_counts is very competitive.
However, if it is dtype int, then the fastest method is numpy.unique:
Code used to produce the plots:
import perfplot
import numpy as np
import pandas as pd
from collections import Counter
def creator(n, dt='obj'):
s = pd.Series(np.random.randint(2*n, size=n))
return s.astype(str) if dt=='obj' else s
def plot_perfplot(datatype):
perfplot.show(
setup = lambda n: creator(n, datatype),
kernels = [lambda s: s.value_counts(),
lambda s: pd.Series(Counter(s)),
lambda s: pd.Series((ic := np.unique(s, return_counts=True))[1], index = ic[0]),
lambda s: pd.Series(s.index, index=s).groupby(level=0).size()
],
labels = ['value_counts', 'Counter', 'np_unique', 'groupby_size'],
n_range = [2 ** k for k in range(5, 25)],
equality_check = lambda *x: (d:= pd.concat(x, axis=1)).eq(d[0], axis=0).all().all(),
xlabel = '~len(s)',
title = f'dtype {datatype}'
)
plot_perfplot('obj')
plot_perfplot('int')

How to average groups of columns

Given the following pandas dataframe:
I am trying to get to point b (shown in image 2). Where I want to use row 'class' to identify column names and average columns with the same class. I have been trying to use setdefault to create a dictionary but I am not having much luck. I aim to achieve the final result shown in fig 2.
Since this is a representative example (the actual dataframe is huge), please let me know of a loop based example if possible.
Any help or pointers in the right direction is immensely appreciated.
Imports and Test DataFrame
import pandas as pd
from string import ascii_lowercase # for test data
import numpy as np # for test data
np.random.seed(365)
df = pd.DataFrame(np.random.rand(5, 6) * 1000, columns=list(ascii_lowercase[:6]))
df.index.name = 'Class'
a b c d e f
Class
0 941.455743 641.602705 684.610467 588.562066 543.887219 368.070913
1 766.625774 305.012427 442.085972 110.443337 438.373785 752.615799
2 291.626250 885.722745 996.691261 486.568378 349.410194 151.412764
3 891.947611 773.542541 780.213921 489.000349 532.862838 189.855095
4 958.551868 882.662907 86.499676 243.609553 279.726092 215.662172
Create a DataFrame of column pair means
# use list slicing to select even and odd columns
even_cols = df.columns[0::2]
odd_cols = df.columns[1::2]
# zip the two lists into pairs
# zip creates tuples, but pandas requires list of columns, so we map the tuples into lists
col_pairs = list(map(list, zip(even_cols, odd_cols)))
# in a list comprehension iterate through each column pair, get the mean, and concat the results into a dataframe
df_means = pd.concat([df[pairs].mean(axis=1) for pairs in col_pairs], axis=1)
# in a list comprehension create column header names with a string join
df_means.columns = [' & '.join(pair) for pair in col_pairs]
# display(df_means)
a & b c & d e & f
Class
0 791.529224 636.586267 455.979066
1 535.819101 276.264655 595.494792
2 588.674498 741.629819 250.411479
3 832.745076 634.607135 361.358966
4 920.607387 165.054615 247.694132
Try This
df['A B'] = df[['A', 'B']].mean(axis=1)

create MultiIndex columns based on "lookup"

I'd like to take an existing DataFrame with a single level of columns and modify it to use a MultiIndex based on a reference list of tuples and have the proper ordering/alignment. To illustrate by example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10,5), columns = ['nyc','london','canada','chile','earth'])
coltuples = [('cities','nyc'),('countries','canada'),('countries','usa'),('countries','chile'),('planets','earth'),('planets','mars'),('cities','sf'),('cities','london')]
I'd like to create a new DataFrame which has a top level consisting of 'cities', 'countries', and 'planets' with the corresponding original columns underneath. I am not concerned about order but definitely proper alignment.
It can be assumed that 'coltuples' will not be missing any of the columns from 'df', but may have extraneous pairs, and the ordering of the pairs can be random.
I am trying something along the lines of:
coltuplesuse = [x for x in coltuples if x[1] in df.columns]
cols = pd.MultiIndex.from_tuples(coltuplesuse, names=['level1','level2'])
df.reindex(columns=cols)
which seems to be on the right track but the underlying data in the DataFrame is 'nan'
thanks in advance!
Two things to notice: you want the command set_axis rather than reindex, and sorting by the original column order will ensure the correct label is assigned to the correct column (this is done in the sorted... key= bit).
use_cols = [tup for tup in coltuples if tup[1] in df.columns]
use_cols = sorted(use_cols, key=lambda x: list(df.columns).index(x[1]))
multi_index = pd.MultiIndex.from_tuples(use_cols, names=['level1', 'level2'])
df.set_axis(multi_index, axis=1)
output:
level1 cities countries planets
level2 nyc london canada chile earth
0 0.028033 0.540977 -0.056096 1.675698 -0.328630
1 1.170465 -1.003825 0.882126 0.453294 -1.127752
2 -0.187466 -0.192546 0.269802 -1.225172 -0.548491
3 2.272900 -0.085427 0.029242 -2.258696 1.034485
4 -1.243871 -1.660432 -0.051674 2.098602 -2.098941
5 -0.820820 -0.289754 0.019348 0.176778 0.395959
6 1.346459 -0.260583 0.212008 -1.071501 0.945545
7 0.673351 1.133616 1.117379 -0.531403 1.467604
8 0.332187 -3.541103 -0.222365 1.035739 -0.485742
9 -0.605965 -1.442371 -1.628210 -0.711887 -2.104755

Use data in Spark Dataframe column as condition or input in another column expression

I have an operation that I want to perform within PySpark 2.0 that would be easy to perform as a df.rdd.map, but since I would prefer to stay inside the Dataframe execution engine for performance reasons, I want to find a way to do this using Dataframe operations only.
The operation, in RDD-style, is something like this:
def precision_formatter(row):
formatter = "%.{}f".format(row.precision)
return row + [formatter % row.amount_raw / 10 ** row.precision]
df = df.rdd.map(precision_formatter)
Basically, I have a column that tells me, for each row, what the precision for my string formatting operation should be, and I want to selectively format the 'amount_raw' column as a string depending on that precision.
I don't know of a way to use the contents of one or more columns as input to another Column operation. The closest I can come is suggesting the use of Column.when with an externally-defined set of boolean operations that correspond to the set of possible boolean conditions/cases within the column or columns.
In this specific case, for instance, if you can obtain (or better yet, already have) all possible values of row.precision, then you can iterate over that set and apply a Column.when operation for each value in the set. I believe this set can be obtained with df.select('precision').distinct().collect().
Because the pyspark.sql.functions.when and Column.when operations themselves return a Column object, you can iterate over the items in the set (however it was obtained) and keep 'appending' when operations to each other programmatically until you have exhausted the set:
import pyspark.sql.functions as PSF
def format_amounts_with_precision(df, all_precisions_set):
amt_col = PSF.when(df['precision'] == 0, df['amount_raw'].cast(StringType()))
for precision in all_precisions_set:
if precision != 0: # this is a messy way of having a base case above
fmt_str = '%.{}f'.format(precision)
amt_col = amt_col.when(df['precision'] == precision,
PSF.format_string(fmt_str, df['amount_raw'] / 10 ** precision)
return df.withColumn('amount', amt_col)
You can do it with a python UDF. They can take as many input values (values from columns of a Row) and spit out a single output value. It would look something like this:
from pyspark.sql import types as T, functions as F
from pyspark.sql.function import udf, col
# Create example data frame
schema = T.StructType([
T.StructField('precision', T.IntegerType(), False),
T.StructField('value', T.FloatType(), False)
])
data = [
(1, 0.123456),
(2, 0.123456),
(3, 0.123456)
]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, schema)
# Define UDF and apply it
def format_func(precision, value):
format_str = "{:." + str(precision) + "f}"
return format_str.format(value)
format_udf = F.udf(format_func, T.StringType())
new_df = df.withColumn('formatted', format_udf('precision', 'value'))
new_df.show()
Also, if instead of the column precision value you wanted to use a global one, you could use the lit(..) function when you call it like this:
new_df = df.withColumn('formatted', format_udf(F.lit(2), 'value'))

How to split column of vectors into two columns?

I use PySpark.
Spark ML's Random Forest output DataFrame has a column "probability" which is a vector with two values. I just want to add two columns to the output DataFrame, "prob1" and "prob2", which correspond to the first and second values in the vector.
I've tried the following:
output2 = output.withColumn('prob1', output.map(lambda r: r['probability'][0]))
but I get the error that 'col should be Column'.
Any suggestions on how to transform a column of vectors into columns of its values?
I figured out the problem with the suggestion above. In pyspark, "dense vectors are simply represented as NumPy array objects", so the issue is with python and numpy types. Need to add .item() to cast a numpy.float64 to a python float.
The following code works:
split1_udf = udf(lambda value: value[0].item(), FloatType())
split2_udf = udf(lambda value: value[1].item(), FloatType())
output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2'))
Or to append these columns to the original dataframe:
randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability'))
Got the same problem, below is the code adjusted for the situation when you have n-length vector.
splits = [udf(lambda value: value[i].item(), FloatType()) for i in range(n)]
out = tstDF.select(*[s('features').alias("Column"+str(i)) for i, s in enumerate(splits)])
You may want to use one UDF to extract the first value and another to extract the second. You can then use the UDF with a select call on the output of the random forrest data frame. Example:
from pyspark.sql.functions import udf, col
split1_udf = udf(lambda value: value[0], FloatType())
split2_udf = udf(lambda value: value[1], FloatType())
output2 = randomForrestOutput.select(split1_udf(col("probability")).alias("c1"),
split2_udf(col("probability")).alias("c2"))
This should give you a dataframe output2 which has columns c1 and c2 corresponding to the first and second values in the list stored in the column probability.
I tried #Rookie Boy 's loop but it seems the splits udf loop doesn't work for me.
I modified a bit.
out = df
for i in range(len(n)):
splits_i = udf(lambda x: x[i].item(), FloatType())
out = out.withColumn('{col_}'.format(i), splits_i('probability'))
out.select(*['col_{}'.format(i) for i in range(3)]).show()

Resources