How can I get the ".describe()" statistics over all numerical columns, nested or not? - python-3.x

What is the best method to get the simple descriptive statistics of any column in a dataframe (or list or array), be it nested or not, a sort of advanced df.describe() that also includes nested structures with numerical values.
In my case, I have a dataframe with many columns. Some columns have a numerical list in each row (in my case a time series structure), which is a nested structure.
Such nested structures are meant:
list of arrays,
array of arrays,
series of lists,
dataframe with nested lists of numerical values in some columns (my case)
How to get the simple descriptive statistics from any level of the nested structure in one go?
Asking for
df.describe()
will give me just the statistics of the numerical columns, but not those of the columns that include a list with numerical values.
I cannot get the statistics just by applying
from scipy import stats
stats.describe(arr)
either as it is the solution in How can I get descriptive statistics of a NumPy array? for a non-nested array.

My first approach would be to get the statistics of each numerical list first, and then take the statistics of that again, e.g. the mean of the mean or the mean of the variance would then give me some information as well.
In my first approach here, I convert a specific column that has a nested list of numerical values to a series of nested lists first. Nested arrays or lists might need a small adjustment, not tested.
NESTEDSTRUCTURE = df['nestedColumn']
[stats.describe([a[x] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]]) for x in range(6)]
gives you the stats of the stats for a nested structure column. If you want the mean of all means of a column, you can use
stats.describe([a[2] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]])
as position 2 stands for "mean" in
DescribeResult(nobs=, minmax=(, ), mean=, variance=, skewness=,
kurtosis=)
I expect that there is a better descriptive statistics approach that should also automatically understand nested structures with numerical values, this is just a workaround.

Related

Iterating through each row of a single column in a Python geopandas (or pandas) DataFrame

I am trying to iterate through this very large DataFrame that I have in python. The thing is, I only want to pull out data from one specific column that contains the names of a bunch of counties.
I have tried to use iteritems(), itertupel(), and iterrows() to no avail.
Any suggestions on how to do this?
My end goal is to have a nested dictionary with each internal dictionary's key being a name from the DataFrame column.
Also tried to use this method below to select a single column but that will only print the name of the column, not its contents.
for county in map_datafile[['NAME']]:
print(county)
If you delete one pair of square brackets, you get a Series that is iterable:
for county in map_datafile['NAME']:
print(county)
See the difference:
print(type(map_datafile[['NAME']]))
# pandas.core.frame.DataFrame
print(type(map_datafile['NAME']))
# pandas.core.series.Series

PySpark - A more efficient method to count common elements

I have two dataframes, say dfA and dfB.
I want to take their intersection and then count the number of unique user_ids in that intersection.
I've tried the following which is very slow and it crashes a lot:
dfA.join(broadcast(dfB), ['user_id'], how='inner').select('user_id').dropDuplicates().count()
I need to run many such lines, in order to get a plot.
How can I perform such query in an efficient way?
As described in the question, the only relevant part of the dataframe is the column user_id (in your question you describe that you join on user_id and afterwards uses only the user_id field)
The source of the performance problem is joining two big dataframes when you need only the distinct values of one column in each dataframe.
In order to improve the performance I'd do the following:
Create two small DFs which will holds only the user_id column of each dataframe
This will reduce dramatically the size of each dataframe as it will hold only one column (the only relevant column)
dfAuserid = dfA.select("user_id")
dfBuserid = dfB.select("user_id")
Get the distinct (Note: it is equivalent to dropDuplicate() values of each dataframe
This will reduce dramatically the size of each dataframe as each new dataframe will hold only the distinct values of column user_id.
dfAuseridDist = dfA.select("user_id").distinct()
dfBuseridDist = dfB.select("user_id").distinct()
Perform the join on the above two minimalist dataframes in order to get the unique values in the intersection
I think you can either select the necessary columns before and perform the join afterwards. It should also be beneficial to move the dropDuplicates before the join as well, since then you get rid of user_ids that appear multiple times in one of the dataframes.
The resulting query could look like:
dfA.select("user_id").join(broadcast(dfB.select("user_id")), ['user_id'], how='inner')\
.select('user_id').dropDuplicates().count()
OR:
dfA.select("user_id").dropDuplicates(["user_id",]).join(broadcast(dfB.select("user_id")\
.dropDuplicates(["user_id",])), ['user_id'], how='inner').select('user_id').count()
OR the version with distinct should work as well.
dfA.select("user_id").distinct().join(broadcast(dfB.select("user_id").distinct()),\
['user_id'], how='inner').select('user_id').count()

Does it make sense to do vecterindex after stringindex on categorical features?

Say I have bunch of categorical string columns in my dataframe. Then I do below transform:
StringIndex the columns
then I use VectorAssembler to assemble all the transformed columns into one vector feature column
do VectorIndexer on the new vector feature column.
Question: for step 3, does it make sense, or is it duplicated effort? I think step 1 already did the index.
Yes it makes sense if you're going to use Spark tree based algorithm (RandomForestClassifier or GBMClassifier) and you have high cardinality features.
E.g. for criteo dataset StringIndexer would convert values in categorical column to integers in range 1 to 65000. It will save this in metadata as a NominalAttribute. Then in RFClassifier it would extract this from metadata as categorical features.
For tree based algorithms you have to specify maxBins parameter that
Must be >= 2 and >= number of categories in any categorical feature.
Too high maxBins parameter would lead to slow performance. To solve this need to use VectorIndexer with .setMaxCategories(64) for example. This will treat as categorical variables only those that has <64 unique values.

Pyspark: filter DataaFrame where column value equals some value in list of Row objects

I have a list of pyspark.sql.Row objects as follows:
[Row(artist=1255340), Row(artist=942), Row(artist=378), Row(artist=1180), Row(artist=813)]
From a DataFrame having schema (id, name) I want to filter out rows where id equals some artist in the given Row of list. What will be the correct way to go about it ?
To clarify further, I want to do something like: select row from dataframe where row.id is in list_of_row_objects
The main question is how big is list_of_row_objects. If it is small then the link provided by #Karthik Ravindra
If it is big, then you can instead use dataframe_of_row_objects. do an inner join between your dataframe and dataframe_of_row_objects with the artist column in dataframe_of_row_objects and the id column in your original dataframe. This would basically remove any id not in dataframe_of_row_objects.
Of course using a join is slower but it is more flexible. For lists which are not small but are still small enough to fit into memory you can use the broadcast hint to still get better performance.

combining rows/columns from spark data frames by mathematical operation

I have two spark data frames (A and B) with respective sizes a x m and b x m, containing floating point values.
Additionally, each data frame has a column 'ID', that is a string identifier. A and B have exactly the same set of 'ID's (i.e. contain information about the same group of customers.)
I'd like to combine a column of A with a column of B by some function.
More specifically, I'd like to build a scalar product a column of A with a column of B, with ordering of the columns according to the ID.
Even more specifically I'd like to calculate the correlation between columns of A and B.
Performing this operation on all pairs of columns would be the same as a matrix multiplication: A_transposed x B.
However, for now I'm only interested in correlations of a small subset of pairs.
I have two approaches in mind, but I struggle to implement them. (And don't know whether either is possible or advisable, at all.)
(1) Take the column of interest of each data frame and combines each entry to a key value pair, where the key is the ID. Then something like reduceByKey() on the two columns of key value pairs and subsequent summation.
(2) Take the column of interest of each data frame, sort it by its ID, cast it to an RDD (haven't figure out how to do this) and simply apply
Statistics.corr(rdd1,rdd2) from pyspark.mllib.stat.
Also I wonder: Is it generally computationally preferable to operate on columns rather than rows (since spark data frames are columnar oriented) or does that make no difference?
Starting from spark 1.4 and if all you need is pearson correlation then you could go like this:
cor = dfA.join(dfB, dfA.id == dfB.id, how='inner').select(dfA.value.alias('aval'), dfB.value.alias('bval')).corr('aval', 'bval')

Resources