regarding avoiding using column as index by default in Pandas [duplicate] - python-3.x

Can someone point me to a link or provide an explanation of the benefits of indexing in pandas? I routinely deal with tables and join them based on columns, and this joining/merging process seems to re-index things anyway, so it's a bit cumbersome to apply index criteria considering I don't think I need to.
Any thoughts on best-practices around indexing?

Like a dict, a DataFrame's index is backed by a hash table. Looking up rows
based on index values is like looking up dict values based on a key.
In contrast, the values in a column are like values in a list.
Looking up rows based on index values is faster than looking up rows based on column values.
For example, consider
df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])
Here is how you could look up any row where the df['index'] column equals 999.
Pandas has to loop through every value in the column to find the ones equal to 999.
df[df['index'] == 999]
# foo index
# 999 0.375489 999
Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:
df_with_index.loc[999]
# foo 0.375489
# index 999.000000
# Name: 999, dtype: float64
Looking up rows by index is much faster than looking up rows by column value:
In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 µs per loop
In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 µs per loop
Note however, it takes time to build the index:
In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 µs per loop
So having the index is only advantageous when you have many lookups of this type
to perform.
Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index, stack, unstack, pivot, pivot_table, melt,
lreshape, and crosstab, all use or manipulate the index.
Sometimes we want the DataFrame in a different shape for presentation purposes, or for join, merge or groupby operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join, merge and groupby take advantage of fast index lookups when possible.
Time series have resample, asfreq and interpolate methods whose underlying implementations take advantage of fast index lookups too.
So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash
lookups.

Related

How can I get the ".describe()" statistics over all numerical columns, nested or not?

What is the best method to get the simple descriptive statistics of any column in a dataframe (or list or array), be it nested or not, a sort of advanced df.describe() that also includes nested structures with numerical values.
In my case, I have a dataframe with many columns. Some columns have a numerical list in each row (in my case a time series structure), which is a nested structure.
Such nested structures are meant:
list of arrays,
array of arrays,
series of lists,
dataframe with nested lists of numerical values in some columns (my case)
How to get the simple descriptive statistics from any level of the nested structure in one go?
Asking for
df.describe()
will give me just the statistics of the numerical columns, but not those of the columns that include a list with numerical values.
I cannot get the statistics just by applying
from scipy import stats
stats.describe(arr)
either as it is the solution in How can I get descriptive statistics of a NumPy array? for a non-nested array.
My first approach would be to get the statistics of each numerical list first, and then take the statistics of that again, e.g. the mean of the mean or the mean of the variance would then give me some information as well.
In my first approach here, I convert a specific column that has a nested list of numerical values to a series of nested lists first. Nested arrays or lists might need a small adjustment, not tested.
NESTEDSTRUCTURE = df['nestedColumn']
[stats.describe([a[x] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]]) for x in range(6)]
gives you the stats of the stats for a nested structure column. If you want the mean of all means of a column, you can use
stats.describe([a[2] for a in [stats.describe(x) for x in NESTEDSTRUCTURE]])
as position 2 stands for "mean" in
DescribeResult(nobs=, minmax=(, ), mean=, variance=, skewness=,
kurtosis=)
I expect that there is a better descriptive statistics approach that should also automatically understand nested structures with numerical values, this is just a workaround.

How to get a list of columns that will give me a unique record in Pyspark Dataframe

My intension is to write a python function that would take a pyspark DataFrame as input, and its output would be a list of columns (could be multiple lists) that gives a unique record when combined together.
So, if you take a set of values for the columns in the list, you would always get just 1 record from the DataFrame.
Example:
Input Dataframe
Name Role id
--------------------
Tony Dev 130
Stark Qa 131
Steve Prod 132
Roger Dev 133
--------------------
Output:
Name,Role
Name,id
Name,id,Role
Why is the output what it is?
For any Name,Role combination I will always get just 1 record
And, for any Name, id combination I will always get just 1 record.
There are ways to define a function, which will do exactly what you are asking for.
I will only show 1 possibility and it is a very naive solution. You can iterate through all the combinations of columns and check whether they form a unique entry in the table:
import itertools as it
def find_all_unique_columns_naive(df):
cols = df.columns
res = []
for num_of_cols in range(1, len(cols) + 1):
for comb in it.combinations(cols, num_of_cols):
num_of_nonunique = df.groupBy(*comb).count().where("count > 1").count()
if not num_of_nonunique:
res.append(comb)
return res
With a result for your example being:
[('Name',), ('id',), ('Name', 'Role'), ('Name', 'id'), ('Role', 'id'),
('Name', 'Role', 'id')]
There is obviously a performance issue, since this function is exponentially increasing in time as the number of columns grow, i.e. O(2^N). Meaning the runtime for a table with just 20 columns is already going to take quite a long time.
There are however some obvious ways to speed this up, f.e. in case you already know that column Name is unique, then definitely any combination which includes the already known unique combination will remain unique, hence you can already by that fact deduce that combinations (Name, Role), (Name, id) and (Name, Role, id) are unique as well and this will definitely reduce the search space quite efficiently. The worst case scenario however remains the same, i.e. in case the table has no unique combination of columns, you will have to exhaust the entire search space to make that conclusion.
As a conclusion, I'd suggest that you should think about why you want this function in the first place. There might be some specific use-cases for small tables I agree, just to save some time, but to be completely honest, this is not how one should treat a table. If a table exists, then there should be a purpose for the table to exist and a proper table design, i.e. how are the data inside the table really structured and updated. And that should be the starting point when looking for unique identifiers. Because even though you will be able to find other unique identifiers now with this method, it very well might be the case that the table design will destroy them with the next update. I'd much rather suggest to use the table's metadata and documentation, because then you can be sure that you are treating the table in the correct way as it was designed and, in case the table has a lot of columns, it is actually faster.

Find row number is a large pandas dataframe is very slow using np.where

i am trying to find the row number of a dataframe having 55k rows. I am having multiple index and hence i tried to search using np.where. But this is very very slow. Is there a better solution to achieve in less than a millisecond. I tried loc, iloc but seems of not better options.
lprowcount = np.where(dfreglist['lineports'] == lineport)[0]
Also any best method to locate and update a row in a large dataframe other loc, iloc methods.

PySpark 2.1.1 groupby + approx_count_distinct giving counts of 0

I'm using Spark 2.1.1 (pyspark), doing a groupby followed by an approx_count_distinct aggregation on a DataFrame with about 1.4 billion rows. The groupby operation results in about 6 million groups to perform the approx_count_distinct operation on. The expected distinct counts for the groups range from single-digits to the millions.
Here is the code snippet I'm using, with column 'item_id' containing the ID of items, and 'user_id' containing the ID of users. I want to count the distinct users associated with each item.
>>> distinct_counts_df = data_df.groupby(['item_id']).agg(approx_count_distinct(data_df.user_id).alias('distinct_count'))
In the resulting DataFrame, I'm getting about 16,000 items with a count of 0:
>>> distinct_counts_df.filter(distinct_counts_df.distinct_count == 0).count()
16032
When I checked the actual distinct count for a few of these items, I got numbers between 20 and 60. Is this a known issue with the accuracy of the HLL approximate counting algorithm or is this a bug?
Although I am not sure where the actual problem lies, but since approx_count_distinct relies on approximation(https://stackoverflow.com/a/40889920/7045987), HLL may well be the issue.
You can try this:
There is a parameter 'rsd' which you can pass in approx_count_distinct which determines the error margin. If rsd = 0, it will give you accurate results although the time increases significantly and in that case, countDistinct becomes a better option. Nevertheless, you can try decreasing rsd to say 0.008 at the cost of increasing time. This may help in giving a little more accurate results.

More efficient way to Iterate & compute over columns [duplicate]

This question already has answers here:
Spark columnar performance
(2 answers)
Closed 5 years ago.
I have a very wide dataframe > 10,000 columns and I need to compute the percent of nulls in each. Right now I am doing:
threshold=0.9
for c in df_a.columns[:]:
if df_a[df_a[c].isNull()].count() >= (df_a.count()*threshold):
# print(c)
df_a=df_a.drop(c)
Of course this is a slow process and crashes on occasion. Is there a more efficient method I am missing?
Thanks!
There are few strategies you can take depending upon the size of the dataframe. The code looks good to me. You need to go through each column and count the number of null values.
One strategy is to cache the input dataframe. That will enable faster filtering. This however works if the dataframe is not huge
Also
df_a=df_a.drop(c)
I am little skeptical with this as this is changing the dataframe in the loop. Better to keep the null column names and drop from the dataframe later in a separate loop.
If the dataframe is huge and you can't cache it completely you can partition the dataframe in to some finite manageable columns. Like take 100 column each and cache that smaller dataframe and run the analysis 100 times in a loop.
Now you might want to keep track of the analyzed column list separate from the yet to be analyzed columns in this case. That way even if the job fails you can start the analysis from the rest of the columns.
You should avoid iterating when using pyspark, since it does not distribute the computations anymore.
Using count on a column will compute the count of non-null elements.
threshold = 0.9
import pyspark.sql.functions as psf
count_df = df_a\
.agg(*[psf.count("*").alias("count")]+ [psf.count(c).alias(c) for c in df_a.columns])\
.toPandas().transpose()
The first element is the number of lines in the dataframe:
total_count = count_df.iloc[0, 0]
kept_cols = count_df[count_df[0] > (1 - threshold)*total_count].iloc[1:,:]
df_a.select(list(kept_cols.index))

Resources