speed up pandas search for a certain value not in the whole df - python-3.x

I have a large pandas DataFrame consisting of some 100k rows and ~100 columns with different dtypes and arbitrary content.
I need to assert that it does not contain a certain value, let's say -1.
Using assert( not (any(test1.isin([-1]).sum()>0))) results in processing time of some seconds.
Any idea how to speed it up?

Just to make a full answer out of my comment:
With -1 not in test1.values you can check if -1 is in your DataFrame.
Regarding the performance, this still needs to check every single value, which is in your case
10^5*10^2 = 10^7.
You only save with this the performance cost for summation and an additional comparison of these results.

Related

Faster way to apply a function to every pair of columns in pandas data frame

I am trying to apply some function on every possible pair of columns in a data frame. The iteration method works, but since the data frame is huge it takes up a lot of time. My data frame has a size of around 10,000 columns and 1000 rows.
Is there a faster way of doing this. Given below is a toy example of the same.
TOY EXAMPLE
My function is something like this:
def foo(x,y):
if ['alpha','beta'] == df[[x,y]]:
print(x,y)
df = pd.DataFrame([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
for i in df.columns:
for j in df.columns:
foo(i,j)
I have also tried loop comprehension and itertools.combinations, but it is also taking a lot of time.
z = [foo(i,j) for i,j in itertools.combinations(df.columns,2)]
My function is exactly the same. It checks if 3-4 rows are present in the pair of columns and writes the column information to a file.
I tried using numpy matrices instead of data frame, but did not achieve any significant time improvement. All of the above are working but takes a lot of time (obviously due to the huge size of the data frame). Hence I need some help in optimizing the time.
Any suggestions regarding the same would be highly appreciated. Thanks a lot.

pyarrow append and read row/columns for time series data

I am looking to use pyarrow to do memory mapped reads both for row and columns, for time series data with multiple columns. I Don't really care about writing historical data at a slower speed. My main aim is the fastest read speed (for single row, single columns, multiple row columns), and there after the fastest possible append speed (with rows appended periodically). Here is the code that generates data I am looking to test on. This is a multiindex dataframe with columns as fields (open, high, low ...) and the index is a two level multiindex with datetime and symbols as the two levels. Comments on this particular architecture are also welcome.
import time
import psutil, os
KB = 1<<10
MB = 1024 * KB
GB = 1024 * MB
idx = pd.date_range('20150101', '20210613', freq='T')
df = {}
for j in range(10):
df[j] = pd.DataFrame(np.random.randn(len(idx), 6), index=idx, columns=[i for i in 'ohlcvi'])
df = pd.concat(df, axis=1)
df = df.stack(level=0)
df.index.names=['datetime', 'sym']
df.columns.name = 'field'
print(df.memory_usage().sum()/GB)
Now I am looking for the most efficient code to do the following:
Write this data in a memory mapped format on disk so that It can be used to read rows/columns or some random access.
Append another row to this dataset at the end.
query the last 5 rows.
query a few random columns for a given set of continuous rows.
query non continuous rows and columns.
If the task masters are looking for how I did it before they allow anybody to answer this question, please respond and I will roll out all the preliminary code I wrote to accomplish this. I am not doing it here as It will probably dirty up the space without much info. I did not get speeds promised on blogs on pyarrow and I am sure I am doing it wrong, thus this request for guidance.

In tensorflow, How to compute mean for each columns of a batch generated from a csv that has NaNs in multiple columns?

I am reading in a csv in batches and each batch has nulls in various place. I dont want to use tensorflow transform as it requires loading the entire data in memory. Currently i cannot ignore the NaNs present in each column while computing means if i am to try to do it for the entire batch at once. I can loop through each column and then find the mean per columns that way but that seems to be an inelegant solution.
Can somebody help in finding the right way to compute the mean per column of a csv batch that has NaNs present in multiple columns. Also, [1,2,np.nan] should produce 1.5 not 1.
I am currently doing this: given tensor a of rank 2 tf.math.divide_no_nan(tf.reduce_sum(tf.where(tf.math.is_finite(a),a,0.),axis=0),tf.reduce_sum(tf.cast(tf.math.is_finite(a),tf.float32),axis=0))
Let me know somebody has a better option

PySpark 2.1.1 groupby + approx_count_distinct giving counts of 0

I'm using Spark 2.1.1 (pyspark), doing a groupby followed by an approx_count_distinct aggregation on a DataFrame with about 1.4 billion rows. The groupby operation results in about 6 million groups to perform the approx_count_distinct operation on. The expected distinct counts for the groups range from single-digits to the millions.
Here is the code snippet I'm using, with column 'item_id' containing the ID of items, and 'user_id' containing the ID of users. I want to count the distinct users associated with each item.
>>> distinct_counts_df = data_df.groupby(['item_id']).agg(approx_count_distinct(data_df.user_id).alias('distinct_count'))
In the resulting DataFrame, I'm getting about 16,000 items with a count of 0:
>>> distinct_counts_df.filter(distinct_counts_df.distinct_count == 0).count()
16032
When I checked the actual distinct count for a few of these items, I got numbers between 20 and 60. Is this a known issue with the accuracy of the HLL approximate counting algorithm or is this a bug?
Although I am not sure where the actual problem lies, but since approx_count_distinct relies on approximation(https://stackoverflow.com/a/40889920/7045987), HLL may well be the issue.
You can try this:
There is a parameter 'rsd' which you can pass in approx_count_distinct which determines the error margin. If rsd = 0, it will give you accurate results although the time increases significantly and in that case, countDistinct becomes a better option. Nevertheless, you can try decreasing rsd to say 0.008 at the cost of increasing time. This may help in giving a little more accurate results.

More efficient way to Iterate & compute over columns [duplicate]

This question already has answers here:
Spark columnar performance
(2 answers)
Closed 5 years ago.
I have a very wide dataframe > 10,000 columns and I need to compute the percent of nulls in each. Right now I am doing:
threshold=0.9
for c in df_a.columns[:]:
if df_a[df_a[c].isNull()].count() >= (df_a.count()*threshold):
# print(c)
df_a=df_a.drop(c)
Of course this is a slow process and crashes on occasion. Is there a more efficient method I am missing?
Thanks!
There are few strategies you can take depending upon the size of the dataframe. The code looks good to me. You need to go through each column and count the number of null values.
One strategy is to cache the input dataframe. That will enable faster filtering. This however works if the dataframe is not huge
Also
df_a=df_a.drop(c)
I am little skeptical with this as this is changing the dataframe in the loop. Better to keep the null column names and drop from the dataframe later in a separate loop.
If the dataframe is huge and you can't cache it completely you can partition the dataframe in to some finite manageable columns. Like take 100 column each and cache that smaller dataframe and run the analysis 100 times in a loop.
Now you might want to keep track of the analyzed column list separate from the yet to be analyzed columns in this case. That way even if the job fails you can start the analysis from the rest of the columns.
You should avoid iterating when using pyspark, since it does not distribute the computations anymore.
Using count on a column will compute the count of non-null elements.
threshold = 0.9
import pyspark.sql.functions as psf
count_df = df_a\
.agg(*[psf.count("*").alias("count")]+ [psf.count(c).alias(c) for c in df_a.columns])\
.toPandas().transpose()
The first element is the number of lines in the dataframe:
total_count = count_df.iloc[0, 0]
kept_cols = count_df[count_df[0] > (1 - threshold)*total_count].iloc[1:,:]
df_a.select(list(kept_cols.index))

Resources