I'm currently trying to assign a unique indexer across rows, rather than alongside columns. The main problem is these values can never repeat, and must be preserved with every monthly report that I run.
I've thought about merging the columns and assigning an indexer to that, but my concern is that I won't be able to easily modify the dataframe and still preserve the same index values for each cell with this method.
I'm expecting my df to look something like this below:
Sample DataFrame
I haven't yet found a viable solution so haven't got any code to show yet. Any solutions would be much appreciated. Thank you.
Related
I am reading in a csv in batches and each batch has nulls in various place. I dont want to use tensorflow transform as it requires loading the entire data in memory. Currently i cannot ignore the NaNs present in each column while computing means if i am to try to do it for the entire batch at once. I can loop through each column and then find the mean per columns that way but that seems to be an inelegant solution.
Can somebody help in finding the right way to compute the mean per column of a csv batch that has NaNs present in multiple columns. Also, [1,2,np.nan] should produce 1.5 not 1.
I am currently doing this: given tensor a of rank 2 tf.math.divide_no_nan(tf.reduce_sum(tf.where(tf.math.is_finite(a),a,0.),axis=0),tf.reduce_sum(tf.cast(tf.math.is_finite(a),tf.float32),axis=0))
Let me know somebody has a better option
Is there a way to add values in Excel based off of values previously in table?
For example, in the table I currently have, is there a way to exclude adding the 1 from the "Attended" column in the "Sonics and Cold Cash" row because I already had a row with "Sonics" and "1" in attended? I don't want to add a 1 to the SUMIF function if I have already attended that team once before.
I hope this is clear enough for some help. Thank you!
edit: So far, I have a table that tracks how many times a team has been "attended". This works, however I am trying to use linear optimization for scheduling, and using the results table has some linearity problems. I'm trying to find a way to only use the table instead of a second, results table.
I have this issue using boolean masks in Pandas. It has to do with the integer index labels that seem to carry over from one DataFrame into another after applying a boolean mask. I applied a boolean mask to a DataFrame called f500 and stored that in another one called null_previous_rank. The index labels of the first DataFrame survive in the filtered DataFrame, but I'm not sure what role they play there. They're not row indices of the new df. I played with a few different ways of making sense of them, which you'll see in the screenshots with comments. Hopefully, they'll make sense.
Thanks in advance.
I think the answer to the question is that the "residual" indices are row labels in the new DataFrame.
I guess the index doesn't appear when selecting a single entry but does appear when selecting a list or slice because when there are multiple entries, we need to keep track of them, but when displaying a single entry, there's no need to keep track?
I have a dataframe that I am working with in a Python based Jupyter notebook. I want to add an additional column based on the content of an existing column, where the content of the new column is derived from running an external API call on the original column.
The solution I attempted was to use a Python based UDF. The first cell contains something like this:
def analysis(old_column):
new_column = myapi.analyze(text=old_column)
return(new_column)
analysis_udf = udf(analysis)
and the second cell this:
df2 = df1.withColumn("col2",analysis_udf('col1'))
df2.select('col2').show(n=5)
My dataframe is relatively large, with some 70000 rows, and where col1 can have a 100 to 10000+ characters of text. When I ran the code above in cell 2, it actually seemed to run fairly quickly (minutes), and dumped out the 5 rows of the df2 dataframe. So I thought I was in business. However, my next cell had the following code:
df2.cache()
df2.filter(col('col2').isNull()).count()
The intent of this code is to cache the contents of the new dataframe to improve access time to the DF, and then count how many of the entries in the dataframe have null values generated by the UDF. This surprisingly (to me) took many hours to run, and eventually provided an output of 6. Its not clear to me why the second cell ran quickly and the third was slow. I would have though that the df2.select('col2').show(n=5) call would have caused the UDF to run on all of the rows, and that one would have been slow, and then subsequent calls to access the new column of the dataframe would be quick. But that wasn't the case, so I supposed then that the cache call was the one that was actually causing the UDF to run on all of the rows so any subsequent calls now should be quick. So added another cell with:
df2.show(n=5)
Assuming it would run quickly, but again, it was taking much longer than I expected and it seems like perhaps the UDF was running again. (?)
My questions are
Which Spark api calls actually cause the udf to run (or re-run), and how to structure the calls to run the UDF only once so that the new column is created with the text output by the UDF's python function.
I have read that Python UDFs should be avoided because they are slow (seems correct) so what alternatives do I have when I need to use an API call to generate the new column?
I would have though that the df2.select('col2').show(n=5) call would have caused the UDF to run on
It is not a correct assumption. Spark will evaluate as little data as possible, given limitations of the API. Because you use Python udf it will evaluate minimum number of partitions required to collect 5 rows.
Which spark api calls actually cause the udf to run (or re-run), and how to structure the calls to run the UDF only once so that the new column is created with the text output by the UDF's python function.
Any evaluation, if data is no longer cached (evicted from memory).
Possibly any usage of the resulting column, unless udf is marked as non-deterministic.
I have read that Python UDFs should be avoided because they are slow (seems correct) so what alternatives do I have when I need to use an API call to generate the new column?
Unless you want to switch to Scala or RDD API, the only alternative is pandas_udf, which is somewhat more efficient, but supports only a limited subset of types.
Problem statement:
I have this crummy file from a government department that lists the operation schedules of 500+ bus routes across multiple sheets in a single excel. There is really no structure here and the author seems to have a single objective - pack everything tight in a single file !
Now, what am I trying to do:
Do extensive text analysis to extract the starting time of each run on the route. Please note there are multiple routes on a single sheet and then there are around 12 sheets in all.
I am cutting my teeth with the pandas library and stuck at this point:
Have a dictionary where
Key: sheet name (random str to identify the route sequence)
Value: DataFrame created with all cell data on that sheet.
What do I would like to know:
Create one gigantic DataFrame that has all the rows from across the 12 sheets. Start with my text analysis post this step.
Is that above the right way forward?
Thanks in advance.
AT
Could try a multi-index dataframe:
df_3d=pd.concat(dfs, # List of dataframes
keys=sheetnames, # List of sheetnames
axis=1)
Where dfs would be something like
dfs=[read_excel(io,sheetname=i) for i in sheetnames]