randomly split DataFrame by group? - python-3.x

I have a DataFrame where multiple rows share group_id values (very large number of groups).
Is there an elegant way to randomly split this data into training and test data in a way that the training and test sets do not share group_id?
The best process I can come up with right now is
- create mask from msk = np.random.rand()
- apply it to the DataFrame
- check test file for rows that share group_id with training set and move these rows to training set.
This is clearly non-elegant and has multiple issues (including the possibility that the test data ends up empty). I feel like there must be a better way, is there?
Thanks

Oh, there is an easy way !
create a list / array of unique group_ids
create a random mask for this list
and use the mask to split the file

Related

In tensorflow, How to compute mean for each columns of a batch generated from a csv that has NaNs in multiple columns?

I am reading in a csv in batches and each batch has nulls in various place. I dont want to use tensorflow transform as it requires loading the entire data in memory. Currently i cannot ignore the NaNs present in each column while computing means if i am to try to do it for the entire batch at once. I can loop through each column and then find the mean per columns that way but that seems to be an inelegant solution.
Can somebody help in finding the right way to compute the mean per column of a csv batch that has NaNs present in multiple columns. Also, [1,2,np.nan] should produce 1.5 not 1.
I am currently doing this: given tensor a of rank 2 tf.math.divide_no_nan(tf.reduce_sum(tf.where(tf.math.is_finite(a),a,0.),axis=0),tf.reduce_sum(tf.cast(tf.math.is_finite(a),tf.float32),axis=0))
Let me know somebody has a better option

pandas merge multiple Dataframes and the do text analysis?

Problem statement:
I have this crummy file from a government department that lists the operation schedules of 500+ bus routes across multiple sheets in a single excel. There is really no structure here and the author seems to have a single objective - pack everything tight in a single file !
Now, what am I trying to do:
Do extensive text analysis to extract the starting time of each run on the route. Please note there are multiple routes on a single sheet and then there are around 12 sheets in all.
I am cutting my teeth with the pandas library and stuck at this point:
Have a dictionary where
Key: sheet name (random str to identify the route sequence)
Value: DataFrame created with all cell data on that sheet.
What do I would like to know:
Create one gigantic DataFrame that has all the rows from across the 12 sheets. Start with my text analysis post this step.
Is that above the right way forward?
Thanks in advance.
AT
Could try a multi-index dataframe:
df_3d=pd.concat(dfs, # List of dataframes
keys=sheetnames, # List of sheetnames
axis=1)
Where dfs would be something like
dfs=[read_excel(io,sheetname=i) for i in sheetnames]

spark dataset : how to get count of occurence of unique values from a column

Trying spark dataset apis which reads a CSV file and count occurrence of unique values in a particular field. One approach which i think should work is not behaving as expected. Let me know what am i overlooking. I am posted both working as well as buggy approach below.
// get all records from a column
val professionColumn = data.select("profession")
// breakdown by professions in descending order
// ***** DOES NOT WORKS ***** //
val breakdownByProfession = professionColumn.groupBy().count().collect()
// ***** WORKS ***** //
val breakdownByProfessiond = data.groupBy("profession").count().sort("count") // WORKS
println ( s"\n\nbreakdown by profession \n")
breakdownByProfession.show()
Also please let me know which approach is more efficient. My guess would be the first one ( the reason to attempt that in first place )
Also what is the best way to save output of such an operation in a text file using dataset APIs
In the first case, since there are no grouping columns specified, the entire dataset is considered as one group -- this behavior holds even though there is only one column present in the dataset. So, you should always pass the list of columns to groupBy().
Now the two options would be: data.select("profession").groupBy("profession").count vs. data.groupBy("profession").count. In most cases, the performance of these two alternatives will be exactly the same since Spark tries to push projections (i.e., column selection) down the operators as much as possible. So, even in the case of data.groupBy("profession").count, Spark first selects the profession column before it does the grouping. You can verify this by looking at the execution plan -- org.apache.spark.sql.Dataset.explain()
In groupBy transformation you need to provide column name as below
val breakdownByProfession = professionColumn.groupBy().count().collect()

Custom parallel extractor - U-SQL

I try create a custom parallel extractor, but i have no idea how do it correctly. I have a big files (more than 250 MB), where data for each row are stored in 4 lines. One file row store data for one column. Is this possible to create working parallely extractor for large files? I am afraid that data for one row, will be in different extents after file splitting.
Example:
...
Data for first row
Data for first row
Data for first row
Data for first row
Data for second row
Data for second row
Data for second row
Data for second row
...
Sorry for my English.
I think, you can process this data using U-SQL sequentially not in parallel. You have to write a custom applier to take a single/multiple rows and return single/multiple rows. And then, you can invoke it with CROSS APPLY. You can take help from this applier.
U-SQL Extractors by default are scaled out to work in parallel over smaller parts of the input files, called extents. These extents are about 250MB in size each.
Today, you have to upload your files as row-structured files to make sure that the rows are aligned with the extent boundaries (although we are going to provide support for rows spanning extent boundaries in the near future). In either way though, the extractor UDO model would not know if your 4 rows are all inside the same extent or across them.
So you have two options:
Mark the extractor as operating on the whole file with adding the following line before the extractor class:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
Now the extractor will see the full file. But you lose the scale out of the file processing.
You extract one row per line and use a U-SQL statement (eg. using Window Functions or a custom REDUCER) to merge the rows into a single row.
I have discovered that I cant use static method to get an instance of IExtractor implementation in USING statement if I want use AtomicFileProcessing set on true.

Group analysis in Microsoft Azure

I have a data set, and I would like to produce a prediction model based on that data-set, usimg Microsoft Azure
This data-set contains some group of events that together make a bigger event, for example - few lines in the data-set that are close in time (there is a time column) create together one event in time.
does anybody know the method for how can I do it? is there anyway to create a prediction model that learns not from a certain column, but from a different data-set (of results, for that matter)
thanks
Yes, this does exist using Azure Machine Learning, have a look in the Gallery for various examples where this has been shown.
I think for your specific question it will be good to have a look at the Bike regression example. In that example they show how they aggregate information from multiple rows to score on a single feature.

Resources