pandas merge multiple Dataframes and the do text analysis? - excel

Problem statement:
I have this crummy file from a government department that lists the operation schedules of 500+ bus routes across multiple sheets in a single excel. There is really no structure here and the author seems to have a single objective - pack everything tight in a single file !
Now, what am I trying to do:
Do extensive text analysis to extract the starting time of each run on the route. Please note there are multiple routes on a single sheet and then there are around 12 sheets in all.
I am cutting my teeth with the pandas library and stuck at this point:
Have a dictionary where
Key: sheet name (random str to identify the route sequence)
Value: DataFrame created with all cell data on that sheet.
What do I would like to know:
Create one gigantic DataFrame that has all the rows from across the 12 sheets. Start with my text analysis post this step.
Is that above the right way forward?
Thanks in advance.
AT

Could try a multi-index dataframe:
df_3d=pd.concat(dfs, # List of dataframes
keys=sheetnames, # List of sheetnames
axis=1)
Where dfs would be something like
dfs=[read_excel(io,sheetname=i) for i in sheetnames]

Related

Comparing two tables of different size, with multiple columns in VBA

Looking to use VBA to compare two tables, with three columns each against each other. Beginner here and very lost.
They may have a different amount of entries each, and there may be some in table A that aren't in table B, and vice versa
Some of the individual Columns may match but trying to work out how to make sure all three columns are compared as one against all three columns in the other table
For example
xyz123 55.50 12/07/21 if compared with XYZ123 54.55 12/07/21 will show up as not a match, because the middle column is a different number.
Have attached a picture below. For the most part, and unlike the photo, each table will be in a completely random order, and its unlikely that there will be the same entry in table 1, row 1, as table 2 row 1
Ideally, I'm trying to create two new table to the right of the original tables, the first one being the entries table 1 has, that table 2 does not have. The second one being the entries table 2 has, that table 1 does not have.
Have attached an example below of the end result I'm looking for out of this. The four rows on the left are entries that the first table has but the second table doesn't, and the rows to the right are all entries that the second table has, but the first table does not.
I've tried to search on this but haven't found something that matches what I've got, and I'm struggling to adapt someone else's code to my specific problem
Any help on this would be greatly appreciated
Maybe not a direct answer to your problem but is this data also in a database somewhere or are you familiar with Ms Access? As you could open the tables in Access, and it is pretty easy to do this kind of thing with data bases.
If not, then yes, it is do able with VBA. Numerous ways of doing it.
The simplest is to scroll through one table a line at a time and compare it with every row in the other table and match or not. This will work with small tables and be easy and quick but for large data tables it would be wasteful and may take a long time to complete.

Levenshtein distance algorithm on Spark

I'm starting with Hadoop ecosystem and I'm facing some questions and need your help.
I have two HDFS files and need to execute Levenshtein distance between a group of columns of the first one versus another group of the second one.
This process will be executed each day with a quite considerable amount of data (150M rows in the first file Vs 11M rows in the second one).
I will appreciate to have some guidance (code example, references, etc) on how I can read my two files from HDFS execute Levenshtein distance (using Spark?) as described and save the results on a third HDFS file.
Thank you very much in advance.
I guess you have csv file so you can read the directly to the dataframe:
val df1 = spark.read.option("header","true").csv("hdfs:///pathtoyourfile_1")
The spark.sql.functions module conatins deflevenshtein(l: Column, r: Column): Column function so you need to pass as a parameter - dataframe column with String type, if you want to pass a group of columns you can take concat('col1,'col2,..) function to concatenate multiple columns and pass them to the previous function. If you have 2 or more dataframes you have to join them into one dataframe and then perform distance calculation. Finally you can save your results to csv using df.write.csv("path")

Excel sheets with scores by same ID of person (Kahoot) - How to extract and summarize scores from several quizzes?

I've used Kahoot in the classroom and have several excel files with scores from quizzes.
Students attended quizzes by using unique IDs. In each file, scores are visible for each ID (but ordered by success on each quiz). There are also some students missing or stating wrong IDs (I'll ignore it).
Now I would like to accumulate all scores for all student IDs in one sheet and summarize them by Student ID.
How can I do that most efficiently?
Any pointer or advice is appreciated.
Thanks,
B.
Here's a high level guide to getting what you want along with a sample in this file.
Step 1 - Combine Files to Sheet with Unified Columns
Objective
The goal here is to:
Combine all of your data from other files to single sheet
Merge the data to be in a single column for each field (i.e. Column A has ID, Column B has score).
No breaks in rows.
No formulas.
To illustrate, I made this fake list based loosely on your
description.
Method
You probably can do this manually, but a macro could also be used. If you expect to do this year over year, you might look into vba to open close files in a folder. However, since that wasn't part of question, you can do copy-paste (better yet make a kid do it!). Just make sure there's only one header for each column, and all of the data records align. Probably should do copy paste value if you have any formulas.
Step 2 - Show Summation
There's a couple ways this could be done. A pivot table is probably the most sensible because you could include each quiz as a column to see the total. You could also use a pivot table to do averages by student etc.
TO make a pivot table, I would recommend going on YouTube and they will do a better job of explaining than me.
On that same file I made as an example, I included some tabs to illustrate the power of pivot tables and a couple graphs.
Hope that helps. If you have specific technical questions on this, you might consider asking separately.

Assigning indexes across rows within python DataFrame

I'm currently trying to assign a unique indexer across rows, rather than alongside columns. The main problem is these values can never repeat, and must be preserved with every monthly report that I run.
I've thought about merging the columns and assigning an indexer to that, but my concern is that I won't be able to easily modify the dataframe and still preserve the same index values for each cell with this method.
I'm expecting my df to look something like this below:
Sample DataFrame
I haven't yet found a viable solution so haven't got any code to show yet. Any solutions would be much appreciated. Thank you.

Google Sheets Query double group by for data

I was wondering what the Excel or googlesheets function was for taking a set of data with two labels and aggregating it by one of them and transposing it for graphing would be. To illustrate the problem, I have an image below:
I am interested in taking the 3 columns on the left and converting them to be viewed like the 3 columns on the right.
I found the answer - it was using pivot and group by at the same time:
=query(AR6:AT15,"select AS, avg(AT) group by AS pivot AR")

Resources