Updating column value with alias name pySpark

Updating column value with alias name pySpark - apache-spark

I'm trying to update value of a column which has alias name by using .withColumn. (Im using pySpark)
Ex. Data frame has source1.NAME, source2.NAME I want to update the value of source1.NAME and retain same alias name for the columns.
When I tried,
df.withColumn('source1.NAME', lit('MyName'))
it actually creates new column.
Example Input and Output
source1.NAME source2.NAME
Thomas Martin
Kris Johan
Peter Rolls
source1.NAME source2.NAME
New Name Martin
New Name Johan
New Name Rolls
I want to retain the alias name but just update the source1.NAME values based on conditions.

Related

Excel - Auto increment value based on string search

I've got a set of data that needs updating to use the same "reference" value (auto-incrementing by .1 with each record found) where a name matches against the first two words.
Please see below for an example table with the desired format required below - is there a way to do this via an automated function in Excel?
Alternatively, I can use SQL Server to do this if easier.
Many thanks in advance
Name
Current Reference
DESIRED Reference
First Name
1001.1
1001.1
First Name Also
2123.1
1001.2
First Name Also With More Text
3456.1
1001.3
Second Name
4567.1
4567.1
Second Name Also
1232.1
4567.2
Second Name Also With More Text & Symbols
5890.1
4567.3

setting a variable equal to the value in the adjacent column in pandas dataframe

I have a dataframe that i need to use to get to the full name of the object from the abbreviation in order to search for it in a different dataframe.
this is the first few lines of the simple dataframe. it lists all of the national parks in the US. I need this for input menus and decision trees in the program.
In bad pseudo code I need code that is like.
my`_var = next line over from park_abbrev`in df
so if park_abbrev = DENA then my_var = Denali National Park and Preserve
I need this because I use the initials for user input and that leads to this function, which is picking trails from a separate very large dataframe depending on the difficulty level the user selects.This dataframe only has the full name of the park, not the abbreviation, and i need that to get only the trails in the park of interest.
thank you for any suggestions.

You can get the the park_namess if the park_abbrev is DENA using df.loc:
df.loc[df['park_abbrev']=='DENA','park_name']

Compare Excel rows within a table

I have an Excel table (which is formatted as table and named as "table 1"). So the table columns are named like #LastName, #FirstName and so on.
This table contains information on when people attended a seminar. They're able to attend a number of seminars but also can attend a single seminar twice or more often.
I now want to find out, when the last date was , when a person attended a special seminar.
Lets give an Example:
Table:
#FirstName #LastName #Seminar #Date
Frank Mayer Workshop 1 2017/01/15
Frank Mayer Workshop 2 2019/05/27
Sabine Adams Workshop 1 2017/01/15
Volker Mueller Workshop 1 2017/01/15
Frank Mayer Workshop 1 2018/04/23
As you can see from this simple example, Frank Mayer attended Workshop 1 2x. All others attended each Workshop only once.
Goal is to have a list of Name, Workshop and last attendance. So the final list should look like:
#FirstName #LastName #Seminar #Date
Frank Mayer Workshop 2 2019/05/27
Sabine Adams Workshop 1 2017/01/15
Volker Mueller Workshop 1 2017/01/15
Frank Mayer Workshop 1 2018/04/23
I really have no idea how to solve this with Excel Formulas, since there is not only comparing dates, but also find double entry which differ only with the date. If possible, I'd like to NOT use VBA programming.
Do you guys have any idea? My table has 1500 lines, so doing that by hand is not an option...
Maybe there is a way to create a new sheet or table with the results?
Best Regards
Olaf

A more dynamic approach
(i.e. you would not have to go through the entire process each time more data is added.)
Add a helper column with name+last name+workshop per cell named e.g. "ID". Use either Concatenate or =Name & Lastname & Workshop to consolidate
Create a pivot table from your table with the following field-list:Rows: Add "ID" , Values: Add "Date" and change the value field settings (right-click) to "Max of Date"
Remember to refresh the pivot table when adding new data or change the pivot settings to update automatically when opening the workbook.

There are a thousand and one ways to do this and you might be surprised how easy it is to do manually too.
If you're reducing the table, meaning, removing rows from the table; then you could...
Add a helper column which concatenates the first name, last name, and workshop.
Then sort by this column and by date decreasing values.
Remove duplicates based only in the helper column.
Remove the helper column
Sort as desired
The trick is the sort. Sorting the helper column groups the repeat names per workshop and including the date by decreasing values ensures the most recent is at the top. So when you remove duplicates, the first occurrence will be the most recent and will be retained but subsequent entries well be removed.

I'd advise you to use the "Subtotals" Excel feature, choosing the Max function.
You might encouter problems as you need to give a single column in the first entry of the dialog, but this can be solved, creating an extra column, appending first name and last name, and basing your that (hidden) column (sorry for the Dutch, I don't have an English Excel):
The corresponding entries for the other columns can be found using basic search formulas (Match, VLookup, ...).

I would recommend using PIVOT as the simplest and fastest way:

How to make a fuzzy filter in Power Query according to a list

I want to filter a Power Query table according to another list:
The fact table is:
Location Name
MEL/1F/101 zmel
SHA zsha
BKK/2F zbkk
SGN zsgn
And the lookup list is
{"BKK","SHA"}
The result I want is
Location Name
SHA zsha
BKK/2F zbkk
Now I use
l={"SHA","BKK"},
b=Table.SelectRows(#"Expanded Column1", each List.Contains(l,[location]))
but the BKK/2F is omitted, only SHA shows.
Does any one knows to correct this? Thanks.

You can create a new column using conditional Column by referencing to the table column that contains SHA & BKK. Replace the Column Name to your column.
You can use fill down function if you want to get rid of the nulls.
Update
For your case you might want to use the operator begins with since your BKK has extra text behind

Need Help in Excel Pivot Table

I am working on Excel 2007 and I need help with creating a pivot table.
My excel sheet looks some what like this
Name Date Team Location
John 2011-05-01 Project NY
John 2010-10-12 Information NY
John 2010-02-04 Development CA
Sam 2011-05-01 Development CA
Sam 2010-01-01 Project NY
Sam 2008-01-01 Programmer NY
Brad 2011-04-03 Project NY
Brad 2009-01-01 Info NY
Brad 2007-01-01 Designer CA
Now, if I create a pivot table based on the data above, and put a filter on the "Date" to see who worked at where aka "Location" under what "Team", let's say between "2010-01-01 to 2011-12-31"
Then it will count "John" three times, "Sam" twice and "Brad" once. And total of 6 employeses working during "2010-01-01 to 2011-12-31"
Now I want to remove these duplicates so that if "John" is counted once, he won't be counted anymore, even if he switched to different "Team" or "Location" so I can count for the total number of employees during "2010-01-01 to 2011-12-31" without any duplicates.
I understand that if I want to edit the pivot table and create unique value to remove these duplicates, I need to add another column. But I need help creating this column.
Could anyone help me out here?
Thanks a lot guys!

Anyway, tell me if this would work for you.
1) Sort your spreadsheet by 'Name' first and by 'Date' second.
2) Add an extra column called 'Old Position'.
3) Go down the sorted list and for every name with duplicate rows that you encounter, leave the first occurance alone, but add an 'X' to the column 'Old Position' for all of the older duplicates.
Now you can filter by keeping rows that have their 'Old Position' column not equal to 'X'. This should give you just the most recent positions for all employees.
As long as there are not two distinct employees with the exact same name, I think this should work (otherwise try to use an employee id or somethings unique to each individual instead of their name).

Put "Date" in report filter, "Name" in row labels, set filter for "Location" as "NY" then "Location" can in placed in either report filter or row labels depending on how you want to see data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Updating column value with alias name pySpark - apache-spark

Related

Excel - Auto increment value based on string search

setting a variable equal to the value in the adjacent column in pandas dataframe

Compare Excel rows within a table

How to make a fuzzy filter in Power Query according to a list

Need Help in Excel Pivot Table

Categories

Resources