Pandas sort row by values of a column - python-3.x

My data looks something like this:
given dataframe
just with many more rows (and different timestamps).
I now want to transform the table, so that I have one row per timestamp and the row is sorted by the the JI value and each JI value contains X,Y,Z,RX,RY,RZ and confidence. So it should look like this
timestamp | JI0 | JI1 | JI2 | ... | J21 | phase
timestamp1 | (confidence|X|Y|Z|RX|RY|RZ) of JI0 | ... | (confidence|X|Y|Z|RX|RY|RZ) of JI21 | phasevalue
timestamp2 |
The ID value is not needed for now. I already tried to get this result by using pivot (and pivot_table), but I was not able to get all values for one joint after another, but rather all (actually not even all for some reason) values from RX for every Joint, then all values from RY for every joint and so on.
The code:
df.pivot_table(index='timestamp', columns='JI', values=['confidence','X','Y','Z','RX','RY','RZ'])
The result using the code from above: result
I hope my question is understandable, otherwise I am of course willing to answer any questions.

Related

How to convert an array of values so that each value is closer the mean, but with a similarly shaped distribution (i.e. reduce the stdev) in PySpark

I hope I've described the job I need to do in the correct terms. Essentially, I need to 'compress' a series of values so that all the values are closer to the mean, but their values should be reduced (or increased) relative to their distance from the mean...
The dataframe looks like this:
>>> df[['population', 'postalCode']].show(10)
+----------+----------+
|population|postalCode|
+----------+----------+
| 1464| 96028|
| 465| 96015|
| 366| 96016|
| 5490| 96101|
| 183| 96068|
| 569| 96009|
| 366| 96054|
| 90| 96119|
| 557| 96006|
| 233| 96116|
+----------+----------+
only showing top 10 rows
>>> df.describe().show()
+-------+------------------+------------------+
|summary| population| postalCode|
+-------+------------------+------------------+
| count| 1082| 1082|
| mean|23348.511090573014| 93458.60813308688|
| stddev|21825.045923603615|1883.6307236060127|
+-------+------------------+------------------+
The population mean is about right for my purposes, but I need the variance around it to be smaller...
Hope that makes sense, any help performing this job either in pyspark or node.js greatly appreciated.
The general idea is to:
translate the mean to zero.
rescale to the new standard deviation
translate to the desired mean (in this case, the original mean)
In pseudo-code, if your values are stored in the variable x:
x.scaled = new.mean + (x - mean(x)) * new.SD/sd(x)
Or, for the specific case of, say, SD=1000 and no change to the mean:
x.scaled = mean(x) + (x - mean(x)) * 1000/sd(x)

Find average and total time between dates in a row string in pyspark?

The spark df column has a list of ordered dates demilited by '|' . It looks like this :
+------------------------------------------------------------+
| date |
+------------------------------------------------------------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02 |
|2016-08-11|2013-08-25|2013-04-01|2012-01-01 |
+------------------------------------------------------------+
I want to calculate the average time for each row and the total difference (last date- first date) in days like this :
+------------------------------------------------------------+---------------+---------------+
| date | average time | total time |
+------------------------------------------------------------+---------------+---------------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-08-01 | 584 | 2920 |
|2016-08-11|2013-08-25|2013-04-01|2012-08-11 | 365 | 1460 |
+------------------------------------------------------------+---------------+---------------+
We can use inbuilt array functions of dataframe for this case..
Example:
val df= Seq(("2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02"),("2016-08-11|2013-08-25|2013-04-01|2012-01-01")).toDF("id") //sample data
df.withColumn("arr_id", array_sort(split(col("id"),"\\|"))). //split to create array and sort the array
withColumn("first_ele", col("arr_id")(0)). // to get first element of array
withColumn("last_ele",reverse(col("arr_id"))(0)). //to get last element of array
withColumn("total_diff",datediff(col("last_ele").cast("date"), col("first_ele").cast("date"))). //get total diff of first and last dates
withColumn("avg_time", col("total_diff")/size(col("arr_id"))). //caluculate diff of date and divide with size of array column
select("id", "total_diff","avg_time").
show(false)
Result:
+------------------------------------------------------+----------+--------+
|id |total_diff|avg_time|
+------------------------------------------------------+----------+--------+
|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02|3043 |608.6 |
|2016-08-11|2013-08-25|2013-04-01|2012-01-01 |1684 |421.0 |
+------------------------------------------------------+----------+--------+
I think the output in question and answer is off because start and end dates in array are little different in input and output in question!
#Shu gave an awesome answer in Scala.
This just pyspark version. Feel free to give him the accepted answer.
from pyspark.sql import functions as F
df.withColumn("date", F.split(F.regexp_replace("date","\|",","),','))\
.withColumn("start", F.reverse(F.col("date"))[0]).withColumn("end",F.col("date")[0])\
.withColumn("total_time", F.datediff("end","start"))\
.withColumn("Average", F.col("total_time")/(F.size(F.col("date"))))\
.drop("start","end").show()
+--------------------+----------+-------+
| date|total_time|Average|
+--------------------+----------+-------+
|[2018-08-01, 2017...| 3043| 608.6|
|[2016-08-11, 2013...| 1684| 421.0|
+--------------------+----------+-------+
Here is a simplified version of the other answers using array built-in functions in Spark 2.4+.
First, split the dates string to get an array.
Now, as the dates are already sorted, use element_at function to get first and last dates since when the index is negative it accesses the elements of the array from last to the first. The rest is just calculating datediff and divide it by the size of the array to get the average:
df.withColumn("array_dates", split(col("date"), "[|]"))\
.withColumn("total_time", datediff(element_at("array_dates", -1), element_at("array_dates", 1))) \
.withColumn("average_time", col("total_time") / size(col("array_dates"))) \
.drop("array_dates")\
.show(2, False)
#+------------------------------------------------------+----------+------------+
#|date |total_time|average_time|
#+------------------------------------------------------+----------+------------+
#|2018-08-01|2017-06-01|2015-11-01|2012-08-21|2010-04-02|3043 |608.6 |
#|2016-08-11|2013-08-25|2013-04-01|2012-01-01 |1684 |421.0 |
#+------------------------------------------------------+----------+------------+
If the dates in the string aren't ordred as in your example, you could sort the array after split using array_sort

Comparing two cells and show both of them when it is not equal

I have a table in EXCEL that looks like this:
ATTRIBUTE_IN_1|TYPE_IN_1|ATTRIBUTE_IN_2 |TYPE_IN_2 |
________________________________________________________
advance |String | accounts |decimal(20,10) |
education |String | Job_category |String |
date of birth |date | advance |decimal(20,10) |
and etc.
I want to compare TYPE_IN_1 and TYPE_IN_2 (2nd and 4th columns) for equal values in ATTRIBUTE_IN_1 and ATTRIBUTE_IN_2 (1st and 3rd columns).
In the result I want to see values, when ATTRIBUTE_IN_1 and ATTRIBUTE_IN_2 have equal value, but TYPE_IN_1 and TYPE_IN_2 (2nd and 4th columns) have unequal values.
For that example in result I want to see:
ATTRIBUTE_IN_1|TYPE_IN_1|ATTRIBUTE_IN_2 |TYPE_IN_2 |
_______________________________________________________
advance |String | advance |decimal(20,10) |
What formula can i use in MS EXCEL for solving this problem?
Excel doesn't really lend itself to this type of question. You can cobble something together, perhaps along these lines (for attribute 1):
=IFERROR(INDEX($A$2:$A$10,INT(SMALL(IF(($A$2:$A$10=TRANSPOSE($C$2:$C$10))*($B$2:$B$10<>TRANSPOSE($D$2:$D$10)),
ROWS($A$2:$A$10)*(ROW($A$2:$A$10)-ROW($A$2))+TRANSPOSE(ROW($A$2:$A$10)-ROW($A$2))),ROW(1:1))/ROWS($A$2:$A$10))+1),"")
This sets up a 2D array where the rows are the attribute 1's and the columns are the attribute 2's. The array elements where there is a conflict are set to 1. The rest of the formula is to get the corresponding values out in row/column order.
For type 1:
=IFERROR(INDEX($B$2:$B$10,INT(SMALL(IF(($A$2:$A$10=TRANSPOSE($C$2:$C$10))*($B$2:$B$10<>TRANSPOSE($D$2:$D$10)),
ROWS($A$2:$A$10)*(ROW($A$2:$A$10)-ROW($A$2))+TRANSPOSE(ROW($A$2:$A$10)-ROW($A$2))),ROW(1:1))/ROWS($A$2:$A$10))+1),"")
For type 2:
=IFERROR(INDEX($D$2:$D$10,MOD(SMALL(IF(($A$2:$A$10=TRANSPOSE($C$2:$C$10))*($B$2:$B$10<>TRANSPOSE($D$2:$D$10)),
ROWS($A$2:$A$10)*(ROW($A$2:$A$10)-ROW($A$2))+TRANSPOSE(ROW($A$2:$A$10)-ROW($A$2))),ROW(1:1)),ROWS($A$2:$A$10))+1),"")
It's interesting to compare this with the SQL for doing the same thing e.g. in SQLite:
.mode csv
.import book2.csv test1
select a.attribute_in_1,a.type_in_1,b.type_in_2
from
test1 a inner join test1 b
on a.attribute_in_1=b.attribute_in_2 and a.type_in_1<>b.type_in_2;

Excel: Create new Row for each column

I have a data set as follows:
A | B | C |...
1abc | 1def | 1ghi |...
2abc | 2def | 2ghi |...
...
This is a collection of memo notes for each account in our database. The 1 in "1abc" represents the account ID and the letters represent some text. There are a total of 177 columns and 1866 rows but not every row has values up to column 177, some may only have two columns worth of data.
I need each column to drop down to the bottom of column A so that all data only occupies Column A. For example, all the data in column B would inserted after the last value in A. Data in column C would be inserted after the last value in column A after it's been populated with the data from B and so on.
In the end, it should look something like this (from which point I can simply sort it in ascending order).
A | B | C |
1abc | | |
2abc | | |
...
1def | | |
2def | | |
...
1ghi | | |
2ghi | | |
...
Can anyone help out with this? I am assuming that I need a macro and have tried working with the solution found in this post but couldn't figure it out. Oh, and I'm using Excel 2003 if that helps.
I am not sure if this will help but, from what you have given me, it sounds like you have 3 columns of data that you would like to be put into one column. If I am correct in my thinking, you can do this a few different ways. So as to preserve the original document in its entirety while adding an extended sorting system you could: 1: Create new tab. Then highlight range a with your mouse. Using your mouse right click and select copy (try to keep in mind the value of the range, E.g. How many items have you copied. Paste it in your now tab. Do this for the remaining ranges. Or maybe use V look up. If you would like I have a schedule module I have created that has many different types of range associations. You can also split sting in excel to divide the id from the sting.
Without VBA, add a blank row at the top and a column on the left populated throughout with say x and then follow the steps detailed here. Then delete the left-hand two columns and filter the remaining one to remove any blanks.

Automatically Show (Un-hide) Columns in Excel

Is there a method whereby columns in Excel (2003, 2007 and/or 2010) can be automatically shown (un-hidden) when the column to the left contains data?
Something like this:
IF column to the left contains data
THEN show column
+-----+-----+
| C | C | //If column1 contains data
| O | O | //Then reveal/show (unhide) column2
| L | L |
| U | U |
| M | M |
| N | N |
| 1 | 2 |
+-----+-----+
I'm guessing that VB code is required but am unsure as to what this would be.
Further to this, is there a way to automatically show the column going by the date (first day of each month)? This is a little more complicated. For example:
FOR all dates
IF system date = year(month.day1) //If it is the first day of a new month
THEN show column(month) //Then show the corresponding column for that month
ENDIF
ENDFOR
i.e. IF system date = 01/09/2012
THEN show column(September)
Is this possible?
Thank you.
Correct, you need VBA to achieve that. Use the Worksheet_Change event which fires whenever something changes. Then, use one of the various methods to determine if a column is not empty (http://ewbi.blogs.com/develops/2006/03/determine_if_a_.html, or just google). Alternatively, if this is to slow because if fires almost all the time, you could use the Worksheet_Activate() event (an overview of all Excel events can be found here).
If your column 7 contains data, you can unhide column 8 using MyWorksheet.Columns("8:8").Hidden = False.
Your second problem can be solved in a similar way. In the Worksheet_Activate(), check if today is the first day in a month using Day(Date) = 1 (I guess it needs to take into account that the user may not be using Excel this day, so the code should be a little more complex) and show that column using MyWorksheet.Columns("12:12").Hidden = False (or whichever it is) for December 1st.
All this code assumes that the columns are already there, just hidden.

Resources