Pyspark - combine 2 rows 2 one, every 2 rows

Pyspark - combine 2 rows 2 one, every 2 rows - apache-spark

I have a pyspark dataframe here like the picture below. I would like to group every 2 rows, but in a way that:
the first row would be that user from row 1 and 2 and
the second row would be from row 2 and 3 etc.
Something like this:
---CustomerID--previous_stockcodes----stock_codes-----
Prices and quantities are not used, previous basket and current basket are put into one. For example, the first row of CustomerID 12347 would be:
12347----[85116, 22375, 71...]-----[84625A, 84625C, ...]
I have written loops to do that but that's really inefficient and slow. I wonder if I can do something like that efficiently using pyspark but I am having trouble figuring that out. Thanks a lot in advance

You could get the next row by using lead function provided by spark-sql.
lead is a window function.
Syntax : lead(column_name,int_value,default_value) over (partition by column_name order by column_name)
int_value takes number of rows you want to lead from current row.
default_value takes input for case when leading rows are not found
>>> input_df.show()
+----------+---------+----------------+
|customerID|invoiceNo| stockCode_list|
+----------+---------+----------------+
| 12347| 537626| [85116, 22375]|
| 12347| 542237|[84625A, 84625C]|
| 12347| 549222| [22376, 22374]|
| 12347| 556201| [23084, 23162]|
| 12348| 539318| [84992, 22951]|
| 12348| 541998| [21980, 21985]|
| 12348| 548955| [23077, 23078]|
+----------+---------+----------------+
>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import lead,col
>>> win_func = Window.partitionBy("customerID").orderBy("invoiceNo")
>>> new_col = lead("stockCode_list",1,None).over(win_func)
>>> req_df = input_df.select(col("customerID"),col("invoiceNo"),col("stockCode_list"),new_col.alias("req_col"))
>>> req_df.orderBy("customerID","invoiceNo").show()
+----------+---------+----------------+----------------+
|customerID|invoiceNo| stockCode_list| req_col|
+----------+---------+----------------+----------------+
| 12347| 537626| [85116, 22375]|[84625A, 84625C]|
| 12347| 542237|[84625A, 84625C]| [22376, 22374]|
| 12347| 549222| [22376, 22374]| [23084, 23162]|
| 12347| 556201| [23084, 23162]| null|
| 12348| 539318| [84992, 22951]| [21980, 21985]|
| 12348| 541998| [21980, 21985]| [23077, 23078]|
| 12348| 548955| [23077, 23078]| null|
+----------+---------+----------------+----------------+

Related

Using multiple parent IDs for cutoff times in deep feature synthesis

My data looks like: People <-- Events <--Activities. The parent is People, of which the only variable is the person_id. Events and Activities both have a time index, along with event_id and activity_id, both which have a few features.
Members of the 'People' entity visit places at all different times. I am trying to generate deep features for people. If people is something like [1,2,3], how do I pass cut off times that create deep features for something like (Person,cutofftime): [1,January2], [1, January3]
If I have only 3 People, it seems like I can't pass a cutoff_time dataframe that has 10 rows (for example, person 1 with 10 possible cutoff times). Trying this gives me the error "Duplicated rows in cutoff time dataframe", despite dropping duplicates from my cutoff_times dataframe.
Must I include time index in the People Entity? This would leave my parent entity with multiple people in the index, although they would have different time index. My instinct is that the people entity should not include any datetime column. I would like to give cut off times to the DFS function.
My cutoff_times df.head looks like this, and has multiple instances of some people_id:
+-------------------------------------------+
| person_id time label |
+-------------------------------------------+
| 0 f_GZSVLYU 2019-12-06 0.0 |
| 1 f_ATBJEQS 2019-12-06 1.0 |
| 2 f_GLFYVAY 2019-12-06 0.5 |
| 3 f_DIHPTPA 2019-12-06 0.5 |
| 4 f_GZSVLYU 2019-12-02 1.0 |
+-------------------------------------------+
The Parent People Entity is like this:
+-------------------+
| person_id |
+-------------------+
| 0 f_GZSVLYU |
| 1 f_ATBJEQS |
| 2 f_GLFYVAY |
| 3 f_DIHPTPA |
| 4 f_DVOYHRQ |
+-------------------+
How can I make featuretools understand what I'm trying to do?
'Duplicated rows in cutoff time dataframe.' I have explored my cutoff_times df and there are no duplicate rows. Person_id, times, and labels all have multiple occurrences each but no 2 rows are the same. Could these duplicates the error is referring to be somewhere else in the EntitySet?

The answer is one row of the cutoff_df had the same ID and time but with different labels. That's a problem.

Extract a substring new column based on a substring based on conditions ideally with Pandas

I got a data set (Excel) with hundreds of entries. In one string column there is most of the information. The information is divided by '_' and typed in by humans. Therefore, it is not possible to work with index positions.
To create a usable data basis it's mandatory to extract information from this column in another column.
The search pattern = '*v*' is alone not enough. But combined with the condition that the first item has to be a digit it works.
I tried to get it to work with iterrows, iteritems, str.strip, str.extract and many more. But the best solution I received with a for-loop.
pattern = '_*v*_'
test = []
for i in df['col']:
'#Split the string in substrings
i = i.split('_')
for c in i:
if c.find('x') == 1:
if c[0].isdigit():
# print(c)
test.append(c)
else:
'#To be able to fix a few rows manually
test.append(0)
[4]: test =[22v3, 33v55, 4v2]
#Input
+-----------+-----------+
| col | targetcol |
+-----------+-----------+
| as_22v3 | |
| 33v55_bdd | |
| Ave_4v2 | |
+-----------+-----------+
#Output
+-----------+-----------+--+
| col | targetcol | |
+-----------+-----------+--+
| as_22v3 | 22v3 | |
| 33v55_bdd | 33v55 | |
| Ave_4v2 | 4v2 | |
+-----------+-----------+--+
My code does work, but only for the first few rows. It stops after 36 values and I can't figure out why. There is no error message besides of course that it is not possible to assign the list to a DataFrame series since it has not the same size.

pandas.Series.str.extract should help:
>>> df['col'].str.extract(r'(\d+v+\d+)')
0
0 22v3
1 33v55
2 4v2
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2']
})
df['targetcol'] = df['col'].str.extract(r'(\d+v+\d+)')
EDIT
df = pd.DataFrame({
'col': ['as_22v3', '33v55_bdd', 'Ave_4v2', '_22 v3', 'space 2,2v3', '2.v3',
'2.111v999', 'asd.123v77', '1 v7', '123 v 8135']
})
pattern = r'(\d+(\,[0-9]+)?(\s+)?v\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
col result
0 as_22v3 22v3
1 33v55_bdd 33v55
2 Ave_4v2 4v2
3 _22 v3 22 v3
4 space 2,2v3 2,2v3
5 2.v3 NaN
6 2.111v999 111v999
7 asd.123v77 123v77
8 1 v7 1 v7
9 123 v 8135 NaN

You say it stops after 36 values? You say it is Excel file you are processing? One thing you could try is to save data set to .csv file and try to read this file in with pd.read_csv function. There are sometimes some extra characters in Excel file that are not easily visible.

Merge in Panda does not allow second key to join on

After looking for answers and trying everything could not figure out a way out, so here it goes.
I have a list of *.txt files that I want to merge by column. I am 100% sure that they have the same structure, as follows
File1
date | time | model_name1
1850-01-16 | 12:00:00 | 0.10
File2
date | time | model_name2
1850-01-16 | 12:00:00 | 0.50
File3..... and so on
Note: the vertical bars are just for clarity here.
Now my output should look like this:
Output
date | time | model_name1 | model_name2
1850-01-16 | 12:00:00 | 0.10 | 0.50
With the following piece of code
out_list4 = os.listdir(out_directory)
df_list = [pd.read_table(out_path+os.fsdecode(file_x), sep='\s+') for file_x in out_list4]
df_merged = reduce(lambda left,right: ,
pd.merge(left,right,on=['date'], how='outer'), df_list)
pd.DataFrame.to_csv(df_merged, out_path+'merged.txt', sep='\t', index=False)
I manage the following output:
Output
date | time_x | model_name1 |time_y | model_name2
1850-01-16 | 12:00:00 | 0.10 |12:00:00| 0.50
As expected since I only have the key ""on=['date']"".
Now if I try to write time as second key as follows: ""on=['date','time']"", it crashes with the following error:
Key error:'time'
and a long list of tracebacks.
I tried placing left_on/righ_on in case "date" was being handled as index. No use. I know the problem does not lie on the files, the structure is right, it is the code. Any help will be much appreciated. And sorry for readibility on the

So, the problem was before. I had defined ""out_list4"" as a list before:
out_list4 = list()
and it was making a mess at the end. Each data element on the list should have size 1872 x 3, but at the end it was adding them altogether again making one last entry be 1872 x 12 and no 'time' header.
Changing the definition of ""out_list4"" to:
out_list4 = []
did the trick. The tip came from Combine a list of pandas dataframes to one pandas dataframe.

Pyspark window function with condition

Suppose I have a DataFrame of events with time difference between each row, the main rule is that one visit is counted if only the event has been within 5 minutes of the previous or next event:
+--------+-------------------+--------+
|userid |eventtime |timeDiff|
+--------+-------------------+--------+
|37397e29|2017-06-04 03:00:00|60 |
|37397e29|2017-06-04 03:01:00|60 |
|37397e29|2017-06-04 03:02:00|60 |
|37397e29|2017-06-04 03:03:00|180 |
|37397e29|2017-06-04 03:06:00|60 |
|37397e29|2017-06-04 03:07:00|420 |
|37397e29|2017-06-04 03:14:00|60 |
|37397e29|2017-06-04 03:15:00|1140 |
|37397e29|2017-06-04 03:34:00|540 |
|37397e29|2017-06-04 03:53:00|540 |
+--------+----------------- -+--------+
The challenge is to group by the start_time and end_time of the latest eventtime that has the condition of being within 5 minutes. The output should be like this table:
+--------+-------------------+--------------------+-----------+
|userid |start_time |end_time |events |
+--------+-------------------+--------------------+-----------+
|37397e29|2017-06-04 03:00:00|2017-06-04 03:07:00 |6 |
|37397e29|2017-06-04 03:14:00|2017-06-04 03:15:00 |2 |
+--------+-------------------+--------------------+-----------+
So far I have used window lag functions and some conditions, however, I do not know where to go from here:
%spark.pyspark
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col
windowSpec = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"])
windowSpecDesc = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"].desc())
# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm
nextEventTime = F.lag(col("eventtime"), -1).over(windowSpec)
# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm
previousEventTime = F.lag(col("eventtime"), 1).over(windowSpec)
diffEventTime = nextEventTime - col("eventtime")
nextTimeDiff = F.coalesce((F.unix_timestamp(nextEventTime)
- F.unix_timestamp('eventtime')), F.lit(0))
previousTimeDiff = F.coalesce((F.unix_timestamp('eventtime') -F.unix_timestamp(previousEventTime)), F.lit(0))
# Check if the next POI is the equal to the current POI and has a time differnce less than 5 minutes.
validation = F.coalesce(( (nextTimeDiff < 300) | (previousTimeDiff < 300) ), F.lit(False))
# Change True to 1
visitCheck = F.coalesce((validation == True).cast("int"), F.lit(1))
result_poi.withColumn("visit_check", visitCheck).withColumn("nextTimeDiff", nextTimeDiff).select("userid", "eventtime", "nextTimeDiff", "visit_check").orderBy("eventtime")
My questions: Is this a viable approach, and if so, how can I "go forward" and look at the maximum eventtime that fulfill the 5 minutes condition. To my knowledge, iterate through values of a Spark SQL Column, is it possible? wouldn't it be too expensive?. Is there another way to achieve this result?
Result of Solution suggested by #Aku:
+--------+--------+---------------------+---------------------+------+
|userid |subgroup|start_time |end_time |events|
+--------+--------+--------+------------+---------------------+------+
|37397e29|0 |2017-06-04 03:00:00.0|2017-06-04 03:06:00.0|5 |
|37397e29|1 |2017-06-04 03:07:00.0|2017-06-04 03:14:00.0|2 |
|37397e29|2 |2017-06-04 03:15:00.0|2017-06-04 03:15:00.0|1 |
|37397e29|3 |2017-06-04 03:34:00.0|2017-06-04 03:43:00.0|2 |
+------------------------------------+-----------------------+-------+
It doesn't give the result expected. 3:07 - 3:14 and 03:34-03:43 are being counted as ranges within 5 minutes, it shouldn't be like that. Also, 3:07 should be the end_time in the first row as it is within 5 minutes of the previous row 3:06.

You'll need one extra window function and a groupby to achieve this.
What we want is for every line with timeDiff greater than 300 to be the end of a group and the start of a new one. Aku's solution should work, only the indicators mark the start of a group instead of the end. To change this you'll have to do a cumulative sum up to n-1 instead of n (n being your current line):
w = Window.partitionBy("userid").orderBy("eventtime")
DF = DF.withColumn("indicator", (DF.timeDiff > 300).cast("int"))
DF = DF.withColumn("subgroup", func.sum("indicator").over(w) - func.col("indicator"))
DF = DF.groupBy("subgroup").agg(
func.min("eventtime").alias("start_time"),
func.max("eventtime").alias("end_time"),
func.count("*").alias("events")
)
+--------+-------------------+-------------------+------+
|subgroup| start_time| end_time|events|
+--------+-------------------+-------------------+------+
| 0|2017-06-04 03:00:00|2017-06-04 03:07:00| 6|
| 1|2017-06-04 03:14:00|2017-06-04 03:15:00| 2|
| 2|2017-06-04 03:34:00|2017-06-04 03:34:00| 1|
| 3|2017-06-04 03:53:00|2017-06-04 03:53:00| 1|
+--------+-------------------+-------------------+------+
It seems that you also filter out lines with only one event, hence:
DF = DF.filter("events != 1")
+--------+-------------------+-------------------+------+
|subgroup| start_time| end_time|events|
+--------+-------------------+-------------------+------+
| 0|2017-06-04 03:00:00|2017-06-04 03:07:00| 6|
| 1|2017-06-04 03:14:00|2017-06-04 03:15:00| 2|
+--------+-------------------+-------------------+------+

So if I understand this correctly you essentially want to end each group when TimeDiff > 300? This seems relatively straightforward with rolling window functions:
First some imports
from pyspark.sql.window import Window
import pyspark.sql.functions as func
Then setting windows, I assumed you would partition by userid
w = Window.partitionBy("userid").orderBy("eventtime")
Then figuring out what subgroup each observation falls into, by first marking the first member of each group, then summing the column.
indicator = (TimeDiff > 300).cast("integer")
subgroup = func.sum(indicator).over(w).alias("subgroup")
Then some aggregation functions and you should be done
DF = DF.select("*", subgroup)\
.groupBy("subgroup")\
.agg(
func.min("eventtime").alias("start_time"),
func.max("eventtime").alias("end_time"),
func.count(func.lit(1)).alias("events")
)

Approach can be grouping the dataframe based on your timeline criteria.
You can create a dataframe with the rows breaking the 5 minutes timeline.
Those rows are criteria for grouping the records and
that rows will set the startime and endtime for each group.
Then find the count and max timestamp(endtime) for each group.

How to justify columns using format function?

I have a working function that takes a list made up of lists and outputs it as a table. I am just missing certain spacing and new lines. I'm pretty new to formatting strings (and python in general.) How do I use the format function to fix my output?
For examples:
>>> show_table([['A','BB'],['C','DD']])
'| A | BB |\n| C | DD |\n'
>>> print(show_table([['A','BB'],['C','DD']]))
| A | BB |
| C | DD |
>>> show_table([['A','BBB','C'],['1','22','3333']])
'| A | BBB | C |\n| 1 | 22 | 3333 |\n'
>>> print(show_table([['A','BBB','C'],['1','22','3333']]))
| A | BBB | C |
| 1 | 22 | 3333 |
What I am actually outputting though:
>>>show_table([['A','BB'],['C','DD']])
'| A | BB | C | DD |\n'
>>>show_table([['A','BBB','C'],['1','22','3333']])
'| A | BBB | C | 1 | 22 | 3333 |\n'
>>>show_table([['A','BBB','C'],['1','22','3333']])
| A | BBB | C | 1 | 22 | 3333 |
I will definitely need to use the format function but I'm not sure how?
This is my current code (my indenting is actually correct but I'm horrible with stackoverflow format):
def show_table(table):
if table is None:
table=[]
new_table = ""
for row in table:
for val in row:
new_table += ("| " + val + " ")
new_table += "|\n"
return new_table

You do actually have an indentation error in your function: the line
new_table += "|\n"
should be indented further so that it happens at the end of each row, not at the end of the table.
Side note: you'll catch this kind of thing more easily if you stick to 4 spaces per indent. This and other conventions are there to help you, and it's a very good idea to learn the discipline of keeping to them early in your progress with Python. PEP 8 is a great resource to familarise yourself with.
The spacing on your "what I need" examples is also rather messed up, which is unfortunate since spacing is the subject of your question, but I gather from this question that you want each column to be properly aligned, e.g.
>>> print(show_table([['10','2','300'],['4000','50','60'],['7','800','90000']]))
| 10 | 2 | 300 |
| 4000 | 50 | 60 |
| 7 | 800 | 90000 |
In order to do that, you'll need to know in advance what the maximum width of each item in a column is. That's actually a little tricky, because your table is organised into rows rather than columns, but the zip() function can help. Here's an example of what zip() does:
>>> table = [['10', '2', '300'], ['4000', '50', '60'], ['7', '800', '90000']]
>>> from pprint import pprint
>>> pprint(table, width=30)
[['10', '2', '300'],
['4000', '50', '60'],
['7', '800', '90000']]
>>> flipped = zip(*table)
>>> pprint(flipped, width=30)
[('10', '4000', '7'),
('2', '50', '800'),
('300', '60', '90000')]
As you can see, zip() turns rows into columns and vice versa. (don't worry too much about the * before table right now; it's a bit advanced to explain for the moment. Just remember that you need it).
You get the length of a string with len():
>>> len('800')
3
You get the maximum of the items in a list with max():
>>> max([2, 4, 1])
4
You can put all these together in a list comprehension, which is like a compact for loop that builds a list:
>>> widths = [max([len(x) for x in col]) for col in zip(*table)]
>>> widths
[4, 3, 5]
If you look carefully, you'll see that there are actually two list comprehensions in that line:
[len(x) for x in col]
makes a list with the lengths of each item x in a list col, and
[max(something) for col in zip(*table)]
makes a list with the maximum value of something for each column in the flipped (with zip) table … where something is the other list comprehension.
That's all kinda complicated the first time you see it, so spend a little while making sure you understand what's going on.
Now that you have your maximum widths for each column, you can use them to format your output. In order to do so, though, you need to keep track of which column you're in, and to do that, you need enumerate(). Here's an example of enumerate() in action:
>>> for i, x in enumerate(['a', 'b', 'c']):
... print("i is", i, "and x is", x)
...
i is 0 and x is a
i is 1 and x is b
i is 2 and x is c
As you can see, iterating over the result of enumerate() gives you two values: the position in the list, and the item itself.
Still with me? Fun, isn't it? Pressing on ...
The only thing left is the actual formatting. Python's str.format() method is pretty powerful, and too complex to explain thoroughly in this answer. One of the things you can use it for is to pad things out to a given width:
>>> "{val:5s}".format(val='x')
'x '
In the example above, {val:5s} says "insert the value of val here as a string, padding it out to 5 spaces". You can also specify the width as a variable, like this:
>>> "{val:{width}s}".format(val='x', width=3)
'x '
These are all the pieces you need … and here's a function that uses all those pieces:
def show_table(table):
if table is None:
table = []
new_table = ""
widths = [max([len(x) for x in c]) for c in zip(*table)]
for row in table:
for i, val in enumerate(row):
new_table += "| {val:{width}s} ".format(val=val, width=widths[i])
new_table += "|\n"
return new_table
… and here it is in action:
>>> table = [['10','2','300'],['4000','50','60'],['7','800','90000']]
>>> print(show_table(table))
| 10 | 2 | 300 |
| 4000 | 50 | 60 |
| 7 | 800 | 90000 |
I've covered a fair bit of ground in this answer. Hopefully if you study the final version of show_table() given here in detail (as well as the docs linked to throughout the answer), you'll be able to see how all the pieces described earlier on fit together.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pyspark - combine 2 rows 2 one, every 2 rows - apache-spark

Related

Using multiple parent IDs for cutoff times in deep feature synthesis

Extract a substring new column based on a substring based on conditions ideally with Pandas

Merge in Panda does not allow second key to join on

Pyspark window function with condition

How to justify columns using format function?

Categories

Resources