Finding a special row in a data frame in Rcpp ( filtering in Rcpp corresponding to filter() in R) - rcpp

I am very new in Rcpp. Assume we have two data frames: edge and ref, edge consists of three columns: time, sender, receiver. ref consists of three columns sender, receiver and teller. teller shows the indices of rows which is from 1 to nrow(ref). You can see an example below. I want to go through each row of "edge" and find which row of the "ref" is the same as that. Let's say you go through edge and ref and find that the index of that row in ref is 10. Then I create a data frame, say "dat", with two columns: time and status. Then I replace the corresponding value in dat$status with one. That is, dat$status[10]<- 1. I wrote the code in R as follows:
cdata <- lapply(2:nrow(edge), function(z) {
welke <- filter(ref, sender == edge[z, "sender"], receiver == edge[z, "receiver"])$teller
dat <- matrix(0, nrow = nrow(ref), ncol = 2) %>%
as.data.frame() %>%
set_colnames(c("time","status"))
dat$status[welke] <- 1
return(dat)
}) %>%
dplyr::bind_rows()
I do not know how can I translate into Rcpp.

Related

Sliding window over a string using python

I am working on a dataset as a part of my course practice and am stuck in a particular step. I have tried that using R, but I wish to do the same in python. I am comparatively new to python and so require help.
The data set consists of a column with name 'Seq' with seq(5000+) records. I have another column of name 'MainSeq' that contains the substring seq values in it. I need to check the presence of seq on MainSeq based on the start position given and then print 7 letters before and after each letter of the seq. i.e.
I have a a value in col 'MainSeq' as 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.
Col 'Seq' contains value JKLMNO
Start Position of J= 10 and O= 15
I need to create a new column such that it takes 7 letters before and after the start letter from J till O i.e. having a total length of 15
CDEFGHI**J**KLMNOPQ
DEFGHIJ**K**LMNOPQR
EFGHIJK**L**MNOPQRS
FGHIJKL**M**NOPQRST
GHIJKLM**N**OPQRSTU
HIJKLMN**O**PQRSTUV
I know to apply the logic on a specific seq. But since I have around 5000+ seq records, I need to figure out a way to apply the same on all the seq records.
seq = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
i = seq.index('J')
j = seq.index('O')
value = 7
for mid in range(i, 1+j):
print(seq[mid-value:mid+value+1])
I'm not sure this will do exactly what you want, you've not really supplied a lot of data to test with, but it might work or at least give you a start.
import pandas as pd
df = pd.DataFrame({'MainSeq':['ABCDEFGHIJKLMNOPQRSTUVWZYZ','ABCDEFGHIJKLMNOPQRSTUVWZYZ'], 'Seq':'JKLMNO'})
def get_sequences(seq, letters, value):
sequences = [seq[seq.index(letter)-value:seq.index(letter)+value+1] for letter in letters]
return sequences
df['new_seq'] = df.apply(lambda row : get_sequences(row['MainSeq'], row['Seq'], 7), axis = 1)
df = df.explode('new_seq')
print(df)

Splitting the output obtained by Counter in Python and pushing it to Excel

I am using the counter function to count every word of the description of 20000 products and see how many times this word repeats like 'pipette' repeats 1282 times.To do this i have split a column A into many columns P,Q,R,S,T,U & V
df["P"] = df["A"].str.split(n=10).str[0]
df["Q"] = df["A"].str.split(n=10).str[1]
df["R"] = df["A"].str.split(n=10).str[2]
df["S"] = df["A"].str.split(n=10).str[3]
df["T"] = df["A"].str.split(n=10).str[4]
df["U"] = df["A"].str.split(n=10).str[5]
df["V"] = df["A"].str.split(n=10).str[6]
This shows the splitted products
And the i am individually counting all of the columns and then add them to get the total number of words.
d = Counter(df['P'])
e = Counter(df['Q'])
f = Counter(df['R'])
g = Counter(df['S'])
h = Counter(df['T'])
i = Counter(df['U'])
j = Counter(df['V'])
m = d+e+f+g+h+i+j
print(m)
This is the image of the output i obtained on using counter.
Now i want to transfer the output into a excel sheet with the Keys in one column and the Values in another.
Am i using the right method to do so? If yes how shall i push them into different columns.
Note: Length of each key is different
Also i wanna make all the items of column 'A' into lower case so that the counter does not repeat the items. How shall I go about it ?
I've been learning python for just a couple of months but I'll give it a shot. I'm sure there are some better ways to perform that same action. Maybe we both can learn something from this question. Let me know how this turns out. GoodLuck
import pandas as pd
num = len(m.keys())
df = pd.DataFrame(columns=['Key', 'Value']
for i,j,k in zip(range(num), m.keys(), m.values()):
df.loc[i] = [j, k]
df.to_csv('Your_Project.csv')

Find distinct values for each column in an RDD in PySpark

I have an RDD that is both very long (a few billion rows) and decently wide (a few hundred columns). I want to create sets of the unique values in each column (these sets don't need to be parallelized, as they will contain no more than 500 unique values per column).
Here is what I have so far:
data = sc.parallelize([["a", "one", "x"], ["b", "one", "y"], ["a", "two", "x"], ["c", "two", "x"]])
num_columns = len(data.first())
empty_sets = [set() for index in xrange(num_columns)]
d2 = data.aggregate((empty_sets), (lambda a, b: a.add(b)), (lambda x, y: x.union(y)))
What I am doing here is trying to initate a list of empty sets, one for each column in my RDD. For the first part of the aggregation, I want to iterate row by row through data, adding the value in column n to the nth set in my list of sets. If the value already exists, it doesn't do anything. Then, it performs the union of the sets afterwards so only distinct values are returned across all partitions.
When I try to run this code, I get the following error:
AttributeError: 'list' object has no attribute 'add'
I believe the issue is that I am not accurately making it clear that I am iterating through the list of sets (empty_sets) and that I am iterating through the columns of each row in data. I believe in (lambda a, b: a.add(b)) that a is empty_sets and b is data.first() (the entire row, not a single value). This obviously doesn't work, and isn't my intended aggregation.
How can I iterate through my list of sets, and through each row of my dataframe, to add each value to its corresponding set object?
The desired output would look like:
[set(['a', 'b', 'c']), set(['one', 'two']), set(['x', 'y'])]
P.S I've looked at this example here, which is extremely similar to my use case (it's where I got the idea to use aggregate in the first place). However, I find the code very difficult to convert into PySpark, and I'm very unclear what the case and zip code is doing.
There are two problems. One, your combiner functions assume each row is a single set, but you're operating on a list of sets. Two, add doesn't return anything (try a = set(); b = a.add('1'); print b), so your first combiner function returns a list of Nones. To fix this, make your first combiner function non-anonymous and have both of them loop over the lists of sets:
def set_plus_row(sets, row):
for i in range(len(sets)):
sets[i].add(row[i])
return sets
unique_values_per_column = data.aggregate(
empty_sets,
set_plus_row, # can't be lambda b/c add doesn't return anything
lambda x, y: [a.union(b) for a, b in zip(x, y)]
)
I'm not sure what zip does in Scala, but in Python, it takes two lists and puts each corresponding element together into tuples (try x = [1, 2, 3]; y = ['a', 'b', 'c']; print zip(x, y);) so you can loop over two lists simultaneously.

Referencing the next entry in RDD within a map function

I have a stream of <id, action, timestamp, data>s to process.
For example, (let us assume there's only 1 id for simplicity)
id event timestamp
-------------------------------
1 A 1
1 B 2
1 C 4
1 D 7
1 E 15
1 F 16
Let's say TIMEOUT = 5. Because more than 5 seconds passed after D happened without any further event, I want to map this to a JavaPairDStream with two key : value pairs.
id1_1:
A 1
B 2
C 4
D 7
and
id1_2:
E 15
F 16
However, in my anonymous function object, PairFunction that I pass to mapToPair() method,
incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, RequestData> call(String s) {
I cannot reference the data in the next entry. In other words, when I am processing the entry with event D, I cannot look at the data at E.
If this was not Spark, I could have simply created an array timeDifferences, store the differences in two adjacent timestamps, and split the array into parts whenever I see a time difference in timeDifferences that is larger than TIMEOUT. (Although, actually there's no need to explicitly create an array)
How can I do this in Spark?
I'm still struggling to understand your question a bit, but based on what you've written, I think you can do it this way:
val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
val B = A.map(x=>(x._1-1,x._2))
val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))
So the concept is you zip with index to create val A (which assigns a serial long number to each entry of your RDD), and duplicate the RDD but with the index of the consecutive entry to create val B (by subtracting 1 from the index), then use a join to work out the TIMEOUT between consecutive entries. Then use Filter. This method uses RDD. A easier way is to collect them into the Master and use Map or zipped mapping, but it would be scala not spark I guess.
I believe this does what you need:
def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})
// joining the two to attach a "followingGap" to each event
val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
})
// collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
// if this collection is very large, another join might be needed
val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()
// going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}
case class Event(timestamp: Long, data: String)
case class ExtendedEvent(event: Event, followingGap: Long)
The first part builds on GameOfThrows's answer - joining the input with itself with 1's offset to calculate the 'followingGap' for each record. Then we collect the "breaks" or "cutoff points" between the windows, and perform another transformation on the input using these points to group it by window.
NOTE: there might be more efficient ways to perform some of these transformations, depending on the characteristics of the input, for example: if you have lots of "sessions", this code might be slow or run out of memory.

Sorting Excel rows alphabetically in F# (Office.Interop)

I am using the Excel interop in Visual Studio 2010 to try to sort all of these rows of data alphabetically. Some are already in alphabetical order.
Accountancy Graduate, Trainees Banking, Insurance, Finance
Accountancy Graduate, Trainees Customer Services
Accountancy Graduate, Trainees Education
Accountancy Graduate, Trainees Health, Nursing
Accountancy Graduate, Trainees Legal
Accountancy Graduate, Trainees Management Consultancy
Accountancy Graduate, Trainees Media, New Media, Creative
Accountancy Graduate, Trainees Oil, Gas, Alternative Energy
Accountancy Graduate, Trainees Public Sector & Services
Accountancy Graduate, Trainees Recruitment Sales
Accountancy Graduate, Trainees Secretarial, PAs, Administration
Accountancy Graduate, Trainees Telecommunications
Accountancy Graduate, Trainees Transport, Logistics
The current version of my code is as follows (I'm getting my code to work in interactive before putting it into an fs file).
#r "office.dll"
#r "Microsoft.Office.Interop.Excel.dll"
open System;;
open System.IO;;
open Microsoft.Office.Interop.Excel;;
let app = new ApplicationClass(Visible = true)
let inputBook = app.Workbooks.Open #"C:\Users\simon.hayward\Dropbox\F# Scripts\TotalJobsSort\SortData.xlsx" //work
//let inputBook = app.Workbooks.Open #"C:\Users\Simon Hayward\Dropbox\F# Scripts\TotalJobsSort\SortData.xlsx" //home
let outputBook = app.Workbooks.Add()
let inSheet = inputBook.Worksheets.[1] :?> _Worksheet
let outSheet = outputBook.Worksheets.[1] :?> _Worksheet
let rows = inSheet.UsedRange.Rows.Count;;
let toSeq (range : Range) =
seq {
for r in 1 .. range.Rows.Count do
for c in 1 .. range.Columns.Count do
let cell = range.Item(r, c) :?> Range
yield cell
}
for i in 1 .. rows do
let mutable row = inSheet.Cells.Rows.[i] :?> Range
row |> toSeq |> Seq.map (fun x -> x.Value2.ToString()) |> Seq.sort |>
(outSheet.Cells.Rows.[i] :?> Range).Value2 <- row.Value2;;
app.Quit();;
But there is a problem with types. The final line before the quit command
(outSheet.Cells.Rows.[i] :?> Range).Value2 <- row.Value2;;
Is red underlined by intellisense and the error I get is
"This expression is expected to have type seq -> 'a but here has type unit".
I get what VS is trying to tell me, but I have made several attempts to fix this now and i can't seem to get around the type issue.
Can anyone please advise how I can get the pipeline to the correct type so that the output will write to my output sheet?
EDIT 1: This is the full error message that I get with the sorted variable commented out as follows
let sorted = row |> toSeq //|> Seq.map (fun x -> x.Value2.ToString()) |> Seq.sort
The error message is:-
System.Runtime.InteropServices.COMException (0x800A03EC): Exception from HRESULT: 0x800A03EC
at System.RuntimeType.ForwardCallToInvokeMember(String memberName, BindingFlags flags, Object target, Int32[] aWrapperTypes, MessageData& msgData)
at Microsoft.Office.Interop.Excel.Range.get_Item(Object RowIndex, Object ColumnIndex)
at FSI_0122.toSeq#34-47.Invoke(Int32 c) in C:\Users\Simon Hayward\Dropbox\F# Scripts\TotalJobsSort\sortExcelScript.fsx:line 36
at Microsoft.FSharp.Collections.IEnumerator.map#109.DoMoveNext(b& )
at Microsoft.FSharp.Collections.IEnumerator.MapEnumerator1.System-Collections-IEnumerator-MoveNext()
at Microsoft.FSharp.Core.CompilerServices.RuntimeHelpers.takeOuter#651[T,TResult](ConcatEnumerator2 x, Unit unitVar0)
at Microsoft.FSharp.Core.CompilerServices.RuntimeHelpers.takeInner#644[T,TResult](ConcatEnumerator2 x, Unit unitVar0)
at <StartupCode$FSharp-Core>.$Seq.MoveNextImpl#751.GenerateNext(IEnumerable1& next)
at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase1.MoveNextImpl()
at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase1.System-Collections-IEnumerator-MoveNext()
at Microsoft.FSharp.Collections.SeqModule.ToArray[T](IEnumerable1 source)
at Microsoft.FSharp.Collections.ArrayModule.OfSeq[T](IEnumerable1 source)
at .$FSI_0122.main#() in C:\Users\Simon Hayward\Dropbox\F# Scripts\TotalJobsSort\sortExcelScript.fsx:line 42
Stopped due to error
EDIT 2: Could this problem be due to the toSeq function being designed to turn a whole sheet into a sequence? Where I apply it I only want it to apply to one row.
I have tried limiting the r variable in toSeq to 1, but this didn't help.
Does the fact that my actual data is a jagged array matter? It does not always have 3 entries in each row, it varies between 1 and 4.
EDIT 3:
Here is the current iteration of my code, based on Tomas' suggestions
#r "office.dll"
#r "Microsoft.Office.Interop.Excel.dll"
open System;;
open System.IO;;
open Microsoft.Office.Interop.Excel;;
let app = new ApplicationClass(Visible = true);;
let inputBook = app.Workbooks.Open #"SortData.xlsx" //workbook
let outputBook = app.Workbooks.Add();;
let inSheet = inputBook.Worksheets.[1] :?> _Worksheet
let outSheet = outputBook.Worksheets.[1] :?> _Worksheet
let rows = inSheet.UsedRange.Rows.Count;;
let columns = inSheet.UsedRange.Columns.Count;;
// Get the row count and calculate the name of the last cell e.g. "A13"
let rangeEnd = sprintf "A%d" columns
// Get values in the range A1:A13 as 2D object array of size 13x1
let values = inSheet.Range("A1", rangeEnd).Value2 :?> obj[,]
// Read values from the first (and only) column into 1D string array
let data = [| for i in 1 .. columns -> values.[1, i] :?> string |]
// Sort the array and get a new sorted 1D array
let sorted1D = data |> Array.sort
// Turn the 1D array into 2D array (13x1), so that we can write it back
let sorted2D = Array2D.init 1 columns (fun i _ -> data.[i])
// Write the data to the output sheet in Excel
outSheet.Range("A1", rangeEnd).Value2 <- sorted2D
But because the actual data has a variable number of entries in each row I am getting the standard range exception error (this is an improvement on the HRESULT exception errors of the last few days at least).
So I need to define columns for each individual row, or just bind the length of the row to a variable in the for loop. (I would guess).
It looks like you have an additional |> operator at the end of the line with Seq.sort - this means that the list is sorted and then, the compiler tries to pass it to the expression that performs the assignment (which does not take any parameter and has a type unit).
Something like this should compile (though there may be some other runtime issues):
for i in 1 .. rows do
let row = inSheet.Cells.Rows.[i] :?> Range
let sorted = row |> toSeq |> Seq.map (fun x -> x.Value2.ToString()) |> Seq.sort
(outSheet.Cells.Rows.[i] :?> Range).Value2 <- Array.ofSeq sorted
Note that you do not need to mark row as mutable, because the code creates a copy (and - in my version - assigns it to a new variable sorted).
I also use Array.ofSeq to convert the sorted sequence to an array, because I think the Excel interop works better with arrays.
When setting the Value2 property on a range, the size of the range should be the same as the size of the array that you're assigning to it. Also, depending on the range you want to set, you might need a 2D array.
EDIT Regarding runtime errors, I'm not entirely sure what is wrong with your code, but here is how I would do the sorting (assuming you have just one column with string values and you want to sort the rows):
// Get the row count and calculate the name of the last cell e.g. "A13"
let rows = inSheet.UsedRange.Rows.Count
let rangeEnd = sprintf "A%d" rows
// Get values in the range A1:A13 as 2D object array of size 13x1
let values = inSheet.Range("A1", rangeEnd).Value2 :?> obj[,]
// Read values from the first (and only) column into 1D string array
let data = [| for i in 1 .. rows -> values.[i, 1] :?> string |]
// Sort the array and get a new sorted 1D array
let sorted1D = data |> Array.sort
// Turn the 1D array into 2D array (13x1), so that we can write it back
let sorted2D = Array2D.init rows 1 (fun i _ -> data.[i])
// Write the data to the output sheet in Excel
outSheet.Range("A1", rangeEnd).Value2 <- sorted

Resources