I'm not sure if this is the correct place to ask this, but basically I have a .txt file containing values that came from 2 separate sensors.
Example of some data:
{"t":3838202,"s":0,"n":"x1","v":-1052}
{"t":3838203,"s":0,"n":"y1","v":44}
{"t":3838204,"s":0,"n":"z1","v":-84}
{"t":3838435,"s":0,"n":"x1","v":-1052}
{"t":3838436,"s":0,"n":"y1","v":36}
{"t":3838437,"s":0,"n":"z1","v":-80}
{"t":3838670,"s":0,"n":"x1","v":-1056}
{"t":3838671,"s":0,"n":"y1","v":52}
{"t":3838672,"s":0,"n":"z1","v":-88}
{"t":3838902,"s":0,"n":"x1","v":-1052}
{"t":3838903,"s":0,"n":"y1","v":48}
{"t":3838904,"s":0,"n":"z1","v":-80}
{"t":3839136,"s":0,"n":"x1","v":-1056}
{"t":3839137,"s":0,"n":"y1","v":40}
{"t":3839138,"s":0,"n":"z1","v":-80}
x2:-944
y2:108
z2:-380
{"t":3839841,"s":0,"n":"x1","v":-1052}
{"t":3839842,"s":0,"n":"y1","v":44}
{"t":3839843,"s":0,"n":"z1","v":-80}
x2:-948
y2:100
z2:-380
{"t":3840541,"s":0,"n":"x1","v":-1052}
{"t":3840542,"s":0,"n":"y1","v":40}
{"t":3840543,"s":0,"n":"z1","v":-84}
{"t":3840774,"s":0,"n":"x1","v":-1052}
{"t":3840775,"s":0,"n":"y1","v":40}
{"t":3840776,"s":0,"n":"z1","v":-84}
x2:-948
y2:108
z2:-368
I'm trying to get the data into excel, so that for each "chunk" of data in the x1y1z1 section, I take the last set of recorded data and discard the rest and "pair" it with the next set of x2y2z2 data. I don't think I'm explaining it very well, but I basically want to take that text file and get this in excel:
+---------+-------+----+-----+------+-----+------+
| t | x1 | y1 | z1 | x2 | y2 | z2 |
+---------+-------+----+-----+------+-----+------+
| 3839138 | -1056 | 40 | -80 | -944 | 100 | -380 |
| 3839843 | -1052 | 44 | -80 | -948 | 100 | -380 |
| 3840776 | -1052 | 40 | -84 | -948 | 108 | -368 |
+---------+-------+----+-----+------+-----+------+
I'm really stuck as to where I should even start
I think like a programmer, so I would approach this problem in steps. If you are not a programmer, this might not be so helpful to you, and I am sorry for that.
First, define the data. How does each line of data get read and understood.
Second, write a parsing utility. A piece of code which interprets the data as it is read in and stores it in the form you want for your output
Third, import data into Excel.
So, based on the limited data you provided, I am not sure how you are able to determine the x1,y1,z1,x2,y2,z2 for each t, but I assume that the values enclosed in curly braces have something to do with that based on the values for s, n, and v I'm seeing in there. So, first of all you need to clearly determine the way you read the data. Take it one line at a time, and determine how you would build your output table based on each line of data. I assume you would treat the lines enclosed in curly braces differently from the lines with standalone x/y/z values for example.
I hope this points you in the right direction.
I am printing a "Table" to the console. I will be using this same table structure for several different variables. However as you can see from Output below, the lines don't all align.
One way to resolve it would be to increase the number of decimal places (e.g. 6.730000 for Standard Deviation) which would push the line into place.
However, I do not want this many decimal places.
Is it possible to add extra 0s to the end of a number, and make these invisible?
I am planning on using this table structure for several variables, and the length of Mean, Stddev, and Median will likely never be more than 6 characters.
EDIT - I would really like to ensure that each value which appears in the table will be 6 characters long, and if it is not 6 characters long, add additional "invisible" zeros.
Input
# Create and structure Table to store descriptive statistics for each variable.
subtitle = "| Mean | Stddev | Median |"
structure = '| {:0.2f} | {:0.2f} | {:0.2f} |'
lines = '=' * len(subtitle)
# Print table.
print(lines)
print(subtitle)
print(lines)
print(structure.format(mean, std, median))
print(lines)
Output:
======================================
| Mean | Stddev | Median |
======================================
| 181.26 | 6.73 | 180.34 |
======================================
Didn't really figure this out - but found a workaround.
I just did the following:
"| {:^6} | {:^6} | {:^6} | {:^6} | {:^6} |"
This keeps the width between | consistent.
I have a table with events which are grouped by a uid. All rows have the columns uid, visit_num and event_num.
visit_num is an arbitrary counter that occasionally increases. event_num is the counter of interactions within the visit.
I want to merge these two counters into a single interaction counter that keeps increasing by 1 for each event and continues to increase when then next visit has started.
As I only look at the relative distance between events, it's fine if I don't start the counter at 1.
|uid |visit_num|event_num|interaction_num|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 2 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 2 | 4 |
| 2 | 1 | 1 | 500 |
| 2 | 2 | 1 | 501 |
| 2 | 2 | 2 | 502 |
I can achieve this by repartitioning the data and using the monotonically_increasing_id like this:
df.repartition("uid")\
.sort("visit_num", "event_num")\
.withColumn("iid", fn.monotonically_increasing_id())
However the documentation states:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As the id seems to be monotonically increasing by partition this seems fine. However:
I am close to reaching the 1 billion partition/uid threshold.
I don't want to rely on the current implementation not changing.
Is there a way I can start each uid with 1 as the first interaction num?
Edit
After testing this some more, I notice that some of the users don't seem to have consecutive iid values using the approach described above.
Edit 2: Windowing
Unfortunately there are some (rare) cases where more thanone row has the samevisit_numandevent_num`. I've tried using the windowing function as below, but due to this assigning the same rank to two identical columns, this is not really an option.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
The best solution is the Windowing function with rank, as suggested by Jacek Laskowski.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
In my specific case some more data cleaning was required but generally, this should work.
I have some returns data on 1000+ firms that I want to convert into panel form.
From my understanding, it is neither truly wide nor long form (at least from the examples I've seen).
I have attached an example of the original data set and what I want it to look like. Is there a way to achieve this? I am intermediate with Excel/VBA, and new to SAS/Stata but can use them and self-teach myself.
Consider this example using reshape in Stata:
clear *
input float(date FIRM_A FIRM_B FIRM_C FIRM_D)
1 .14304407 .8583148 .3699433 .7310092
2 .34405795 .9531917 .6376472 .2895169
3 .04766626 .6588161 .6988417 .5564945
4 .21615694 .18380463 .4781089 .3058527
5 .709911 .85116 .14080866 .10687433
6 .3805699 .070911616 .55129284 .8039169
7 .1680727 .7267236 .1779183 .51454383
8 .3610604 .1578059 .15383714 .9001798
9 .7081585 .9755411 .28951603 .20034006
10 .27780765 .8351805 .04982195 .3929535
end
reshape long FIRM_, i(date) j(Firm_ID) string
rename FIRM_ return
replace Firm_ID = "Firm " + Firm_ID
list in 1/8, sepby(date)
+---------------------------+
| date Firm_ID return |
|---------------------------|
1. | 1 Firm A .1430441 |
2. | 1 Firm B .8583148 |
3. | 1 Firm C .3699433 |
4. | 1 Firm D .7310092 |
|---------------------------|
5. | 2 Firm A .3440579 |
6. | 2 Firm B .9531917 |
7. | 2 Firm C .6376472 |
8. | 2 Firm D .2895169 |
+---------------------------+
see help reshape for more on the topic.
This can be done very easily with proc transpose in SAS. All you will need to add is a column name for column A. This will be your by variable so that the following variables will be transposed along each specific date. Other than that just make sure your data is sorted by the date column. The code would look similar to this:
proc sort data=have;
by date;
run;
proc transpose data=have out=want; /* you could add a name= or prefix= statement here to rename your variables */
by date;
run;
I have used two RFID readers (different vendors), which provide two different identifiers for the very same RFID tag:
Reader A shows 5BFA0746 (decimal 1543112518)
Reader B shows 4607FA5B (decimal 1174927963)
Can you explain why?
There are no similarities with the last bytes nor prefixes.
I'm not sure that I completely understand your question, but the two values are identical except for their byte-order. Hence, both readers do read the same value (possibly an ISO/IEC 14443-3 UID/anti-collsion identifier?). They just present them in reversed byte order:
+--------+--------+--------+--------+
Reader A: | Byte 0 | Byte 1 | Byte 2 | Byte 3 |
| 5B | FA | 07 | 46 |
+--------+--------+--------+--------+
Reader B: | Byte 3 | Byte 2 | Byte 1 | Byte 0 |
| 46 | 07 | FA | 5B |
+--------+--------+--------+--------+
I can think of two reasons this may be happening:
1) CRC or checksum calculations at the start vs. end of the tag ID (vendors may implement this differently) but it sounds like you've already investigated that
2) The readers are configured to read different areas of the tag. For example, are you sure you are getting the tag ID in both cases? For passive UHF RFID Tags, you might be configured to read the TID Serial # vs. EPC. For HF MiFare readers, perhaps you are reading a data bank on one reader and the ID in the other. This is a long way of saying, are you sure both of your readers are actually configured the same way?