Analyse Logdata out of a cvs file all written in a long row - python-3.x

I have a set of a couple hundreds sensors whose data is recorded in a log file. Each measurement cycle of all sensors is written in one line of the log file in cvs format. I need to be able to structure the log file to make some analysis with plots and calculations of the values.
The format of the CVS is like the following:
ID;Time;SensorID;ValueA;ValueB;ValueC;ValueD;SensorID;ValueA;ValueB;ValueC;ValueD[3..400]SensorID;ValueA;ValueB;ValueC;ValueD
11234;11:12:123456;12345678;5.3;53.4;53;-36.6;72345670;5.8;57.4;56;-39.6;[...]92345670;5.9;60.4;55;-33.6;
So I have a very long table with about 5000 or 6000 columns which contain my values but I'm not sure which is the right way to extract it in a easy way to perform some analysis. The table has about 600 rows.
I have written a report function in python with the help of pandas. The format I can already analyse is like the following:
Time;SensorID;ValueA;ValueB;ValueC;ValueD
11:12:123456;12345678;5.3;53.4;53;-36.6;
11:12:123457;12345679;5.5;55;54;-40;
So the time is slightly different and the Sensor ID will be different.
I use groupby(SensorID) and plots of the groupby and after that I perform some value_count() within some columns

If I understand you correct, each data line in the CSV contains ValueA - D for a set sensors, sharting the same ID and Time columns. Also your dataline ends with a semicolon which will throw pandas off.
[...]92345670;5.9;60.4;55;-33.6;
This answer leaves the semicolon stays in place since I assume you can change the process that produces the CSV file.
from io import StringIO
string = """
ID;Time;SensorID;ValueA;ValueB;ValueC;ValueD;SensorID;ValueA;ValueB;ValueC;ValueD;SensorID;ValueA;ValueB;ValueC;ValueD;
11234;11:12:123456;12345678;5.3;53.4;53;-36.6;72345670;5.8;57.4;56;-39.6;92345670;5.9;60.4;55;-33.6;
"""
df = pd.read_csv(StringIO(string), sep=';', engine='python')
# Shift the columns. We need this because the extra semicolon!
columns = df.columns[1:]
df = df.iloc[:, :-1]
df.columns = columns
df = df.set_index('Time')
# n is how many groups of sensor measurement are stored in each line
n = df.shape[1] // 5
idx = pd.MultiIndex.from_product([range(n), ['SensorID', 'ValueA', 'ValueB', 'ValueC', 'ValueD']])
result = df.stack(level=0).droplevel(-1).reset_index()
Result:
Time SensorID ValueA ValueB ValueC ValueD
0 11:12:123456 12345678 5.3 53.4 53 -36.6
1 11:12:123456 72345670 5.8 57.4 56 -39.6
2 11:12:123456 92345670 5.9 60.4 55 -33.6
Now you can send it to your analysis function.

thanks for thinking about it. Have you tested this code? I'm getting this table as output:
Time 0
0 11234 11:12:123456
1 11234 12345678
2 11234 5.3
3 11234 53.4
4 11234 53
5 11234 -36.6
6 11234 72345670
7 11234 5.8
8 11234 57.4
9 11234 56
10 11234 -39.6
11 11234 92345670
12 11234 5.9
13 11234 60.4
14 11234 55
15 11234 -33.6

Related

Finding which rows have duplicates in a .csv, but only if they have a certain amount of duplicates

I am trying to determine which sequential rows have at least 50 duplicates within one column. Then I would like to be able to read which rows have the duplicates in a summarized manner, ie
start end total
9 60 51
200 260 60
I'm trying to keep the start and end separate so I can call on them independently later.
I have this to open the .csv file and read its contents:
df = pd.read_csv("BN4 A4-F4, H4_row1_column1_watershed_label.csv", header=None)
df.groupby(0).filter(lambda x: len(x) > 0)
Which gives me this:
0
0 52.0
1 65.0
2 52.0
3 52.0
4 52.0
... ...
4995 8.0
4996 8.0
4997 8.0
4998 8.0
4999 8.0
5000 rows × 1 columns
I'm having a number of problems with this. 1) I'm not sure I totally understand the second function. It seems like it is supposed to group the numbers in my column together. This code:
df.groupby(0).count()
gives me this:
0
0.0
1.0
2.0
3.0
4.0
...
68.0
69.0
70.0
71.0
73.0
65 rows × 0 columns
Which I assume means that there are a total of 65 different unique identities in my column. This just doesn't tell me what they are or where they are. I thought that's what this one would do
df.groupby(0).filter(lambda x: len(x) > 0)
but if I change the 0 to anything else then it screws up my generated list.
Problem 2) I think in order to get the number of duplicates in a sequence, and which rows they are in, I would probably need to use a for loop, but I'm not sure how to build it. So far, I've been pulling my hair out all day trying to figure it out but I just don't think I know Python well enough yet.
Can I get some help, please?
UPDATE
Thanks! So this is what I have thanks to #piterbarg:
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df
.reset_index()
.shift(periods=-1)
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
out:
0 index
mean first last len
0 7 32 87 56
1 19 277 333 57
2 1 785 940 156
3 30 4062 4125 64
4 29 4214 4269 56
5 7 4450 4599 150
6 1 4612 4775 164
7 7 4778 4882 105
8 8 4945 4999 56
The current issue is trying to make it so the dataframe includes the last row of my .csv. If anyone happens to see this, I would love your input!
Let's start by mocking a df:
import numpy as np
np.random.seed(314)
df=pd.DataFrame({0:np.random.randint(10,size = 5000)})
# make sure we have a couple of large blocks
df.loc[300:400,0] = 5
df.loc[600:660,0] = 4
First we identify where the changes to the consecutive numbers occur, and groupby each of such groups. We record where it starts, where it finishes, and the size of each group
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({'index':['first','last',len]})
)
Then we only pick those groups that are longer than 50
(df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True)
)
output:
index
first last len
0 300 400 101
1 600 660 61
For your question as to what df.groupby(0).filter(lambda x: len(x) > 0) does, as far as I can tell it does nothing. It groups by different values in column 0 and then discard those groups whose size is 0, which is none of them by definition. So this returns your full df
Edit
Your code is not quite right, should be
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum())
.agg({0 : 'mean', 'index':['first','last',len]}))
df3 = (df2.where(df2[('index','len')]>50)
.dropna()
.astype(int)
.reset_index(drop = True))
print(df3)
note that we define and return df3 not df2, and also I amended the code to return the value that is repeated in the mean column (sorry names are not very intuitive but you can change them if you want)
first is the index when the repetition starts, last is the last index, and len is how many elements there.
#function to identify which behaviors have at least 49 frames, and give the starting, ending, and number of frames
def behavior():
df2 = (df.reset_index()
.groupby((df[0].diff() != 0).cumsum()) #if the diff between a row and the prev row is not 0, increase cumulative sum
.agg({0 : 'mean', 'index':['first','last',len]})) #mean is the behavior category
.shift(-1)
df3 = (df2.where(df2[('index','len')]>49)
.dropna() #drop N/A
.astype(int) #type = int
.reset_index(drop = True))
print(df3)
yields this:
0 index
mean first last len
0 7 31 86 56
1 19 276 332 57
2 1 784 939 156
3 31 4061 4124 64
4 29 4213 4268 56
5 8 4449 4598 150
6 1 4611 4774 164
7 8 4777 4881 105
8 8 4944 4999 56
Which I love. I did notice that the group with 56x duplicates of '7' actually starts on row 32, and ends on row 87 (just one later in both cases, and the pattern is consistent throughout the sheet). Am I right in believing that this can be fixed with the shift() function somehow? I'm toying around with this still :D

New dataframe row by row from a different format dataframe

I'm wandering if something like this could be achieved with python.
I currently have the following dataframe (df1):
A B C D E F
1.1.1 amba 131 1 50 4
2.2.2 erto 50 7 131 8
3.3.3 gema 131 2 50 5
And I would like to get this output in a new dataframe (df2):
ID User 131 50
1.1.1 amba 1 4
2.2.2 erto 8 7
3.3.3 gema 2 5
Take in mind that df1 has an undetermined number of rows and df2 should have the same number of rows than df1. First and second column do not change and keep the same. Columns C and E in df1 store attribute IDs while columns D and F store attribute's values. For example, in df1 131=1 and 50=4 in the first row. Plus attribute IDs are not always in the same column and the attribute ID could be placed in Column C or column E.
I am thinking on creating df2 using a loop and analyzing rows with lambda but i am currently having issues to make work anything for the moment. Any idea?
I have understood evey part of the code and I am now adding columns but I am wondering if this could be done with a loop or something similar. This is how code looks after adding 4 extra colums:
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO(""" A B C D E F G H I J
1.1.1 amba 131 1 50 4 40 3 150 5
2.2.2 erto 50 7 40 8 150 8 131 2
3.3.3 gema 131 2 150 5 40 1 50 3"""), sep="\s+")
df2 = (pd.concat([df1.drop(columns=["C","D","E","F","G","H"]).rename(columns={"I":"key","J":"val"}),
df1.drop(columns=["C","D","E","F","I","J"]).rename(columns={"G":"key","H":"val"}),
df1.drop(columns=["C","D","G","H","I","J"]).rename(columns={"E":"key","F":"val"}),
df1.drop(columns=["E","F","G","H","I","J"]).rename(columns={"C":"key","D":"val"}),
])
.rename(columns={"A":"ID","B":"User"})
.set_index(["ID","User","key"])
.unstack(2)
.reset_index()
)
# flatten the columns..
df2.columns = [c[1] if c[0]=="val" else c[0] for c in df2.columns.to_flat_index()]
df2
print(df2)
And this is the output:
ID User 40 50 131 150
0 1.1.1 amba 3 4 1 5
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
So yes, everything is working fine but I would like to find a way to make this with a loop instead of having tons of lines (I have about 70 columns per row). Thank you very much for the help. Thanks.
I have just one extra question and I will have everything working fine. In my actual table I have some rows with 60 columns nd other ones with just 30 or so. This means that I have tons of NaN in these rows with less colums, so I am getting an error when try to unstack. I have read about pivot_tables, drop_duplicates, etc, but not sure how to make work some of these options with this code. Thanks!
Logically you have a mix of keys being part of row and part of columns. Construct a df by concat() that has the whole key as part of the row. Then it's a simple case of using unstack() to get what you want
df1 = pd.read_csv(io.StringIO(""" A B C D E F
1.1.1 amba 131 1 50 4
2.2.2 erto 50 7 131 8
3.3.3 gema 131 2 50 5"""), sep="\s+")
df2 = (pd.concat([df1.drop(columns=["C","D"]).rename(columns={"E":"key","F":"val"}),
df1.drop(columns=["E","F"]).rename(columns={"C":"key","D":"val"}),
])
.rename(columns={"A":"ID","B":"User"})
.set_index(["ID","User","key"])
.unstack(2)
.reset_index()
)
# flatten the columns..
df2.columns = [c[1] if c[0]=="val" else c[0] for c in df2.columns.to_flat_index()]
df2
output
ID User 50 131
1.1.1 amba 4 1
2.2.2 erto 7 8
3.3.3 gema 5 2
......

pandas get rows from one dataframe which are existed in other dataframe

I have two dataframes. The dataframes as follows:
df1 is
numbers
user_id
0 9154701244
1 9100913773
2 8639988041
3 8092118985
4 8143131334
5 9440609551
6 8309707235
7 8555033317
8 7095451372
9 8919206985
10 8688960416
11 9676230089
12 7036733390
13 9100914771
it's shape is (14,1)
df2 is
user_id numbers names type duration date_time
0 9032095748 919182206378 ramesh incoming 23 233445445
1 9032095748 918919206983 suresh incoming 45 233445445
2 9032095748 919030785187 rahul incoming 45 233445445
3 9032095748 916281206641 jay incoming 67 233445445
4 jakfnka998nknk 9874654411 query incoming 25 8571228412
5 jakfnka998nknk 9874654112 form incoming 42 678565487
6 jakfnka998nknk 9848022238 json incoming 10 89547212765
7 ukajhj9417fka 9984741215 keert incoming 32 8548412664
8 ukajhj9417fka 9979501984 arun incoming 21 7541344646
9 ukajhj9417fka 95463241 paru incoming 42 945151215451
10 ukajknva939o 7864621215 hari outgoing 34 49829840920
and it's shape is (10308,6)
Here in df1, the column name numbers are having the multiple unique numbers. These numbers are available in df2 and those are repeated depends on the duration. I want to get those data who all are existed in df2 based on the numbers which are available in df1.
Here is the code I've tried to get this but I'm not able to figure it out how it can be solved using pandas.
df = pd.concat([df1, df2]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
df = df.reindex(idx)
It gives me only unique numbers column which are there in df2. But I need to get all the data including other columns from the second dataframe.
It would be great that anyone can help me on this. Thanks in advance.
Here is a sample dataframe, I created keeping the gist same.
df1=pd.DataFrame({"numbers":[123,1234,12345,5421]})
df2=pd.DataFrame({"numbers":[123,1234,12345,123,123,45643],"B":[1,2,3,4,5,6],"C":[2,3,4,5,6,7]})
final_df=df2[df2.numbers.isin(df1.numbers)]
Output DataFrame The result is all unique numbers that are present in df1 and present in df2 will be returned
numbers B C
0 123 1 2
1 1234 2 3
2 12345 3 4
3 123 4 5
4 123 5 6

Pandas : merge dataframes with conditions

I'd like something pretty complicated, I think.
So i have 2 pandas DataFrames,
contact_extrafields (which is a CSV file converted to a DataFrame):
contact_id departement age region size
0 17068CE3 5 19.5
1 788159ED 59 18 ABC
2 4796EDA9 69 100.0
3 2BB080E4 32 DEF 50.5
4 8562B30E 10 GHI 79.95
5 9602758E 67 JKL 23.7
6 3CBBA9F7 65 MNO 14.7
7 DAE5EE44 75 98 159.6
8 5B9E3410 49 10 PQR 890.1
...
datafield_types (which is a dictionary converted to a DataFrame):
name datatype_id datafield_id datatype_name
0 size 1 4 float
1 region 2 3 string
2 age 3 2 integer
3 departement 3 1 integer
I would like a new DataFrame like this :
contact_id datafield_id string_value integer_value boolean_value float_value
0 17068CE3 4 19.5
1 17068CE3 3
2 17068CE3 2 5
3 17068CE3 1
4 788159ED 4
5 788159ED 3 ABC
6 788159ED 2 18
7 788159ED 1 59
....
The DataFrame contact_extrafields contains about 3 million lines.
EDIT (exemple):
If I take contact_id 788159ED from DataFrame contact_extrafields,
I'll take the name of the column and its value,
check the type of the value with in DataFrame datafield_types with the column name,
for example for the column department its value is 59 and its type is integrated according to the DataFrame datafield_types so the id is 3,
it should insert a line in the new DataFrame that i will create like this:
contact_id datafield_id string_value integer_value boolean_value float_value
0 788159ED 1 59
....
The datafield_id is retrieved from the DataFrame datafield_types this will allow me to know that the contact 788159ED had for the column department which is integer type the value 59.
Each column create a row in the DataFrame I want to create.
Is it possible to do it with pandas?
How to do it?
The columns in contact_extrafields can change (so i will change the datafield_types names too)
I've tried a lot of things that have led me to a memory saturation.
My code is running on a machine with 16 gigas of ram.
Thanks a lot !

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

Resources