I have data in a text file with the following format:
{"date":"Jan 6"; "time":"07:00:01"; "ip":"178.41.163.99"; "user":"null"; "country":"Slovakia"; "city":"Miloslavov"; "lat":48.1059; "lon":17.3}
{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"postgres"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}
{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"null"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}
I need to read it into pandas DataFrame with keys to column names and values to items. This is my code to read data in:
columns = ['date', 'time', 'ip', 'user', 'country', 'city', 'lat', 'lon']
df = pd.read_csv("log.txt", sep=';', header=None, names=columns)
A bit frustrated since all I've managed to get is this:
date time ... lat lon
0 {"date":"Jan 6" "time":"07:00:01" ... "lat":48.1059 "lon":17.3}
1 {"date":"Jan 6" "time":"07:05:26" ... "lat":57.7072 "lon":11.9668}
2 {"date":"Jan 6" "time":"07:05:26" ... "lat":57.7072 "lon":11.9668}
I've read docs from top to bottom, but still unable to achieve required result, like below:
date time ... lat lon
0 Jan 6 07:00:01 ... 48.1059 17.3
1 Jan 6 07:05:26 ... 57.7072 11.9668
2 Jan 6 07:05:26 ... 57.7072 11.9668
Is it possible at all? Any advice will be much appreciated. Thanks.
If, as it looks like, you don't have any ; in the string values, you could use string replacement to make it into valid (line separated) json:
In [11]: text
Out[11]: '{"date":"Jan 6"; "time":"07:00:01"; "ip":"178.41.163.99"; "user":"null"; "country":"Slovakia"; "city":"Miloslavov"; "lat":48.1059; "lon":17.3}\n{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"postgres"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}\n{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"null"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}'
In [12]: pd.read_json(text.replace(";", ","), lines=True)
Out[12]:
city country date ip lat lon time user
0 Miloslavov Slovakia Jan 6 178.41.163.99 48.1059 17.3000 07:00:01 null
1 Gothenburg Sweden Jan 6 37.123.163.124 57.7072 11.9668 07:05:26 postgres
2 Gothenburg Sweden Jan 6 37.123.163.124 57.7072 11.9668 07:05:26 null
Related
the data comes in 3 columns after (orderbook = pd.DataFrame(orderbook_data):
timestamp bids asks
UNIX timestamp [bidprice, bidvolume] [askprice, askvolume]
list has 100 values of each. timestamp is the same
the problem is that I don't know how to access/index the values inside each row list [price, volume] of each column
I know that by running ---> bids = orderbook["bids"]
I get the list of 100 lists ---> [bidprice, bidvolume]
I'm looking to avoid doing a loop.... there has to be a way to just plot the data
I hope someone can undertand my problem. I just want to plot price on x and volume on y. The goal is to make it live
As you didn't present your input file, I prepared it on my own:
timestamp;bids
1579082401;[123.12, 300]
1579082461;[135.40, 220]
1579082736;[130.76, 20]
1579082801;[123.12, 180]
To read it I used:
orderbook = pd.read_csv('Input.csv', sep=';')
orderbook.timestamp = pd.to_datetime(orderbook.timestamp, unit='s')
Its content is:
timestamp bids
0 2020-01-15 10:00:01 [123.12, 300]
1 2020-01-15 10:01:13 [135.40, 220]
2 2020-01-15 10:05:36 [130.76, 20]
3 2020-01-15 10:06:41 [123.12, 180]
Now:
timestamp has been converted to native pandasonic type of datetime,
but bids is of object type (actually, a string).
and, as I suppose, this is the same when read from your input file.
And now the main task: The first step is to extract both numbers from bids,
convert them to float and int and save in respective columns:
orderbook = orderbook.join(orderbook.bids.str.extract(
r'\[(?P<bidprice>\d+\.\d+), (?P<bidvolume>\d+)]'))
orderbook.bidprice = orderbook.bidprice.astype(float)
orderbook.bidvolume = orderbook.bidvolume.astype(int)
Now orderbook contains:
timestamp bids bidprice bidvolume
0 2020-01-15 10:00:01 [123.12, 300] 123.12 300
1 2020-01-15 10:01:01 [135.40, 220] 135.40 220
2 2020-01-15 10:05:36 [130.76, 20] 130.76 20
3 2020-01-15 10:06:41 [123.12, 180] 123.12 180
and you can generate e.g. a scatter plot, calling:
orderbook.plot.scatter('bidprice', 'bidvolume');
or other plotting function.
Another possibility
Or maybe your orderbook_data is a dictionary? Something like:
orderbook_data = {
'timestamp': [1579082401, 1579082461, 1579082736, 1579082801],
'bids': [[123.12, 300], [135.40, 220], [130.76, 20], [123.12, 180]] }
In this case, when you create a DataFrame from it, the column types
are initially:
timestamp - int64,
bids - also object, but this time each cell contains a plain
pythonic list.
Then you can also convert timestamp column to datetime just like
above.
But to split bids (a column of lists) into 2 separate columns,
you should run:
orderbook[['bidprice', 'bidvolume']] = pd.DataFrame(orderbook.bids.tolist())
Then you have 2 new columns with respective components of the
source column and you can create your graphics jus like above.
I have stock data set like
**Date Open High ... Close Adj Close Volume**
0 2014-09-17 465.864014 468.174011 ... 457.334015 457.334015 21056800
1 2014-09-18 456.859985 456.859985 ... 424.440002 424.440002 34483200
2 2014-09-19 424.102997 427.834991 ... 394.795990 394.795990 37919700
3 2014-09-20 394.673004 423.295990 ... 408.903992 408.903992 36863600
4 2014-09-21 408.084991 412.425995 ... 398.821014 398.821014 26580100
I need to cumulative sum the columns Open,High,Close,Adj Close, Volume
I tried this df.cumsum(), its shows the the error time stamp error.
I think for processing trade data is best create DatetimeIndex:
#if necessary
#df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
And then if necessary cumulative sum for all column:
df = df.cumsum()
If want cumulative sum only for some columns:
cols = ['Open','High','Close','Adj Close','Volume']
df[cols] = df.cumsum()
I am trying to convert these string objects to python datetime objects but the letter T,Z are creating a problem.
datetime.strptime(i,'%Y-%m-%d').date() for i in df['dateUpdated']
0 2014-02-01T04:41:06Z
1 2013-12-11T05:28:28Z
2 2015-11-19T22:20:29Z
3 2014-02-01T05:22:01Z
4 2016-05-19T14:31:26Z
ValueError: unconverted data remains: T04:41:06Z
Assuming you are working with Pandas, use pd.to_datetime:
import pandas as pd
df = pd.DataFrame({'dateUpdated': ['2014-02-01T04:41:06Z', '2013-12-11T05:28:28Z']})
df['dateUpdated'] = pd.to_datetime(df['dateUpdated'])
Results:
0 2014-02-01 04:41:06+00:00
1 2013-12-11 05:28:28+00:00
Name: dateUpdated, dtype: datetime64[ns, UTC]
If you then want to access only the date part of your new column, you can use:
df['dateUpdated'].dt.date
I have a csv file with columns like this
I need to separate column (B) values into separate columns and multiple rows like this
This is what I tried (the data here in below code is same as csv data above) and did not work
data = [{"latlong":'{lat: 15.85173248 , lng: 78.6216129},{lat: 15.85161765 , lng: 78.61982138},{lat: 15.85246304 , lng: 78.62031075},{lat: 15.85250474 , lng: 78.62034441},{lat: 15.85221891 , lng: 78.62174507},', "Id": 1},
{"latlong": '{lat: 15.8523723 , lng: 78.62177758},{lat: 15.85236637 , lng: 78.62179098},{lat: 15.85231281 , lng: 78.62238316},{lat: 15.8501259 , lng: 78.62201676},', "Id":2}]
df = pd.DataFrame(data)
df
df.latlong.apply(pd.Series)
This works in this case
data1 = [{'latlong':[15.85173248, 78.6216129, 1]},{'latlong': [15.85161765, 78.61982138, 1]},{'latlong': [15.85246304, 78.62031075, 1]},
{'latlong': [15.85250474, 78.62034441, 1]}, {'latlong': [15.85221891, 78.62174507, 1]},{'latlong': [15.8523723, 78.62177758, 2]},
{'latlong': [15.85236637, 78.62179098, 2]}, {'latlong': [15.85231281, 78.62238316, 2]},{'latlong': [15.8501259,78.62201676, 2]}]
df1 = pd.DataFrame(data1)
df1
df1 = df1['latlong'].apply(pd.Series)
df1.columns = ['lat', 'long', 'Id']
df1
How can I achieve this with Python ?
New to python. I tried following links... Could not understand how to apply it to my case.
Splitting dictionary/list inside a Pandas Column into Separate Columns
python split data frame columns into multiple rows
Your data is in a very strange format ... the entries of latlong aren't actually valid JSON (there is a trailing comma at the end, and there are no quotes around the field names), so I would probably actually use a regular expression to split out the columns, and a list comprehension to split out the rows:
In [39]: pd.DataFrame(
[{'Id':r['Id'], 'lat':lat, 'long':long}
for r in data
for lat,long in re.findall("lat: ([\d.]+).*?lng: ([\d.]+)",
r['latlong'])])
Out[39]:
Id lat long
0 1 15.85173248 78.6216129
1 1 15.85161765 78.61982138
2 1 15.85246304 78.62031075
3 1 15.85250474 78.62034441
4 1 15.85221891 78.62174507
5 2 15.8523723 78.62177758
6 2 15.85236637 78.62179098
7 2 15.85231281 78.62238316
8 2 15.8501259 78.62201676
I want to save the dataframe df to the .h5 file MainDataFile.h5 :
df.to_hdf ("c:/Temp/MainDataFile.h5", "MainData", mode = "w", format = "table", data_columns=['_FirstDayOfPeriod','Category','ChannelId'])
and get the following error :
*** Exception: cannot find the correct atom type -> > [dtype->object,items->Index(['Libellé_Article', 'Libellé_segment'], dtype='object')]
Now if I drop the column 'Libellé_Article' from df (which is a string column), I don't get the error message anymore.
What could be wrong with this column ? I suspect a special, forbidden, character in it, but unable to find which so far.
UPDATE 1
Following Jeff's comment I have tried to encode the column 'Libellé_Article' :
df['Libellé_Article'] = df['Libellé_Article'].str.encode('utf-8')
The column now appears like this :
df['Libellé_Article']
0 b'PAPETERIE'
2 b'NR CONTRIBUTION DEEE'
4 b'NON UTILISE 103'
7 b"L'ENFANT SOUS TERREUR/MILLER A."
10 b'ENERGIE VITALE ET AUTOGUERISON/CHIA M.'
12 b'ENERGIE COSMIQUE CETTE PUISSANCE QUI EST EN ...
13 b'ENERGIE COSMIQUE CETTE PUISSANCE QUI EST EN ...
18 b"COMMENT ATTIRER L'ARGENT/MURPHY J."
19 b"COMMENT ATTIRER L'ARGENT/MURPHY J."
and when I execute the command to_hdf, I get :
*** TypeError: Cannot serialize the column [Libellé_Article] because
its data contents are [mixed] object dtype
This will work in py2. For py3, this should work w/o the encoding step.
This is actually a 'mixed' column as it includes strings and unicode.
In [24]: from pandas.compat import u
In [25]: df = DataFrame({'unicode':[u('\u03c3')] * 5 + list('abc') })
In [26]: df
Out[26]:
unicode
0 ?
1 ?
2 ?
3 ?
4 ?
5 a
6 b
7 c
In [27]: df['unicode'] = df.unicode.str.encode('utf-8')
In [28]: df.to_hdf('test.h5','df',mode='w',data_columns=['unicode'],format='table')
In [29]: pd.read_hdf('test.h5','df')
Out[29]:
unicode
0 ?
1 ?
2 ?
3 ?
4 ?
5 a
6 b
7 c