Read multiple values into pandas DataFrame - python-3.x

I have data in a text file with the following format:
{"date":"Jan 6"; "time":"07:00:01"; "ip":"178.41.163.99"; "user":"null"; "country":"Slovakia"; "city":"Miloslavov"; "lat":48.1059; "lon":17.3}
{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"postgres"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}
{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"null"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}
I need to read it into pandas DataFrame with keys to column names and values to items. This is my code to read data in:
columns = ['date', 'time', 'ip', 'user', 'country', 'city', 'lat', 'lon']
df = pd.read_csv("log.txt", sep=';', header=None, names=columns)
A bit frustrated since all I've managed to get is this:
date time ... lat lon
0 {"date":"Jan 6" "time":"07:00:01" ... "lat":48.1059 "lon":17.3}
1 {"date":"Jan 6" "time":"07:05:26" ... "lat":57.7072 "lon":11.9668}
2 {"date":"Jan 6" "time":"07:05:26" ... "lat":57.7072 "lon":11.9668}
I've read docs from top to bottom, but still unable to achieve required result, like below:
date time ... lat lon
0 Jan 6 07:00:01 ... 48.1059 17.3
1 Jan 6 07:05:26 ... 57.7072 11.9668
2 Jan 6 07:05:26 ... 57.7072 11.9668
Is it possible at all? Any advice will be much appreciated. Thanks.

If, as it looks like, you don't have any ; in the string values, you could use string replacement to make it into valid (line separated) json:
In [11]: text
Out[11]: '{"date":"Jan 6"; "time":"07:00:01"; "ip":"178.41.163.99"; "user":"null"; "country":"Slovakia"; "city":"Miloslavov"; "lat":48.1059; "lon":17.3}\n{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"postgres"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}\n{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"null"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}'
In [12]: pd.read_json(text.replace(";", ","), lines=True)
Out[12]:
city country date ip lat lon time user
0 Miloslavov Slovakia Jan 6 178.41.163.99 48.1059 17.3000 07:00:01 null
1 Gothenburg Sweden Jan 6 37.123.163.124 57.7072 11.9668 07:05:26 postgres
2 Gothenburg Sweden Jan 6 37.123.163.124 57.7072 11.9668 07:05:26 null

Related

How to graph Binance API Orderbook with Pandas-matplotlib?

the data comes in 3 columns after (orderbook = pd.DataFrame(orderbook_data):
timestamp bids asks
UNIX timestamp [bidprice, bidvolume] [askprice, askvolume]
list has 100 values of each. timestamp is the same
the problem is that I don't know how to access/index the values inside each row list [price, volume] of each column
I know that by running ---> bids = orderbook["bids"]
I get the list of 100 lists ---> [bidprice, bidvolume]
I'm looking to avoid doing a loop.... there has to be a way to just plot the data
I hope someone can undertand my problem. I just want to plot price on x and volume on y. The goal is to make it live
As you didn't present your input file, I prepared it on my own:
timestamp;bids
1579082401;[123.12, 300]
1579082461;[135.40, 220]
1579082736;[130.76, 20]
1579082801;[123.12, 180]
To read it I used:
orderbook = pd.read_csv('Input.csv', sep=';')
orderbook.timestamp = pd.to_datetime(orderbook.timestamp, unit='s')
Its content is:
timestamp bids
0 2020-01-15 10:00:01 [123.12, 300]
1 2020-01-15 10:01:13 [135.40, 220]
2 2020-01-15 10:05:36 [130.76, 20]
3 2020-01-15 10:06:41 [123.12, 180]
Now:
timestamp has been converted to native pandasonic type of datetime,
but bids is of object type (actually, a string).
and, as I suppose, this is the same when read from your input file.
And now the main task: The first step is to extract both numbers from bids,
convert them to float and int and save in respective columns:
orderbook = orderbook.join(orderbook.bids.str.extract(
r'\[(?P<bidprice>\d+\.\d+), (?P<bidvolume>\d+)]'))
orderbook.bidprice = orderbook.bidprice.astype(float)
orderbook.bidvolume = orderbook.bidvolume.astype(int)
Now orderbook contains:
timestamp bids bidprice bidvolume
0 2020-01-15 10:00:01 [123.12, 300] 123.12 300
1 2020-01-15 10:01:01 [135.40, 220] 135.40 220
2 2020-01-15 10:05:36 [130.76, 20] 130.76 20
3 2020-01-15 10:06:41 [123.12, 180] 123.12 180
and you can generate e.g. a scatter plot, calling:
orderbook.plot.scatter('bidprice', 'bidvolume');
or other plotting function.
Another possibility
Or maybe your orderbook_data is a dictionary? Something like:
orderbook_data = {
'timestamp': [1579082401, 1579082461, 1579082736, 1579082801],
'bids': [[123.12, 300], [135.40, 220], [130.76, 20], [123.12, 180]] }
In this case, when you create a DataFrame from it, the column types
are initially:
timestamp - int64,
bids - also object, but this time each cell contains a plain
pythonic list.
Then you can also convert timestamp column to datetime just like
above.
But to split bids (a column of lists) into 2 separate columns,
you should run:
orderbook[['bidprice', 'bidvolume']] = pd.DataFrame(orderbook.bids.tolist())
Then you have 2 new columns with respective components of the
source column and you can create your graphics jus like above.

cumalativive the all other columns expect date column in python ML with cumsum()

I have stock data set like
**Date Open High ... Close Adj Close Volume**
0 2014-09-17 465.864014 468.174011 ... 457.334015 457.334015 21056800
1 2014-09-18 456.859985 456.859985 ... 424.440002 424.440002 34483200
2 2014-09-19 424.102997 427.834991 ... 394.795990 394.795990 37919700
3 2014-09-20 394.673004 423.295990 ... 408.903992 408.903992 36863600
4 2014-09-21 408.084991 412.425995 ... 398.821014 398.821014 26580100
I need to cumulative sum the columns Open,High,Close,Adj Close, Volume
I tried this df.cumsum(), its shows the the error time stamp error.
I think for processing trade data is best create DatetimeIndex:
#if necessary
#df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
And then if necessary cumulative sum for all column:
df = df.cumsum()
If want cumulative sum only for some columns:
cols = ['Open','High','Close','Adj Close','Volume']
df[cols] = df.cumsum()

How to parse this type of datetime data

I am trying to convert these string objects to python datetime objects but the letter T,Z are creating a problem.
datetime.strptime(i,'%Y-%m-%d').date() for i in df['dateUpdated']
0 2014-02-01T04:41:06Z
1 2013-12-11T05:28:28Z
2 2015-11-19T22:20:29Z
3 2014-02-01T05:22:01Z
4 2016-05-19T14:31:26Z
ValueError: unconverted data remains: T04:41:06Z
Assuming you are working with Pandas, use pd.to_datetime:
import pandas as pd
df = pd.DataFrame({'dateUpdated': ['2014-02-01T04:41:06Z', '2013-12-11T05:28:28Z']})
df['dateUpdated'] = pd.to_datetime(df['dateUpdated'])
Results:
0 2014-02-01 04:41:06+00:00
1 2013-12-11 05:28:28+00:00
Name: dateUpdated, dtype: datetime64[ns, UTC]
If you then want to access only the date part of your new column, you can use:
df['dateUpdated'].dt.date

How to separate column values into multiple rows & multiple columns with Python

I have a csv file with columns like this
I need to separate column (B) values into separate columns and multiple rows like this
This is what I tried (the data here in below code is same as csv data above) and did not work
data = [{"latlong":'{lat: 15.85173248 , lng: 78.6216129},{lat: 15.85161765 , lng: 78.61982138},{lat: 15.85246304 , lng: 78.62031075},{lat: 15.85250474 , lng: 78.62034441},{lat: 15.85221891 , lng: 78.62174507},', "Id": 1},
{"latlong": '{lat: 15.8523723 , lng: 78.62177758},{lat: 15.85236637 , lng: 78.62179098},{lat: 15.85231281 , lng: 78.62238316},{lat: 15.8501259 , lng: 78.62201676},', "Id":2}]
df = pd.DataFrame(data)
df
df.latlong.apply(pd.Series)
This works in this case
data1 = [{'latlong':[15.85173248, 78.6216129, 1]},{'latlong': [15.85161765, 78.61982138, 1]},{'latlong': [15.85246304, 78.62031075, 1]},
{'latlong': [15.85250474, 78.62034441, 1]}, {'latlong': [15.85221891, 78.62174507, 1]},{'latlong': [15.8523723, 78.62177758, 2]},
{'latlong': [15.85236637, 78.62179098, 2]}, {'latlong': [15.85231281, 78.62238316, 2]},{'latlong': [15.8501259,78.62201676, 2]}]
df1 = pd.DataFrame(data1)
df1
df1 = df1['latlong'].apply(pd.Series)
df1.columns = ['lat', 'long', 'Id']
df1
How can I achieve this with Python ?
New to python. I tried following links... Could not understand how to apply it to my case.
Splitting dictionary/list inside a Pandas Column into Separate Columns
python split data frame columns into multiple rows
Your data is in a very strange format ... the entries of latlong aren't actually valid JSON (there is a trailing comma at the end, and there are no quotes around the field names), so I would probably actually use a regular expression to split out the columns, and a list comprehension to split out the rows:
In [39]: pd.DataFrame(
[{'Id':r['Id'], 'lat':lat, 'long':long}
for r in data
for lat,long in re.findall("lat: ([\d.]+).*?lng: ([\d.]+)",
r['latlong'])])
Out[39]:
Id lat long
0 1 15.85173248 78.6216129
1 1 15.85161765 78.61982138
2 1 15.85246304 78.62031075
3 1 15.85250474 78.62034441
4 1 15.85221891 78.62174507
5 2 15.8523723 78.62177758
6 2 15.85236637 78.62179098
7 2 15.85231281 78.62238316
8 2 15.8501259 78.62201676

Error message "Exception: cannot find the correct atom type" when executing pandas to_hdf

I want to save the dataframe df to the .h5 file MainDataFile.h5 :
df.to_hdf ("c:/Temp/MainDataFile.h5", "MainData", mode = "w", format = "table", data_columns=['_FirstDayOfPeriod','Category','ChannelId'])
and get the following error :
*** Exception: cannot find the correct atom type -> > [dtype->object,items->Index(['Libellé_Article', 'Libellé_segment'], dtype='object')]
Now if I drop the column 'Libellé_Article' from df (which is a string column), I don't get the error message anymore.
What could be wrong with this column ? I suspect a special, forbidden, character in it, but unable to find which so far.
UPDATE 1
Following Jeff's comment I have tried to encode the column 'Libellé_Article' :
df['Libellé_Article'] = df['Libellé_Article'].str.encode('utf-8')
The column now appears like this :
df['Libellé_Article']
0 b'PAPETERIE'
2 b'NR CONTRIBUTION DEEE'
4 b'NON UTILISE 103'
7 b"L'ENFANT SOUS TERREUR/MILLER A."
10 b'ENERGIE VITALE ET AUTOGUERISON/CHIA M.'
12 b'ENERGIE COSMIQUE CETTE PUISSANCE QUI EST EN ...
13 b'ENERGIE COSMIQUE CETTE PUISSANCE QUI EST EN ...
18 b"COMMENT ATTIRER L'ARGENT/MURPHY J."
19 b"COMMENT ATTIRER L'ARGENT/MURPHY J."
and when I execute the command to_hdf, I get :
*** TypeError: Cannot serialize the column [Libellé_Article] because
its data contents are [mixed] object dtype
This will work in py2. For py3, this should work w/o the encoding step.
This is actually a 'mixed' column as it includes strings and unicode.
In [24]: from pandas.compat import u
In [25]: df = DataFrame({'unicode':[u('\u03c3')] * 5 + list('abc') })
In [26]: df
Out[26]:
unicode
0 ?
1 ?
2 ?
3 ?
4 ?
5 a
6 b
7 c
In [27]: df['unicode'] = df.unicode.str.encode('utf-8')
In [28]: df.to_hdf('test.h5','df',mode='w',data_columns=['unicode'],format='table')
In [29]: pd.read_hdf('test.h5','df')
Out[29]:
unicode
0 ?
1 ?
2 ?
3 ?
4 ?
5 a
6 b
7 c

Resources