How to convert bytes data into a python pandas dataframe? - python-3.x

I would like to convert 'bytes' data into a Pandas dataframe.
The data looks like this (few first lines):
(b'#Settlement Date,Settlement Period,CCGT,OIL,COAL,NUCLEAR,WIND,PS,NPSHYD,OCGT'
b',OTHER,INTFR,INTIRL,INTNED,INTEW,BIOMASS\n2017-01-01,1,7727,0,3815,7404,3'
b'923,0,944,0,2123,948,296,856,238,\n2017-01-01,2,8338,0,3815,7403,3658,16,'
b'909,0,2124,998,298,874,288,\n2017-01-01,3,7927,0,3801,7408,3925,0,864,0,2'
b'122,998,298,816,286,\n2017-01-01,4,6996,0,3803,7407,4393,0,863,0,2122,998'
The columns headers appear at the top. each subsequent line is a timestamp and numbers.
Is there a straightforward way to do this?
Thank you very much
#Paula Livingstone:
This seems to work:
s=str(bytes_data,'utf-8')
file = open("data.txt","w")
file.write(s)
df=pd.read_csv('data.txt')
maybe this can be done without using a file in between.

I had the same issue and found this library https://docs.python.org/2/library/stringio.html from the answer here: How to create a Pandas DataFrame from a string
Try something like:
from io import StringIO
s=str(bytes_data,'utf-8')
data = StringIO(s)
df=pd.read_csv(data)

You can also use BytesIO directly:
from io import BytesIO
df = pd.read_csv(BytesIO(bytes_data))
This will save you the step of transforming bytes_data to a string

Ok cool, your input formatting is quite awkward but the following works:
with open('file.txt', 'r') as myfile:
data=myfile.read().replace('\n', '') #read in file as a string
df = pd.Series(" ".join(data.strip(' b\'').strip('\'').split('\' b\'')).split('\\n')).str.split(',', expand=True)
print(df)
this produces the following:
0 1 2 3 4 5 6 7 \
0 #Settlement Date Settlement Period CCGT OIL COAL NUCLEAR WIND PS
1 2017-01-01 1 7727 0 3815 7404 3923 0
2 2017-01-01 2 8338 0 3815 7403 3658 16
3 2017-01-01 3 7927 0 3801 7408 3925 0
8 9 10 11 12 13 14 15
0 NPSHYD OCGT OTHER INTFR INTIRL INTNED INTEW BIOMASS
1 944 0 2123 948 296 856 238
2 909 0 2124 998 298 874 288
3 864 0 2122 998 298 816 286 None
In order for this to work you will need to ensure that your input file contains only a collection of complete rows. For this reason I removed the partial row for the purposes of the test.
As you have said that the data source is an http GET request then the initial read would take place using pandas.read_html.
More detail on this can be found here. Note specifically the section on io (io : str or file-like).

Related

How to improve Beautiful soup webscrape loop?

I'm dipping my toes into webscraping with beautiful soup, and to do so I'm doing a small project where I'm looking at a pokemon fansite and getting the pokemon moves from a table. I'm going for the move name and nothing else. Currently my code does that poorly and incorrectly until the very bottom of its output.
It looks something like this.
It eventually does what I anticipate at the end there (starting with pound).
Here is what the table looks like on the webpage.
What I've got:
import requests
from bs4 import BeautifulSoup as bs
# Load page
r = requests.get("https://bulbapedia.bulbagarden.net/wiki/List_of_moves")
# Convert to soup object
soup = bs(r.content)
# Get first table (aka the one we need)
first_table = soup.find('table')
# Loop and grab what we want
for td in first_table.find_all('td', style=False, align=False):
download = td.find_all('a', href=True, title=True, style=False, align=False)
for a in download:
text = a.string
print(text)
input()
All of this is not even necessary. You can simply use pandas to scrape the entire table:
import requests
import pandas as pd
r = requests.get("https://bulbapedia.bulbagarden.net/wiki/List_of_moves")
df = pd.read_html(r.content)[1]
print(df)
Output:
# Name Type Category Contest PP Power Accuracy Gen
0 1 Pound Normal Physical Tough 35 40 100% I
1 2 Karate Chop* Fighting Physical Tough 25 50 100% I
2 3 Double Slap Normal Physical Cute 10 15 85% I
3 4 Comet Punch Normal Physical Tough 15 18 85% I
4 5 Mega Punch Normal Physical Tough 20 80 85% I
.. ... ... ... ... ... .. ... ... ...
821 822 Fiery Wrath Dark Special ??? 10 90 100% VIII
822 823 Thunderous Kick Fighting Physical ??? 10 90 100% VIII
823 824 Glacial Lance Ice Physical ??? 5 130 100% VIII
824 825 Astral Barrage Ghost Special ??? 5 120 100% VIII
825 826 Eerie Spell Psychic Special ??? 5 80 100% VIII
[826 rows x 9 columns]
You can also send these values to a neat csv file by adding this line to your code:
df.to_csv('Moves.csv', index = False)
Screenshot of csv file:

Count Occurrences for Objects in a Column of Lists for Really Large CSV File

I have a huge CSV file (8gb) containing multiple columns. One of the columns are a column of lists that looks like this:
YEAR WIN_COUNTRY_ISO3
200 2017 ['BEL', 'FRA', 'ESP']
201 2017 ['BEL', 'LTU']
202 2017 ['POL', 'BEL']
203 2017 ['BEL']
204 2017 ['GRC', 'DEU', 'FRA', 'LVA']
205 2017 ['LUX']
206 2017 ['BEL', 'SWE', 'LUX']
207 2017 ['BEL']
208 2017 []
209 2017 []
210 2017 []
211 2017 ['BEL']
212 2017 ['SWE']
213 2017 ['LUX', 'LUX']
214 2018 ['DEU', 'LUX']
215 2018 ['ESP', 'PRT']
216 2018 ['AUT']
217 2018 ['DEU', 'BEL']
218 2009 ['ESP']
219 2009 ['BGR']
Each of the 3-letter code represents a country. I would like to create a frequency table for each country so i can count the occurrences of each country in the entire column. Since the file is really large and my PC can't handle to load the whole CSV as dataframes, I try to read the file lazily and iterate through the line --> getting the last column and add the object in each row of the WIN_COUNTRY_ISO3 column (which happens to be the last column) to a set of dictionary.
import sys
from itertools import islice
n=100
i = 0
col_dict={}
with open(r"filepath.csv") as file:
for nline in iter(lambda: tuple(islice(file, n)), ()):
row = nline.splitline
WIN_COUNTRY_ISO3 = row[-1]
for iso3 in WIN_COUNTRY_ISO3:
if iso3 in col_dict.keys():
col_dict[iso3]+=1
else:
col_dict[iso3]=1
i+=1
sys.stdout.write("\rDoing thing %i" % i)
sys.stdout.flush()
print(col_dict)
However, this process takes a really long time. I tried through iterate through multiple lines by using the code
for nline in iter(lambda: tuple(islice(file, n)), ())
Q1:
However, this doesn't seem to work and python process the file one by one. Does anybody know the most any
efficient way for me to generate the count of each country for a really large file like mine?
The resulting table would look like this:
Country Freq
BEL 4543
FRA 4291
ESP 3992
LTU 3769
POL 3720
GRC 3213
DEU 3119
LVA 2992
LUX 2859
SWE 2802
PRT 2584
AUT 2374
BGR 1978
RUS 1770
TUR 1684
I would also like to create the frequency table by each year (in the YEAR column) if anybody can help me with this. Thank you.
Try this:
from collections import defaultdict
import csv
import re
result = defaultdict(int)
f = open(r"filepath.csv")
next(f)
for row in f:
data = re.sub(r'[\s\d\'\[\]]', '', row)
if data:
for x in data.split(','):
result[x] += 1
print(result)
If you can handle awk, here's one:
$ cat program.awk
{
while(match($0,/'[A-Z]{3}'/)) {
a[substr($0,RSTART+1,RLENGTH-2)]++
$0=substr($0,RSTART+RLENGTH)
}
}
END {
for(i in a)
print a[i],i
}
Execute it:
$ awk -f program.awk file
Output:
1 AUT
3 DEU
3 ESP
1 BGR
1 LTU
2 FRA
1 PRT
5 LUX
8 BEL
1 POL
1 GRC
1 LVA
2 SWE
$0 processes the whole record (row) of data, so it might include false hits from elsewhere in the record. You can enhance that with proper field separation but as it wasn't available I can't help any further. See gnu awk, FS and maybe FPAT in google.

Bokeh Dodge Chart using Different Pandas DataFrame

everyone! So I have 2 dataframes extracted from Pro-Football-Reference as a csv and run through Pandas with the aid of StringIO.
I'm pasting only the header and a row of the info right below:
data_1999 = StringIO("""Tm,W,L,W-L%,PF,PA,PD,MoV,SoS,SRS,OSRS,DSRS Indianapolis Colts,13,3,.813,423,333,90,5.6,0.5,6.1,6.6,-0.5""")
data = StringIO("""Tm,W,L,T,WL%,PF,PA,PD,MoV,SoS,SRS,OSRS,DSRS Indianapolis Colts,10,6,0,.625,433,344,89,5.6,-2.2,3.4,3.9,-0.6""")
And then interpreted normally using pandas.read_csv, creating 2 different dataframes called df_nfl_1999 and df_nfl respectively.
So I was trying to use Bokeh and do something like here, except instead of 'apples' and 'pears' would be the name of the teams being the main grouping. I tried to emulate it by using only Pandas Dataframe info:
p9=figure(title='Comparison 1999 x 2018',background_fill_color='#efefef',x_range=df_nfl_1999['Tm'])
p9.xaxis.axis_label = 'Team'
p9.yaxis.axis_label = 'Variable'
p9.vbar(x=dodge(df_nfl_1999['Tm'],0.0,range=p9.x_range),top=df_nfl_1999['PF'],legend='PF in 1999', width=0.3)
p9.vbar(x=dodge(df_nfl_1999['Tm'],0.25,range=p9.x_range),top=df_nfl['PF'],legend='PF in 2018', width=0.3, color='#A6CEE3')
show(p9)
And the error I got was:
ValueError: expected an element of either String, Dict(Enum('expr',
'field', 'value', 'transform'), Either(String, Instance(Transform),
Instance(Expression), Float)) or Float, got {'field': 0
Washington Redskins
My initial idea was to group by Team Name (df_nfl['Tm']), analyzing the points in favor in each year (so df_nfl['PF'] for 2018 and df_nfl_1999['PF'] for 1999). A simple offset of the columns could resolve, but I can't seem to find a way to do this, other than the dodge chart, and it's not really working (I'm a newbie).
By the way, the error reference is appointed at happening on the:
p9.vbar(x=dodge(df_nfl_1999['Tm'],0.0,range=p9.x_range),top=df_nfl_1999['PF'],legend='PF in 1999', width=0.3)
I could use a scatter plot, for example, and both charts would coexist, and in some cases overlap (if the data is the same), but I was really aiming at plotting it side by side. The other answers related to the subject usually have older versions of Bokeh with deprecated functions.
Any way I can solve this? Thanks!
Edit:
Here is the .head() method. The other one will return exactly the same categories, columns and rows, except that obviously the data changes since it's from a different season.
Tm W L W-L% PF PA PD MoV SoS SRS OSRS \
0 Washington Redskins 10 6 0.625 443 377 66 4.1 -1.3 2.9 6.8
1 Dallas Cowboys 8 8 0.500 352 276 76 4.8 -1.6 3.1 -0.3
2 New York Giants 7 9 0.438 299 358 -59 -3.7 0.7 -3.0 -1.8
3 Arizona Cardinals 6 10 0.375 245 382 -137 -8.6 -0.2 -8.8 -5.5
4 Philadelphia Eagles 5 11 0.313 272 357 -85 -5.3 1.1 -4.2 -3.3
DSRS
0 -3.9
1 3.4
2 -1.2
3 -3.2
4 -0.9
And the value of executing just x=dodge returns:
dodge() missing 1 required positional argument: 'value'
By adding that argumento value=0.0 or value=0.2 the error returned is the same as the original post.
The first argument to dodge should a single column name of a column in a ColumnDataSource. The effect is then that any values from that column are dodged by the specified amount when used as coordinates.
You are trying to pass the contents of a column, which is is not expected. It's hard to say for sure without complete code to test, but you most likely want
x=dodge('Tm', ...)
However, you will also need to actually use an explicit Bokeh ColumnDataSource and pass that as source to vbar as is done in the example you link. You can construct one explicitly, but often times you can also just pass the dataframe directly source=df, and it will be adapted.

Convert .txt file into multi index dataframe pandas

I have a very unorganized dataset located in a text file say file.txt
The sample looks something like so
TYPE Invoice C AC DATE TIME Total Invoice Qty1 ITEMVG By Total 3,000.00
Piece Item
5696 01/03/2018 09:21 32,501.35 1 Golden Plate ÞÔÞæÇä ÈÞÑ 6,517.52
1 áÈä ÑæÇÈí ÊÚäÇíá 2 ßÛ 4,261.45
1 Magic chef pop corn 907g 3,509.43
1 áÈäÉ ÊÚäÇíá ÔÝÇÝÉ 1 ßíáæ 9,525.60
1 KHOURY UHT 1 L 2,506.74
1 ÎÈÒ ÔãÓíä ÕÛíÑ 1,002.69
2 Almera 200Tiss 2,506.74
1.55 VG Potato 1,550.17
0.41 VG Eggplant 619.67
1 Delivery Charge 501.35
5697 01/03/2018 09:31 15,751.35 0.5 Halloum 1K. 4,476.03
0.59 Cheese double Cream 3,253.75
3 ãæáÇä ÏæÑ ÎÈÒ æÓØ 32 3,760.11
3 ãæáÇä ÏæÑ ÎÈÒ æÓØ 32 3,760.11
1 Delivery Charge 501.35
I want to import it into a data frame pandas using multi-index. Can someone help me with this?
In fact it can not read it as a txt file
# Obtain the Unorganized data from txt
file1=open('file.txt','r')
UnOrgan=file1.read()
You should be able to just read it in using read_table.
import pandas as pd
df = pd.read_table(<your file>, sep="\t", headers=[rows with column info])
I'm guessing that the separator is a tab.

How can I swap numbers inside data block of repeating format using linux commands?

I have a huge data file, and I hope to swap some numbers of 2nd column only, in the following format file. The file have 25,000,000 dataset, and 8768 lines each.
%% Edited: shorter 10 line example. Sorry for the inconvenience. This is typical one data block.
# Dataset 1
#
# Number of lines 10
#
# header lines
5 11 3 10 120 90 0 0.952 0.881 0.898 2.744 0.034 0.030
10 12 3 5 125 112 0 0.952 0.897 0.905 2.775 0.026 0.030
50 10 3 48 129 120 0 1.061 0.977 0.965 3.063 0.001 0.026
120 2 4 5 50 186 193 0 0.881 0.965 0.899 0.917 3.669 0.000 -0.005
125 3 4 10 43 186 183 0 0.897 0.945 0.910 0.883 3.641 0.000 0.003
186 5 4 120 125 249 280 0 0.899 0.910 0.931 0.961 3.727 0.000 -0.001
193 6 4 120 275 118 268 0 0.917 0.895 0.897 0.937 3.799 0.000 0.023
201 8 4 278 129 131 280 0 0.921 0.837 0.870 0.934 3.572 0.000 0.008
249 9 4 186 355 179 317 0 0.931 0.844 0.907 0.928 3.615 0.000 0.008
280 10 4 186 201 340 359 0 0.961 0.934 0.904 0.898 3.700 0.000 0.033
#
# Dataset 1
#
# Number of lines 10
...
As you can see, there are 7 repeating header lines in the head, and 1 trailing line at the end of the dataset. Those header and trailing lines are all beginning from #. As a result, the data will have 7 header lines, 8768 data lines, and 1 trailing line, total 8776 lines per a data block. That one trailing line only contains sinlge '#'.
I want to swap some numbers in 2nd columns only. First, I want to replace
1, 9, 10, 11 => 666
2, 6, 7, 8 => 333
3, 4, 5 => 222
of the 2nd column, and then,
666 => 6
333 => 3
222 => 2
of the 2nd column. I hope to conduct this replacing for all repeating dataset.
I tried this with python, but the data is too big, so it makes memory error. How can I perform this swapping with linux commands like sed or awk or cat commands?
Thanks
Best,
This might work for you, but you'd have to use GNU awk, as it's using the gensub command and $0 reassignment.
Put the following into an executable awk file ( like script.awk ):
#!/usr/bin/awk -f
BEGIN {
a[1] = a[9] = a[10] = a[11] = 6
a[2] = a[6] = a[7] = a[8] = 3
a[3] = a[4] = a[5] = 2
}
function swap( c2, val ) {
val = a[c2]
return( val=="" ? c2 : val )
}
/^( [0-9]+ )/ { $0 = gensub( /^( [0-9]+)( [0-9]+)/, "\\1 " swap($2), 1 ) }
47 # print the line
Here's the breakdown:
BEGIN - set up an array a with mappings of the new values.
create a user defined function swap to provide values for the 2nd column from the a array or the value itself. The c2 element is passed in, while the val element is a local variable ( becuase no 2nd argument is passed in ).
when a line starts with a space followed by a number and a space (the pattern), then use gensub to replace the first occurrance of the first number pattern with itself concatenated with a space and the return from swap(the action). In this case, I'm using gensub's replacement text to preserve the first column data. The second column is passed to swap using the field data identifier of $2. Using gensub should preserve the formatting of the data lines.
47 - an expression that evaluates to true provides the default action of printing $0, which for data lines might have been modified. Any line that wasn't "data" will be printed out here w/o modifications.
The provided data doesn't show all the cases, so I made up my own test file:
# 2 skip me
9 2 not going to process me
1 1 don't change the for matting
2 2 4 23242.223 data
3 3 data that's formatted
4 4 7 that's formatted
5 5 data that's formatted
6 6 data that's formatted
7 7 data that's formatted
8 8 data that's formatted
9 9 data that's formatted
10 10 data that's formatted
11 11 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
Running the executable awk (like ./script.awk data) gives the following output:
# 2 skip me
9 2 not going to process me
1 6 don't change the for matting
2 3 4 23242.223 data
3 2 data that's formatted
4 2 7 that's formatted
5 2 data that's formatted
6 3 data that's formatted
7 3 data that's formatted
8 3 data that's formatted
9 6 data that's formatted
10 6 data that's formatted
11 6 data that's formatted
12 12 data that's formatted
13 13 data that's formatted
14 s data that's formatted
# some other data
which looks alright to me, but I'm not the one with 25 million datasets.
You'd also most definitely want to try this on a smaller sample of your data first (the first few datasets?) and redirect stdout a temp file perhaps like:
head -n 26328 data | ./script.awk - > tempfile
You can learn more about the elements used in this script here:
awk basics (the man page)
Arrays
User defined functions
String functions - gensub()
And of course, you should spend some quality time reviewing awk related questions and answers on Stack Overflow ;)

Resources