How to extract data using python from a text file - python-3.x

I have been having troubles with extracting reading/manipulating/extracting data from a txt file. In the text file it has a general header with various information that is setup something like this below just as an example:
~ECOLOGY
~LOCATION
LAT: 59
LONG: 23
~PARAMETERS
Area. 8
Distribution. 3
Diversity. 5
~DATA X Y CONF DECID PEREN
3 6 1 3 0
7 2 4 2 1
4 8 0 6 2
9 9 6 2 0
2 3 2 5 4
6 5 0 2 7
7 1 2 4 2
I want to be able to extract the headers of the columns and use the headers of the columns as an index or key since sometimes the types of column data can change between files and the amount of rows of data can fluctuate as well. I want to be able to read the data in each column so that pending on location I can sum or add columns such as show below and export it as a separate file:
~DATA X Y CONF DECID PEREN TOTAL
3 6 1 3 0 4
7 2 4 2 1 7
4 8 0 6 2 8
9 9 6 2 0 8
2 3 2 5 4 11
6 5 0 2 7 9
7 1 2 4 2 8
Any suggestions?
This is what I have so far:
E = open("ECOLOGY.txt", "r")
with open(path) as E:
for i, line in enumerate(E):
sep_lines = line.rsplit()
if "~DATA" in sep_lines:
key =(line.rsplit())
key.remove('~DATA')
for j, value in enumerate(key):
print (j,value)
print (key)
dict = {L: v for v, L in enumerate(key)}
print(dict)

Life would be much easier for you if you learned a smidgen of Pandas. But you can do it without.
with open('ttl.txt') as ttl:
for _ in range(10):
next(ttl)
first = True
for line in ttl:
line = line.rstrip()
if first:
first = False
labels = line.split()+['TOTAL']
fmt = 7*'{:<9s}'
print (fmt.format(*labels))
else:
numbers = [int(_) for _ in line.split()]
total = sum(numbers[-3:])
other_items = numbers + [total]
fmt = 6*'{:<9d}'
fmt = '{:<9s}'+fmt
print (fmt.format('', *other_items))
~DATA X Y CONF DECID PEREN TOTAL
3 6 1 3 0 4
7 2 4 2 1 7
4 8 0 6 2 8
9 9 6 2 0 8
2 3 2 5 4 11
6 5 0 2 7 9
7 1 2 4 2 8
next skips lines in the input file. You can use split() to split input lines on whitespace, the use formatting to put items back together as you want them.

This a very basic, frail, format depending solution. But I hope it can help you.
with open("test.txt") as f:
data_part_reached = False
for line in f:
if "~DATA" in line:
column = [[elem] for elem in line.split(" ") if elem not in (" ", "", "\n", "~DATA")]
data_part_reached = True
elif data_part_reached:
values = [int(elem) for elem in line.split(" ") if elem not in (" ", "", "\n")]
for i in range(len(columns)):
columns[i].append(values[i])
columns =
[['X', 3, 7, 4, 9, 2, 6, 7],
['Y', 6, 2, 8, 9, 3, 5, 1],
['CONF', 1, 4, 0, 6, 2, 0, 2],
['DECID', 3, 2, 6, 2, 5, 2, 4],
['PEREN', 0, 1, 2, 0, 4, 7, 2],
['TOTAL', 4, 7, 8, 8, 11, 9, 8]]
This will get you a list of lists where the first element of each list is the header and the rest are the values. I casted the values to int since you said you want to operate with them. You can turn this list into a dict where the key is the header and the list of values of each column are the value if you want, like this.
d = {}
for column in columns:
d[column.pop(0)] = column
d =
{'DECID': [3, 2, 6, 2, 5, 2, 4],
'PEREN': [0, 1, 2, 0, 4, 7, 2],
'CONF': [1, 4, 0, 6, 2, 0, 2],
'X': [3, 7, 4, 9, 2, 6, 7],
'TOTAL': [4, 7, 8, 8, 11, 9, 8],
'Y': [6, 2, 8, 9, 3, 5, 1]}

Create a empty dictionary to store all needed data.
Read from the file object as E and loop until you reach a line starting with ~DATA.
Then split the header items, append TOTAL and then break from the loop.
Create a list to store the remaining data.
Loop to split the data and then append the sum total.
The list will append each list of data.
Loop ends and then adds to list of lists to the dictionary.
dic = {}
with open("ECOLOGY.txt") as E:
for line in E:
if line[:5] == '~DATA':
dic['header'] = line.split()[1:] + ['TOTAL']
break
data = []
for line in E:
cols = line.split()
cols.append(sum([int(num) for num in cols[2:]]))
data.append(cols)
dic['data'] = data
The dictionary will be i.e. {'header': [...], 'data': [[...], ...]}
edit: Added missing dic declaration at the beginning of code.

Related

Print a groupby object for a specific group/groups only

I need to print the result of groupby object in Python for a specific group/groups only.
Below is the dataframe:
import pandas as pd
df = pd.DataFrame({'ID' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4],
'Entry' : [1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6]})
print("\n df = \n",df)
In order to group the dataferame by ID and print the result I used these codes:
grouped_by_unit = df.groupby(by="ID")
print("\n", grouped_by_unit.apply(print))
Can somebody please let me know below two things:
How can I print the data frame grouped by 'ID=1' only?
I need to get the below output:
Likewise, how can I print the data frame grouped by 'ID=1' and 'ID=4' together?
I need to get the below output:
You can iterate over the groups for example with for-loop:
grouped_by_unit = df.groupby(by="ID")
for id_, g in grouped_by_unit:
if id_ == 1 or id_ == 4:
print(g)
print()
Prints:
ID Entry
0 1 1
1 1 2
2 1 3
3 1 4
ID Entry
12 4 1
13 4 2
14 4 3
15 4 4
16 4 5
17 4 6
You can use get_group function:
df.groupby(by="ID").get_group(1)
which prints
ID Entry
0 1 1
1 1 2
2 1 3
3 1 4
You can use the same method to print the group for the key 4.

Python Pandas Dataframe: How to join 3 rows of data, separated by columns, then repeat this result for the 3 rows of data involved

Here's the pandas dataframe that I'm using to learn how to do this:
import pandas as pd
test_list = pd.DataFrame()
test_list["Item"] = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"]
test_list["Number"] = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11"]
test_list["Combined Numbers"]= ""
Based on that dataframe above, I intend to combine up to 3 numbers, separated by commas.
Following that, I intend to repeat this combined value I now have, for each of the test_list["Item"] and test_list["Number"] involved.
I've been scratching my head figuring it out so far. So far I've seen examples of groupby() function for situations like combining information based on a given criteria, like a duplicate value from a column. I'm learning to explore if I don't have anything to refer to, how can I work this out instead?
Here's my intended goal:
Item
Number
Combined Numbers
A
1
1, 2, 3
B
2
1, 2, 3
C
3
1, 2, 3
D
4
4, 5, 6
E
5
4, 5, 6
F
6
4, 5, 6
G
7
7, 8, 9
H
8
7, 8, 9
I
9
7, 8, 9
J
10
10, 11
K
11
10, 11
Thank you
In you case as you don't have this column with the groups, you can generate a range of the length of your dataframe (with np.arange) and do the floor division by 3 (//3). Use groupby.transform to keep the original shape of your data and do the join operation on the column.
test_list["Combined Numbers"] = (
test_list.groupby(np.arange(len(test_list))//3)
['Number'].transform(', '.join)
)
print(test_list)
# Item Number Combined Numbers
# 0 A 1 1, 2, 3
# 1 B 2 1, 2, 3
# 2 C 3 1, 2, 3
# 3 D 4 4, 5, 6
# 4 E 5 4, 5, 6
# 5 F 6 4, 5, 6
# 6 G 7 7, 8, 9
# 7 H 8 7, 8, 9
# 8 I 9 7, 8, 9
# 9 J 10 10, 11
# 10 K 11 10, 11

How to aggregate n previous rows as list in Pandas DataFrame?

As the title says:
a = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])
Having a dataframe with 10 values we want to aggregate say last 5 rows and put them as list into a new column:
>>> a new_col
0
0 1
1 2
2 3
3 4
4 5 [1,2,3,4,5]
5 6 [2,3,4,5,6]
6 7 [3,4,5,6,7]
7 8 [4,5,6,7,8]
8 9 [5,6,7,8,9]
9 10 [6,7,8,9,10]
How?
Due to how rolling windows are implemented, you won't be able to aggregate the results as you expect, but we still can reach your desired result by iterating each window and storing the values as a list of values:
>>> new_col_values = [
window.to_list() if len(window) == 5 else None
for window in df["column"].rolling(5)
]
>>> df["new_col"] = new_col_values
>>> df
column new_col
0 1 None
1 2 None
2 3 None
3 4 None
4 5 [1, 2, 3, 4, 5]
5 6 [2, 3, 4, 5, 6]
6 7 [3, 4, 5, 6, 7]
7 8 [4, 5, 6, 7, 8]
8 9 [5, 6, 7, 8, 9]
9 10 [6, 7, 8, 9, 10]

How to create a separate df after applying groupby?

I have a df as follows:
Product Step
1 1
1 3
1 6
1 6
1 8
1 1
1 4
2 2
2 4
2 8
2 8
2 3
2 1
3 1
3 3
3 6
3 6
3 8
3 1
3 4
What I would like to do is to:
For each Product, every Step must be grabbed and the order must not be changed, that is, if we look at Product 1, after Step 8, there is a 1 coming and that 1 must be after 8 only. So, the expected output for product 1 and product 3 should be of the order: 1, 3, 6, 8, 1, 4; for the product 2 it must be: 2, 4, 8, 3, 1.
Update:
Here, I only want one value of 6 for product 1 and 3, since in the main df both the 6 next to each other, but both the values of 1 must be present since they are not next to each other.
Once the first step is done, the products with the same Steps must be grouped together into a new df (in the below example: Product 1 and 3 have same Steps, so they must be grouped together)
What I have done:
import pandas as pd
sid = pd.DataFrame(data.groupby('Product').apply(lambda x: x['Step'].unique())).reset_index()
But it is yielding a result like:
Product 0
0 1 [1 3 6 8 4]
1 2 [2 4 8 3 1]
2 3 [1 3 6 8 4]
which is not the result I want. I would like the value for the first and third product to be [1 3 6 8 1 4].
IIUC Create the Newkey by using cumsum and diff
df['Newkey']=df.groupby('Product').Step.apply(lambda x : x.diff().ne(0).cumsum())
df.drop_duplicates(['Product','Newkey'],inplace=True)
s=df.groupby('Product').Step.apply(tuple)
s.reset_index().groupby('Step').Product.apply(list)
Step
(1, 3, 6, 8, 1, 4) [1, 3]
(2, 4, 8, 3, 1) [2]
Name: Product, dtype: object
groupby preservers the order of rows within a group, so there isn't much need to worry about the rows shifting.
A straightforward, but not greatly performant, solution would be to apply(tuple), since they are hashable allowing you to group on them to see which Products are identical. form_seq will make it so that consecutive values only appear once in the list of steps before forming the tuple.
def form_seq(x):
x = x[x != x.shift()]
return tuple(x)
s = df.groupby('Product').Step.apply(form_seq)
s.groupby(s).groups
#{(1, 3, 6, 8, 1, 4): Int64Index([1, 3], dtype='int64', name='Product'),
# (2, 4, 8, 3, 1): Int64Index([2], dtype='int64', name='Product')}
Or if you'd like a DataFrame:
s.reset_index().groupby('Step').Product.apply(list)
#Step
#(1, 3, 6, 8, 1, 4) [1, 3]
#(2, 4, 8, 3, 1) [2]
#Name: Product, dtype: object
The values of that dictionary are the groupings of products that share the step sequence (given by the dictionary keys). Products 1 and 3 are grouped together by the step sequence 1, 3, 6, 8, 1, 4.
Another very similar way:
df_no_dups=df[df.shift()!=df].dropna(how='all').ffill()
df_no_dups_grouped=df_no_dups.groupby('Product')['Step'].apply(list)

Set 3 level of column names in pandas DataFrame

I'm trying to have a frame with the following structure
h/a totales
sub1 sub2 sub1 sub2
a b ... f g ....m a b ... f g ....m
That being, 2 labels for the first layer, again 2 labels for the second one, and then a subset of column names where sub1 and sub2 doesn't have the same column names.
In order to do so I did the following:
columnas=pd.MultiIndex.from_product([['h/a','totals'],['means','percentages'],
[('means','a'),('means','b'),....('percentage','g'),....],
names=['data level 1','data level 2','data level 3']])
data=[data,pata,......]
newframe=pd.DataFrame(data,columns=columnas)
What I get is this error:
>ValueError: Shape of passed values is (1, 21), indices imply (84, 21)
How can I fix this to have a multi leveled frame by column names?
Thank you
I think need MultiIndex.from_tuples from list comprehensions:
L1 = list('abc')
L2 = list('ghi')
tups = ([('h/a','means', x) for x in L1] +
[('h/a','percentage', x) for x in L2] +
[('totals','means', x) for x in L1] +
[('totals','percentage', x) for x in L2])
columnas=pd.MultiIndex.from_tuples(tups, names=['data level 1','data level 2','data level 3'])
print (columnas)
MultiIndex(levels=[['h/a', 'totals'],
['means', 'percentage'],
['a', 'b', 'c', 'g', 'h', 'i']],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]],
names=['data level 1', 'data level 2', 'data level 3'])
#some random data
np.random.seed(785)
data = np.random.randint(10, size=(3, 12))
print (data)
[[8 0 4 1 2 5 4 1 4 1 1 8]
[1 5 0 7 4 8 4 1 3 8 0 2]
[5 9 4 9 4 6 3 7 0 5 2 1]]
newframe=pd.DataFrame(data,columns=columnas)
print (newframe)
data level 1 h/a totals
data level 2 means percentage means percentage
data level 3 a b c g h i a b c g h i
0 8 0 4 1 2 5 4 1 4 1 1 8
1 1 5 0 7 4 8 4 1 3 8 0 2
2 5 9 4 9 4 6 3 7 0 5 2 1

Resources