limit number of words in a column in a DataFrame - python-3.x

My dataframe looks like
Abc XYZ
0 Hello How are you doing today
1 Good This is a
2 Bye See you
3 Books Read chapter 1 to 5 only
max_size = 3,
I want to truncate the column(XYZ) to a maximum size of 3 words(max_size). There are rows with length less than max_size, and it should be left as it is.
Desired output:
Abc XYZ
0 Hello How are you
1 Good This is a
2 Bye See you
3 Books Read chapter 1

Use split with limit, remove last value and then join lists together:
max_size = 3
df['XYZ'] = df['XYZ'].str.split(n=max_size).str[:max_size].str.join(' ')
print (df)
Abc XYZ
0 Hello How are you
1 Good This is a
2 Bye See you
3 Books Read chapter 1
Another solution with lambda function:
df['XYZ'] = df['XYZ'].apply(lambda x: ' '.join(x.split(maxsplit=max_size)[:max_size]))

Related

Finding biggest element index in a dictionary?

Sample test cases:
Input
3 ------> length of dictionary
abcd 5 3 6 2 //dict elements (key=list[0], value=list[1:])
ok 1 8 5 3 //
best 9 1 3 5 //
Expected output
abcd 2 ----> key and index of max ele of list
ok 1
best 0
are you looking for the just the highest key ?
d = {15:3, 4:1, 11:2, 14:9}
print(max(d)) #gives max key
print(d[max(d)]) # gives max value

Highest frequency in a dataframe

I am looking for a way to get the highest frequency in the entire pandas, not in a particular column. I have looked at value count, but it seems that works in a column specific way. Any way to do that?
Use DataFrame.stack with Series.mode for top values, for first select by position:
df = pd.DataFrame({
'B':[4,5,4,5,4,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
a = df.stack().mode().iat[0]
print (a)
4
Or if need also frequency is possible use Series.value_counts:
s = df.stack().value_counts()
print (s)
4 6
5 4
3 3
9 2
7 2
2 2
1 2
8 1
6 1
0 1
dtype: int64
print (s.index[0])
4
print (s.iat[0])
6

using index match with sum if

I need to link up a sumif() with an index match (i'm guessing here) but don't really know where to start.
Basically i a table with different classes of pets, their species and quantity. there are 3 stores. I need an output where i can get the quantity of each species from each store dynamically.
data table:
"A1" Pet Stores
Species Class a b c
cat Fluffy1 1 0 0
cat Fluffy2 3 0 0
cat Fluffy3 5 7 1
cat Fluffy4 6 0 7
dog Barky1 7 6 9
dog Barky2 1 3 9
dog Barky3 0 2 8
dog Barky4 0 2 3
fish Swimmy1 0 0 0
fish Swimmy2 1 3 0
fish Swimmy3 0 2 3
fish Swimmy4 0 0 0
Output:
Pet Store a <--change this
cat 15 <--output
dog 8 <--output
fish 1 <--output
right now my formula for "cat" is =SUMIF($A$3:$A$14,A17,$C$3:$C$14). however, it only looks down the 1 column that i've set. how do i change it such that it searches for the "Pet Store" and returns sum of the respective column?
How about this:
Formula in cell H3 copied down is
=SUMIF($A$2:$A$13,G3,INDEX($C$2:$E$13,,MATCH(H$2,$C$1:$E$1,0)))
Slightly shorter that #teylyn's version:
=SUMIF(A$2:A$13,A16,OFFSET(C$2:C$13,,CODE(B$15)-97))
but less versatile as it relies on the shop names being coded (which however is as in the example and makes sense for column label purposes):
However my preference would be for a PivotTable:

Creating a Two-Mode Network

Using Python 3.2 I am trying to turn data from a CSV file into a two-mode network. For those who do not know what that means, the idea is simple:
This is a snippet of my dataset:
Project_ID Name_1 Name_2 Name_3 Name_4 ... Name_150
1 Jean Mike
2 Mike
3 Joe Sarah Mike Jean Nick
4 Sarah Mike
5 Sarah Jean Mike Joe
I want to create a CSV that puts the Project_IDs across the first row of the CSV and each unique name down the first column (with cell A1 blank) and then a 1 in the i,j cell if that person worked on a given project. NOTE: My data has full names (with middle initial), with no two people having the same name so there will not be any duplicates.
The final data output would look like this:
1 2 3 4 5
Jean 1 0 1 0 1
Mike 1 1 1 1 1
Joe 0 0 1 0 1
Sarah 0 0 1 1 1
... ... ... ... ... ...
Nick 0 0 1 0 0
Start by using the CVS reader
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
Note that row will read as arrays for each line.
The output array should probably be created before you start. As from this question, here is how you could do that
buckets = [[0 for col in range(5)] for row in range(10)]

Pattern decoding

I need a little help in the following. I have this kind of datafile:
0 0 # <--- Group 1 -- 1 house (0) and 1 room (0)
0 0 # <--- Group 2 -- 2 houses (0;1) and 3,2 rooms (0,1,2;0,1)
0 1
0 2
1 0 # <--- house 2 in Group 2, with the first room (0)
1 1 # <--- house 2 in Group 2, with the second room (1)
0 0 # <--- Group 3
0 1 # <--- house 1 in Group 3, with the second room (1)
0 2
0 0 # <--- Group 4
1 0 # <--- house 2 in Group 4, with one room only (0)
2 0
3 0 # <--- house 4 in Group 4, with one room only (0)
0 0 # <--- Group 5
0 0 # <--- Group 6
There are some cases, which have to be answered:
There are groups in the example; there exists a group, if it is separated by new lines from the other, so in this case we have 6 groups. We have to determine the followings
Get the actual number (ordinal number) of the group (the counter starts for example from 1)
if the 1st column = 0 and the 2nd column = 0 and the next line is empty
So the desired output according to the above example would be
1
5
6
if the first column = 0 and the 2nd column can vary and the next line is empty
So the desired output according to the above example would be
3
... etc. How can this be generalized in a way that we can set at the beginning which case would we like to get ?
There might be many cases according to the values of the columns in a group.
We can imagine this if we think about something like this: the first column means the number of houses in a street, and the second column means the number of rooms in a house. Now I would like to find all possible kind of streets in a city, which means for example
let us pick up those streets, in which there are two houses with different number of rooms, in the first house there are 3 rooms, and in the second house there are 2 rooms. So we have the get output 2, becasue this requirement fulfills this group in the file
0 0
0 1
0 2
1 0
1 1
Important: 0 0 means there is one house with one room
Correction: if there is one house only, then it has just one room all the time! Like in the cases Group 1, Group 5, and Group 6. Remember, that the second column is the number of room, and 0 mean "1 room", 1 means "2 rooms", ...etc. This is just a counter which starts from 0, instead of 1, sorry if it is confusing a little bit...
I don't know what would be your expected output, however I have converted/decoded your number pattern to a meaningful group/house/rooms format. any further "query" could be done on this content.
see below:
kent$ cat file
0 0
0 0
0 1
0 2
1 0
1 1
0 0
0 1
0 2
0 0
1 0
2 0
3 0
0 0
0 0
awk:
kent$ awk 'BEGIN{RS=""}
{ print "\ngroup "++g;
delete a;
for(i=1;i<=NF;i++) if(i%2) a[$i]++;
for(x in a) printf "House#: %s , Room(s): %s \n", x, a[x]; }' file
we get output:
group 1
House#: 0 , Room(s): 1
group 2
House#: 0 , Room(s): 3
House#: 1 , Room(s): 2
group 3
House#: 0 , Room(s): 3
group 4
House#: 0 , Room(s): 1
House#: 1 , Room(s): 1
House#: 2 , Room(s): 1
House#: 3 , Room(s): 1
group 5
House#: 0 , Room(s): 1
group 6
House#: 0 , Room(s): 1
note that the generated format could be changed to fit your "filter" or "query"
UPDATE
OP's comment:
I need to know, the number of the group(s) which have/has for example
1 house with one room. The output would be in the above case: 1, 5 ,6
as I said, based on your query criteria, we could adjust the awk output for next step. now I change the awk abovet to:
awk 'BEGIN{RS=""}
{print ""; gid=++g;
delete a;
for(i=1;i<=NF;i++) if(i%2) a[$i]++;
for(x in a) printf "%s %s %s\n", gid,x, a[x]; }' file
this will output:
1 0 1
2 0 3
2 1 2
3 0 3
4 0 1
4 1 1
4 2 1
4 3 1
5 0 1
6 0 1
the format is groupIdx houseIdx numberOfRooms and there is a blank line between groups. we save the text above to a file named decoded.txt
so your query could be done on this text:
kent$ awk 'BEGIN{RS="\n\n"}{if (NF==3 && $3==1)print $1}' decoded.txt
1
5
6
the last awk line above means, print the group number, if room number ($3) = 1 and there is only one line in the group block.
I would first define a House class and a Group class:
class House:
def __init__(self, rooms):
self.rooms = rooms
class Group:
def __init__(self, index, houses):
self.index = index
# houses.values() is a list with number of rooms for each house.
self.houses = [House(houses[house_nr]) for house_nr in sorted(houses)]
def __str__(self):
return 'Group {}'.format(self.index)
def __repr__(self):
return 'Group {}'.format(self.index)
Then parse the data into this hierarchical structure:
with open('in.txt') as f:
groups = []
# Variable to accumulate current group.
group = collections.defaultdict(int)
i = 1
for line in f:
if not line.strip():
# Empty line found, create a new group.
groups.append(Group(i, group))
# Reset accumulator.
group = collections.defaultdict(int)
i += 1
continue
house_nr, room_nr = line.split()
group[house_nr] += 1
# Create the last group at EOF
groups.append(Group(i, group))
Then you can do stuff like this:
found = filter(
lambda g:
len(g.houses) == 1 and # Group contains one house
g.houses[0].rooms == 1, # First house contains one room
groups)
print(list(found)) # Prints [Group 1, Group 5, Group 6]
found = filter(
lambda g:
len(g.houses) == 2 and # Group contains two houses
g.houses[0].rooms == 3 and # First house contains three rooms
g.houses[1].rooms == 2, # Second house contains two rooms
groups)
print(list(found)) # Prints [Group 2]
Perl solution. It converts the input into this format:
1|0
2|1 2
3|2
4|0 0 0 0
5|0
6|0
The first column is group number, in second column there are number of rooms (minus one) of all its houses, sorted. To search for group with two different houses with 2 and 3 rooms, you can just grep '|1 2$', to search for groups with just one house with one room, grep '|0$'
#!/usr/bin/perl
#-*- cperl -*-
#use Data::Dumper;
use warnings;
use strict;
sub report {
print join ' ', sort {$a <=> $b} #_;
print "\n";
}
my $group = 1;
my #last = (0);
print '1|';
my #houses = ();
while (<>) {
if (/^$/) { # group end
report(#houses, $last[1]);
undef #houses;
print ++$group, '|';
#last = (0);
} else {
my #tuple = split;
if ($tuple[0] != $last[0]) { # new house
push #houses, $last[1];
}
#last = #tuple;
}
}
report(#houses, $last[1]);
It is based on the fact that for each house, only the last line is important.

Resources