Excel to pandas, whitespace as row delimeter - excel

I have some badly formatted Excel data that looks like this:
aaaa
bbbb
cccc
aaaa
bbbb
cccc
dddd
Is there a decent way to, using the white space as a delimiter, turn each segment into a pandas row with a default value to fix raggedness?
I'd like the end result to be something like this:
aaaa bbbb cccc ""
aaaa bbbb cccc dddd
Thanks!

I think I was able to get something that works by bringing the column in as a list, and then trying to pull out each section of the list separated by any number of '' values. There may be a better way to do these things, but maybe the idea is helpful (and it does work for my example at least).
The sample data I used here is like yours, with another little chunk at the end. Hopefully I'm understanding it right.
l = ['aaaa',
'bbbb',
'cccc',
'',
'',
'',
'aaaa',
'bbbb',
'cccc',
'dddd',
'',
'asdf',
'badfd'
]
With that list l, the idea is to loop through the list for the number of '' items, and pull out the items between lastspace (the last blank-space value) and thisspace (the current blank-space value).
It needed some special handling for when there are multiple '' values in a row. That is the while thisspace == lastspace+1... piece, which if that happens it will work to skip it and go to the next '' value. Because of this, we want to increment our loop counter when we find these consecutives, and so our loop also has to be a while loop instead of a for loop (which can't handle manual increments).
l.index() gives a ValueError when it can't find the substring, and so when we have gone past the last '', we need this error-handling.
d = []
lastspace = -1
i = 0
while i <= l.count(''):
try:
thisspace = l.index('',lastspace+1)
while thisspace == lastspace+1:
lastspace = thisspace
thisspace = l.index('',lastspace+1)
i += 1
d.append(l[lastspace+1:thisspace])
lastspace = thisspace
i += 1
except ValueError:
d.append(l[lastspace+1:])
i += 1
df = pd.DataFrame(d)
The dataframe I get in the end looks like this:
0 1 2 3
0 aaaa bbbb cccc None
1 aaaa bbbb cccc dddd
2 asdf badfd None None

I have used the following text file test.txt to simulate your inputs
aaaa bbbb cccc aaaa bbbb cccc dddd
Is this correct?
df = pd.read_csv('test.txt', delimiter = ' ', header = None)
0 1 2 3 4 5 6 7
0 aaaa bbbb cccc NaN aaaa bbbb cccc dddd
And to further process the dataframe
df = df.fillna('')
print(np.array(df.loc[0,:].to_list()).reshape(-1,4))
[['aaaa' 'bbbb' 'cccc' '']
['aaaa' 'bbbb' 'cccc' 'dddd']]

Related

How to read the file and remove the field of each line then add the extra row to each line

I do have two shell script , read the file(file1) and remove the first field of the file ( output result redirect to the file2)
, then second script again read above modified file (file2) then add extra field to each line (file3),
How to do this in single script without using two shell script .
#!/bin/bash
# loop on all .txt files
for i in File1.txt; do
# remove first column
cut -d' ' -f2- < $i > File2.txt
done
#!/bin/bash
filename='File2.txt'
while read line; do
# reading each line
echo "$RANDOM $line" >> File3.txt
done < $filename
File1.txt
Date Field2 Field3
20111 aaaa bbbb
33111 bbbb vvvv
44444 cccc gggg
File2.txt
Field2 Field3
aaaa bbbb
bbbb vvvv
cccc gggg
File3.txt
New Field2 Fileld3
1 aaaa bbbb
2 bbbb vvvv
1 cccc gggg
One idea:
new="New"
while read -r ignore rest_of_line
do
echo "$new $rest_of_line"
new=$RANDOM
done < file1.txt > file3.txt
This generates:
$ cat file3.txt
New Field2 Field3
29258 aaaa bbbb
31885 bbbb vvvv
15550 cccc gggg
NOTE: it's not clear (to me) what the input/output field delimiters are so for now I'm assuming any white space on input and a single space on output; should be (relatively) easy to modify per OP's requirement

Regex for simple pattern in python

I have a string containing exactly one pair of parentheses (and some words between them), and lots of other words.
How would one create a regex to split the string into [ words before (, words between (), words after )]?
e.g.
line = "a bbbb cccc dd ( ee fff ggg ) hhh iii jk"
would be split into
[ "a bbbb cccc dd", "ee fff ggg", "hhh iii jk" ]
I've tried
line = re.compile("[^()]+").split(line)
but it doesn't work.
It seems that in the process you want to remove the leading and trailing whitespaces, i.e., the whitespaces before and after ( and ). You could try:
>>> line = "a bbbb cccc dd ( ee fff ggg ) hhh iii jk"
>>> re.split(r'\s*[\(\)]\s*', line)
['a bbbb cccc dd', 'ee fff ggg', 'hhh iii jk']
>>>
>>> # to make it look as in your description ...
>>> line = re.compile(r'\s*[\(\)]\s*').split(line)
>>> line
['a bbbb cccc dd', 'ee fff ggg', 'hhh iii jk']
To split the output in three I think the simplest option is to use three capture groups (some_regex)(another_regex)(yet_another_regex). In your case, the first part is any character that is not a (, followed by (, then any character that is not ) followed by ) and finally followed by any character.
Therefore the regex is ([^(]*)\(([^)]*)\)(.*), which you can then use to retrieve groups (your desired output):
>>> import re
>>> pattern = re.compile(r'([^(]*)\(([^)]*)\)(.*)')
>>> pattern.match(line).groups()
('a bbbb cccc dd ', ' ee fff ggg ', ' hhh iii jk')
With:
([^(]*) the first group
([^)]*) the second group
(.*) the last group

Linux grep, how can I display lines that don't contain word 1 and word 2 but still display the lines that have both words in them

I need some help with displaying all lines that don't contain word1 or word2 but lines that contain both of them have to be shown.
Example:
aaaa bbbb cccc
bbbb bbbb bbbb
cccc cccc cccc
dddd dddd aaaa
if word1 = aaaa and word2 = bbbb then output should be:
aaaa bbbb cccc
cccc cccc cccc
Tried
grep -Ewv "word1/word2" file.txt
but this shows only lines that don't contain them, it doesn't show lines containing both
I need to do this with grep command, forgot to mention this
Grep version of both or none of each:
grep -v -P '((?=.*aaaa)(?!.*bbbb))|((?=.*bbbb)(?!.*aaaa))'
But please do not use grep in this case. Negative and positive look ahead can easily lead to Catastrophic Backtracking
GNU grep knows Perl compatible regular expression (PCRE) syntax (option -P). This thing is still called a "regular" expression, although it not regular anymore. Other people are more explicit and call backtracking irregular expressions.
How it works:
(?=.*aaaa) matches aaaa anywhere in the line, but does not move the cursor. After the match the next search starts at the beginning of the line.
(?!.*bbbb) matches when no bbbb is in the line and does not move the cursor either.
Both together matches lines, which include aaaa but do not include bbbb.
This is one of the cases you want to exclude, from your search results. The second behind the or condition (|) is the other one you want to exclude: any bbbb without a aaaa.
With the above, you have defined, what you do not want. Next use -v to invert the search to get what you want.
Bash version of both or none of each:
#! /bin/bash
word1=${1:-aaaa}
word2=${2:-bbbb}
while read -r line; do
if [[ $line =~ $word1 ]]; then
if [[ $line =~ $word2 ]]; then
printf "%s\n" "$line"
fi
else
if [[ $line =~ $word2 ]]; then
:
else
printf "%s\n" "$line"
fi
fi
done
In my opinion, the simplest way (even though possibly not the fastest) is to find separately the lines that contain neither word and the lines that contain both words, and to concatenate the results. For example (assuming file.txt is a text file in directory test, and I pass the input values as environment variables for generality - and we are only looking for full words, not word fragments):
[mathguy#localhost test]$ more file.txt
aaaa bbbb cccc
bbbb bbbb bbbb
cccc cccc cccc
dddd dddd aaaa
[mathguy#localhost test]$ word1=aaaa
[mathguy#localhost test]$ word2=bbbb
[mathguy#localhost test]$ ( grep "\b$word1\b" file.txt | grep "\b$word2\b" ; \
> grep -v "\b$word1\b" file.txt | grep -v "\b$word2\b" ) | cat
aaaa bbbb cccc
cccc cccc cccc

how to insert a line before 2-3 line of match pattern

Input file:file
aaaa
bbbb
cccc
dddd
ffff *
==================
Schedule
end of file
i want to insert zzzz before 2-3 lines of 'schedule'
but it must check whether any word is available on that line or not.if available then insert zzzz to next line
Input file:file
aaaa
bbbb
cccc
dddd
ffff *
zzzz
==================
Schedule
end of file
It's not clear what before 2-3 lines really means but I think this is probably what you want:
$ cat tst.awk
NR==FNR {
if (/Schedule/) {
tgts[NR-2]
}
next
}
{ print }
(FNR in tgts) && /ffff/ { print "zzzz" }
$ awk -f tst.awk file file
aaaa
bbbb
cccc
dddd
ffff *
zzzz
==================
Schedule
end of file

In AWK, how to split consecutive rows that have the same string as a "record"?

Let's say I have below text.
aaaaaaa
aaaaaaa
bbb
bbb
bbb
ccccccccccccc
ddddd
ddddd
Is there a way to modify the text as the following.
1 aaaaaaa
1 aaaaaaa
2 bbb
2 bbb
2 bbb
3 ccccccccccccc
4 ddddd
4 ddddd
You could use something like this in awk:
$ awk '{print ($0!=p?++i:i),$0;p=$0}' file
1 aaaaaaa
1 aaaaaaa
2 bbb
2 bbb
2 bbb
3 ccccccccccccc
4 ddddd
4 ddddd
i is incremented whenever the current line differs from the previous line. p holds the value of the previous line, $0.
Alternatively, as suggested by JID:
awk '$0!=p{p=$0;i++}{print i,$0}' file
When the current line differs from p, replace p and increment i. See the comments for discussion of the pros and cons of either approach :)
A further contribution (and even shorter!) by NeronLeVelu
$ awk '{print i+=($0!=p),p=$0}' file
This version performs the addition assignment and basic assignment within the print statement. This works because the return value of each assignment is the value that has been assigned.
As pointed out in the comments, if the first line of the file is empty, the behaviour changes slightly. Assuming that the first line should always begin with a 1, the following block can be added to the start of any of the one-liners:
NR==1{p=$0;i=1}
i.e. on the first line, initialise p to the contents of the line (empty or not) and i to 1. Thanks to Wintermute for this suggestion.

Resources