Splitting a pandas column every n characters - python-3.x

I have a dataframe where some columns contain long strings (e.g. 30000 characters). I would like to split these columns every 4000 characters so that I end up with a range of new columns containing strings of length at most 4000. I have an upper bound on the string lengths so I know there should be at most 9 new columns. I would like there to always be 9 new columns, having None/NaN in columns where the string is shorter.
As an example (with n = 10 instead of 4000 and 3 columns instead of 9), let's say I have the dataframe:
df_test = pd.DataFrame({'id': [1, 2, 3],
'str_1': ['This is a long string', 'This is an even longer string', 'This is the longest string of them all'],
'str_2': ['This is also a long string', 'a short string', 'mini_str']})
id str_1 str_2
0 1 This is a long string This is also a long string
1 2 This is an even longer string a short string
2 3 This is the longest string of them all mini_str
In this case I want to get the result
id str_1_1 str_1_2 str_1_3 str_1_4 str_2_1 str_2_2 str_2_3
0 1 This is a long strin g NaN This is al so a long string
1 2 This is an even long er string NaN a short st ring NaN
2 3 This is th e longest string of them all mini_str NaN NaN
Here, I want e.g. first row, column str_1_3 to be a string of length 1.
I tried using
df_test['str_1'].str.split(r".{10}", expand=True, n=10)
but that didn't work. It gave this as result
0 1 2 3
0 g None
1 er string None
2 them all
where the first columns aren't filled.
I also tried looping through every row and inserting '|' every 10 characters and then splitting on '|' but that seems tedious and slow.
Any help is appreciated.

The answer is quite simple, that is, insert a delimiter and split it.
For example, use | as the delimiter and let n = 4:
series = pd.Series(['This is an even longer string', 'This is the longest string of them all'],name='str1')
name = series.name
cols = series.str.replace('(.{10})', r'\1|').str.split('|', n=4, expand=True).add_prefix(f'{name}_')
That is, use str.replace to add delimiter, use str.split to split them apart and use add_prefix to add the prefixes.
The output will be:
str1_0 str1_1 str1_2 str1_3
0 This is an even long er string None
1 This is th e longest string of them all
The reason why str.split('.{10}') doesn't work is that the pat param in the function str.split is a pattern to match the strings as split delimiters but not strings that should be in splited results. Therefore, with str.split('.{10}'), you get one character every 10-th chars.
UPDATE: Accroding to the suggestion from #AKX, use \x1F as a better delimiter:
cols = series.str.replace('(.{10})', '\\1\x1F').str.split('\x1F', n=4, expand=True).add_prefix(f'{name}_')
Note the absence of the r string flags.

Related

Remove 1-3 length character from string in sql

From a space delimited string, i would like to remove all words that are long from 1 to 3 characters.
For example: this string
LCCPIT A2 LCCMAD B JBPM_JIT CCC
should become
LCCPIT LCCMAD JBPM_JIT
So, A2, B and CCC words are removed (since they are long 2, 1 and 3 characters). Is there a way to do it? I think i could use REGEXP_REPLACE, but i didn't find the correct regular expression to have this result.
Split string to words and aggregate back only these substrings whose length is greater than 3.
Sample data:
SQL> with test (col) as
2 (select 'LCCPIT A2 LCCMAD B JBPM_JIT CCC' from dual)
Query begins here:
3 select listagg(val, ' ') within group (order by lvl) result
4 from (select regexp_substr(col, '[^ ]+', 1, level) val,
5 level lvl
6 from test
7 connect by level <= regexp_count(col, ' ') + 1
8 )
9 where length(val) > 3;
RESULT
--------------------------------------------------------------------------------
LCCPIT LCCMAD JBPM_JIT
SQL>
I prefer a regex replacement trick:
SELECT TRIM(REGEXP_REPLACE(val, '(^|\s+)\w{1,3}(\s+|$)', ' '))
FROM dual;
-- output is 'LCCPIT LCCMAD JBPM_JIT'
Demo
The strategy above is match any 1, 2, or 3 letter word, along with any surrounding whitespace, and to replace with just a single space. The outer call to TRIM() is necessary to remove dangling spaces which might arise from the first or last word being removed.

what will be the dp and transitions in this problem

Vasya has a string s of length n consisting only of digits 0 and 1. Also he has an array a of length n.
Vasya performs the following operation until the string becomes empty: choose some consecutive substring of equal characters, erase it from the string and glue together the remaining parts (any of them can be empty). For example, if he erases substring 111 from string 111110 he will get the string 110. Vasya gets ax points for erasing substring of length x.
Vasya wants to maximize his total points, so help him with this!
https://codeforces.com/problemset/problem/1107/E
i was trying to get my head around the editorial,but couldn't understand it... can anyone tell an easy way to do it?
input:
7
1101001
3 4 9 100 1 2 3
output:
109
Explanation
the optimal sequence of erasings is: 1101001 → 111001 → 11101 → 1111 → ∅.
Here, we consider removing prefixes instead of substrings. Why?
We try to remove a consecutive prefix of a particular state which is actually a substring in the main string. So, our DP states will be start index, end index, prefix length.
Let's consider an example str = "1010110". Here, initially start=0, end=7, and prefix=1(the first '1' will be the only prefix now). we iterate over all the indices in the current state except the starting index and check if str[i]==str[start]. Here, for example, str[4]==str[0]. Now we divide the string into "010" with prefix=1(010) && "110" with prefix=2(1010110). These two are now two individual subproblems. So, when there remains a string with length 1, we return aprefix.
Here is my code.

Python: setting a string so it always has 2 decimals after comma

I'm looking to write a Python function that adds a number to the back of the string. However, I want it in a way that the string always has 2 characters after the comma.
I believe using a string is easier to remove and skip characters. I will be converting the result with the float() method.
As an example:
I start at the string "0.00"
Adding a 5 will make it "0.05"
Adding a 5 and a 6 will make it "5.56" etc
Another example:
again we start at "0.00". Adding consecutive the characters "5" "4" "3" "2" "1" will ultimately result in "543.21"
Why not just convert the string to an integer and divide the number by 100?
num = int(input())
print(num/float(100))
E.g. input = '5',
convert to integer = 5,
Divide by 100 = 0.05

Find index of cells containing my string

I have a cellarray C which contain numbers and string like that.
1 0 'C:\user' 41.57
2 0 'C:\user' 46.25
3 0 'C:\user' 48
4 0 'C:\user' 48.33
I want to get the index of the cell which is equal to a specified name enter.
I have tried to do something like that but it didn't work
idx=find(strcmp([C{:,:}],'C:\User\..')
I need help please
To use strcmp, you have to use num2str at first to convert the double to string. Use UniformOutput as false since your C has both numbers and strings.
idx = find(strcmp(cellfun(#num2str, C, 'un', 0), 'C:\user'));
[row, col] = ind2sub(size(C), idx);

matlab string vector / array handling (multiplication u and str2num)

I would like to understand if this is really correct, or if this might be an issue in matlab.
I create an string vector/array via:
>>a=['1','2';'3','4']
It returns:
a =
12
34
Now I would like to convert the content from string to number and multiply this with a number:
>>6*str2num(a)
The result looks like this:
a =
72
204
I don't understand why the comma separated elements (strings) will be concatenated and not separated handled. If you use number instead of strings they will be separated handled. Then it looks like this:
>> a=[1,2;3,4]
a =
1 2
3 4
>> 6*a
ans =
6 12
18 24
I would expect the same results. Any ideas ?
Thanks
Have you read about how string handling is done in MATLAB?
Basically, multiple strings can only be stored as a column vector (of strings). If attempted to store as a row vector, they will be concatenated. This is why strings '1' and '2' are being concatenated, as well as '3' and '4'. Also note, that this is only possible if all resulting strings are of the same length.
I'm not sure what you're trying to do, but if you want to store strings as a matrix (that is, multiple strings in a row), consider storing them in a cell array, for instance:
>> A = {'1', '2'; '3', '4'}
A =
'1' '2'
'3' '4'
>> cellfun(#str2num, A)
ans =
1 2
3 4
I would say that using a cell array as #EitanT suggests would probably be the best solution for you.
However, it is possible to handle strings (or rather characters) like the way you tried by manually inserting spaces and lining up the number of characters.
For example
>> a=['1 2';'3 4']
produces
a =
1 2
3 4
and using
>> 6*str2num(a)
produces
ans =
6 12
18 24
Converting between a matrix and a string using
b=[1,2;3,10000];
num2str(b)
spaces are inserted automatically and the characters are lined up properly. This produces
ans =
1 2
3 10000

Resources