How to remove the FIRST whitespace from a python dataframe on a certain column - python-3.x

I extracted a pdf table using tabula.read_pdf but some of the data entries a) show a whitespace between the values and b) includes two sets of values into one column as shown one columns "Sports 2019/2018" and "Total 2019/2018": https://imgur.com/a/MviV6N9
In order for me to use df_1=df1["Sprots 2019/2018"].str.split(expand=True) to split the two values which are separated by a space, I need to remove the FIRST space shown in the first value so that it doesn't split into three columns.
I've tried df1["Sports 2019/2018"] = df1["Sports 2019/2018"].str.replace(" ", "") but this removes all the spaces, which would then combine the two values.
Is there a way to remove the first whitespace on column "Sports 2019/2018 so that it resembles the values on "Internet 2019/2018'?

df1["Sports 2019/2018"] = df1["Sports 2019/2018"].str.replace(" ", "", n = 1)
n=1 is an argument that will only replace the first character that will find.

Related

How to check for words in two excel columns?

I have two columns in my excel sheet. I want to add their output in 3rd column. Formula should check for tags in both columns.
If tag totally miss-matches in both columns output should be shown as "inaccurate" or "failed".
If every tag is matching in columns output should be "accurate".
If some of tags are matching, then output should be "incomplete".
EXAMPLE SHOWN:
Another Example where i am unable to add incomplete as output. It only checks for inaccurate and accurate.
You can use LET and FILTERXML to test to see if the entire/a partial/no part of the string is in another. For example:
=LET(x, FILTERXML("<t><s>"&SUBSTITUTE(I1, ",", "</s><s>")&"</s></t>", "//s"),
y, FILTERXML("<t><s>"&SUBSTITUTE(J1, ",", "</s><s>")&"</s></t>", "//s"),
z, COUNT(MATCH(x,y,0)),
IF(z=MIN(COUNTA(x), COUNTA(y)), "Accurate", IF(z>0, "Partial", "Failed")))
Here we first create a dynamic array of the contents of each cell and call them x and y respectively. Next, we count the number of matches in each array. Finally, we test this count: if the count is equal to the size of the smallest dynamic array, then all the tags match and we call it "Accurate". Otherwise, if there is any match in the dynamic arrays, we call it "Partial". Finally, if there are no matches, we call it "Failed".

Extract last word in string in R - error faced

First, I wish to extract the last word and first word for the Description column (this column contains at least 3 words) into a newly created column firstword and lastword. However, the word() function is not applied to all the rows. As such, there are many rows with empty lastword, though these rows actually have a last word (as you can see from the Description column). This is shown in the first two lines of codes.
Second, I am also trying to get the third line of code to replace the lastword with firstword, if lastword is empty. However it isn't working.
Is there a way to rectify this?
c1$lastword = word(c1$Description,start=-1) #extract last word
c1$firstword = word(c1$Description,start=1) #extract first word
c1$lastword=ifelse(c1$lastword == " ", c1$firstword, c1$lastword)
I realise that there is white space at the beginning of some of the rows of the Description variable, which isn't shown when viewed in R.
Removing the whitespace using stri_trim() solved the issue.
c1$Description = stri_trim(c1$Description, "left") #remove whitespace

Processing TSV Files in Lua

I have a very very large TSV file. The first line is headers. The following lines contain data followed by tabs or double-tabs if a field was blank otherwise the fields can contain alphanumerics or alphanumerics plus punctuation marks.
for example:
Field1<tab>Field2<tab>FieldN<newline>
The fields may contain spaces, punctuation or alphanumerics. The only thing(s) that remains true are:
each field is followed by a tab except the last one
the last field is followed by a newline
blank fields are filled with a tab. Like all other fields they are followed by a tab. This makes them double-tab.
I've tried many combinations of pattern matching in lua and never get it quite right. Typically the fields with punctuation (time and date fields) are the ones that get me.
I need the blank fields (the ones with double-tab) preserved so that the rest of the fields are always at the same index value.
Thanks in Advance!
Try the code below:
function test(s)
local n=0
s=s..'\t'
for w in s:gmatch("(.-)\t") do
n=n+1
print(n,"["..w.."]")
end
end
test("10\t20\t30\t\t50")
test("100\t200\t300\t\t500\t")
It adds a tab to the end of the string so that all fields are follow by a tab, even the last one.
Rows and columns are separated:
local filename = "big_tables.tsv" -- tab separated values
-- local filename = "big_tables.csv" -- comma separated values
local lines = io.lines(filename) -- open file as lines
local tables = {} -- new table with columns and rows as tables[n_column][n_row]=value
for line in lines do -- row iterator
local i = 1 -- first column
for value in (string.gmatch(line, "[^%s]+")) do -- tab separated values
-- for value in (string.gmatch(line, '%d[%d.]*')) do -- comma separated values
tables[i]=tables[i]or{} -- if not column then create new one
tables[i][#tables[i]+1]=tonumber(value) -- adding row value
i=i+1 -- column iterator
end
end

Openrefine: Split multi-valued cells by token/word count?

I have a large corpus of text data that I'm pre-processing for document classification with MALLET using openrefine.
Some of the cells are long (>150,000 characters) and I'm trying to split them into <1,000 word/token segments.
I'm able to split long cells into 6,000 character chunks using the "Split multi-valued cells" by field length, which roughly translates to 1,000 word/token chunks, but it splits words across rows, so I'm losing some of my data.
Is there a function I could use to split long cells by the first whitespace (" ") after every 6,000th character, or even better, split every 1,000 words?
Here is my simple solution:
Go to Edit cells -> Transform and enter
value.replace(/((\s+\S+?){999})\s+/,"$1###")
This will replace every 1000th whitespace (consecutive whitespaces are counted as one and replaced if they appear at the split border) with ### (you can choose any token you like, as long as it doesn't appear in the original text).
The go to Edit cells -> Split multi-valued cells and split using the token ### as separator.
The simplest way is probably to split your text by spaces, to insert a very rare character (or group of characters) after each group of 1000 elements, to reconcatenate, then to use "Split multivalued cells" with your weird character(s).
You can do that in GREL, but it will be much clearer by choosing "Python/Jython" as script language.
So: Edit cells -> Transform -> Python/Jython:
my_list = value.split(' ')
n = 1000
i = n
while i < len(my_list):
my_list.insert(i, '|||')
i+= (n+1)
return " ".join(my_list)
(For an explanation of this script, see here)
Here is a more compact version :
text = value.split(' ')
n = 1000
return "|||".join([' '.join(text[i:i+n]) for i in range(0,len(text),n)])
You can then split using ||| as separator.
If you prefer to split by characters instead of words, looks like you can do that in two lines with textwrap :
import textwrap
return "|||".join(textwrap.wrap(value, 6000))

powerquery split column by variable field lengths

In PowerQuery I need to import a fixed width txt file (each line is the concatenation of a number of fields, each field has a fixed specific length).
When I import it I get a table with one single column that contains the txt lines, e.g. in the following format:
AAAABBCCCCCDDD
I want to add more columns in this way:
Column1: AAAA
Column2: BB
Column3: CCCCC
Column4: DDD
In other words the fields composing the source column are of known length, but this length is not the same for all fields (in the example above the lengths are: 4,2,5,3).
I'd like to use the "Split Column">"By number of character" utility but I can only insert one single length at a time, and to get the desired output I'd have to repeat the process 3 times, adding one column each time and using the "Once, as far left as possible" option for the "Split Column">"By number of character" utility.
My real life case has many different line types (files) to import and convert, each with more then 20 fields, so a less manual approach is needed; I'd like to somehow specify the record structure (the length of each field) and get the lines split automagically :)
There would probably be the need for some M code, which I know nothing about: can anybody point me to the right direction?
Thanks!
Create a query with the formula below. Let's call this query SplitText:
let
SplitText = (text, lengths) =>
let
LengthsCount = List.Count(lengths),
// Keep track of the index in the lengths list and the position in the text to take the next characters from. Use this information to get the next segment and put it into a list.
Split = List.Generate(() => {0, 0}, each _{0} < LengthsCount, each {_{0} + 1, _{1} + lengths{_{0}}}, each Text.Range(text, _{1}, lengths{_{0}}))
in
Split,
// Convert the list to a record to
ListToRecord = (text, lengths) =>
let
List = SplitText(text, lengths),
Record = Record.FromList(List, List.Transform({1 .. List.Count(List)}, each Number.ToText(_)))
in
Record
in
ListToRecord
Then, in your table, add a custom column that uses this formula:
each SplitText([Column1], {4, 2, 5, 3})
The first argument is the text to split, and the second argument is a list of lengths to split by.
Finally, expand the column to get the split text values into your table. You may want to rename the columns since they will be named 1, 2, etc.

Resources