Deleting colums of large csv files - linux

I have a large CSV file of around 2 GB, containing 7 columns. I want to delete its 4th column which is a text (snippet). I used "cut" command like:
cut -d, f 4 -- complement file
But it is not removing the column because it is making columns whenever a comma is encountered in a row and deleting the 4th column from that row. Following answer here, i used csvquote like:
csvquote file | cut -d "," -f 4 --complement | uniq -c | csvquote -u
it worked for a small file, but throwing an error for large files:
errno: Value too large for defined data type
I want to know some solutions for deleting columns of the large data file. Thanks.
Edit: Head file output:
funny,user_id,review_id,text,business_id,stars,date,useful,type,cool
0,WV5XKbgVHJXEgw7f-b6PVA,hhmpSM4LcHQv6noXlYYCgw,"Went out of our way to find this place because I read they had amazing poutine. Worth the traveling. It really was spot on amazing. Served out of a storage container this place is hip. $10 for two huge portions of poutine. The fries were crisp and held up to the creamy gravy well. Topped with a huge portion of squeaky white cheese curds this was a fantastic meal.

Have you tried telling cut to use the other fields instead?
Like this:
trucks | cut -f 1,3- -d , | uniq -c | csvquote/csvquote -u
I tested it on my machine and it seems to work. But I didn't see a sample of your data, also you didn't note which program is throwing the
errno: Value too large for defined data type

Related

Trouble pipelining grep into a sort

I have data in file in the form:
Torch Lake township | Antrim | 1194
I'm able use grep to look for keywords and pipe that into a sort but the sort isn't behaving how I intendended
This is what I have:
grep '| Saginaw' data | sort -k 5,5
I would like to be able to sort by the numerical value in the last column but it currently isn't and I'm unsure what I'm actually doing wrong.
A few things seem to be bogging you down.
First, the vertical bar can be a special character in grep. It means OR. Ex:
A|B
could be interpreted as A or B, and not A vertical bar B.
To correct that, you need to tell grep to interpret the | as a non-special character. To do that, escape it, like this:
grep '\| Saginaw' data
or, simply remove it altogether, if you data format allows that.
Second, the sort command needs to know what your column separator is. By default, it uses a space character (Actually, it's any white space). sort -k 5,5 actually says "sort on the 5th word"
To specify that your column separator is actually the vertical bay, use the -t option:
sort -t'|' -k 5,5
alternately,
sort --field-separator='|' -k 5,5
Third, You've got a bit of a sticky wicket now. Your data is formatted as:
Field1 | Field2 | Field3
...and not...
Field1|Field2|Field3
You may have issues with that additional space. Or maybe not. If all of your data has EXACTLY the same white-space, you'll be fine. If some have a single space, some have 2 spaces, and others have a tab, your sort will get jacked up.
Fourth, sorting by numbers may not be intuitive for you. The number 10 comes after the number 1 and before the number 2.
To sort the way you think it ought to be, where 10 comes after 9, use the option -n for numeric sort.
grep '\| Saginaw' data | sort -t'|' -n -k 5,5
The entire filed #5 will be sorted. Thus, 10 Abbington will come before 10 Andover.

Extracting two columns and search for specific words in the first column remaining without cuting the ones remaining

I have a .csv file filled with names of people, their group, the city they live in, and the day they are able to work, these 4 informations are separated with this ":".
For e.g
Dennis:GR1:Thursday:Paris
Charles:GR3:Monday:Levallois
Hugues:GR2:Friday:Ivry
Michel:GR2:Tuesday:Paris
Yann:GR1:Monday:Pantin
I'd like to cut the 2nd and the 3rd columns, and prints all the lines containing names ending with "s", but without cutting the 2nd column remaining.
For e.g, I would like to have something like that :
Dennis:Paris
Charles:Levallois
Hugues:Ivry
I tried to this with grep and cut, and but using cut ends with having just the 1st remaining.
I hope that I've been able to make myself understood !
It sounds like all you need is:
$ awk 'BEGIN{FS=OFS=":"} $1~/s$/{print $1, $4}' file
Dennis:Paris
Charles:Levallois
Hugues:Ivry
To address your comment requesting a grep+cut solution:
$ grep -E '^[^:]+s:' file | cut -d':' -f1,4
Dennis:Paris
Charles:Levallois
Hugues:Ivry
but awk is the right way to do this.

convert one single column to multiple columns in linux

I am trying to convert a single column to multiple columns using
grep -v '^\s*$' $1| pr -ts" " --columns $2
but I get this error:
pr: page width too narrow
could someone help me with this error?
The question isn't fully clear to me. Maybe you can provide some sample input data as well. If you want to convert 1 column to multiple columns xargs command can be used.
xargs -n10 --> will convert 1 column to 10 i.e. it will take 10 rows and make them one.
BR,

Subtracting a CSV List from Another?

I have a huge list of broken links I generated with Screaming Frog and started to fix a lot of them.. I ran the csv file back through screaming frog to see which broken links I had left so now I have 2 CSV files. How do I subtract the newer list from the old list so I can see which links I already fixed?
The following method assumes:
A: All broken links are in oldfile.
B: Some broken links are in newfile.
C: The shared lines are exact duplicates.
sort newfile oldfile | uniq -d > filesThatAreStillBroken
or
sort newfile oldfile | uniq -u > filesThatAreFixed
Sort merges the files together into a single sorted list. It doesn't matter if newfile or oldfile is first.
uniq -d puts out only lines that occur more than once. Since they were in both lists, they are still broken.
uniq -u puts out only lines are are unique.
Note: This won't catch new errors that you have introduced while fixing the old errors. New errors will only be in newfiles and so will be falsely reported as fixed in the second invocation, and not reported at all in the first invocation.
Type
man sort
man uniq
for more details about these two command line utilities.
If you are on a windows box, you can install the cygwin environment, or possibly windows now has a posix command set.
Import both CSV's into excel.
Add the formula into cells in column B of the larger list: =COUNTIF(Sheet2!A:A,A1)
This will give you a count of how many times that cell appears in the other list.
Now you just have to delete any that have a count > 0.
Tip: To delete the rows easily: Add a header row, turn on auto filters, deselect count of 0, delete the rows, turn off auto filters. (Alternatively, you could sort your list if you don't mind the order getting messed up)
Try this function in Excel:
=IF(COUNTIFS($B$1:$B$6, A1), "Borked", "Fixed")
Just make sure that the A value points to something in the smaller list (still broken) and the B range covers the original set of broken links

How to delete double lines in bash

Given a long text file like this one (that we will call file.txt):
EDITED
1 AA
2 ab
3 azd
4 ab
5 AA
6 aslmdkfj
7 AA
How to delete the lines that appear at least twice in the same file in bash? What I mean is that I want to have this result:
1 AA
2 ab
3 azd
6 aslmdkfj
I do not want to have the same lines in double, given a specific text file. Could you show me the command please?
Assuming whitespace is significant, the typical solution is:
awk '!x[$0]++' file.txt
(eg, The line "ab " is not considered the same as "ab". It is probably simplest to pre-process the data if you want to treat whitespace differently.)
--EDIT--
Given the modified question, which I'll interpret as only wanting to check uniqueness after a given column, try something like:
awk '!x[ substr( $0, 2 )]++' file.txt
This will only compare columns 2 through the end of the line, ignoring the first column. This is a typical awk idiom: we are simply building an array named x (one letter variable names are a terrible idea in a script, but are reasonable for a one-liner on the command line) which holds the number of times a given string is seen. The first time it is seen, it is printed. In the first case, we are using the entire input line contained in $0. In the second case we are only using the substring consisting of everything including and after the 2nd character.
Try this simple script:
cat file.txt | sort | uniq
cat will output the contents of the file,
sort will put duplicate entries adjacent to each other
uniq will remove adjcacent duplicate entries.
Hope this helps!
The uniq command will do what you want.
But make sure the file is sorted first, it only checks for consecutive lines.
Like this:
sort file.txt | uniq

Resources