I have a csv file of 100000 ids
wef7efwe1fwe8
wef7efwe1fwe3
ewefwefwfwgrwergrgr
that are being transformed into a json object using jq
output=$(jq -Rsn '
{"id":
[inputs
| . / "\n"
| (.[] | select(length > 0) | . / ";") as $input
| $input[0]]}
' <$FILE)
output
{
"id": [
"wef7efwe1fwe8",
"wef7efwe1fwe3",
....
]
}
currently, I need to manually split the file into smaller 10000 line files... because the API call has a limit.
I would like a way to automatically loop through the large file... and only use 10000 lines as a time as $FILE... up until the end of the list.
I would use the split command and write a little shell script around it:
#!/bin/bash
input_file=ids.txt
temp_dir=splits
api_limit=10000
# Make sure that there are no leftovers from previous runs
rm -rf "${temp_dir}"
# Create temporary folder for splitting the file
mkdir "${temp_dir}"
# Split the input file based on the api limit
split --lines "${api_limit}" "${input_file}" "${temp_dir}/"
# Iterate through splits and make an api call per split
for split in "${temp_dir}"/* ; do
jq -Rsn '
{"id":
[inputs
| . / "\n"
| (.[] | select(length > 0) | . / ";") as $input
| $input[0]]
}' "${split}" > api_payload.json
# now do something ...
# curl -dapi_payload.json http://...
rm -f api_payload.json
done
# Clean up
rm -rf "${temp_dir}"
Here's a simple and efficient solution that at its core just uses jq. It takes advantage of the -c command-line option. I've used xargs printf ... for illustration - mainly to show how easy it is to set up a shell pipeline.
< data.txt jq -Rnc '
def batch($n; stream):
def b: [limit($n; stream)]
| select(length > 0)
| (., b);
b;
{id: batch(10000; inputs | select(length>0) | (. / ";")[0])}
' | xargs printf "%s\n"
Parameterizing batch size
It might make sense to set things up so that the batch size is specified outside the jq program. This could be done in numerous ways, e.g. by invoking jq along the lines of:
jq --argjson n 10000 ....
and of course using $n instead of 10000 in the jq program.
Why “def b:”?
For efficiency. jq’s TCO (tail recursion optimization) only works for arity-0 filters.
Note on -s
In the Q as originally posted, the command-line options -sn are used in conjunction with inputs. Using -s with inputs defeats the whole purpose of inputs, which is to make it possible to process input in a stream-oriented way (i.e. one line of input or one JSON entity at a time).
How do I structure tests for the following program:
I'm writing a unit test framework for simulated combinatorial circuits. This framework will support multiple digital logic simulators (JLS, Logisim, TKGate, etc.) Thus, each test should be run once for each supported simulator.
My first idea is to do something like this:
Scenario Outline: Test of valid circuit
when I run DLUnit with "testCircuit1.<type> testFile"
Then I should see "All tests (4) passed." on stdout
Examples:
| type |
| jls | # extension for JLS files
| circ | # extension for Logisim files
| v | # extension for tkgate files
Scenario Outline: Test of invalid circuit
when I run DLUnit with "brokenCircuit1.<type> testFile"
Then I should see "There were failures" on stdout
Examples:
| type |
| jls |
| circ |
| v |
# Many more tests to follow
Although this will technically work, it results in feature code that may be difficult to maintain: Each feature is followed by a list of supported simulators. Adding support for an additional simulator would require adding the same line to each test.
I could also create jls.feature, then use sed to automatically create logisim.feature and tkgate.feature; but, I'd like to avoid that type of complexity if Cucumber offers a simpler built-in solution.
Perhaps you could do something like this in RSpec:
describe DLUnit do
[
'jls', 'testCircuit1', true,
'jls', 'brokenCircuit1', false,
# ...
].each do |simulator, circuit, expected_validity|
it "with the #{simulator} simulator finds the #{circuit} circuit #{expected_validity ? 'valid' : 'invalid' }" do
actual_output = DLUnit.run "#{circuit.simulator}" # obviously I'm making this part up
expect(actual_output).to include(expected_validity ? 'passed' : 'failures')
end
end
end
The test code itself is a little involved, but you only have to write it once, and the RSpec output should be clear.
Not much of an upgrade to what you already have but how about merging into one scenario outline.
Addition of new simulator you will need to make two changes in one examples table. Also you can make it more configurable in terms of changes in a different valid test file for two simulators or different result messages. But same can be done for your existing solution all be it you will have to do change the steps and examples.
Scenario Outline: Testing circuit
when I run <kind> DLUnit with "<circuit>.<type> testFile"
Then I should see <result> on stdout
Examples:
| type | kind | circuit | result |
| jls | valid | testCircuit1 | All tests (4) passed |
| jls | invalid | brokenCircuit1 | There were failures |
| circ | valid | testCircuit1 | All tests (4) passed |
| circ | invalid | brokenCircuit1 | There were failures |
| v | valid | testCircuit1 | All tests (4) passed |
| v | invalid | brokenCircuit1 | There were failures |
I want to parse the compute zones in open-stack command output as below
+-----------------------+----------------------------------------+
| Name | Status |
+-----------------------+----------------------------------------+
| internal | available |
| |- controller | |
| | |- nova-conductor | enabled :-) 2016-07-07T08:09:57.000000 |
| | |- nova-consoleauth | enabled :-) 2016-07-07T08:10:01.000000 |
| | |- nova-scheduler | enabled :-) 2016-07-07T08:10:00.000000 |
| | |- nova-cert | enabled :-) 2016-07-07T08:10:00.000000 |
| Compute01 | available |
| |- compute01 | |
| | |- nova-compute | enabled :-) 2016-07-07T08:09:53.000000 |
| Compute02 | available |
| |- compute02 | |
| | |- nova-compute | enabled :-) 2016-07-07T08:10:00.000000 |
| nova | not available |
+-----------------------+----------------------------------------+
i want to parse the result as below, taking only nodes having nova-compute
Compute01;Compute02
I used below command:
nova availability-zone-list | awk 'NR>2 {print $2}' | grep -v '|' | tr '\n' ';'
but it returns output like this
;internal;Compute01;Compute02;nova;;
In Perl (and written rather more verbosely than is really necessary):
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $node; # Store current node name
my #compute_nodes; # Store known nova-compute nodes
while (<>) { # Read from STDIN
# If we find the start of line, followed by a pipe, a space and
# a series of word characters...
if (/^\| (\w+)/) {
# Store the series of word characters (i.e. the node name) in $node
$node = $1;
}
# If we find a line that contains "nova-compute", add the current
# node name in #compute_nodes
push #compute_nodes, $node if /nova-compute/;
}
# Print out all of the values in #compute_nodes
say join ';', #compute_nodes;
I detest one-line programs except for the most simple of applications. They are unnecessarily cryptic, they have none of the usual programming support, and they are stored only in the terminal buffer. Want to do the same thing tomorrow? You must start coding again
Here's a Perl solution. Run it as
$ perl nova-compute.pl command-output.txt
use strict;
use warnings 'all';
my ($node, #nodes);
while ( <> ) {
$node = $1 if /^ \| \s* (\w+) /x;
push #nodes, $node if /nova-compute/;
}
print join(';', #nodes), "\n";
output
Compute01;Compute02
Now all of that is saved on disk. It may be run again at any time, modified for similar results, or fixed if you got it wrong. It is also readable. No contest
$ nova availability-zone-list | awk '/^[|] [^|]/{node=$2} node && /nova-compute/ {s=s ";" node} END{print substr(s,2)}'
Compute01;Compute02
How it works:
/^[|] [^|]/{node=$2}
Any time a line begins with | followed by space followed by a character not |, then save the second field as a node name.
node && /nova-compute/ {s=s ";" node}
If node is non-empty and the current line contains nova-compute, then append node to the string s.
END{print substr(s,2)}
After we have read all the lines, print out string s minus its first character which is a superfluous ;.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 4 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Improve this question
Anyone know of a command-line CSV viewer for Linux/OS X? I'm thinking of something like less but that spaces out the columns in a more readable way. (I'd be fine with opening it with OpenOffice Calc or Excel, but that's way too overpowered for just looking at the data like I need to.) Having horizontal and vertical scrolling would be great.
You can also use this:
column -s, -t < somefile.csv | less -#2 -N -S
column is a standard unix program that is very convenient -- it finds the appropriate width of each column, and displays the text as a nicely formatted table.
Note: whenever you have empty fields, you need to put some kind of placeholder in it, otherwise the column gets merged with following columns. The following example demonstrates how to use sed to insert a placeholder:
$ cat data.csv
1,2,3,4,5
1,,,,5
$ sed 's/,,/, ,/g;s/,,/, ,/g' data.csv | column -s, -t
1 2 3 4 5
1 5
$ cat data.csv
1,2,3,4,5
1,,,,5
$ column -s, -t < data.csv
1 2 3 4 5
1 5
$ sed 's/,,/, ,/g;s/,,/, ,/g' data.csv | column -s, -t
1 2 3 4 5
1 5
Note that the substitution of ,, for , , is done twice. If you do it only once, 1,,,4 will become 1, ,,4 since the second comma is matched already.
You can install csvtool (on Ubuntu) via
sudo apt-get install csvtool
and then run:
csvtool readable filename | view -
This will make it nice and pretty inside of a read-only vim instance, even if you have some cells with very long values.
Have a look at csvkit. It provides a set of tools that adhere to the UNIX philosophy (meaning they are small, simple, single-purposed and can be combined).
Here is an example that extracts the ten most populated cities in Germany from the free Maxmind World Cities database and displays the result in a console-readable format:
$ csvgrep -e iso-8859-1 -c 1 -m "de" worldcitiespop | csvgrep -c 5 -r "\d+"
| csvsort -r -c 5 -l | csvcut -c 1,2,4,6 | head -n 11 | csvlook
-----------------------------------------------------
| line_number | Country | AccentCity | Population |
-----------------------------------------------------
| 1 | de | Berlin | 3398362 |
| 2 | de | Hamburg | 1733846 |
| 3 | de | Munich | 1246133 |
| 4 | de | Cologne | 968823 |
| 5 | de | Frankfurt | 648034 |
| 6 | de | Dortmund | 594255 |
| 7 | de | Stuttgart | 591688 |
| 8 | de | Düsseldorf | 577139 |
| 9 | de | Essen | 576914 |
| 10 | de | Bremen | 546429 |
-----------------------------------------------------
Csvkit is platform independent because it is written in Python.
Tabview: lightweight python curses command line CSV file viewer (and also other tabular Python data, like a list of lists) is here on Github
Features:
Python 2.7+, 3.x
Unicode support
Spreadsheet-like view for easily visualizing tabular data
Vim-like navigation (h,j,k,l, g(top), G(bottom), 12G goto line 12, m - mark,
' - goto mark, etc.)
Toggle persistent header row
Dynamically resize column widths and gap
Sort ascending or descending by any column. 'Natural' order sort for numeric values.
Full-text search, n and p to cycle between search results
'Enter' to view the full cell contents
Yank cell contents to clipboard
F1 or ? for keybindings
Can also use from python command line to visualize any tabular data (e.g.
list-of-lists)
If you're a vimmer, use the CSV plugin, which is juuust beautiful:
.
The nodejs package tecfu/tty-table can be globally installed to do precisely this:
apt-get install nodejs
npm i -g tty-table
cat data.csv | tty-table
It can also handle streams.
For more info, see the docs for terminal usage here.
xsv is more than a viewer. I recommend it for most CSV task on the command line, especially when dealing with large datasets.
I used pisswillis's answer for a long time.
csview()
{
local file="$1"
sed "s/,/\t/g" "$file" | less -S
}
But then combined some code I found at http://chrisjean.com/2011/06/17/view-csv-data-from-the-command-line which works better for me:
csview()
{
local file="$1"
cat "$file" | sed -e 's/,,/, ,/g' | column -s, -t | less -#5 -N -S
}
The reason it works better for me is that it handles wide columns better.
Ofri's answer gives you everything you asked for.
But.. if you don't want to remember the command you can add this to your ~/.bashrc (or equivalent):
csview()
{
local file="$1"
sed "s/,/\t/g" "$file" | less -S
}
This is exactly the same as Ofri's answer except I have wrapped it in a shell function and am using the less -S option to stop the wrapping of lines (makes less behaves more like a office/oocalc).
Open a new shell (or type source ~/.bashrc in your current shell) and run the command using:
csview <filename>
Here's a (probably too) simple option:
sed "s/,/\t/g" filename.csv | less
Yet another multi-functional CSV (and not only) manipulation tool: Miller. From its own description, it is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. (link to github repository: https://github.com/johnkerl/miller)
tblless in the Tabulator package wraps the unix column command, and also aligns numeric columns.
I've created tablign for these (and other) purposes. Install with
pip install tablign
and
$ cat test.csv
Header1,Header2,Header3
Pizza,Artichoke dip,Bob's Special of the Day
BLT,Ham on rye with the works,
$ tablign test.csv
Header1 , Header2 , Header3
Pizza , Artichoke dip , Bob's Special of the Day
BLT , Ham on rye with the works ,
Also works if the data is separated by something else than commas. Most importantly, it preserves the delimiters so you can also use it to style your ASCII tables without sacrificing your [Markdown,CSV,LaTeX] syntax.
I wrote this csv_view.sh to format CSVs from the command line, this reads the entire file to figure out the optimal width of each column (requires perl, assumes there are no commas in fields, also uses less):
#!/bin/bash
perl -we '
sub max( # ) {
my $max = shift;
map { $max = $_ if $_ > $max } #_;
return $max;
}
sub transpose( # ) {
my #matrix = #_;
my $width = scalar #{ $matrix[ 0 ] };
my $height = scalar #matrix;
return map { my $x = $_; [ map { $matrix[ $_ ][ $x ] } 0 .. $height - 1 ] } 0 .. $width - 1;
}
# Read all lines, as arrays of fields
my #lines = map { s/\r?\n$//; [ split /,/ ] } ;
my $widths =
# Build a pack expression based on column lengths
join "",
# For each column get the longest length plus 1
map { 'A' . ( 1 + max map { length } #$_ ) }
# Get arrays of columns
transpose
#lines
;
# Format all lines with pack
map { print pack( $widths, #$_ ) . "\n" } #lines;
' $1 | less -NS
Tabview is really good. Worked with 200+MB files that displayed nicely which were buggy with LibreOffice as well as csv plugin in gvim.
The Anaconda version is available here: https://anaconda.org/bioconda/tabview
Using TxtSushi you can do:
csvtopretty filename.csv | less -S
I wrote a script, viewtab , in Groovy for just this purpose. You invoke it like:
viewtab filename.csv
It is basically a super-lightweight spreadsheet that can be invoked from the command line, handles CSV and tab separated files, can read VERY large files that Excel and Numbers choke on, and is very fast. It's not command-line in the sense of being text-only, but it is platform independent and will probably fit the bill for many people looking for a solution to the problem of quickly inspecting many or large CSV files while working in a command line environment.
The script and how to install it are described here:
http://bayesianconspiracy.blogspot.com/2012/06/quick-csvtab-file-viewer.html
There's this short command line script in python: https://github.com/rgrp/csv2ascii/blob/master/csv2ascii.py
Just download and place in your path. Usage is like
csv2ascii.py [options] csv-file-path
Convert csv file at csv-file-path to ascii form returning the result on
stdout. If csv-file-path = '-' then read from stdin.
Options:
-h, --help show this help message and exit
-w WIDTH, --width=WIDTH
Width of ascii output
-c COLUMNS, --columns=COLUMNS
Only display this number of columns