NAME
Miller -- like awk, sed, cut, join, and sort for name-indexed data such as CSV and tabular JSON.
SYNOPSIS
Usage: mlr [flags] {verb} [verb-dependent options ...] {zero or more file names}
If zero file
names are provided, standard input is read, e.g.
mlr --csv sort -f shape example.csv
Output of one
verb may be chained as input to another using
"then", e.g.
mlr --csv stats1 -a min,mean,max -f quantity then sort -f
color example.csv
Please see ’mlr help topics’ for more information. Please also see https://miller.readthedocs.io
DESCRIPTION
Miller operates on key-value-pair data while the familiar Unix tools operate on integer-indexed fields: if the natural data structure for the latter is the array, then Miller’s natural data structure is the insertion-ordered hash map. This encompasses a variety of data formats, including but not limited to the familiar CSV, TSV, and JSON. (Miller can handle positionally-indexed data as a special case.) This manpage documents mlr 6.12.0.
EXAMPLES
mlr --icsv
--opprint cat example.csv
mlr --icsv --opprint sort -f shape example.csv
mlr --icsv --opprint sort -f shape -nr index example.csv
mlr --icsv --opprint cut -f flag,shape example.csv
mlr --csv filter ’$color == "red"’
example.csv
mlr --icsv --ojson put ’$ratio = $quantity /
$rate’ example.csv
mlr --icsv --opprint --from example.csv sort -nr index then
cut -f shape,quantity
FILE FORMATS
CSV/CSV-lite:
comma-separated values with separate header line
TSV: same but with tabs in places of commas
+---------------------+
| apple,bat,cog |
| 1,2,3 | Record 1: "apple":"1",
"bat":"2", "cog":"3"
| 4,5,6 | Record 2: "apple":"4",
"bat":"5", "cog":"6"
+---------------------+
JSON (array of
objects):
+---------------------+
| [ |
| { |
| "apple": 1, | Record 1:
"apple":"1",
"bat":"2", "cog":"3"
| "bat": 2, |
| "cog": 3 |
| }, |
| { |
| "dish": { | Record 2:
"dish.egg":"7",
| "egg": 7, |
"dish.flint":"8",
"garlic":""
| "flint": 8 |
| }, |
| "garlic": "" |
| } |
| ] |
+---------------------+
JSON Lines
(sequence of one-line objects):
+------------------------------------------------+
| {"apple": 1, "bat": 2,
"cog": 3} |
| {"dish": {"egg": 7, "flint":
8}, "garlic": ""} |
+------------------------------------------------+
Record 1: "apple":"1",
"bat":"2", "cog":"3"
Record 2: "dish:egg":"7",
"dish:flint":"8",
"garlic":""
PPRINT:
pretty-printed tabular
+---------------------+
| apple bat cog |
| 1 2 3 | Record 1: "apple:"1",
"bat":"2", "cog":"3"
| 4 5 6 | Record 2: "apple":"4",
"bat":"5", "cog":"6"
+---------------------+
Markdown
tabular:
+-----------------------+
| | apple | bat | cog | |
| | --- | --- | --- | |
| | 1 | 2 | 3 | | Record 1: "apple:"1",
"bat":"2", "cog":"3"
| | 4 | 5 | 6 | | Record 2: "apple":"4",
"bat":"5", "cog":"6"
+-----------------------+
XTAB:
pretty-printed transposed tabular
+---------------------+
| apple 1 | Record 1: "apple":"1",
"bat":"2", "cog":"3"
| bat 2 |
| cog 3 |
| |
| dish 7 | Record 2: "dish":"7",
"egg":"8"
| egg 8 |
+---------------------+
DKVP: delimited
key-value pairs (Miller default format)
+---------------------+
| apple=1,bat=2,cog=3 | Record 1:
"apple":"1",
"bat":"2", "cog":"3"
| dish=7,egg=8,flint | Record 2:
"dish":"7",
"egg":"8",
"3":"flint"
+---------------------+
NIDX:
implicitly numerically indexed (Unix-toolkit style)
+---------------------+
| the quick brown | Record 1: "1":"the",
"2":"quick",
"3":"brown"
| fox jumped | Record 2: "1":"fox",
"2":"jumped"
+---------------------+
HELP OPTIONS
Type ’mlr
help {topic}’ for any of the following:
Essentials:
mlr help topics
mlr help basic-examples
mlr help file-formats
Flags:
mlr help flags
mlr help flag
mlr help list-separator-aliases
mlr help list-separator-regex-aliases
mlr help comments-in-data-flags
mlr help compressed-data-flags
mlr help csv/tsv-only-flags
mlr help file-format-flags
mlr help flatten-unflatten-flags
mlr help format-conversion-keystroke-saver-flags
mlr help json-only-flags
mlr help legacy-flags
mlr help miscellaneous-flags
mlr help output-colorization-flags
mlr help pprint-only-flags
mlr help profiling-flags
mlr help separator-flags
Verbs:
mlr help list-verbs
mlr help usage-verbs
mlr help verb
Functions:
mlr help list-functions
mlr help list-function-classes
mlr help list-functions-in-class
mlr help usage-functions
mlr help usage-functions-by-class
mlr help function
Keywords:
mlr help list-keywords
mlr help usage-keywords
mlr help keyword
Other:
mlr help auxents
mlr help terminals
mlr help mlrrc
mlr help output-colorization
mlr help type-arithmetic-info
mlr help type-arithmetic-info-extended
Shorthands:
mlr -g = mlr help flags
mlr -l = mlr help list-verbs
mlr -L = mlr help usage-verbs
mlr -f = mlr help list-functions
mlr -F = mlr help usage-functions
mlr -k = mlr help list-keywords
mlr -K = mlr help usage-keywords
Lastly, ’mlr help ...’ will search for your
exact text ’...’ using the sources of
’mlr help flag’, ’mlr help verb’,
’mlr help function’, and ’mlr help
keyword’.
Use ’mlr help find ...’ for approximate
(substring) matches, e.g. ’mlr help find map’
for all things with "map" in their names.
VERB LIST
altkv bar
bootstrap case cat check clean-whitespace count-distinct
count
count-similar cut decimate fill-down fill-empty filter
flatten format-values
fraction gap grep group-by group-like gsub having-fields
head histogram
json-parse json-stringify join label latin1-to-utf8
least-frequent
merge-fields most-frequent nest nothing put regularize
remove-empty-columns
rename reorder repeat reshape sample sec2gmtdate sec2gmt
seqgen shuffle
skip-trivial-records sort sort-within-records sparsify split
ssub stats1
stats2 step sub summary tac tail tee template top
utf8-to-latin1 unflatten
uniq unspace unsparsify
FUNCTION LIST
abs acos acosh
antimode any append apply arrayify asin asinh
asserting_absent
asserting_array asserting_bool asserting_boolean
asserting_empty
asserting_empty_map asserting_error asserting_float
asserting_int
asserting_map asserting_nonempty_map asserting_not_array
asserting_not_empty
asserting_not_map asserting_not_null asserting_null
asserting_numeric
asserting_present asserting_string atan atan2 atanh bitcount
boolean
capitalize cbrt ceil clean_whitespace collapse_whitespace
concat contains cos
cosh count depth dhms2fsec dhms2sec distinct_count erf erfc
every exec exp
expm1 flatten float floor fmtifnum fmtnum fold format
fsec2dhms fsec2hms
get_keys get_values gmt2localtime gmt2nsec gmt2sec gssub
gsub haskey hexfmt
hms2fsec hms2sec hostname index int invqnorm is_absent
is_array is_bool
is_boolean is_empty is_empty_map is_error is_float is_int
is_map is_nan
is_nonempty_map is_not_array is_not_empty is_not_map
is_not_null is_null
is_numeric is_present is_string joink joinkv joinv
json_parse json_stringify
kurtosis latin1_to_utf8 leafcount leftpad length
localtime2gmt localtime2nsec
localtime2sec log log10 log1p logifit lstrip madd mapdiff
mapexcept mapselect
mapsum max maxlen md5 mean meaneb median mexp min minlen
mmul mode msub
nsec2gmt nsec2gmtdate nsec2localdate nsec2localtime
null_count os percentile
percentiles pow qnorm reduce regextract regextract_or_else
rightpad round
roundm rstrip sec2dhms sec2gmt sec2gmtdate sec2hms
sec2localdate sec2localtime
select sgn sha1 sha256 sha512 sin sinh skewness sort
sort_collection splita
splitax splitkv splitkvx splitnv splitnvx sqrt ssub stddev
strfntime
strfntime_local strftime strftime_local string strip strlen
strmatch strmatchx
strpntime strpntime_local strptime strptime_local sub substr
substr0 substr1
sum sum2 sum3 sum4 sysntime system systime systimeint tan
tanh tolower toupper
truncate typeof unflatten unformat unformatx upntime uptime
urand urand32
urandelement urandint urandrange utf8_to_latin1 variance
version ! != !=~ % &
&& * ** + - . .* .+ .- ./ / // < << <=
<=> == =~ > >= >> >>> ?: ?? ??? ^
^^ |
|| ~
COMMENTS-IN-DATA FLAGS
Miller lets you put comments in your data, such as
# This is a
comment for a CSV file
a,b,c
1,2,3
4,5,6
Notes:
* Comments are
only honored at the start of a line.
* In the absence of any of the below four options, comments
are data like
any other text. (The comments-in-data feature is opt-in.)
* When ’--pass-comments’ is used, comment lines
are written to standard output
immediately upon being read; they are not part of the record
stream. Results
may be counterintuitive. A suggestion is to place comments
at the start of
data files.
--pass-comments
Immediately print commented lines (prefixed by
’#’)
within the input.
--pass-comments-with {string}
Immediately print commented lines within input, with
specified prefix.
--skip-comments Ignore commented lines (prefixed by
’#’) within the
input.
--skip-comments-with {string}
Ignore commented lines within input, with specified
prefix.
COMPRESSED-DATA FLAGS
Miller offers a few different ways to handle reading data files
which have been compressed. |
* Decompression
done within the Miller process itself: ’--bz2in’
’--gzin’
’--zin’’--zstdin’
* Decompression done outside the Miller process:
’--prepipe’ ’--prepipex’
Using
’--prepipe’ and ’--prepipex’ you can
specify an action to be
taken on each input file. The prepipe command must be able
to read from
standard input; it will be invoked with ’{command}
< {filename}’. The
prepipex command must take a filename as argument; it will
be invoked with
’{command} {filename}’.
Examples:
mlr --prepipe
gunzip
mlr --prepipe zcat -cf
mlr --prepipe xz -cd
mlr --prepipe cat
Note that this
feature is quite general and is not limited to decompression
utilities. You can use it to apply per-file filters of your
choice. For output
compression (or other) utilities, simply pipe the output:
’mlr ... | {your compression command} >
outputfilenamegoeshere’
Lastly, note
that if ’--prepipe’ or ’--prepipex’
is specified, it replaces any
decisions that might have been made based on the file
suffix. Likewise,
’--gzin’/’--bz2in’/’--zin’’--zin’
are ignored if ’--prepipe’ is also
specified.
--bz2in
Uncompress bzip2 within the Miller process. Done by
default if file ends in ’.bz2’.
--gzin Uncompress gzip within the Miller process. Done by
default if file ends in ’.gz’.
--prepipe {decompression command}
You can, of course, already do without this for
single input files, e.g. ’gunzip < myfile.csv.gz |
mlr ...’. Allowed at the command line, but not in
’.mlrrc’ to avoid unexpected code execution.
--prepipe-bz2 Same as ’--prepipe bz2’, except
this is allowed in
’.mlrrc’.
--prepipe-gunzip Same as ’--prepipe gunzip’,
except this is allowed in
’.mlrrc’.
--prepipe-zcat Same as ’--prepipe zcat’, except
this is allowed in
’.mlrrc’.
--prepipe-zstdcat Same as ’--prepipe zstdcat’,
except this is allowed
in ’.mlrrc’.
--prepipex {decompression command}
Like ’--prepipe’ with one exception:
doesn’t insert
’<’ between command and filename at runtime.
Useful
for some commands like ’unzip -qc’ which
don’t read
standard input. Allowed at the command line, but not
in ’.mlrrc’ to avoid unexpected code execution.
--zin Uncompress zlib within the Miller process. Done by
default if file ends in ’.z’.
--zstdin Uncompress zstd within the Miller process. Done by
default if file ends in ’.zstd’.
CSV/TSV-ONLY FLAGS
These are flags which are applicable to CSV format.
--allow-ragged-csv-input
or --ragged or --allow-ragged-tsv-input
If a data line has fewer fields than the header line,
fill remaining keys with empty string. If a data line
has more fields than the header line, use integer
field labels as in the implicit-header case.
--csv-trim-leading-space Trims leading spaces in CSV data.
Use this for data
like ’"foo", "bar’ which is
non-RFC-4180 compliant,
but common.
--headerless-csv-output or --ho or --headerless-tsv-output
Print only CSV/TSV data lines; do not print CSV/TSV
header lines.
--implicit-csv-header or --headerless-csv-input or --hi or
--implicit-tsv-header
Use 1,2,3,... as field labels, rather than from line
1 of input files. Tip: combine with ’label’ to
recreate missing headers.
--lazy-quotes Accepts quotes appearing in unquoted fields,
and
non-doubled quotes appearing in quoted fields.
--no-implicit-csv-header or --no-implicit-tsv-header
Opposite of ’--implicit-csv-header’. This is the
default anyway -- the main use is for the flags to
’mlr join’ if you have main file(s) which are
headerless but you want to join in on a file which
does have a CSV/TSV header. Then you could use ’mlr
--csv --implicit-csv-header join
--no-implicit-csv-header -l
your-join-in-with-header.csv ...
your-headerless.csv’.
--quote-all Force double-quoting of CSV fields.
-N Keystroke-saver for ’--implicit-csv-header
--headerless-csv-output’.
FILE-FORMAT FLAGS
See the File
formats doc page, and or ’mlr help
file-formats’, for more
about file formats Miller supports.
Examples:
’--csv’ for CSV-formatted input and output;
’--icsv --opprint’ for
CSV-formatted input and pretty-printed output.
Please use
’--iformat1 --oformat2’ rather than
’--format1 --oformat2’.
The latter sets up input and output flags for
’format1’, not all of which
are overridden in all cases by setting output format to
’format2’.
--asv or
--asvlite Use ASV format for input and output data.
--csv or -c Use CSV format for input and output data.
--csvlite Use CSV-lite format for input and output data.
--dkvp Use DKVP format for input and output data.
--gen-field-name Specify field name for --igen. Defaults to
"i".
--gen-start Specify start value for --igen. Defaults to 1.
--gen-step Specify step value for --igen. Defaults to 1.
--gen-stop Specify stop value for --igen. Defaults to 100.
--iasv or --iasvlite Use ASV format for input data.
--icsv Use CSV format for input data.
--icsvlite Use CSV-lite format for input data.
--idkvp Use DKVP format for input data.
--igen Ignore input files and instead generate sequential
numeric input using --gen-field-name, --gen-start,
--gen-step, and --gen-stop values. See also the
seqgen verb, which is more useful/intuitive.
--ijson Use JSON format for input data.
--ijsonl Use JSON Lines format for input data.
--imd or --imarkdown Use markdown-tabular format for input
data.
--inidx Use NIDX format for input data.
--io {format name} Use format name for input and output
data. For
example: ’--io csv’ is the same as
’--csv’.
--ipprint Use PPRINT format for input data.
--itsv Use TSV format for input data.
--itsvlite Use TSV-lite format for input data.
--iusv or --iusvlite Use USV format for input data.
--ixtab Use XTAB format for input data.
--json or -j Use JSON format for input and output data.
--jsonl Use JSON Lines format for input and output data.
--nidx Use NIDX format for input and output data.
--oasv or --oasvlite Use ASV format for output data.
--ocsv Use CSV format for output data.
--ocsvlite Use CSV-lite format for output data.
--odkvp Use DKVP format for output data.
--ojson Use JSON format for output data.
--ojsonl Use JSON Lines format for output data.
--omd or --omarkdown Use markdown-tabular format for output
data.
--onidx Use NIDX format for output data.
--opprint Use PPRINT format for output data.
--otsv Use TSV format for output data.
--otsvlite Use TSV-lite format for output data.
--ousv or --ousvlite Use USV format for output data.
--oxtab Use XTAB format for output data.
--pprint Use PPRINT format for input and output data.
--tsv or -t Use TSV format for input and output data.
--tsvlite Use TSV-lite format for input and output data.
--usv or --usvlite Use USV format for input and output data.
--xtab Use XTAB format for input and output data.
--xvright Right-justify values for XTAB format.
-i {format name} Use format name for input data. For
example: ’-i csv’
is the same as ’--icsv’.
-o {format name} Use format name for output data. For
example: ’-o
csv’ is the same as ’--ocsv’.
FLATTEN-UNFLATTEN FLAGS
These flags control how Miller converts record values which are maps or arrays, when input is JSON and output is non-JSON (flattening) or input is non-JSON and output is JSON (unflattening).
See the Flatten/unflatten doc page for more information.
--flatsep or
--jflatsep {string}
Separator for flattening multi-level JSON keys, e.g.
’{"a":{"b":3}}’ becomes
’a:b => 3’ for non-JSON
formats. Defaults to ’.’.
--no-auto-flatten When output is non-JSON, suppress the
default
auto-flatten behavior. Default: if ’$y =
[7,8,9]’
then this flattens to ’y.1=7,y.2=8,y.3=9, and
similarly for maps. With ’--no-auto-flatten’,
instead
we get ’$y=[1, 2, 3]’.
--no-auto-unflatten When input non-JSON and output is JSON,
suppress the
default auto-unflatten behavior. Default: if the
input has ’y.1=7,y.2=8,y.3=9’ then this
unflattens to
’$y=[7,8,9]’. flattens to
’y.1=7,y.2=8,y.3=9. With
’--no-auto-flatten’, instead we get
’${y.1}=7,${y.2}=8,${y.3}=9’.
FORMAT-CONVERSION KEYSTROKE-SAVER FLAGS
As
keystroke-savers for format-conversion you may use the
following.
The letters c, t, j, l, d, n, x, p, and m refer to formats
CSV, TSV, DKVP, NIDX,
JSON, JSON Lines, XTAB, PPRINT, and markdown,
respectively.
| In\out | CSV
| TSV | JSON | JSONL | DKVP | NIDX | XTAB | PPRINT |
Markdown |
+----------+-------+-------+--------+--------+--------+--------+--------+--------+----------|
| CSV | | --c2t | --c2j | --c2l | --c2d | --c2n | --c2x |
--c2p | --c2m |
| TSV | --t2c | | --t2j | --t2l | --t2d | --t2n | --t2x |
--t2p | --t2m |
| JSON | --j2c | --j2t | | --j2l | --j2d | --j2n | --j2x |
--j2p | --j2m |
| JSONL | --l2c | --l2t | | | --l2d | --l2n | --l2x | --l2p
| --l2m |
| DKVP | --d2c | --d2t | --d2j | --d2l | | --d2n | --d2x |
--d2p | --d2m |
| NIDX | --n2c | --n2t | --n2j | --n2l | --n2d | | --n2x |
--n2p | --n2m |
| XTAB | --x2c | --x2t | --x2j | --x2l | --x2d | --x2n | |
--x2p | --x2m |
| PPRINT | --p2c | --p2t | --p2j | --p2l | --p2d | --p2n |
--p2x | | --p2m |
| Markdown | --m2c | --m2t | --m2j | --m2l | --m2d | --m2n |
--m2x | --m2p | |
-p
Keystroke-saver for ’--nidx --fs space
--repifs’.
-T Keystroke-saver for ’--nidx --fs tab’.
JSON-ONLY FLAGS
These are flags which are applicable to JSON output format.
--jlistwrap or
--jl Wrap JSON output in outermost ’[ ]’. This
is the
default for JSON output format.
--jvquoteall Force all JSON values -- recursively into lists
and
object -- to string.
--jvstack Put one key-value pair per line for JSON output
(multi-line output). This is the default for JSON
output format.
--no-jlistwrap Wrap JSON output in outermost ’[
]’. This is the
default for JSON Lines output format.
--no-jvstack Put objects/arrays all on one line for JSON
output.
This is the default for JSON Lines output format.
LEGACY FLAGS
These are flags
which don’t do anything in the current Miller version.
They are accepted as no-op flags in order to keep old
scripts from breaking.
--jknquoteint
Type information from JSON input files is now
preserved throughout the processing stream.
--jquoteall Type information from JSON input files is now
preserved throughout the processing stream.
--json-fatal-arrays-on-input
Miller now supports arrays as of version 6.
--json-map-arrays-on-input
Miller now supports arrays as of version 6.
--json-skip-arrays-on-input
Miller now supports arrays as of version 6.
--jsonx The ’--jvstack’ flag is now default true
in Miller 6.
--mmap Miller no longer uses memory-mapping to access data
files.
--no-mmap Miller no longer uses memory-mapping to access
data
files.
--ojsonx The ’--jvstack’ flag is now default
true in Miller 6.
--quote-minimal Ignored as of version 6. Types are
inferred/retained
through the processing flow now.
--quote-none Ignored as of version 6. Types are
inferred/retained
through the processing flow now.
--quote-numeric Ignored as of version 6. Types are
inferred/retained
through the processing flow now.
--quote-original Ignored as of version 6. Types are
inferred/retained
through the processing flow now.
--vflatsep Ignored as of version 6. This functionality is
subsumed into JSON formatting.
MISCELLANEOUS FLAGS
These are flags
which don’t fit into any other category.
--fflush Force buffered output to be written after every
output record. The default is flush output after
every record if the output is to the terminal, or
less often if the output is to a file or a pipe. The
default is a significant performance optimization for
large files. Use this flag to force frequent updates
even when output is to a pipe or file, at a
performance cost.
--files {filename} Use this to specify a file which itself
contains, one
per line, names of input files. May be used more than
once.
--from {filename} Use this to specify an input file before
the verb(s),
rather than after. May be used more than once.
Example: ’mlr --from a.dat --from b.dat cat’ is
the
same as ’mlr cat a.dat b.dat’.
--hash-records This is an internal parameter which normally
does not
need to be modified. It controls the mechanism by
which Miller accesses fields within records. In
general --no-hash-records is faster, and is the
default. For specific use-cases involving data having
many fields, and many of them being processed during
a given processing run, --hash-records might offer a
slight performance benefit.
--infer-int-as-float or -A
Cast all integers in data files to floats.
--infer-none or -S Don’t treat values like 123 or
456.7 in data files as
int/float; leave them as strings.
--infer-octal or -O Treat numbers like 0123 in data files as
numeric;
default is string. Note that 00--07 etc scan as int;
08-09 scan as float.
--load {filename} Load DSL script file for all put/filter
operations on
the command line. If the name following ’--load’
is a
directory, load all ’*.mlr’ files in that
directory.
This is just like ’put -f’ and ’filter
-f’ except
it’s up-front on the command line, so you can do
something like ’alias mlr=’mlr --load
~/myscripts’’
if you like.
--mfrom {filenames} Use this to specify one of more input
files before
the verb(s), rather than after. May be used more than
once. The list of filename must end with ’--’.
This
is useful for example since ’--from *.csv’
doesn’t do
what you might hope but ’--mfrom *.csv --’ does.
--mload {filenames} Like ’--load’ but works with
more than one filename,
e.g. ’--mload *.mlr --’.
--no-dedupe-field-names By default, if an input record has a
field named ’x’
and another also named ’x’, the second will be
renamed ’x_2’, and so on. With this flag
provided,
the second ’x’’s value will replace the
first ’x’’s
value when the record is read. This flag has no
effect on JSON input records, where duplicate keys
always result in the last one’s value being retained.
--no-fflush Let buffered output not be written after every
output
record. The default is flush output after every
record if the output is to the terminal, or less
often if the output is to a file or a pipe. The
default is a significant performance optimization for
large files. Use this flag to allow less-frequent
updates when output is to the terminal. This is
unlikely to be a noticeable performance improvement,
since direct-to-screen output for large files has its
own overhead.
--no-hash-records See --hash-records.
--norc Do not load a .mlrrc file.
--nr-progress-mod {m} With m a positive integer: print
filename and record
count to os.Stderr every m input records.
--ofmt {format} E.g. ’%.18f’,
’%.0f’, ’%9.6e’. Please use
sprintf-style codes (https://pkg.go.dev/fmt) for
floating-point numbers. If not specified, default
formatting is used. See also the ’fmtnum’
function
and the ’format-values’ verb.
--ofmte {n} Use --ofmte 6 as shorthand for --ofmt %.6e, etc.
--ofmtf {n} Use --ofmtf 6 as shorthand for --ofmt %.6f, etc.
--ofmtg {n} Use --ofmtg 6 as shorthand for --ofmt %.6g, etc.
--records-per-batch {n} This is an internal parameter for
maximum number of
records in a batch size. Normally this does not need
to be modified, except when input is from ’tail
-f’.
See also
https://miller.readthedocs.io/en/latest/reference-main-flag-list/.
--s-no-comment-strip {file name}
Take command-line flags from file name, like -s, but
with no comment-stripping. For more information
please see
https://miller.readthedocs.io/en/latest/scripting/.
--seed {n} with ’n’ of the form
’12345678’ or ’0xcafefeed’. For
’put’/’filter’ ’urand’,
’urandint’, and ’urand32’.
--tz {timezone} Specify timezone, overriding
’$TZ’ environment
variable (if any).
-I Process files in-place. For each file name on the
command line, output is written to a temp file in the
same directory, which is then renamed over the
original. Each file is processed in isolation: if the
output format is CSV, CSV headers will be present in
each output file, statistics are only over each
file’s own records; and so on.
-n Process no input files, nor standard input either.
Useful for ’mlr put’ with
’begin’/’end’ statements
only. (Same as ’--from /dev/null’.) Also useful
in
’mlr -n put -v ’...’’ for analyzing
abstract syntax
trees (if that’s your thing).
-s {file name} Take command-line flags from file name. For
more
information please see
https://miller.readthedocs.io/en/latest/scripting/.
-x If any record has an error value in it, report it and
stop the process. The default is to print the field
value as ’(error)’ and continue.
OUTPUT-COLORIZATION FLAGS
Miller uses
colors to highlight outputs. You can specify color
preferences.
Note: output colorization does not work on Windows.
Things having colors:
* Keys in CSV
header lines, JSON keys, etc
* Values in CSV data lines, JSON scalar values, etc in
regression-test output
* Some online-help strings
Rules for coloring:
* By default,
colorize output only if writing to stdout and stdout is a
TTY.
* Example: color: ’mlr --csv cat foo.csv’
* Example: no color: ’mlr --csv cat foo.csv >
bar.csv’
* Example: no color: ’mlr --csv cat foo.csv |
less’
* The default colors were chosen since they look OK with
white or black
terminal background, and are differentiable with common
varieties of human
color vision.
Mechanisms for coloring:
* Miller uses
ANSI escape sequences only. This does not work on Windows
except within Cygwin.
* Requires ’TERM’ environment variable to be set
to non-empty string.
* Doesn’t try to check to see whether the terminal is
capable of 256-color
ANSI vs 16-color ANSI. Note that if colors are in the range
0..15
then 16-color ANSI escapes are used, so this is in the
user’s control.
How you can control colorization:
*
Suppression/unsuppression:
* Environment variable ’export
MLR_NO_COLOR=true’ means don’t color
even if stdout+TTY.
* Environment variable ’export
MLR_ALWAYS_COLOR=true’ means do color
even if not stdout+TTY.
For example, you might want to use this when piping mlr
output to ’less -r’.
* Command-line flags ’--no-color’ or
’-M’, ’--always-color’ or
’-C’.
* Color choices
can be specified by using environment variables, or
command-line
flags, with values 0..255:
* ’export MLR_KEY_COLOR=208’,
’MLR_VALUE_COLOR=33’, etc.:
’MLR_KEY_COLOR’ ’MLR_VALUE_COLOR’
’MLR_PASS_COLOR’ ’MLR_FAIL_COLOR’
’MLR_REPL_PS1_COLOR’
’MLR_REPL_PS2_COLOR’
’MLR_HELP_COLOR’
* Command-line flags ’--key-color 208’,
’--value-color 33’, etc.:
’--key-color’ ’--value-color’
’--pass-color’ ’--fail-color’
’--repl-ps1-color’
’--repl-ps2-color’ ’--help-color’
* This is particularly useful if your terminal’s
background color clashes
with current settings.
If
environment-variable settings and command-line flags are
both provided, the
latter take precedence.
Colors can be
specified using names such as "red" or
"orchid": please see
’mlr --list-color-names’ to see available names.
They can also be specified using
numbers in the range 0..255, like 170: please see ’mlr
--list-color-codes’.
You can also use "bold", "underline",
and/or "reverse". Additionally, combinations of
those can be joined with a "-", like
"red-bold", "bold-170",
"bold-underline", etc.
--always-color
or -C Instructs Miller to colorize output even when it
normally would not. Useful for piping output to ’less
-r’.
--fail-color Specify the color (see
’--list-color-codes’ and
’--list-color-names’) for failing cases in
’mlr
regtest’.
--help-color Specify the color (see
’--list-color-codes’ and
’--list-color-names’) for highlights in
’mlr help’
output.
--key-color Specify the color (see
’--list-color-codes’ and
’--list-color-names’) for record keys.
--list-color-codes Show the available color codes in the
range 0..255,
such as 170 for example.
--list-color-names Show the names for the available color
codes, such as
’orchid’ for example.
--no-color or -M Instructs Miller to not colorize any
output.
--pass-color Specify the color (see
’--list-color-codes’ and
’--list-color-names’) for passing cases in
’mlr
regtest’.
--value-color Specify the color (see
’--list-color-codes’ and
’--list-color-names’) for record values.
PPRINT-ONLY FLAGS
These are flags which are applicable to PPRINT format.
--barred or
--barred-output
Prints a border around PPRINT output.
--barred-input When used in conjunction with --pprint,
accepts
barred input.
--right Right-justifies all fields for PPRINT output.
PROFILING FLAGS
These are flags
for profiling Miller performance.
--cpuprofile {CPU-profile file name}
Create a CPU-profile file for performance analysis.
Instructions will be printed to stderr. This flag
must be the very first thing after ’mlr’ on the
command line.
--time Print elapsed execution time in seconds to stderr at
the end of the execution of the program.
--traceprofile Create a trace-profile file for performance
analysis.
Instructions will be printed to stderr. This flag
must be the very first thing after ’mlr’ on the
command line.
SEPARATOR FLAGS
See the
Separators doc page for more about record separators, field
separators, and pair separators. Also see the File formats
doc page, or
’mlr help file-formats’, for more about the file
formats Miller supports.
In brief:
* For DKVP
records like ’x=1,y=2,z=3’, the fields are
separated by a comma,
the key-value pairs are separated by a comma, and each
record is separated
from the next by a newline.
* Each file format has its own default separators.
* Most formats, such as CSV, don’t support
pair-separators: keys are on the CSV
header line and values are on each CSV data line; keys and
values are not
placed next to one another.
* Some separators are not programmable: for example JSON
uses a colon as a
pair separator but this is non-modifiable in the JSON spec.
* You can set separators differently between Miller’s
input and output --
hence ’--ifs’ and ’--ofs’, etc.
Notes about line endings:
* Default line
endings (’--irs’ and ’--ors’) are
newline
which is interpreted to accept carriage-return/newline files
(e.g. on Windows)
for input, and to produce platform-appropriate line endings
on output.
Notes about all other separators:
* IPS/OPS are
only used for DKVP and XTAB formats, since only in these
formats
do key-value pairs appear juxtaposed.
* IRS/ORS are ignored for XTAB format. Nominally IFS and OFS
are newlines;
XTAB records are separated by two or more consecutive
IFS/OFS -- i.e.
a blank line. Everything above about ’--irs/--ors/--rs
auto’ becomes ’--ifs/--ofs/--fs’
auto for XTAB format. (XTAB’s default IFS/OFS are
"auto".)
* OFS must be single-character for PPRINT format. This is
because it is used
with repetition for alignment; multi-character separators
would make
alignment impossible.
* OPS may be multi-character for XTAB format, in which case
alignment is
disabled.
* FS/PS are ignored for markdown format; RS is used.
* All FS and PS options are ignored for JSON format, since
they are not relevant
to the JSON format.
* You can specify separators in any of the following ways,
shown by example:
- Type them out, quoting as necessary for shell escapes,
e.g.
’--fs ’|’ --ips :’
- C-style escape sequences, e.g. ’--rs
’\r\n’ --fs ’\t’’.
- To avoid backslashing, you can use any of the following
names:
ascii_esc =
"\x1b"
ascii_etx = "\x04"
ascii_fs = "\x1c"
ascii_gs = "\x1d"
ascii_null = "\x01"
ascii_rs = "\x1e"
ascii_soh = "\x02"
ascii_stx = "\x03"
ascii_us = "\x1f"
asv_fs = "\x1f"
asv_rs = "\x1e"
colon = ":"
comma = ","
cr = "\r"
crcr = "\r\r"
crlf = "\r\n"
crlfcrlf = "\r\n\r\n"
equals = "="
lf = "\n"
lflf = "\n\n"
newline = "\n"
pipe = "|"
semicolon = ";"
slash = "/"
space = " "
tab = "\t"
usv_fs = "\xe2\x90\x9f"
usv_rs = "\xe2\x90\x9e"
- Similarly, you can use the following for ’--ifs-regex’ and ’--ips-regex’:
spaces =
"( )+"
tabs = "(\t)+"
whitespace = "([ \t])+"
* Default separators by format:
Format FS PS RS
csv "," N/A "\n"
csvlite "," N/A "\n"
dkvp "," "=" "\n"
gen "," N/A "\n"
json N/A N/A N/A
markdown " " N/A "\n"
nidx " " N/A "\n"
pprint " " N/A "\n"
" N/A "\n" |
xtab "\n" " " "\n\n"
--fs {string}
Specify FS for input and output.
--ifs {string} Specify FS for input.
--ifs-regex {string} Specify FS for input as a regular
expression.
--ips {string} Specify PS for input.
--ips-regex {string} Specify PS for input as a regular
expression.
--irs {string} Specify RS for input.
--ofs {string} Specify FS for output.
--ops {string} Specify PS for output.
--ors {string} Specify RS for output.
--ps {string} Specify PS for input and output.
--repifs Let IFS be repeated: e.g. for splitting on multiple
spaces.
--rs {string} Specify RS for input and output.
AUXILIARY COMMANDS
Available
entries:
mlr aux-list
mlr hex
mlr lecat
mlr termcvt
mlr unhex
For more information, please invoke mlr {subcommand}
--help.
MLRRC
You can set up
personal defaults via a $HOME/.mlrrc and/or ./.mlrrc.
For example, if you usually process CSV, then you can put
"--csv" in your .mlrrc file
and that will be the default input/output format unless
otherwise specified on the command line.
The .mlrrc file
format is one "--flag" or "--option
value" per line, with the leading "--"
optional.
Hash-style comments and blank lines are ignored.
Sample .mlrrc:
# Input and output formats are CSV by default (unless
otherwise specified
# on the mlr command line):
csv
# These are no-ops for CSV, but when I do use JSON output, I
want these
# pretty-printing options to be used:
jvstack
jlistwrap
How to specify
location of .mlrrc:
* If $MLRRC is set:
o If its value is "__none__" then no .mlrrc files
are processed.
o Otherwise, its value (as a filename) is loaded and
processed. If there are syntax
errors, they abort mlr with a usage message (as if you had
mistyped something on the
command line). If the file can’t be loaded at all,
though, it is silently skipped.
o Any .mlrrc in your home directory or current directory is
ignored whenever $MLRRC is
set in the environment.
* Otherwise:
o If $HOME/.mlrrc exists, it’s then processed as
above.
o If ./.mlrrc exists, it’s then also processed as
above.
(I.e. current-directory .mlrrc defaults are stacked over
home-directory .mlrrc defaults.)
* The command-line flag "--norc" can be used to
suppress loading the .mlrrc file even when other
conditions are met.
See also:
https://miller.readthedocs.io/en/latest/customization.html
REPL
Usage: mlr repl
[options] {zero or more data-file names}
-v Prints the expressions’s AST (abstract syntax
tree), which gives
full transparency on the precedence and associativity rules
of
Miller’s grammar, to stdout.
-d Like -v but uses a parenthesized-expression format for the AST.
-D Like -d but with output all on one line.
-w Show warnings about uninitialized variables
-q Don’t show startup banner
-s Don’t show prompts
--load {DSL
script file} Load script file before presenting the prompt.
If the name following --load is a directory, load all
"*.mlr" files
in that directory.
--mload {DSL
script files} -- Like --load but works with more than one
filename,
e.g. ’--mload *.mlr --’.
-h|--help Show this message.
Or any --icsv, --ojson, etc. reader/writer options as for the main Miller command line.
Any data-file
names are opened just as if you had waited and typed :open
{filenames}
at the Miller REPL prompt.
VERBS
altkv
Usage: mlr altkv [options]
Given fields with values of the form a,b,c,d,e,f emits
a=b,c=d,e=f pairs.
Options:
-h|--help Show this message.
bar
Usage: mlr bar [options]
Replaces a numeric field with a number of asterisks,
allowing for cheesy
bar plots. These align best with --opprint or --oxtab output
format.
Options:
-f {a,b,c} Field names to convert to bars.
--lo {lo} Lower-limit value for min-width bar: default
’0.000000’.
--hi {hi} Upper-limit value for max-width bar: default
’100.000000’.
-w {n} Bar-field width: default ’40’.
--auto Automatically computes limits, ignoring --lo and
--hi.
Holds all records in memory before producing any output.
-c {character} Fill character: default ’*’.
-x {character} Out-of-bounds character: default
’#’.
-b {character} Blank character: default ’.’.
Nominally the fill, out-of-bounds, and blank characters will
be strings of length 1.
However you can make them all longer if you so desire.
-h|--help Show this message.
bootstrap
Usage: mlr bootstrap [options]
Emits an n-sample, with replacement, of the input records.
See also mlr sample and mlr shuffle.
Options:
-n Number of samples to output. Defaults to number of input
records.
Must be non-negative.
-h|--help Show this message.
case
Usage: mlr case [options]
Uppercases strings in record keys and/or values.
Options:
-k Case only keys, not keys and values.
-v Case only values, not keys and values.
-f {a,b,c} Specify which field names to case (default: all)
-u Convert to uppercase
-l Convert to lowercase
-s Convert to sentence case (capitalize first letter)
-t Convert to title case (capitalize words)
-h|--help Show this message.
cat
Usage: mlr cat [options]
Passes input records directly to output. Most useful for
format conversion.
Options:
-n Prepend field "n" to each record with
record-counter starting at 1.
-N {name} Prepend field {name} to each record with
record-counter starting at 1.
-g {a,b,c} Optional group-by-field names for counters, e.g.
a,b,c
--filename Prepend current filename to each record.
--filenum Prepend current filenum (1-up) to each record.
-h|--help Show this message.
check
Usage: mlr check [options]
Consumes records without printing any output,
Useful for doing a well-formatted check on input data.
with the exception that warnings are printed to stderr.
Current checks are:
* Data are parseable
* If any key is the empty string
Options:
-h|--help Show this message.
clean-whitespace
Usage: mlr clean-whitespace [options]
For each record, for each field in the record,
whitespace-cleans the keys and/or
values. Whitespace-cleaning entails stripping leading and
trailing whitespace,
and replacing multiple whitespace with singles. For
finer-grained control,
please see the DSL functions lstrip, rstrip, strip,
collapse_whitespace,
and clean_whitespace.
Options:
-k|--keys-only Do not touch values.
-v|--values-only Do not touch keys.
It is an error to specify -k as well as -v -- to clean keys
and values,
leave off -k as well as -v.
-h|--help Show this message.
count-distinct
Usage: mlr count-distinct [options]
Prints number of records having distinct values for
specified field names.
Same as uniq -c.
Options:
-f {a,b,c} Field names for distinct count.
-x {a,b,c} Field names to exclude for distinct count: use
each record’s others instead.
-n Show only the number of distinct values. Not compatible
with -u.
-o {name} Field name for output count. Default
"count".
Ignored with -u.
-u Do unlashed counts for multiple field names. With -f a,b
and
without -u, computes counts for distinct combinations of a
and b field values. With -f a,b and with -u, computes counts
for distinct a field values and counts for distinct b field
values separately.
count
Usage: mlr count [options]
Prints number of records, optionally grouped by distinct
values for specified field names.
Options:
-g {a,b,c} Optional group-by-field names for counts, e.g.
a,b,c
-n {n} Show only the number of distinct values. Not
interesting without -g.
-o {name} Field name for output-count. Default
"count".
-h|--help Show this message.
count-similar
Usage: mlr count-similar [options]
Ingests all records, then emits each record augmented by a
count of
the number of other records having the same group-by field
values.
Options:
-g {a,b,c} Group-by-field names for counts, e.g. a,b,c
-o {name} Field name for output-counts. Defaults to
"count".
-h|--help Show this message.
cut
Usage: mlr cut [options]
Passes through input records with specified fields
included/excluded.
Options:
-f {a,b,c} Comma-separated field names for cut, e.g. a,b,c.
-o Retain fields in the order specified here in the argument
list.
Default is to retain them in the order found in the input
data.
-x|--complement Exclude, rather than include, field names
specified by -f.
-r Treat field names as regular expressions. "ab",
"a.*b" will
match any field name containing the substring "ab"
or matching
"a.*b", respectively; anchors of the form
"^ab$", "^a.*b$" may
be used. The -o flag is ignored when -r is present.
-h|--help Show this message.
Examples:
mlr cut -f hostname,status
mlr cut -x -f hostname,status
mlr cut -r -f ’^status$,sda[0-9]’
mlr cut -r -f ’^status$,"sda[0-9]"’
mlr cut -r -f ’^status$,"sda[0-9]"i’
(this is case-insensitive)
decimate
Usage: mlr decimate [options]
Passes through one of every n records, optionally by
category.
Options:
-b Decimate by printing first of every n.
-e Decimate by printing last of every n (default).
-g {a,b,c} Optional group-by-field names for decimate
counts, e.g. a,b,c.
-n {n} Decimation factor (default 10).
-h|--help Show this message.
fill-down
Usage: mlr fill-down [options]
If a given record has a missing value for a given field,
fill that from
the corresponding value from a previous record, if any.
By default, a ’missing’ field either is absent,
or has the empty-string value.
With -a, a field is ’missing’ only if it is
absent.
Options:
--all Operate on all fields in the input.
-a|--only-if-absent If a given record has a missing value
for a given field,
fill that from the corresponding value from a previous
record, if any.
By default, a ’missing’ field either is absent,
or has the empty-string value.
With -a, a field is ’missing’ only if it is
absent.
-f Field names for fill-down.
-h|--help Show this message.
fill-empty
Usage: mlr fill-empty [options]
Fills empty-string fields with specified fill-value.
Options:
-v {string} Fill-value: defaults to "N/A"
-S Don’t infer type -- so ’-v 0’ would
fill string 0 not int 0.
filter
Usage: mlr filter [options] {DSL expression}
Lets you use a domain-specific language to progamatically
filter which
stream records will be output.
See also:
https://miller.readthedocs.io/en/latest/reference-verbs
Options:
-f {file name} File containing a DSL expression (see
examples below). If the filename
is a directory, all *.mlr files in that directory are
loaded.
-e {expression}
You can use this after -f to add an expression. Example use
case: define functions/subroutines in a file you specify
with -f, then call
them with an expression you specify with -e.
(If you mix -e
and -f then the expressions are evaluated in the order
encountered.
Since the expression pieces are simply concatenated, please
be sure to use intervening
semicolons to separate expressions.)
-s name=value:
Predefines out-of-stream variable @name to have
Thus mlr put -s foo=97 ’$column += @foo’ is like
mlr put ’begin {@foo = 97} $column += @foo’.
The value part is subject to type-inferencing.
May be specified more than once, e.g. -s name1=value1 -s
name2=value2.
Note: the value may be an environment variable, e.g. -s
sequence=$SEQUENCE
-x (default
false) Prints records for which {expression} evaluates to
false, not true,
i.e. invert the sense of the filter expression.
-q Does not
include the modified record in the output stream.
Useful for when all desired output is in begin and/or end
blocks.
-S and -F:
There are no-ops in Miller 6 and above, since now
type-inferencing is done
by the record-readers before filter/put is executed.
Supported as no-op pass-through
flags for backward compatibility.
-h|--help Show this message.
Parser-info options:
-w Print warnings about things like uninitialized variables.
-W Same as -w, but exit the process if there are any warnings.
-p Prints the
expressions’s AST (abstract syntax tree), which gives
full
transparency on the precedence and associativity rules of
Miller’s grammar,
to stdout.
-d Like -p but uses a parenthesized-expression format for the AST.
-D Like -d but with output all on one line.
-E Echo DSL expression before printing parse-tree
-v Same as -E -p.
-X Exit after
parsing but before stream-processing. Useful with -v/-d/-D,
if you
only want to look at parser information.
Records will
pass the filter depending on the last bare-boolean statement
in
the DSL expression. That can be the result of <, ==,
>, etc., the return value of a function call
which returns boolean, etc.
Examples:
mlr --csv --from example.csv filter ’$color ==
"red"’
mlr --csv --from example.csv filter ’$color ==
"red" && flag == true’
More example filter expressions:
First record in each file:
’FNR == 1’
Subsampling:
’urand() < 0.001’
Compound booleans:
’$color != "blue" && $value >
4.2’
’($x < 0.5 && $y < 0.5) || ($x > 0.5
&& $y > 0.5)’
Regexes with case-insensitive flag
’($name =~ "^sys.*east$") || ($name =~
"^dev.[0-9]+"i)’
Assignments, then bare-boolean filter statement:
’$ab = $a+$b; $cd = $c+$d; $ab != $cd’
Bare-boolean filter statement within a conditional:
’if (NR < 100) {
$x > 0.3;
} else {
$x > 0.002;
}
’
Using ’any’ higher-order function to see if
$index is 10, 20, or 30:
’any([10,20,30], func(e) {return $index ==
e})’
See also https://miller.readthedocs.io/reference-dsl for more context.
flatten
Usage: mlr flatten [options]
Flattens multi-level maps to single-level ones. Example:
field with name ’a’
and value ’{"b": { "c": 4
}}’ becomes name ’a.b.c’ and value 4.
Options:
-f Comma-separated list of field names to flatten (default
all).
-s Separator, defaulting to mlr --flatsep value.
-h|--help Show this message.
format-values
Usage: mlr format-values [options]
Applies format strings to all field values, depending on
autodetected type.
* If a field value is detected to be integer, applies
integer format.
* Else, if a field value is detected to be float, applies
float format.
* Else, applies string format.
Note: this is a
low-keystroke way to apply formatting to many fields. To get
finer control, please see the fmtnum function within the mlr
put DSL.
Note: this verb
lets you apply arbitrary format strings, which can produce
undefined behavior and/or program crashes. See your
system’s "man printf".
Options:
-i {integer format} Defaults to "%d".
Examples: "%06lld", "%08llx".
Note that Miller integers are long long so you must use
formats which apply to long long, e.g. with ll in them.
Undefined behavior results otherwise.
-f {float format} Defaults to "%f".
Examples: "%8.3lf", "%.6le".
Note that Miller floats are double-precision so you must
use formats which apply to double, e.g. with l[efg] in them.
Undefined behavior results otherwise.
-s {string format} Defaults to "%s".
Examples: "_%s", "%08s".
Note that you must use formats which apply to string, e.g.
with s in them. Undefined behavior results otherwise.
-n Coerce field values autodetected as int to float, and
then
apply the float format.
fraction
Usage: mlr fraction [options]
For each record’s value in specified fields, computes
the ratio of that
value to the sum of values in that field over all input
records.
E.g. with input records x=1 x=2 x=3 and x=4, emits output
records
x=1,x_fraction=0.1 x=2,x_fraction=0.2 x=3,x_fraction=0.3 and
x=4,x_fraction=0.4
Note: this is
internally a two-pass algorithm: on the first pass it
retains
input records and accumulates sums; on the second pass it
computes quotients
and emits output records. This means it produces no output
until all input is read.
Options:
-f {a,b,c} Field name(s) for fraction calculation
-g {d,e,f} Optional group-by-field name(s) for fraction
counts
-p Produce percents [0..100], not fractions [0..1]. Output
field names
end with "_percent" rather than
"_fraction"
-c Produce cumulative distributions, i.e. running sums: each
output
value folds in the sum of the previous for the specified
group
E.g. with input records x=1 x=2 x=3 and x=4, emits output
records
x=1,x_cumulative_fraction=0.1 x=2,x_cumulative_fraction=0.3
x=3,x_cumulative_fraction=0.6 and
x=4,x_cumulative_fraction=1.0
gap
Usage: mlr gap [options]
Emits an empty record every n records, or when certain
values change.
Options:
Emits an empty record every n records, or when certain
values change.
-g {a,b,c} Print a gap whenever values of these fields (e.g.
a,b,c) changes.
-n {n} Print a gap every n records.
One of -f or -g is required.
-n is ignored if -g is present.
-h|--help Show this message.
grep
Usage: mlr grep [options] {regular expression}
Passes through records which match the regular expression.
Options:
-i Use case-insensitive search.
-v Invert: pass through records which do not match the
regex.
-a Only grep for values, not keys and values.
-h|--help Show this message.
Note that "mlr filter" is more powerful, but
requires you to know field names.
By contrast, "mlr grep" allows you to regex-match
the entire record. It does this
by formatting each record in memory as DKVP (or NIDX, if -a
is supplied), using
OFS "," and OPS "=", and matching the
resulting line against the regex specified
here. In particular, the regex is not applied to the input
stream: if you have
CSV with header line "x,y,z" and data line
"1,2,3" then the regex will be
matched, not against either of these lines, but against the
DKVP line
"x=1,y=2,z=3". Furthermore, not all the options to
system grep are supported,
and this command is intended to be merely a keystroke-saver.
To get all the
features of system grep, you can do
"mlr --odkvp ... | grep ... | mlr --idkvp ..."
group-by
Usage: mlr group-by [options] {comma-separated field names}
Outputs records in batches having identical values at
specified field names.Options:
-h|--help Show this message.
group-like
Usage: mlr group-like [options]
Outputs records in batches having identical field names.
Options:
-h|--help Show this message.
gsub
Usage: mlr gsub [options]
Replaces old string with new string in specified field(s),
with regex support
for the old string and handling multiple matches, like the
’gsub’ DSL function.
See also the ’sub’ and ’ssub’ verbs.
Options:
-f {a,b,c} Field names to convert.
-h|--help Show this message.
having-fields
Usage: mlr having-fields [options]
Conditionally passes through records depending on each
record’s field names.
Options:
--at-least {comma-separated names}
--which-are {comma-separated names}
--at-most {comma-separated names}
--all-matching {regular expression}
--any-matching {regular expression}
--none-matching {regular expression}
Examples:
mlr having-fields --which-are amount,status,owner
mlr having-fields --any-matching ’sda[0-9]’
mlr having-fields --any-matching
’"sda[0-9]"’
mlr having-fields --any-matching
’"sda[0-9]"i’ (this is
case-insensitive)
head
Usage: mlr head [options]
Passes through the first n records, optionally by category.
Without -g, ceases consuming more input (i.e. is fast) when
n records have been read.
Options:
-g {a,b,c} Optional group-by-field names for head counts,
e.g. a,b,c.
-n {n} Head-count to print. Default 10.
-h|--help Show this message.
histogram
Just a histogram. Input values < lo or > hi are not
counted.
Usage: mlr histogram [options]
-f {a,b,c} Value-field names for histogram counts
--lo {lo} Histogram low value
--hi {hi} Histogram high value
--nbins {n} Number of histogram bins. Defaults to 20.
--auto Automatically computes limits, ignoring --lo and
--hi.
Holds all values in memory before producing any output.
-o {prefix} Prefix for output field name. Default: no
prefix.
-h|--help Show this message.
json-parse
Usage: mlr json-parse [options]
Tries to convert string field values to parsed JSON, e.g.
"[1,2,3]" -> [1,2,3].
Options:
-f {...} Comma-separated list of field names to json-parse
(default all).
-k If supplied, then on parse fail for any cell, keep the
(unparsable)
input value for the cell.
-h|--help Show this message.
json-stringify
Usage: mlr json-stringify [options]
Produces string field values from field-value data, e.g.
[1,2,3] -> "[1,2,3]".
Options:
-f {...} Comma-separated list of field names to json-parse
(default all).
--jvstack Produce multi-line JSON output.
--no-jvstack Produce single-line JSON output per record
(default).
-h|--help Show this message.
join
Usage: mlr join [options]
Joins records from specified left file name with records
from all file names
at the end of the Miller argument list.
Functionality is essentially the same as the system
"join" command, but for
record streams.
Options:
-f {left file name}
-j {a,b,c} Comma-separated join-field names for output
-l {a,b,c} Comma-separated join-field names for left input
file;
defaults to -j values if omitted.
-r {a,b,c} Comma-separated join-field names for right input
file(s);
defaults to -j values if omitted.
--lk|--left-keep-field-names {a,b,c} If supplied, this means
keep only the specified field
names from the left file. Automatically includes the
join-field name(s). Helpful
for when you only want a limited subset of information from
the left file.
Tip: you can use --lk "": this means the left file
becomes solely a row-selector
for the input files.
--lp {text} Additional prefix for non-join output field
names from
the left file
--rp {text} Additional prefix for non-join output field
names from
the right file(s)
--np Do not emit paired records
--ul Emit unpaired records from the left file
--ur Emit unpaired records from the right file(s)
-s|--sorted-input Require sorted input: records must be
sorted
lexically by their join-field names, else not all records
will
be paired. The only likely use case for this is with a left
file which is too big to fit into system memory otherwise.
-u Enable unsorted input. (This is the default even without
-u.)
In this case, the entire left file will be loaded into
memory.
--prepipe {command} As in main input options; see mlr --help
for details.
If you wish to use a prepipe command for the main input as
well
as here, it must be specified there as well as here.
--prepipex {command} Likewise.
File-format options default to those for the right file
names on the Miller
argument list, but may be overridden for the left file as
follows. Please see
the main "mlr --help" for more information on
syntax for these arguments:
-i {one of csv,dkvp,nidx,pprint,xtab}
--irs {record-separator character}
--ifs {field-separator character}
--ips {pair-separator character}
--repifs
--implicit-csv-header
--implicit-tsv-header
--no-implicit-csv-header
--no-implicit-tsv-header
For example, if you have ’mlr --csv ... join -l foo
... ’ then the left-file format will
be specified CSV as well unless you override with ’mlr
--csv ... join --ijson -l foo’ etc.
Likewise, if you have ’mlr --csv --implicit-csv-header
...’ then the join-in file will be
expected to be headerless as well unless you put
’--no-implicit-csv-header’ after
’join’.
Please use "mlr --usage-separator-options" for
information on specifying separators.
Please see
https://miller.readthedocs.io/en/latest/reference-verbs.html#join
for more information
including examples.
label
Usage: mlr label [options] {new1,new2,new3,...}
Given n comma-separated names, renames the first n fields of
each record to
have the respective name. (Fields past the nth are left with
their original
names.) Particularly useful with --inidx or
--implicit-csv-header, to give
useful names to otherwise integer-indexed fields.
Options:
-h|--help Show this message.
latin1-to-utf8
Usage: mlr latin1-to-utf8, with no options.
Recursively converts record strings from Latin-1 to UTF-8.
For field-level control, please see the latin1_to_utf8 DSL
function.
Options:
-h|--help Show this message.
least-frequent
Usage: mlr least-frequent [options]
Shows the least frequently occurring distinct values for
specified field names.
The first entry is the statistical anti-mode; the remaining
are runners-up.
Options:
-f {one or more comma-separated field names}. Required flag.
-n {count}. Optional flag defaulting to 10.
-b Suppress counts; show only field values.
-o {name} Field name for output count. Default
"count".
See also "mlr most-frequent".
merge-fields
Usage: mlr merge-fields [options]
Computes univariate statistics for each input record,
accumulated across
specified fields.
Options:
-a {sum,count,...} Names of accumulators. One or more of:
count Count instances of fields
null_count Count number of empty-string/JSON-null instances
per field
distinct_count Count number of distinct values per field
mode Find most-frequently-occurring values for fields;
first-found wins tie
antimode Find least-frequently-occurring values for fields;
first-found wins tie
sum Compute sums of specified fields
mean Compute averages (sample means) of specified fields
var Compute sample variance of specified fields
stddev Compute sample standard deviation of specified fields
meaneb Estimate error bars for averages (assuming no sample
autocorrelation)
skewness Compute sample skewness of specified fields
kurtosis Compute sample kurtosis of specified fields
min Compute minimum values of specified fields
max Compute maximum values of specified fields
minlen Compute minimum string-lengths of specified fields
maxlen Compute maximum string-lengths of specified fields
-f {a,b,c} Value-field names on which to compute statistics.
Requires -o.
-r {a,b,c} Regular expressions for value-field names on
which to compute
statistics. Requires -o.
-c {a,b,c} Substrings for collapse mode. All fields which
have the same names
after removing substrings will be accumulated together.
Please see
examples below.
-i Use interpolated percentiles, like R’s type=7;
default like type=1.
Not sensical for string-valued fields.
-o {name} Output field basename for -f/-r.
-k Keep the input fields which contributed to the output
statistics;
the default is to omit them.
String-valued
data make sense unless arithmetic on them is required,
e.g. for sum, mean, interpolated percentiles, etc. In case
of mixed data,
numbers are less than strings.
Example input
data: "a_in_x=1,a_out_x=2,b_in_y=4,b_out_x=8".
Example: mlr merge-fields -a sum,count -f a_in_x,a_out_x -o
foo
produces
"b_in_y=4,b_out_x=8,foo_sum=3,foo_count=2" since
"a_in_x,a_out_x" are
summed over.
Example: mlr merge-fields -a sum,count -r in_,out_ -o bar
produces "bar_sum=15,bar_count=4" since all four
fields are summed over.
Example: mlr merge-fields -a sum,count -c in_,out_
produces
"a_x_sum=3,a_x_count=2,b_y_sum=4,b_y_count=1,b_x_sum=8,b_x_count=1"
since "a_in_x" and "a_out_x" both
collapse to "a_x", "b_in_y" collapses to
"b_y", and "b_out_x" collapses to
"b_x".
most-frequent
Usage: mlr most-frequent [options]
Shows the most frequently occurring distinct values for
specified field names.
The first entry is the statistical mode; the remaining are
runners-up.
Options:
-f {one or more comma-separated field names}. Required flag.
-n {count}. Optional flag defaulting to 10.
-b Suppress counts; show only field values.
-o {name} Field name for output count. Default
"count".
See also "mlr least-frequent".
nest
Usage: mlr nest [options]
Explodes specified field values into separate
fields/records, or reverses this.
Options:
--explode,--implode One is required.
--values,--pairs One is required.
--across-records,--across-fields One is required.
-f {field name} Required.
--nested-fs {string} Defaults to ";". Field
separator for nested values.
--nested-ps {string} Defaults to ":". Pair
separator for nested key-value pairs.
--evar {string} Shorthand for --explode --values
--across-records --nested-fs {string}
--ivar {string} Shorthand for --implode --values
--across-records --nested-fs {string}
Please use "mlr --usage-separator-options" for
information on specifying separators.
Examples:
mlr nest
--explode --values --across-records -f x
with input record "x=a;b;c,y=d" produces output
records
"x=a,y=d"
"x=b,y=d"
"x=c,y=d"
Use --implode to do the reverse.
mlr nest
--explode --values --across-fields -f x
with input record "x=a;b;c,y=d" produces output
records
"x_1=a,x_2=b,x_3=c,y=d"
Use --implode to do the reverse.
mlr nest
--explode --pairs --across-records -f x
with input record "x=a:1;b:2;c:3,y=d" produces
output records
"a=1,y=d"
"b=2,y=d"
"c=3,y=d"
mlr nest
--explode --pairs --across-fields -f x
with input record "x=a:1;b:2;c:3,y=d" produces
output records
"a=1,b=2,c=3,y=d"
Notes:
* With --pairs, --implode doesn’t make sense since the
original field name has
been lost.
* The combination "--implode --values
--across-records" is non-streaming:
no output records are produced until all input records have
been read. In
particular, this means it won’t work in ’tail
-f’ contexts. But all other flag
combinations result in streaming (’tail -f’
friendly) data processing.
If input is coming from ’tail -f’, be sure to
use ’--records-per-batch 1’.
* It’s up to you to ensure that the nested-fs is
distinct from your data’s IFS:
e.g. by default the former is semicolon and the latter is
comma.
See also mlr reshape.
nothing
Usage: mlr nothing [options]
Drops all input records. Useful for testing, or after
tee/print/etc. have
produced other output.
Options:
-h|--help Show this message.
put
Usage: mlr put [options] {DSL expression}
Lets you use a domain-specific language to progamatically
alter stream records.
See also:
https://miller.readthedocs.io/en/latest/reference-verbs
Options:
-f {file name} File containing a DSL expression (see
examples below). If the filename
is a directory, all *.mlr files in that directory are
loaded.
-e {expression}
You can use this after -f to add an expression. Example use
case: define functions/subroutines in a file you specify
with -f, then call
them with an expression you specify with -e.
(If you mix -e
and -f then the expressions are evaluated in the order
encountered.
Since the expression pieces are simply concatenated, please
be sure to use intervening
semicolons to separate expressions.)
-s name=value:
Predefines out-of-stream variable @name to have
Thus mlr put -s foo=97 ’$column += @foo’ is like
mlr put ’begin {@foo = 97} $column += @foo’.
The value part is subject to type-inferencing.
May be specified more than once, e.g. -s name1=value1 -s
name2=value2.
Note: the value may be an environment variable, e.g. -s
sequence=$SEQUENCE
-x (default
false) Prints records for which {expression} evaluates to
false, not true,
i.e. invert the sense of the filter expression.
-q Does not
include the modified record in the output stream.
Useful for when all desired output is in begin and/or end
blocks.
-S and -F:
There are no-ops in Miller 6 and above, since now
type-inferencing is done
by the record-readers before filter/put is executed.
Supported as no-op pass-through
flags for backward compatibility.
-h|--help Show this message.
Parser-info options:
-w Print warnings about things like uninitialized variables.
-W Same as -w, but exit the process if there are any warnings.
-p Prints the
expressions’s AST (abstract syntax tree), which gives
full
transparency on the precedence and associativity rules of
Miller’s grammar,
to stdout.
-d Like -p but uses a parenthesized-expression format for the AST.
-D Like -d but with output all on one line.
-E Echo DSL expression before printing parse-tree
-v Same as -E -p.
-X Exit after
parsing but before stream-processing. Useful with -v/-d/-D,
if you
only want to look at parser information.
Examples:
mlr --from example.csv put ’$qr = $quantity *
$rate’
More example put expressions:
If-statements:
’if ($flag == true) { $quantity *= 10}’
’if ($x > 0.0) { $y=log10($x); $z=sqrt($y) } else
{$y = 0.0; $z = 0.0}’
Newly created fields can be read after being written:
’$new_field = $index**2; $qn = $quantity *
$new_field’
Regex-replacement:
’$name = sub($name, "http.*com"i,
"")’
Regex-capture:
’if ($a =~ "([a-z]+)_([0-9]+)") { $b = "left_\1"; $c = "right_\2" }’ |
Built-in variables:
’$filename = FILENAME’
Aggregations (use mlr put -q):
’@sum += $x; end {emit @sum}’
’@sum[$shape] += $quantity; end {emit @sum,
"shape"}’
’@sum[$shape][$color] += $x; end {emit @sum,
"shape", "color"}’
’
@min = min(@min,$x);
@max=max(@max,$x);
end{emitf @min, @max}
’
See also https://miller.readthedocs.io/reference-dsl for more context.
regularize
Usage: mlr regularize [options]
Outputs records sorted lexically ascending by keys.
Options:
-h|--help Show this message.
remove-empty-columns
Usage: mlr remove-empty-columns [options]
Omits fields which are empty on every input row.
Non-streaming.
Options:
-h|--help Show this message.
rename
Usage: mlr rename [options] {old1,new1,old2,new2,...}
Renames specified fields.
Options:
-r Treat old field names as regular expressions.
"ab", "a.*b"
will match any field name containing the substring
"ab" or
matching "a.*b", respectively; anchors of the form
"^ab$",
"^a.*b$" may be used. New field names may be plain
strings,
or may contain capture groups of the form "\1"
through
"\9". Wrapping the regex in double quotes is
optional, but
is required if you wish to follow it with ’i’ to
indicate
case-insensitivity.
-g Do global replacement within each field name rather than
first-match replacement.
-h|--help Show this message.
Examples:
mlr rename old_name,new_name’
mlr rename
old_name_1,new_name_1,old_name_2,new_name_2’
mlr rename -r ’Date_[0-9]+,Date,’ Rename all
such fields to be "Date"
mlr rename -r ’"Date_[0-9]+",Date’
Same
mlr rename -r ’Date_([0-9]+).*,\1’ Rename all
such fields to be of the form 20151015
mlr rename -r ’"name"i,Name’ Rename
"name", "Name", "NAME", etc.
to "Name"
reorder
Usage: mlr reorder [options]
Moves specified names to start of record, or end of record.
Options:
-e Put specified field names at record end: default is to
put them at record start.
-f {a,b,c} Field names to reorder.
-b {x} Put field names specified with -f before field name
specified by {x},
if any. If {x} isn’t present in a given record, the
specified fields
will not be moved.
-a {x} Put field names specified with -f after field name
specified by {x},
if any. If {x} isn’t present in a given record, the
specified fields
will not be moved.
-h|--help Show this message.
Examples:
mlr reorder -f a,b sends input record
"d=4,b=2,a=1,c=3" to "a=1,b=2,d=4,c=3".
mlr reorder -e -f a,b sends input record
"d=4,b=2,a=1,c=3" to
"d=4,c=3,a=1,b=2".
repeat
Usage: mlr repeat [options]
Copies input records to output records multiple times.
Options must be exactly one of the following:
-n {repeat count} Repeat each input record this many times.
-f {field name} Same, but take the repeat count from the
specified
field name of each input record.
-h|--help Show this message.
Example:
echo x=0 | mlr repeat -n 4 then put ’$x=urand()’
produces:
x=0.488189
x=0.484973
x=0.704983
x=0.147311
Example:
echo a=1,b=2,c=3 | mlr repeat -f b
produces:
a=1,b=2,c=3
a=1,b=2,c=3
Example:
echo a=1,b=2,c=3 | mlr repeat -f c
produces:
a=1,b=2,c=3
a=1,b=2,c=3
a=1,b=2,c=3
reshape
Usage: mlr reshape [options]
Wide-to-long options:
-i {input field names} -o {key-field name,value-field name}
-r {input field regex} -o {key-field name,value-field name}
These pivot/reshape the input data such that the input
fields are removed
and separate records are emitted for each key/value pair.
Note: if you have multiple regexes, please specify them
using multiple -r,
since regexes can contain commas within them.
Note: this works with tail -f and produces output records
for each input
record seen. If input is coming from ’tail -f’,
be sure to use
’--records-per-batch 1’.
Long-to-wide options:
-s {key-field name,value-field name}
These pivot/reshape the input data to undo the wide-to-long
operation.
Note: this does not work with tail -f; it produces output
records only after
all input records have been read.
Examples:
Input file
"wide.txt":
time X Y
2009-01-01 0.65473572 2.4520609
2009-01-02 -0.89248112 0.2154713
2009-01-03 0.98012375 1.3179287
mlr --pprint
reshape -i X,Y -o item,value wide.txt
time item value
2009-01-01 X 0.65473572
2009-01-01 Y 2.4520609
2009-01-02 X -0.89248112
2009-01-02 Y 0.2154713
2009-01-03 X 0.98012375
2009-01-03 Y 1.3179287
mlr --pprint
reshape -r ’[A-Z]’ -o item,value wide.txt
time item value
2009-01-01 X 0.65473572
2009-01-01 Y 2.4520609
2009-01-02 X -0.89248112
2009-01-02 Y 0.2154713
2009-01-03 X 0.98012375
2009-01-03 Y 1.3179287
Input file
"long.txt":
time item value
2009-01-01 X 0.65473572
2009-01-01 Y 2.4520609
2009-01-02 X -0.89248112
2009-01-02 Y 0.2154713
2009-01-03 X 0.98012375
2009-01-03 Y 1.3179287
mlr --pprint
reshape -s item,value long.txt
time X Y
2009-01-01 0.65473572 2.4520609
2009-01-02 -0.89248112 0.2154713
2009-01-03 0.98012375 1.3179287
See also mlr nest.
sample
Usage: mlr sample [options]
Reservoir sampling (subsampling without replacement),
optionally by category.
See also mlr bootstrap and mlr shuffle.
Options:
-g {a,b,c} Optional: group-by-field names for samples, e.g.
a,b,c.
-k {k} Required: number of records to output in total, or by
group if using -g.
-h|--help Show this message.
sec2gmtdate
Usage: ../c/mlr sec2gmtdate {comma-separated list of field
names}
Replaces a numeric field representing seconds since the
epoch with the
corresponding GMT year-month-day timestamp; leaves
non-numbers as-is.
This is nothing more than a keystroke-saver for the
sec2gmtdate function:
../c/mlr sec2gmtdate time1,time2
is the same as
../c/mlr put
’$time1=sec2gmtdate($time1);$time2=sec2gmtdate($time2)’
sec2gmt
Usage: mlr sec2gmt [options] {comma-separated list of field
names}
Replaces a numeric field representing seconds since the
epoch with the
corresponding GMT timestamp; leaves non-numbers as-is. This
is nothing
more than a keystroke-saver for the sec2gmt function:
mlr sec2gmt time1,time2
is the same as
mlr put ’$time1 = sec2gmt($time1); $time2 =
sec2gmt($time2)’
Options:
-1 through -9: format the seconds using 1..9 decimal places,
respectively.
--millis Input numbers are treated as milliseconds since the
epoch.
--micros Input numbers are treated as microseconds since the
epoch.
--nanos Input numbers are treated as nanoseconds since the
epoch.
-h|--help Show this message.
seqgen
Usage: mlr seqgen [options]
Passes input records directly to output. Most useful for
format conversion.
Produces a sequence of counters. Discards the input record
stream. Produces
output as specified by the options
Options:
-f {name} (default "i") Field name for counters.
--start {value} (default 1) Inclusive start value.
--step {value} (default 1) Step value.
--stop {value} (default 100) Inclusive stop value.
-h|--help Show this message.
Start, stop, and/or step may be floating-point. Output is
integer if start,
stop, and step are all integers. Step may be negative. It
may not be zero
unless start == stop.
shuffle
Usage: mlr shuffle [options]
Outputs records randomly permuted. No output records are
produced until
all input records are read. See also mlr bootstrap and mlr
sample.
Options:
-h|--help Show this message.
skip-trivial-records
Usage: mlr skip-trivial-records [options]
Passes through all records except those with zero fields,
or those for which all fields have empty value.
Options:
-h|--help Show this message.
sort
Usage: mlr sort {flags}
Sorts records primarily by the first specified field,
secondarily by the second
field, and so on. (Any records not having all specified sort
keys will appear
at the end of the output, in the order they were
encountered, regardless of the
specified sort order.) The sort is stable: records that
compare equal will sort
in the order they were encountered in the input record
stream.
Options:
-f {comma-separated field names} Lexical ascending
-r {comma-separated field names} Lexical descending
-c {comma-separated field names} Case-folded lexical
ascending
-cr {comma-separated field names} Case-folded lexical
descending
-n {comma-separated field names} Numerical ascending; nulls
sort last
-nf {comma-separated field names} Same as -n
-nr {comma-separated field names} Numerical descending;
nulls sort first
-t {comma-separated field names} Natural ascending
-tr|-rt {comma-separated field names} Natural descending
-h|--help Show this message.
Example:
mlr sort -f a,b -nr x,y,z
which is the same as:
mlr sort -f a -f b -nr x -nr y -nr z
sort-within-records
Usage: mlr sort-within-records [options]
Outputs records sorted lexically ascending by keys.
Options:
-r Recursively sort subobjects/submaps, e.g. for JSON input.
-h|--help Show this message.
sparsify
Usage: mlr sparsify [options]
Unsets fields for which the key is the empty string (or,
optionally, another
specified value). Only makes sense with output format not
being CSV or TSV.
Options:
-s {filler string} What values to remove. Defaults to the
empty string.
-f {a,b,c} Specify field names to be operated on; any other
fields won’t be
modified. The default is to modify all fields.
-h|--help Show this message.
Example: if input is a=1,b=,c=3 then output is a=1,c=3.
split
Usage: mlr split [options] {filename}
Options:
-n {n}: Cap file sizes at N records.
-m {m}: Produce M files, round-robining records among them.
-g {a,b,c}: Write separate files with records having
distinct values for fields named a,b,c.
Exactly one of -m, -n, or -g must be supplied.
--prefix {p} Specify filename prefix; default
"split".
--suffix {s} Specify filename suffix; default is from mlr
output format, e.g. "csv".
-a Append to existing file(s), if any, rather than
overwriting.
-v Send records along to downstream verbs as well as
splitting to files.
-e Do NOT URL-escape names of output files.
-j {J} Use string J to join filename parts; default
"_".
-h|--help Show this message.
Any of the output-format command-line flags (see mlr -h).
For example, using
mlr --icsv --from myfile.csv split --ojson -n 1000
the input is CSV, but the output files are JSON.
Examples: Suppose myfile.csv has 1,000,000 records.
100 output
files, 10,000 records each. First 10,000 records in
split_1.csv, next in split_2.csv, etc.
mlr --csv --from myfile.csv split -n 10000
10 output
files, 100,000 records each. Records 1,11,21,etc in
split_1.csv, records 2,12,22, etc in split_2.csv, etc.
mlr --csv --from myfile.csv split -m 10
Same, but with JSON output.
mlr --csv --from myfile.csv split -m 10 -o json
Same but
instead of split_1.csv, split_2.csv, etc. there are
test_1.dat, test_2.dat, etc.
mlr --csv --from myfile.csv split -m 10 --prefix test
--suffix dat
Same, but written to the /tmp/ directory.
mlr --csv --from myfile.csv split -m 10 --prefix /tmp/test
--suffix dat
If the shape
field has values triangle and square, then there will be
split_triangle.csv and split_square.csv.
mlr --csv --from myfile.csv split -g shape
If the color
field has values yellow and green, and the shape field has
values triangle and square,
then there will be split_yellow_triangle.csv,
split_yellow_square.csv, etc.
mlr --csv --from myfile.csv split -g color,shape
See also the "tee" DSL function which lets you do more ad-hoc customization.
ssub
Usage: mlr ssub [options]
Replaces old string with new string in specified field(s),
without regex support for
the old string, like the ’ssub’ DSL function.
See also the ’gsub’ and ’sub’ verbs.
Options:
-f {a,b,c} Field names to convert.
-h|--help Show this message.
stats1
Usage: mlr stats1 [options]
Computes univariate statistics for one or more given fields,
accumulated across
the input record stream.
Options:
-a {sum,count,...} Names of accumulators: one or more of:
median This is the same as p50
p10 p25.2 p50 p98 p100 etc.
count Count instances of fields
null_count Count number of empty-string/JSON-null instances
per field
distinct_count Count number of distinct values per field
mode Find most-frequently-occurring values for fields;
first-found wins tie
antimode Find least-frequently-occurring values for fields;
first-found wins tie
sum Compute sums of specified fields
mean Compute averages (sample means) of specified fields
var Compute sample variance of specified fields
stddev Compute sample standard deviation of specified fields
meaneb Estimate error bars for averages (assuming no sample
autocorrelation)
skewness Compute sample skewness of specified fields
kurtosis Compute sample kurtosis of specified fields
min Compute minimum values of specified fields
max Compute maximum values of specified fields
minlen Compute minimum string-lengths of specified fields
maxlen Compute maximum string-lengths of specified
fields
-f {a,b,c}
Value-field names on which to compute statistics
--fr {regex} Regex for value-field names on which to compute
statistics
(compute statistics on values in all field names matching
regex
--fx {regex} Inverted regex for value-field names on which
to compute statistics
(compute statistics on values in all field names not
matching regex)
-g {d,e,f}
Optional group-by-field names
--gr {regex} Regex for optional group-by-field names
(group by values in field names matching regex)
--gx {regex} Inverted regex for optional group-by-field
names
(group by values in field names not matching regex)
--grfx {regex} Shorthand for --gr {regex} --fx {that same regex}
-i Use
interpolated percentiles, like R’s type=7; default
like type=1.
Not sensical for string-valued fields.\n");
-s Print iterative stats. Useful in tail -f contexts, in
which
case please avoid pprint-format output since end of input
stream will never be seen. Likewise, if input is coming from
’tail -f’
be sure to use ’--records-per-batch 1’.
-h|--help Show this message.
Example: mlr stats1 -a min,p10,p50,p90,max -f value -g
size,shape
Example: mlr stats1 -a count,mode -f size
Example: mlr stats1 -a count,mode -f size -g shape
Example: mlr stats1 -a count,mode --fr
’^[a-h].*$’ -gr ’^k.*$’
This computes count and mode statistics on all field names
beginning
with a through h, grouped by all field names starting with
k.
Notes:
* p50 and median are synonymous.
* min and max output the same results as p0 and p100,
respectively, but use
less memory.
* String-valued data make sense unless arithmetic on them is
required,
e.g. for sum, mean, interpolated percentiles, etc. In case
of mixed data,
numbers are less than strings.
* count and mode allow text input; the rest require numeric
input.
In particular, 1 and 1.0 are distinct text for count and
mode.
* When there are mode ties, the first-encountered datum
wins.
stats2
Usage: mlr stats2 [options]
Computes bivariate statistics for one or more given
field-name pairs,
accumulated across the input record stream.
-a {linreg-ols,corr,...} Names of accumulators: one or more
of:
linreg-ols Linear regression using ordinary least squares
linreg-pca Linear regression using principal component
analysis
r2 Quality metric for linreg-ols (linreg-pca emits its own)
logireg Logistic regression
corr Sample correlation
cov Sample covariance
covx Sample-covariance matrix
-f {a,b,c,d} Value-field name-pairs on which to compute
statistics.
There must be an even number of names.
-g {e,f,g} Optional group-by-field names.
-v Print additional output for linreg-pca.
-s Print iterative stats. Useful in tail -f contexts, in
which
case please avoid pprint-format output since end of input
stream will never be seen. Likewise, if input is coming from
’tail -f’, be sure to use
’--records-per-batch 1’.
--fit Rather than printing regression parameters, applies
them to
the input data to compute new fit fields. All input records
are
held in memory until end of input stream. Has effect only
for
linreg-ols, linreg-pca, and logireg.
Only one of -s or --fit may be used.
Example: mlr stats2 -a linreg-pca -f x,y
Example: mlr stats2 -a linreg-ols,r2 -f x,y -g size,shape
Example: mlr stats2 -a corr -f x,y
step
Usage: mlr step [options]
Computes values dependent on earlier/later records,
optionally grouped by category.
Options:
-a {delta,rsum,...} Names of steppers: comma-separated, one
or more of:
counter Count instances of field(s) between successive
records
delta Compute differences in field(s) between successive
records
ewma Exponentially weighted moving average over successive
records
from-first Compute differences in field(s) from first record
ratio Compute ratios in field(s) between successive records
rprod Compute running products of field(s) between
successive records
rsum Compute running sums of field(s) between successive
records
shift Alias for shift_lag
shift_lag Include value(s) in field(s) from the previous
record, if any
shift_lead Include value(s) in field(s) from the next
record, if any
slwin Sliding-window averages over m records back and n
forward. E.g. slwin_7_2 for 7 back and 2 forward.
-f {a,b,c}
Value-field names on which to compute statistics
-g {d,e,f} Optional group-by-field names
-F Computes integerable things (e.g. counter) in floating
point.
As of Miller 6 this happens automatically, but the flag is
accepted
as a no-op for backward compatibility with Miller 5 and
below.
-d {x,y,z} Weights for EWMA. 1 means current sample gets all
weight (no
smoothing), near under 1 is light smoothing, near over 0 is
heavy smoothing. Multiple weights may be specified, e.g.
"mlr step -a ewma -f sys_load -d 0.01,0.1,0.9".
Default if omitted
is "-d 0.5".
-o {a,b,c} Custom suffixes for EWMA output fields. If
omitted, these default to
the -d values. If supplied, the number of -o values must be
the same
as the number of -d values.
-h|--help Show this message.
Examples:
mlr step -a rsum -f request_size
mlr step -a delta -f request_size -g hostname
mlr step -a ewma -d 0.1,0.9 -f x,y
mlr step -a ewma -d 0.1,0.9 -o smooth,rough -f x,y
mlr step -a ewma -d 0.1,0.9 -o smooth,rough -f x,y -g
group_name
mlr step -a slwin_9_0,slwin_0_9 -f x
Please see
https://miller.readthedocs.io/en/latest/reference-verbs.html#filter
or
https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average
for more information on EWMA.
sub
Usage: mlr sub [options]
Replaces old string with new string in specified field(s),
with regex support
for the old string and not handling multiple matches, like
the ’sub’ DSL function.
See also the ’gsub’ and ’ssub’
verbs.
Options:
-f {a,b,c} Field names to convert.
-h|--help Show this message.
summary
Usage: mlr summary [options]
Show summary statistics about the input data.
All
summarizers:
field_type string, int, etc. -- if a column has mixed types,
all encountered types are printed
count +1 for every instance of the field across all records
in the input record stream
null_count count of field values either empty string or JSON
null
distinct_count count of distinct values for the field
mode most-frequently-occurring value for the field
sum sum of field values
mean mean of the field values
stddev standard deviation of the field values
var variance of the field values
skewness skewness of the field values
minlen length of shortest string representation for the
field
maxlen length of longest string representation for the field
min minimum field value
p25 first-quartile field value
median median field value
p75 third-quartile field value
max maximum field value
iqr interquartile range: p75 - p25
lof lower outer fence: p25 - 3.0 * iqr
lif lower inner fence: p25 - 1.5 * iqr
uif upper inner fence: p75 + 1.5 * iqr
uof upper outer fence: p75 + 3.0 * iqr
Default
summarizers:
field_type count mean min max null_count distinct_count
Notes:
* min, p25, median, p75, and max work for strings as well as
numbers
* Distinct-counts are computed on string representations --
so 4.1 and 4.10 are counted as distinct here.
* If the mode is not unique in the input data, the
first-encountered value is reported as the mode.
Options:
-a {mean,sum,etc.} Use only the specified summarizers.
-x {mean,sum,etc.} Use all summarizers, except the specified
ones.
--all Use all available summarizers.
-h|--help Show this message.
tac
Usage: mlr tac [options]
Prints records in reverse order from the order in which they
were encountered.
Options:
-h|--help Show this message.
tail
Usage: mlr tail [options]
Passes through the last n records, optionally by category.
Options:
-g {a,b,c} Optional group-by-field names for head counts,
e.g. a,b,c.
-n {n} Head-count to print. Default 10.
-h|--help Show this message.
tee
Usage: mlr tee [options] {filename}
Options:
-a Append to existing file, if any, rather than overwriting.
-p Treat filename as a pipe-to command.
Any of the output-format command-line flags (see mlr -h).
Example: using
mlr --icsv --opprint put ’...’ then tee --ojson
./mytap.dat then stats1 ...
the input is CSV, the output is pretty-print tabular, but
the tee-file output
is written in JSON format.
-h|--help Show this message.
template
Usage: mlr template [options]
Places input-record fields in the order specified by list of
column names.
If the input record is missing a specified field, it will be
filled with the fill-with.
If the input record possesses an unspecified field, it will
be discarded.
Options:
-f {a,b,c} Comma-separated field names for template, e.g.
a,b,c.
-t {filename} CSV file whose header line will be used for
template.
--fill-with {filler string} What to fill absent fields with.
Defaults to the empty string.
-h|--help Show this message.
Example:
* Specified fields are a,b,c.
* Input record is c=3,a=1,f=6.
* Output record is a=1,b=,c=3.
top
Usage: mlr top [options]
-f {a,b,c} Value-field names for top counts.
-g {d,e,f} Optional group-by-field names for top counts.
-n {count} How many records to print per category; default
1.
-a Print all fields for top-value records; default is
to print only value and group-by fields. Requires a single
value-field name only.
--min Print top smallest values; default is top largest
values.
-F Keep top values as floats even if they look like
integers.
-o {name} Field name for output indices. Default
"top_idx".
This is ignored if -a is used.
Prints the n records with smallest/largest values at
specified fields,
optionally by category. If -a is given, then the top records
are emitted
with the same fields as they appeared in the input. Without
-a, only fields
from -f, fields from -g, and the top-index field are
emitted. For more information
please see
https://miller.readthedocs.io/en/latest/reference-verbs#top
utf8-to-latin1
Usage: mlr utf8-to-latin1, with no options.
Recursively converts record strings from Latin-1 to UTF-8.
For field-level control, please see the utf8_to_latin1 DSL
function.
Options:
-h|--help Show this message.
unflatten
Usage: mlr unflatten [options]
Reverses flatten. Example: field with name
’a.b.c’ and value 4
becomes name ’a’ and value
’{"b": { "c": 4 }}’.
Options:
-f {a,b,c} Comma-separated list of field names to unflatten
(default all).
-s {string} Separator, defaulting to mlr --flatsep value.
-h|--help Show this message.
uniq
Usage: mlr uniq [options]
Prints distinct values for specified field names. With -c,
same as
count-distinct. For uniq, -f is a synonym for -g.
Options:
-g {d,e,f} Group-by-field names for uniq counts.
-x {a,b,c} Field names to exclude for uniq: use each
record’s others instead.
-c Show repeat counts in addition to unique values.
-n Show only the number of distinct values.
-o {name} Field name for output count. Default
"count".
-a Output each unique record only once. Incompatible with
-g.
With -c, produces unique records, with repeat counts for
each.
With -n, produces only one record which is the unique-record
count.
With neither -c nor -n, produces unique records.
unspace
Usage: mlr unspace [options]
Replaces spaces in record keys and/or values with _. This is
helpful for PPRINT output.
Options:
-f {x} Replace spaces with specified filler character.
-k Unspace only keys, not keys and values.
-v Unspace only values, not keys and values.
-h|--help Show this message.
unsparsify
Usage: mlr unsparsify [options]
Prints records with the union of field names over all input
records.
For field names absent in a given record but present in
others, fills in
a value. This verb retains all input before producing any
output.
Options:
--fill-with {filler string} What to fill absent fields with.
Defaults to
the empty string.
-f {a,b,c} Specify field names to be operated on. Any other
fields won’t be
modified, and operation will be streaming.
-h|--help Show this message.
Example: if the input is two records, one being
’a=1,b=2’ and the other
being ’b=3,c=4’, then the output is the two
records ’a=1,b=2,c=’ and
’a=,b=3,c=4’.
FUNCTIONS FOR FILTER/PUT
abs
(class=math #args=1) Absolute value.
acos
(class=math #args=1) Inverse trigonometric cosine.
acosh
(class=math #args=1) Inverse hyperbolic cosine.
antimode
(class=stats #args=1) Returns the least frequently occurring
value in an array or map. Returns error for
non-array/non-map types. Values are stringified for
comparison, so for example string "1" and integer
1 are not distinct. In cases of ties, first-found wins.
Examples:
antimode([3,3,4,4,4]) is 3
antimode([3,3,4,4]) is 3
any
(class=higher-order-functions #args=2) Given a map or array
as first argument and a function as second argument, yields
a boolean true if the argument function returns true for any
array/map element, false otherwise. For arrays, the function
should take one argument, for array element; for maps, it
should take two, for map-element key and value. In either
case it should return a boolean.
Examples:
Array example: any([10,20,30], func(e) {return $index == e})
Map example: any({"a": "foo",
"b": "bar"}, func(k,v) {return $[k] ==
v})
append
(class=collections #args=2) Appends second argument to end
of first argument, which must be an array.
apply
(class=higher-order-functions #args=2) Given a map or array
as first argument and a function as second argument, applies
the function to each element of the array/map. For arrays,
the function should take one argument, for array element; it
should return a new element. For maps, it should take two
arguments, for map-element key and value; it should return a
new key-value pair (i.e. a single-entry map).
Examples:
Array example: apply([1,2,3,4,5], func(e) {return e ** 3})
returns [1, 8, 27, 64, 125].
Map example: apply({"a":1, "b":3,
"c":5}, func(k,v) {return {toupper(k): v ** 2}})
returns {"A": 1, "B":9, "C":
25}",
arrayify
(class=collections #args=1) Walks through a nested
map/array, converting any map with consecutive keys
"1", "2", ... into an array. Useful to
wrap the output of unflatten.
asin
(class=math #args=1) Inverse trigonometric sine.
asinh
(class=math #args=1) Inverse hyperbolic sine.
asserting_absent
(class=typing #args=1) Aborts with an error if is_absent on
the argument returns false, else returns its argument.
asserting_array
(class=typing #args=1) Aborts with an error if is_array on
the argument returns false, else returns its argument.
asserting_bool
(class=typing #args=1) Aborts with an error if is_bool on
the argument returns false, else returns its argument.
asserting_boolean
(class=typing #args=1) Aborts with an error if is_boolean on
the argument returns false, else returns its argument.
asserting_empty
(class=typing #args=1) Aborts with an error if is_empty on
the argument returns false, else returns its argument.
asserting_empty_map
(class=typing #args=1) Aborts with an error if is_empty_map
on the argument returns false, else returns its
argument.
asserting_error
(class=typing #args=1) Aborts with an error if is_error on
the argument returns false, else returns its argument.
asserting_float
(class=typing #args=1) Aborts with an error if is_float on
the argument returns false, else returns its argument.
asserting_int
(class=typing #args=1) Aborts with an error if is_int on the
argument returns false, else returns its argument.
asserting_map
(class=typing #args=1) Aborts with an error if is_map on the
argument returns false, else returns its argument.
asserting_nonempty_map
(class=typing #args=1) Aborts with an error if
is_nonempty_map on the argument returns false, else returns
its argument.
asserting_not_array
(class=typing #args=1) Aborts with an error if is_not_array
on the argument returns false, else returns its
argument.
asserting_not_empty
(class=typing #args=1) Aborts with an error if is_not_empty
on the argument returns false, else returns its
argument.
asserting_not_map
(class=typing #args=1) Aborts with an error if is_not_map on
the argument returns false, else returns its argument.
asserting_not_null
(class=typing #args=1) Aborts with an error if is_not_null
on the argument returns false, else returns its
argument.
asserting_null
(class=typing #args=1) Aborts with an error if is_null on
the argument returns false, else returns its argument.
asserting_numeric
(class=typing #args=1) Aborts with an error if is_numeric on
the argument returns false, else returns its argument.
asserting_present
(class=typing #args=1) Aborts with an error if is_present on
the argument returns false, else returns its argument.
asserting_string
(class=typing #args=1) Aborts with an error if is_string on
the argument returns false, else returns its argument.
atan
(class=math #args=1) One-argument arctangent.
atan2
(class=math #args=2) Two-argument arctangent.
atanh
(class=math #args=1) Inverse hyperbolic tangent.
bitcount
(class=arithmetic #args=1) Count of 1-bits.
boolean
(class=conversion #args=1) Convert int/float/bool/string to
boolean.
capitalize
(class=string #args=1) Convert string’s first
character to uppercase.
cbrt
(class=math #args=1) Cube root.
ceil
(class=math #args=1) Ceiling: nearest integer at or
above.
clean_whitespace
(class=string #args=1) Same as collapse_whitespace and
strip, followed by type inference.
collapse_whitespace
(class=string #args=1) Strip repeated whitespace from
string.
concat
(class=collections #args=variadic) Returns the array
concatenation of the arguments. Non-array arguments are
treated as single-element arrays.
Examples:
concat(1,2,3) is [1,2,3]
concat([1,2],3) is [1,2,3]
concat([1,2],[3]) is [1,2,3]
contains
(class=string #args=2) Returns true if the first argument
contains the second as a substring. This is like saying
’index(arg1, arg2) >= 0’but with less
keystroking.
Examples:
contains("abcde", "e") gives true
contains("abcde", "x") gives false
contains(12345, 34) gives true
contains("forêt", "ê") gives
true
cos
(class=math #args=1) Trigonometric cosine.
cosh
(class=math #args=1) Hyperbolic cosine.
count
(class=stats #args=1) Returns the length of an array or map.
Returns error for non-array/non-map types.
Examples:
count([7,8,9]) is 3
count({"a":7,"b":8,"c":9}) is
3
depth
(class=collections #args=1) Prints maximum depth of
map/array. Scalars have depth 0.
dhms2fsec
(class=time #args=1) Recovers floating-point seconds as in
dhms2fsec("5d18h53m20.250000s") =
500000.250000
dhms2sec
(class=time #args=1) Recovers integer seconds as in
dhms2sec("5d18h53m20s") = 500000
distinct_count
(class=stats #args=1) Returns the number of disinct values
in an array or map. Returns error for non-array/non-map
types. Values are stringified for comparison, so for example
string "1" and integer 1 are not distinct.
Examples:
distinct_count([7,8,9,7]) is 3
distinct_count([1,"1"]) is 1
distinct_count([1,1.0]) is 2
erf
(class=math #args=1) Error function.
erfc
(class=math #args=1) Complementary error function.
every
(class=higher-order-functions #args=2) Given a map or array
as first argument and a function as second argument, yields
a boolean true if the argument function returns true for
every array/map element, false otherwise. For arrays, the
function should take one argument, for array element; for
maps, it should take two, for map-element key and value. In
either case it should return a boolean.
Examples:
Array example: every(["a", "b",
"c"], func(e) {return $[e] >= 0})
Map example: every({"a": "foo",
"b": "bar"}, func(k,v) {return $[k] ==
v})
exec
(class=system #args=variadic) ’$output = exec(
"command", ["arg1", "arg2"],
{"env": ["ENV_VAR=ENV_VALUE",
"ENV_VAR2=ENV_VALUE2"], "dir":
"/tmp/run_command_here", "stdin_string":
"this is input fed to program",
"combined_output": true )’ Run a command via
executable, path, args and environment, yielding its stdout
minus final carriage return.
Example:
exec("echo", ["I don’t do",
"$SHELL things"], {"env":
"SHELL=sh"}) outputs "I don’t do $SHELL
things"
exp
(class=math #args=1) Exponential function e**x.
expm1
(class=math #args=1) e**x - 1.
flatten
(class=collections #args=2,3) Flattens multi-level maps to
single-level ones. Useful for nested JSON-like structures
for non-JSON file formats like CSV. With two arguments, the
first argument is a map (maybe $*) and the second argument
is the flatten separator. With three arguments, the first
argument is prefix, the second is the flatten separator, and
the third argument is a map; flatten($*, ".") is
the same as flatten("", ".", $*). See
"Flatten/unflatten: converting between JSON and tabular
formats" at https://miller.readthedocs.io for more
information.
Examples:
flatten({"a":[1,2],"b":3},
".") is {"a.1": 1, "a.2": 2,
"b": 3}.
flatten("a", ".", {"b": {
"c": 4 }}) is {"a.b.c" : 4}.
flatten("", ".", {"a": {
"b": 3 }}) is {"a.b" : 3}.
float
(class=conversion #args=1) Convert int/float/bool/string to
float.
floor
(class=math #args=1) Floor: nearest integer at or below.
fmtifnum
(class=conversion #args=2) Identical to fmtnum, except
returns the first argument as-is if the output would be an
error.
Examples:
fmtifnum(3.4, "%.6f") gives 3.400000"
fmtifnum("abc", "%.6f") gives abc"
$* = fmtifnum($*, "%.6f") formats numeric fields
in the current record, leaving non-numeric ones alone
fmtnum
(class=conversion #args=2) Convert int/float/bool to string
using printf-style format string (https://pkg.go.dev/fmt),
e.g. ’$s = fmtnum($n, "%08d")’ or
’$t = fmtnum($n, "%.6e")’.
Miller-specific extension: "%_d" and
"%_f" for comma-separated thousands. This function
recurses on array and map values.
Examples:
$y = fmtnum($x, "%.6f")
$o = fmtnum($n, "%d")
$o = fmtnum($n, "%12d")
$y = fmtnum($x, "%.6_f")
$o = fmtnum($n, "%_d")
$o = fmtnum($n, "%12_d")
fold
(class=higher-order-functions #args=3) Given a map or array
as first argument and a function as second argument,
accumulates entries into a final output -- for example, sum
or product. For arrays, the function should take two
arguments, for accumulated value and array element. For
maps, it should take four arguments, for accumulated key and
value, and map-element key and value; it should return the
updated accumulator as a new key-value pair (i.e. a
single-entry map). The start value for the accumulator is
taken from the third argument.
Examples:
Array example: fold([1,2,3,4,5], func(acc,e) {return acc +
e**3}, 10000) returns 10225.
Map example: fold({"a":1, "b":3,
"c": 5}, func(acck,accv,ek,ev) {return
{"sum": accv+ev**2}}, {"sum":10000})
returns 10035.
format
(class=string #args=variadic) Using first argument as format
string, interpolate remaining arguments in place of each
"{}" in the format string. Too-few arguments are
treated as the empty string; too-many arguments are
discarded.
Examples:
format("{}:{}:{}", 1,2) gives "1:2:".
format("{}:{}:{}", 1,2,3) gives "1:2:3".
format("{}:{}:{}", 1,2,3,4) gives
"1:2:3".
fsec2dhms
(class=time #args=1) Formats floating-point seconds as in
fsec2dhms(500000.25) = "5d18h53m20.250000s"
fsec2hms
(class=time #args=1) Formats floating-point seconds as in
fsec2hms(5000.25) = "01:23:20.250000"
get_keys
(class=collections #args=1) Returns array of keys of map or
array
get_values
(class=collections #args=1) Returns array of values of map
or array -- in the latter case, returns a copy of the
array
gmt2localtime
(class=time #args=1,2) Convert from a GMT-time string to a
local-time string. Consulting $TZ unless second argument is
supplied.
Examples:
gmt2localtime("1999-12-31T22:00:00Z") =
"2000-01-01 00:00:00" with
TZ="Asia/Istanbul"
gmt2localtime("1999-12-31T22:00:00Z",
"Asia/Istanbul") = "2000-01-01
00:00:00"
gmt2nsec
(class=time #args=1) Parses GMT timestamp as integer
nanoseconds since the epoch.
Example:
gmt2nsec("2001-02-03T04:05:06Z") =
981173106000000000
gmt2sec
(class=time #args=1) Parses GMT timestamp as integer seconds
since the epoch.
Example:
gmt2sec("2001-02-03T04:05:06Z") = 981173106
gssub
(class=string #args=3) Like gsub but does no regexing. No
characters are special.
Example:
gssub("ab.d.fg", ".", "X")
gives "abXdXfg"
gsub
(class=string #args=3) ’$name = gsub($name,
"old", "new")’: replace all, with
support for regular expressions. Capture groups \1 through
\9 in the new part are matched from (...) in the old part,
and must be used within the same call to gsub -- they
don’t persist for subsequent DSL statements. See also
=~ and regextract. See also "Regular expressions"
at https://miller.readthedocs.io.
Examples:
gsub("ababab", "ab", "XY")
gives "XYXYXY"
gsub("abc.def", ".", "X")
gives "XXXXXXX"
gsub("abc.def", "\.", "X")
gives "abcXdef"
gsub("abcdefg", "[ce]", "X")
gives "abXdXfg"
gsub("prefix4529:suffix8567",
"(....ix)([0-9]+)", "[\1 : \2]") gives
"[prefix : 4529]:[suffix : 8567]"
haskey
(class=collections #args=2) True/false if map
has/hasn’t key, e.g. ’haskey($*,
"a")’ or ’haskey(mymap, mykey)’,
or true/false if array index is in bounds / out of bounds.
Error if 1st argument is not a map or array. Note -n..-1
alias to 1..n in Miller arrays.
hexfmt
(class=conversion #args=1) Convert int to hex string, e.g.
255 to "0xff".
hms2fsec
(class=time #args=1) Recovers floating-point seconds as in
hms2fsec("01:23:20.250000") = 5000.250000
hms2sec
(class=time #args=1) Recovers integer seconds as in
hms2sec("01:23:20") = 5000
hostname
(class=system #args=0) Returns the hostname as a string.
index
(class=string #args=2) Returns the index (1-based) of the
second argument within the first. Returns -1 if the second
argument isn’t a substring of the first. Stringifies
non-string inputs. Uses UTF-8 encoding to count characters,
not bytes.
Examples:
index("abcde", "e") gives 5
index("abcde", "x") gives -1
index(12345, 34) gives 3
index("forêt", "t") gives 5
int
(class=conversion #args=1,2) Convert int/float/bool/string
to int. If the second argument is omitted and the first
argument is a string, base is inferred from the first
argument’s prefix. If the second argument is provided
and the first argument is a string, the second argument is
used as the base. If the second argument is provided and the
first argument is not a string, the second argument is
ignored.
Examples:
int("345") gives decimal 345 (base-10/decimal
input is inferred)
int("0xff") gives decimal 255 (base-16/hexadecimal
input is inferred)
int("0377") gives decimal 255 (base-8/octal input
is inferred)
int("0b11010011") gives decimal 211 which is
hexadecimal 0xd3 (base-2/binary input is inferred)
int("0377", 10) gives decimal 377
int(345, 16) gives decimal 345
int(string(345), 16) gives decimal 837
invqnorm
(class=math #args=1) Inverse of normal cumulative
distribution function. Note that invqorm(urand()) is
normally distributed.
is_absent
(class=typing #args=1) False if field is present in input,
true otherwise
is_array
(class=typing #args=1) True if argument is an array.
is_bool
(class=typing #args=1) True if field is present with boolean
value. Synonymous with is_boolean.
is_boolean
(class=typing #args=1) True if field is present with boolean
value. Synonymous with is_bool.
is_empty
(class=typing #args=1) True if field is present in input
with empty string value, false otherwise.
is_empty_map
(class=typing #args=1) True if argument is a map which is
empty.
is_error
(class=typing #args=1) True if if argument is an error, such
as taking string length of an integer.
is_float
(class=typing #args=1) True if field is present with value
inferred to be float
is_int
(class=typing #args=1) True if field is present with value
inferred to be int
is_map
(class=typing #args=1) True if argument is a map.
is_nan
(class=typing #args=1) True if the argument is the NaN
(not-a-number) floating-point value. Note that NaN has the
property that NaN != NaN, so you need
’is_nan(x)’ rather than ’x ==
NaN’.
is_nonempty_map
(class=typing #args=1) True if argument is a map which is
non-empty.
is_not_array
(class=typing #args=1) True if argument is not an array.
is_not_empty
(class=typing #args=1) True if field is present in input
with non-empty value, false otherwise
is_not_map
(class=typing #args=1) True if argument is not a map.
is_not_null
(class=typing #args=1) False if argument is null (empty,
absent, or JSON null), true otherwise.
is_null
(class=typing #args=1) True if argument is null (empty,
absent, or JSON null), false otherwise.
is_numeric
(class=typing #args=1) True if field is present with value
inferred to be int or float
is_present
(class=typing #args=1) True if field is present in input,
false otherwise.
is_string
(class=typing #args=1) True if field is present with string
(including empty-string) value
joink
(class=conversion #args=2) Makes string from map/array keys.
First argument is map/array; second is separator string.
Examples:
joink({"a":3,"b":4,"c":5},
",") = "a,b,c".
joink([1,2,3], ",") = "1,2,3".
joinkv
(class=conversion #args=3) Makes string from map/array
key-value pairs. First argument is map/array; second is
pair-separator string; third is field-separator string.
Mnemonic: the "=" comes before the ","
in the output and in the arguments to joinkv.
Examples:
joinkv([3,4,5], "=", ",") =
"1=3,2=4,3=5"
joinkv({"a":3,"b":4,"c":5},
":", ";") = "a:3;b:4;c:5"
joinv
(class=conversion #args=2) Makes string from map/array
values. First argument is map/array; second is separator
string.
Examples:
joinv([3,4,5], ",") = "3,4,5"
joinv({"a":3,"b":4,"c":5},
",") = "3,4,5"
json_parse
(class=collections #args=1) Converts value from
JSON-formatted string.
json_stringify
(class=collections #args=1,2) Converts value to
JSON-formatted string. Default output is single-line. With
optional second boolean argument set to true, produces
multiline output.
kurtosis
(class=stats #args=1) Returns the sample kurtosis of values
in an array or map. Returns empty string AKA void for
array/map of length less than two; returns error for
non-array/non-map types.
Example:
kurtosis([4,5,9,10,11]) is -1.6703688
latin1_to_utf8
(class=string #args=1) Tries to convert Latin-1-encoded
string to UTF-8-encoded string. If argument is array or map,
recurses into it.
Examples:
$y = latin1_to_utf8($x)
$* = latin1_to_utf8($*)
leafcount
(class=collections #args=1) Counts total number of terminal
values in map/array. For single-level map/array, same as
length.
leftpad
(class=string #args=3) Left-pads first argument to at most
the specified length (second, integer argument) using
specified pad value (third, string argument). If the first
argument is not a string, it will be stringified first.
Examples:
leftpad("abcdefg", 10 , "*") gives
"***abcdefg".
leftpad("abcdefg", 10 , "XY") gives
"XYabcdefg".
leftpad("1234567", 10 , "0") gives
"0001234567".
length
(class=collections #args=1) Counts number of top-level
entries in array/map. Scalars have length 1.
localtime2gmt
(class=time #args=1,2) Convert from a local-time string to a
GMT-time string. Consults $TZ unless second argument is
supplied.
Examples:
localtime2gmt("2000-01-01 00:00:00") =
"1999-12-31T22:00:00Z" with
TZ="Asia/Istanbul"
localtime2gmt("2000-01-01 00:00:00",
"Asia/Istanbul") =
"1999-12-31T22:00:00Z"
localtime2nsec
(class=time #args=1,2) Parses local timestamp as integer
nanoseconds since the epoch. Consults $TZ environment
variable, unless second argument is supplied.
Examples:
localtime2nsec("2001-02-03 04:05:06") =
981165906000000000 with TZ="Asia/Istanbul"
localtime2nsec("2001-02-03 04:05:06",
"Asia/Istanbul") = 981165906000000000"
localtime2sec
(class=time #args=1,2) Parses local timestamp as integer
seconds since the epoch. Consults $TZ environment variable,
unless second argument is supplied.
Examples:
localtime2sec("2001-02-03 04:05:06") = 981165906
with TZ="Asia/Istanbul"
localtime2sec("2001-02-03 04:05:06",
"Asia/Istanbul") = 981165906"
log
(class=math #args=1) Natural (base-e) logarithm.
log10
(class=math #args=1) Base-10 logarithm.
log1p
(class=math #args=1) log(1-x).
logifit
(class=math #args=3) Given m and b from logistic regression,
compute fit: $yhat=logifit($x,$m,$b).
lstrip
(class=string #args=1) Strip leading whitespace from
string.
madd
(class=arithmetic #args=3) a + b mod m (integers)
mapdiff
(class=collections #args=variadic) With 0 args, returns
empty map. With 1 arg, returns copy of arg. With 2 or more,
returns copy of arg 1 with all keys from any of remaining
argument maps removed.
mapexcept
(class=collections #args=variadic) Returns a map with keys
from remaining arguments, if any, unset. Remaining arguments
can be strings or arrays of string. E.g.
’mapexcept({1:2,3:4,5:6}, 1, 5, 7)’ is
’{3:4}’ and ’mapexcept({1:2,3:4,5:6}, [1,
5, 7])’ is ’{3:4}’.
mapselect
(class=collections #args=variadic) Returns a map with only
keys from remaining arguments set. Remaining arguments can
be strings or arrays of string. E.g.
’mapselect({1:2,3:4,5:6}, 1, 5, 7)’ is
’{1:2,5:6}’ and ’mapselect({1:2,3:4,5:6},
[1, 5, 7])’ is ’{1:2,5:6}’.
mapsum
(class=collections #args=variadic) With 0 args, returns
empty map. With >= 1 arg, returns a map with key-value
pairs from all arguments. Rightmost collisions win, e.g.
’mapsum({1:2,3:4},{1:5})’ is
’{1:5,3:4}’.
max
(class=math #args=variadic) Max of n numbers; null loses.
The min and max functions also recurse into arrays and maps,
so they can be used to get min/max stats on array/map
values.
maxlen
(class=stats #args=1) Returns the maximum string length of
values in an array or map. Returns empty string AKA void for
array/map of length less than two; returns error for
non-array/non-map types.
Example:
maxlen(["año", "alto"]) is 4
md5
(class=hashing #args=1) MD5 hash.
mean
(class=stats #args=1) Returns the arithmetic mean of values
in an array or map. Returns empty string AKA void for empty
array/map; returns error for non-array/non-map types.
Example:
mean([4,5,7,10]) is 6.5
meaneb
(class=stats #args=1) Returns the error bar for arithmetic
mean of values in an array or map, assuming the values are
independent and identically distributed. Returns empty
string AKA void for array/map of length less than two;
returns error for non-array/non-map types.
Example:
meaneb([4,5,7,10]) is 1.3228756
median
(class=stats #args=1,2) Returns the median of values in an
array or map. Returns empty string AKA void for empty
array/map; returns error for non-array/non-map types. Please
see the percentiles function for information on optional
flags, and on performance for large inputs.
Examples:
median([3,4,5,6,9,10]) is 6
median([3,4,5,6,9,10],{"interpolate_linearly":true})
is 5.5
median(["abc", "def", "ghi",
"ghi"]) is "ghi"
mexp
(class=arithmetic #args=3) a ** b mod m (integers)
min
(class=math #args=variadic) Min of n numbers; null loses.
The min and max functions also recurse into arrays and maps,
so they can be used to get min/max stats on array/map
values.
minlen
(class=stats #args=1) Returns the minimum string length of
values in an array or map. Returns empty string AKA void for
array/map of length less than two; returns error for
non-array/non-map types.
Example:
minlen(["año", "alto"]) is 3
mmul
(class=arithmetic #args=3) a * b mod m (integers)
mode
(class=stats #args=1) Returns the most frequently occurring
value in an array or map. Returns error for
non-array/non-map types. Values are stringified for
comparison, so for example string "1" and integer
1 are not distinct. In cases of ties, first-found wins.
Examples:
mode([3,3,4,4,4]) is 4
mode([3,3,4,4]) is 3
msub
(class=arithmetic #args=3) a - b mod m (integers)
nsec2gmt
(class=time #args=1,2) Formats integer nanoseconds since
epoch as GMT timestamp. Leaves non-numbers as-is. With
second integer argument n, includes n decimal places for the
seconds part.
Examples:
nsec2gmt(1234567890000000000) =
"2009-02-13T23:31:30Z"
nsec2gmt(1234567890123456789) =
"2009-02-13T23:31:30Z"
nsec2gmt(1234567890123456789, 6) =
"2009-02-13T23:31:30.123456Z"
nsec2gmtdate
(class=time #args=1) Formats integer nanoseconds since epoch
as GMT timestamp with year-month-date. Leaves non-numbers
as-is.
Example:
sec2gmtdate(1440768801700000000) =
"2015-08-28".
nsec2localdate
(class=time #args=1,2) Formats integer nanoseconds since
epoch as local timestamp with year-month-date. Leaves
non-numbers as-is. Consults $TZ environment variable unless
second argument is supplied.
Examples:
nsec2localdate(1440768801700000000) = "2015-08-28"
with TZ="Asia/Istanbul"
nsec2localdate(1440768801700000000,
"Asia/Istanbul") = "2015-08-28"
nsec2localtime
(class=time #args=1,2,3) Formats integer nanoseconds since
epoch as local timestamp. Consults $TZ environment variable
unless third argument is supplied. Leaves non-numbers as-is.
With second integer argument n, includes n decimal places
for the seconds part
Examples:
nsec2localtime(1234567890000000000) = "2009-02-14
01:31:30" with TZ="Asia/Istanbul"
nsec2localtime(1234567890123456789) = "2009-02-14
01:31:30" with TZ="Asia/Istanbul"
nsec2localtime(1234567890123456789, 6) = "2009-02-14
01:31:30.123456" with TZ="Asia/Istanbul"
nsec2localtime(1234567890123456789, 6,
"Asia/Istanbul") = "2009-02-14
01:31:30.123456"
null_count
(class=stats #args=1) Returns the number of values in an
array or map which are empty-string (AKA void) or JSON null.
Returns error for non-array/non-map types. Values are
stringified for comparison, so for example string
"1" and integer 1 are not distinct.
Example:
null_count(["a", "", "c"]) is
1
os
(class=system #args=0) Returns the operating-system name as
a string.
percentile
(class=stats #args=2,3) Returns the given percentile of
values in an array or map. Returns empty string AKA void for
empty array/map; returns error for non-array/non-map types.
Please see the percentiles function for information on
optional flags, and on performance for large inputs.
Examples:
percentile([3,4,5,6,9,10], 90) is 10
percentile([3,4,5,6,9,10], 90,
{"interpolate_linearly":true}) is 9.5
percentile(["abc", "def",
"ghi", "ghi"], 90) is
"ghi"
percentiles
(class=stats #args=2,3) Returns the given percentiles of
values in an array or map. Returns empty string AKA void for
empty array/map; returns error for non-array/non-map types.
See examples for information on the three option flags.
Examples:
Defaults are to not interpolate linearly, to produce a map keyed by percentile name, and to sort the input before computing percentiles:
percentiles([3,4,5,6,9,10],
[25,75]) is { "25": 4, "75": 9 }
percentiles(["abc", "def",
"ghi", "ghi"], [25,75]) is {
"25": "def", "75":
"ghi" }
Use "output_array_not_map" (or shorthand "oa") to get the outputs as an array:
percentiles([3,4,5,6,9,10], [25,75], {"output_array_not_map":true}) is [4, 9]
Use "interpolate_linearly" (or shorthand "il") to do linear interpolation -- note this produces error values on string inputs:
percentiles([3,4,5,6,9,10], [25,75], {"interpolate_linearly":true}) is { "25": 4.25, "75": 8.25 }
The percentiles function always sorts its inputs before computing percentiles. If you know your input is already sorted -- see also the sort_collection function -- then computation will be faster on large input if you pass in "array_is_sorted" (shorthand: "ais"):
x =
[6,5,9,10,4,3]
percentiles(x, [25,75], {"ais":true}) gives {
"25": 5, "75": 4 } which is incorrect
x = sort_collection(x)
percentiles(x, [25,75], {"ais":true}) gives {
"25": 4, "75": 9 } which is correct
You can also leverage this feature to compute percentiles on a sort of your choosing. For example:
Non-sorted input:
x =
splitax("the quick brown fox jumped loquaciously over
the lazy dogs", " ")
x is: ["the", "quick",
"brown", "fox", "jumped",
"loquaciously", "over", "the",
"lazy", "dogs"]
Percentiles are taken over the original positions of the words in the array -- "dogs" is last and hence appears as p99:
percentiles(x, [50, 99], {"oa":true, "ais":true}) gives ["loquaciously", "dogs"]
With sorting done inside percentiles, "the" is alphabetically last and is therefore the p99:
percentiles(x, [50, 99], {"oa":true}) gives ["loquaciously", "the"]
With default sorting done outside percentiles, the same:
x = sort(x) #
or x = sort_collection(x)
x is: ["brown", "dogs", "fox",
"jumped", "lazy",
"loquaciously", "over",
"quick", "the", "the"]
percentiles(x, [50, 99], {"oa":true,
"ais":true}) gives ["loquaciously",
"the"]
percentiles(x, [50, 99], {"oa":true}) gives
["loquaciously", "the"]
Now sorting by word length, "loquaciously" is longest and hence is the p99:
x = sort(x,
func(a,b) { return strlen(a) <=> strlen(b) } )
x is: ["fox", "the", "the",
"dogs", "lazy", "over",
"brown", "quick", "jumped",
"loquaciously"]
percentiles(x, [50, 99], {"oa":true,
"ais":true})
["over", "loquaciously"]
pow
(class=arithmetic #args=2) Exponentiation. Same as **, but
as a function.
qnorm
(class=math #args=1) Normal cumulative distribution
function.
reduce
(class=higher-order-functions #args=2) Given a map or array
as first argument and a function as second argument,
accumulates entries into a final output -- for example, sum
or product. For arrays, the function should take two
arguments, for accumulated value and array element, and
return the accumulated element. For maps, it should take
four arguments, for accumulated key and value, and
map-element key and value; it should return the updated
accumulator as a new key-value pair (i.e. a single-entry
map). The start value for the accumulator is the first
element for arrays, or the first element’s key-value
pair for maps.
Examples:
Array example: reduce([1,2,3,4,5], func(acc,e) {return acc +
e**3}) returns 225.
Map example: reduce({"a":1, "b":3,
"c": 5}, func(acck,accv,ek,ev) {return
{"sum_of_squares": accv + ev**2}}) returns
{"sum_of_squares": 35}.
regextract
(class=string #args=2) Extracts a substring (the first, if
there are multiple matches), matching a regular expression,
from the input. Does not use capture groups; see also the =~
operator which does.
Examples:
regextract("index ab09 file",
"[a-z][a-z][0-9][0-9]") gives "ab09"
regextract("index a999 file",
"[a-z][a-z][0-9][0-9]") gives (absent), which will
result in an assignment not happening.
regextract_or_else
(class=string #args=3) Like regextract but the third
argument is the return value in case the input string (first
argument) doesn’t match the pattern (second argument).
Examples:
regextract_or_else("index ab09 file",
"[a-z][a-z][0-9][0-9]", "nonesuch")
gives "ab09"
regextract_or_else("index a999 file",
"[a-z][a-z][0-9][0-9]", "nonesuch")
gives "nonesuch"
rightpad
(class=string #args=3) Right-pads first argument to at most
the specified length (second, integer argument) using
specified pad value (third, string argument). If the first
argument is not a string, it will be stringified first.
Examples:
rightpad("abcdefg", 10 , "*") gives
"abcdefg***".
rightpad("abcdefg", 10 , "XY") gives
"abcdefgXY".
rightpad("1234567", 10 , "0") gives
"1234567000".
round
(class=math #args=1) Round to nearest integer.
roundm
(class=math #args=2) Round to nearest multiple of m:
roundm($x,$m) is the same as round($x/$m)*$m.
rstrip
(class=string #args=1) Strip trailing whitespace from
string.
sec2dhms
(class=time #args=1) Formats integer seconds as in
sec2dhms(500000) = "5d18h53m20s"
sec2gmt
(class=time #args=1,2) Formats seconds since epoch as GMT
timestamp. Leaves non-numbers as-is. With second integer
argument n, includes n decimal places for the seconds part.
Examples:
sec2gmt(1234567890) = "2009-02-13T23:31:30Z"
sec2gmt(1234567890.123456) =
"2009-02-13T23:31:30Z"
sec2gmt(1234567890.123456, 6) =
"2009-02-13T23:31:30.123456Z"
sec2gmtdate
(class=time #args=1) Formats seconds since epoch (integer
part) as GMT timestamp with year-month-date. Leaves
non-numbers as-is.
Example:
sec2gmtdate(1440768801.7) = "2015-08-28".
sec2hms
(class=time #args=1) Formats integer seconds as in
sec2hms(5000) = "01:23:20"
sec2localdate
(class=time #args=1,2) Formats seconds since epoch (integer
part) as local timestamp with year-month-date. Leaves
non-numbers as-is. Consults $TZ environment variable unless
second argument is supplied.
Examples:
sec2localdate(1440768801.7) = "2015-08-28" with
TZ="Asia/Istanbul"
sec2localdate(1440768801.7, "Asia/Istanbul") =
"2015-08-28"
sec2localtime
(class=time #args=1,2,3) Formats seconds since epoch
(integer part) as local timestamp. Consults $TZ environment
variable unless third argument is supplied. Leaves
non-numbers as-is. With second integer argument n, includes
n decimal places for the seconds part
Examples:
sec2localtime(1234567890) = "2009-02-14 01:31:30"
with TZ="Asia/Istanbul"
sec2localtime(1234567890.123456) = "2009-02-14
01:31:30" with TZ="Asia/Istanbul"
sec2localtime(1234567890.123456, 6) = "2009-02-14
01:31:30.123456" with TZ="Asia/Istanbul"
sec2localtime(1234567890.123456, 6,
"Asia/Istanbul") = "2009-02-14
01:31:30.123456"
select
(class=higher-order-functions #args=2) Given a map or array
as first argument and a function as second argument,
includes each input element in the output if the function
returns true. For arrays, the function should take one
argument, for array element; for maps, it should take two,
for map-element key and value. In either case it should
return a boolean.
Examples:
Array example: select([1,2,3,4,5], func(e) {return e >=
3}) returns [3, 4, 5].
Map example: select({"a":1, "b":3,
"c":5}, func(k,v) {return v >= 3}) returns
{"b":3, "c": 5}.
sgn
(class=math #args=1) +1, 0, -1 for positive, zero, negative
input respectively.
sha1
(class=hashing #args=1) SHA1 hash.
sha256
(class=hashing #args=1) SHA256 hash.
sha512
(class=hashing #args=1) SHA512 hash.
sin
(class=math #args=1) Trigonometric sine.
sinh
(class=math #args=1) Hyperbolic sine.
skewness
(class=stats #args=1) Returns the sample skewness of values
in an array or map. Returns empty string AKA void for
array/map of length less than two; returns error for
non-array/non-map types.
Example:
skewness([4,5,9,10,11]) is -0.2097285
sort
(class=higher-order-functions #args=1-2) Given a map or
array as first argument and string flags or function as
optional second argument, returns a sorted copy of the
input. With one argument, sorts array elements with numbers
first numerically and then strings lexically, and map
elements likewise by map keys. If the second argument is a
string, it can contain any of "f" for lexical
("n" is for the above default), "c" for
case-folded lexical, or "t" for natural sort
order. An additional "r" in that string is for
reverse. An additional "v" in that string means
sort maps by value, rather than by key. If the second
argument is a function, then for arrays it should take two
arguments a and b, returning < 0, 0, or > 0 as a <
b, a == b, or a > b respectively; for maps the function
should take four arguments ak, av, bk, and bv, again
returning < 0, 0, or > 0, using a and b’s keys
and values.
Examples:
Default sorting: sort([3,"A",1,"B",22])
returns [1, 3, 20, "A", "B"].
Note that this is numbers before strings.
Default sorting:
sort(["E","a","c","B","d"])
returns ["B", "E", "a",
"c", "d"].
Note that this is uppercase before lowercase.
Case-folded ascending:
sort(["E","a","c","B","d"],
"c") returns ["a", "B",
"c", "d", "E"].
Case-folded descending:
sort(["E","a","c","B","d"],
"cr") returns ["E", "d",
"c", "B", "a"].
Natural sorting:
sort(["a1","a10","a100","a2","a20","a200"],
"t") returns ["a1", "a2",
"a10", "a20", "a100",
"a200"].
Array with function: sort([5,2,3,1,4], func(a,b) {return b
<=> a}) returns [5,4,3,2,1].
Map with function:
sort({"c":2,"a":3,"b":1},
func(ak,av,bk,bv) {return bv <=> av}) returns
{"a":3,"c":2,"b":1}.
Map without function:
sort({"c":2,"a":3,"b":1})
returns {"a":3,"b":1,"c":2}.
Map without function:
sort({"c":2,"a":3,"b":1},
"v") returns
{"b":1,"c":2,"a":3}.
Map without function:
sort({"c":2,"a":3,"b":1},
"vnr") returns
{"a":3,"c":2,"b":1}.
sort_collection
(class=stats #args=1) This is a helper function for the
percentiles function; please see its online help for
details.
splita
(class=conversion #args=2) Splits string into array with
type inference. First argument is string to split; second is
the separator to split on.
Example:
splita("3,4,5", ",") = [3,4,5]
splitax
(class=conversion #args=2) Splits string into array without
type inference. First argument is string to split; second is
the separator to split on.
Example:
splitax("3,4,5", ",") =
["3","4","5"]
splitkv
(class=conversion #args=3) Splits string by separators into
map with type inference. First argument is string to split;
second argument is pair separator; third argument is field
separator.
Example:
splitkv("a=3,b=4,c=5", "=",
",") =
{"a":3,"b":4,"c":5}
splitkvx
(class=conversion #args=3) Splits string by separators into
map without type inference (keys and values are strings).
First argument is string to split; second argument is pair
separator; third argument is field separator.
Example:
splitkvx("a=3,b=4,c=5", "=",
",") =
{"a":"3","b":"4","c":"5"}
splitnv
(class=conversion #args=2) Splits string by separator into
integer-indexed map with type inference. First argument is
string to split; second argument is separator to split on.
Example:
splitnv("a,b,c", ",") =
{"1":"a","2":"b","3":"c"}
splitnvx
(class=conversion #args=2) Splits string by separator into
integer-indexed map without type inference (values are
strings). First argument is string to split; second argument
is separator to split on.
Example:
splitnvx("3,4,5", ",") =
{"1":"3","2":"4","3":"5"}
sqrt
(class=math #args=1) Square root.
ssub
(class=string #args=3) Like sub but does no regexing. No
characters are special.
Example:
ssub("abc.def", ".", "X")
gives "abcXdef"
stddev
(class=stats #args=1) Returns the sample standard deviation
of values in an array or map. Returns empty string AKA void
for array/map of length less than two; returns error for
non-array/non-map types.
Example:
stddev([4,5,9,10,11]) is 3.1144823
strfntime
(class=time #args=2) Formats integer nanoseconds since the
epoch as timestamp. Format strings are as at
https://pkg.go.dev/github.com/lestrrat-go/strftime, with the
Miller-specific addition of "%1S" through
"%9S" which format the seconds with 1 through 9
decimal places, respectively. ("%S" uses no
decimal places.) See also
https://miller.readthedocs.io/en/latest/reference-dsl-time/
for more information on the differences from the C library
("man strftime" on your system). See also
strftime_local.
Examples:
strfntime(1440768801123456789,"%Y-%m-%dT%H:%M:%SZ")
= "2015-08-28T13:33:21Z"
strfntime(1440768801123456789,"%Y-%m-%dT%H:%M:%3SZ")
= "2015-08-28T13:33:21.123Z"
strfntime(1440768801123456789,"%Y-%m-%dT%H:%M:%6SZ")
= "2015-08-28T13:33:21.123456Z"
strfntime_local
(class=time #args=2,3) Like strfntime but consults the $TZ
environment variable to get local time zone.
Examples:
strfntime_local(1440768801123456789, "%Y-%m-%d %H:%M:%S
%z") = "2015-08-28 16:33:21 +0300" with
TZ="Asia/Istanbul"
strfntime_local(1440768801123456789, "%Y-%m-%d
%H:%M:%3S %z") = "2015-08-28 16:33:21.123
+0300" with TZ="Asia/Istanbul"
strfntime_local(1440768801123456789, "%Y-%m-%d
%H:%M:%3S %z", "Asia/Istanbul") =
"2015-08-28 16:33:21.123 +0300"
strfntime_local(1440768801123456789, "%Y-%m-%d
%H:%M:%9S %z", "Asia/Istanbul") =
"2015-08-28 16:33:21.123456789 +0300"
strftime
(class=time #args=2) Formats seconds since the epoch as
timestamp. Format strings are as at
https://pkg.go.dev/github.com/lestrrat-go/strftime, with the
Miller-specific addition of "%1S" through
"%9S" which format the seconds with 1 through 9
decimal places, respectively. ("%S" uses no
decimal places.) See also
https://miller.readthedocs.io/en/latest/reference-dsl-time/
for more information on the differences from the C library
("man strftime" on your system). See also
strftime_local.
Examples:
strftime(1440768801.7,"%Y-%m-%dT%H:%M:%SZ") =
"2015-08-28T13:33:21Z"
strftime(1440768801.7,"%Y-%m-%dT%H:%M:%3SZ") =
"2015-08-28T13:33:21.700Z"
strftime_local
(class=time #args=2,3) Like strftime but consults the $TZ
environment variable to get local time zone.
Examples:
strftime_local(1440768801.7, "%Y-%m-%d %H:%M:%S
%z") = "2015-08-28 16:33:21 +0300" with
TZ="Asia/Istanbul"
strftime_local(1440768801.7, "%Y-%m-%d %H:%M:%3S
%z") = "2015-08-28 16:33:21.700 +0300" with
TZ="Asia/Istanbul"
strftime_local(1440768801.7, "%Y-%m-%d %H:%M:%3S
%z", "Asia/Istanbul") = "2015-08-28
16:33:21.700 +0300"
string
(class=conversion #args=1) Convert
int/float/bool/string/array/map to string.
strip
(class=string #args=1) Strip leading and trailing whitespace
from string.
strlen
(class=string #args=1) String length.
strmatch
(class=string #args=2) Boolean yes/no for whether the
stringable first argument matches the regular-expression
second argument. No regex captures are provided; please see
’strmatch’.
Examples:
strmatch("a", "abc") is false
strmatch("abc", "a") is true
strmatch("abc", "a[a-z]c") is true
strmatch("abc", "(a).(c)") is true
strmatch(12345, "34") is true
strmatchx
(class=string #args=2) Extended information for whether the
stringable first argument matches the regular-expression
second argument. Regex captures are provided in the
return-value map; \1, \2, etc. are not set, in contrast to
the ’=~’ operator. As well, while the
’=~’ operator limits matches to \1 through \9,
an arbitrary number are supported here.
Examples:
strmatchx("a", "abc") returns:
{
"matched": false
}
strmatchx("abc", "a") returns:
{
"matched": true,
"full_capture": "a",
"full_start": 1,
"full_end": 1
}
strmatchx("[zy:3458]",
"([a-z]+):([0-9]+)") returns:
{
"matched": true,
"full_capture": "zy:3458",
"full_start": 2,
"full_end": 8,
"captures": ["zy", "3458"],
"starts": [2, 5],
"ends": [3, 8]
}
strpntime
(class=time #args=2) strpntime: Parses timestamp as integer
nanoseconds since the epoch. See also strpntime_local.
Examples:
strpntime("2015-08-28T13:33:21Z",
"%Y-%m-%dT%H:%M:%SZ") = 1440768801000000000
strpntime("2015-08-28T13:33:21.345Z",
"%Y-%m-%dT%H:%M:%SZ") = 1440768801345000000
strpntime("1970-01-01 00:00:00 -0400",
"%Y-%m-%d %H:%M:%S %z") = 14400000000000
strpntime("1970-01-01 00:00:00 +0200",
"%Y-%m-%d %H:%M:%S %z") = -7200000000000
strpntime_local
(class=time #args=2,3) Like strpntime but consults the $TZ
environment variable to get local time zone.
Examples:
strpntime_local("2015-08-28T13:33:21Z",
"%Y-%m-%dT%H:%M:%SZ") = 1440758001000000000 with
TZ="Asia/Istanbul"
strpntime_local("2015-08-28T13:33:21.345Z","%Y-%m-%dT%H:%M:%SZ")
= 1440758001345000000 with TZ="Asia/Istanbul"
strpntime_local("2015-08-28 13:33:21",
"%Y-%m-%d %H:%M:%S") = 1440758001000000000 with
TZ="Asia/Istanbul"
strpntime_local("2015-08-28 13:33:21",
"%Y-%m-%d %H:%M:%S", "Asia/Istanbul") =
1440758001000000000
strptime
(class=time #args=2) strptime: Parses timestamp as
floating-point seconds since the epoch. See also
strptime_local.
Examples:
strptime("2015-08-28T13:33:21Z",
"%Y-%m-%dT%H:%M:%SZ") = 1440768801.000000
strptime("2015-08-28T13:33:21.345Z",
"%Y-%m-%dT%H:%M:%SZ") = 1440768801.345000
strptime("1970-01-01 00:00:00 -0400",
"%Y-%m-%d %H:%M:%S %z") = 14400
strptime("1970-01-01 00:00:00 +0200",
"%Y-%m-%d %H:%M:%S %z") = -7200
strptime_local
(class=time #args=2,3) Like strptime but consults the $TZ
environment variable to get local time zone.
Examples:
strptime_local("2015-08-28T13:33:21Z",
"%Y-%m-%dT%H:%M:%SZ") = 1440758001 with
TZ="Asia/Istanbul"
strptime_local("2015-08-28T13:33:21.345Z","%Y-%m-%dT%H:%M:%SZ")
= 1440758001.345 with TZ="Asia/Istanbul"
strptime_local("2015-08-28 13:33:21",
"%Y-%m-%d %H:%M:%S") = 1440758001 with
TZ="Asia/Istanbul"
strptime_local("2015-08-28 13:33:21",
"%Y-%m-%d %H:%M:%S", "Asia/Istanbul") =
1440758001
sub
(class=string #args=3) ’$name = sub($name,
"old", "new")’: replace once
(first match, if there are multiple matches), with support
for regular expressions. Capture groups \1 through \9 in the
new part are matched from (...) in the old part, and must be
used within the same call to sub -- they don’t persist
for subsequent DSL statements. See also =~ and regextract.
See also "Regular expressions" at
https://miller.readthedocs.io.
Examples:
sub("ababab", "ab", "XY")
gives "XYabab"
sub("abc.def", ".", "X") gives
"Xbc.def"
sub("abc.def", "\.", "X")
gives "abcXdef"
sub("abcdefg", "[ce]", "X")
gives "abXdefg"
sub("prefix4529:suffix8567",
"suffix([0-9]+)", "name\1") gives
"prefix4529:name8567"
substr
(class=string #args=3) substr is an alias for substr0. See
also substr1. Miller is generally 1-up with all array and
string indices, but, this is a backward-compatibility issue
with Miller 5 and below. Arrays are new in Miller 6; the
substr function is older.
substr0
(class=string #args=3) substr0(s,m,n) gives substring of s
from 0-up position m to n inclusive. Negative indices -len
.. -1 alias to 0 .. len-1. See also substr and substr1.
substr1
(class=string #args=3) substr1(s,m,n) gives substring of s
from 1-up position m to n inclusive. Negative indices -len
.. -1 alias to 1 .. len. See also substr and substr0.
sum
(class=stats #args=1) Returns the sum of values in an array
or map. Returns error for non-array/non-map types.
Example:
sum([1,2,3,4,5]) is 15
sum2
(class=stats #args=1) Returns the sum of squares of values
in an array or map. Returns error for non-array/non-map
types.
Example:
sum2([1,2,3,4,5]) is 55
sum3
(class=stats #args=1) Returns the sum of cubes of values in
an array or map. Returns error for non-array/non-map types.
Example:
sum3([1,2,3,4,5]) is 225
sum4
(class=stats #args=1) Returns the sum of fourth powers of
values in an array or map. Returns error for
non-array/non-map types.
Example:
sum4([1,2,3,4,5]) is 979
sysntime
(class=time #args=0) Returns the system time in 64-bit
nanoseconds since the epoch.
system
(class=system #args=1) Run command string, yielding its
stdout minus final carriage return.
systime
(class=time #args=0) Returns the system time in
floating-point seconds since the epoch.
systimeint
(class=time #args=0) Returns the system time in integer
seconds since the epoch.
tan
(class=math #args=1) Trigonometric tangent.
tanh
(class=math #args=1) Hyperbolic tangent.
tolower
(class=string #args=1) Convert string to lowercase.
toupper
(class=string #args=1) Convert string to uppercase.
truncate
(class=string #args=2) Truncates string first argument to
max length of int second argument.
typeof
(class=typing #args=1) Convert argument to type of argument
(e.g. "str"). For debug.
unflatten
(class=collections #args=2) Reverses flatten. Useful for
nested JSON-like structures for non-JSON file formats like
CSV. The first argument is a map, and the second argument is
the flatten separator. See also arrayify. See
"Flatten/unflatten: converting between JSON and tabular
formats" at https://miller.readthedocs.io for more
information.
Example:
unflatten({"a.b.c" : 4}, ".") is
{"a": "b": { "c": 4 }}.
unformat
(class=string #args=2) Using first argument as format
string, unpacks second argument into an array of matches,
with type-inference. On non-match, returns error -- use
is_error() to check.
Examples:
unformat("{}:{}:{}", "1:2:3") gives [1,
2, 3].
unformat("{}h{}m{}s", "3h47m22s") gives
[3, 47, 22].
is_error(unformat("{}h{}m{}s",
"3:47:22")) gives true.
unformatx
(class=string #args=2) Same as unformat, but without
type-inference.
Examples:
unformatx("{}:{}:{}", "1:2:3") gives
["1", "2", "3"].
unformatx("{}h{}m{}s", "3h47m22s") gives
["3", "47", "22"].
is_error(unformatx("{}h{}m{}s",
"3:47:22")) gives true.
upntime
(class=time #args=0) Returns the time in 64-bit nanoseconds
since the current Miller program was started.
uptime
(class=time #args=0) Returns the time in floating-point
seconds since the current Miller program was started.
urand
(class=math #args=0) Floating-point numbers uniformly
distributed on the unit interval.
Example:
Int-valued example:
’$n=floor(20+urand()*11)’.
urand32
(class=math #args=0) Integer uniformly distributed 0 and
2**32-1 inclusive.
urandelement
(class=math #args=1) Random sample from the first argument,
which must be an non-empty array.
urandint
(class=math #args=2) Integer uniformly distributed between
inclusive integer endpoints.
urandrange
(class=math #args=2) Floating-point numbers uniformly
distributed on the interval [a, b).
utf8_to_latin1
(class=string #args=1) Tries to convert UTF-8-encoded string
to Latin-1-encoded string. If argument is array or map,
recurses into it.
Examples:
$y = utf8_to_latin1($x)
$* = utf8_to_latin1($*)
variance
(class=stats #args=1) Returns the sample variance of values
in an array or map. Returns empty string AKA void for
array/map of length less than two; returns error for
non-array/non-map types.
Example:
variance([4,5,9,10,11]) is 9.7
version
(class=system #args=0) Returns the Miller version as a
string.
!
(class=boolean #args=1) Logical negation.
!=
(class=boolean #args=2) String/numeric inequality. Mixing
number and string results in string compare.
!=~
(class=boolean #args=2) String (left-hand side) does not
match regex (right-hand side), e.g. ’$name !=~
"^a.*b$"’.
%
(class=arithmetic #args=2) Remainder; never negative-valued
(pythonic).
&
(class=arithmetic #args=2) Bitwise AND.
&&
(class=boolean #args=2) Logical AND.
*
(class=arithmetic #args=2) Multiplication, with
integer*integer overflow to float.
**
(class=arithmetic #args=2) Exponentiation. Same as pow, but
as an infix operator.
+
(class=arithmetic #args=1,2) Addition as binary operator;
unary plus operator.
-
(class=arithmetic #args=1,2) Subtraction as binary operator;
unary negation operator.
.
(class=string #args=2) String concatenation. Non-strings are
coerced, so you can do ’"ax".98’
etc.
.*
(class=arithmetic #args=2) Multiplication, with
integer-to-integer overflow.
.+
(class=arithmetic #args=2) Addition, with integer-to-integer
overflow.
.-
(class=arithmetic #args=2) Subtraction, with
integer-to-integer overflow.
./
(class=arithmetic #args=2) Integer division, rounding toward
zero.
/
(class=arithmetic #args=2) Division. Integer / integer is
integer when exact, else floating-point: e.g. 6/3 is 2 but
6/4 is 1.5.
//
(class=arithmetic #args=2) Pythonic integer division,
rounding toward negative.
<
(class=boolean #args=2) String/numeric less-than. Mixing
number and string results in string compare.
<<
(class=arithmetic #args=2) Bitwise left-shift.
<=
(class=boolean #args=2) String/numeric less-than-or-equals.
Mixing number and string results in string compare.
<=>
(class=boolean #args=2) Comparator, nominally for sorting.
Given a <=> b, returns <0, 0, >0 as a < b, a
== b, or a > b, respectively.
==
(class=boolean #args=2) String/numeric equality. Mixing
number and string results in string compare.
=~
(class=boolean #args=2) String (left-hand side) matches
regex (right-hand side), e.g. ’$name =~
"^a.*b$"’. Capture groups \1 through \9 are
matched from (...) in the right-hand side, and can be used
within subsequent DSL statements. See also "Regular
expressions" at https://miller.readthedocs.io.
Examples:
With if-statement: if ($url =~ "http.*com") { ...
}
Without if-statement: given $line = "index ab09
file", and $line =~
"([a-z][a-z])([0-9][0-9])", then $label =
"[\1:\2]", $label is "[ab:09]"
>
(class=boolean #args=2) String/numeric greater-than. Mixing
number and string results in string compare.
>=
(class=boolean #args=2) String/numeric
greater-than-or-equals. Mixing number and string results in
string compare.
>>
(class=arithmetic #args=2) Bitwise signed right-shift.
>>>
(class=arithmetic #args=2) Bitwise unsigned right-shift.
?:
(class=boolean #args=3) Standard ternary operator.
??
(class=boolean #args=2) Absent-coalesce operator. $a ?? 1
evaluates to 1 if $a isn’t defined in the current
record.
???
(class=boolean #args=2) Absent/empty-coalesce operator. $a
??? 1 evaluates to 1 if $a isn’t defined in the
current record, or has empty value.
^
(class=arithmetic #args=2) Bitwise XOR.
^^
(class=boolean #args=2) Logical XOR.
|
(class=arithmetic #args=2) Bitwise OR.
||
(class=boolean #args=2) Logical OR.
~
(class=arithmetic #args=1) Bitwise NOT. Beware
’$y=~$x’ since =~ is the regex-match operator:
try ’$y = ~$x’.
KEYWORDS FOR PUT AND FILTER
all
all: used in "emit1", "emit",
"emitp", and "unset" as a synonym for
@*
begin
begin: defines a block of statements to be executed before
input records
are ingested. The body statements must be wrapped in curly
braces.
Example: ’begin { @count = 0 }’
bool
bool: declares a boolean local variable in the current
curly-braced scope.
Type-checking happens at assignment: ’bool b =
1’ is an error.
break
break: causes execution to continue after the body of the
current for/while/do-while loop.
call
call: used for invoking a user-defined subroutine.
Example: ’subr s(k,v) { print k . " is " . v} call s("a", $a)’
continue
continue: causes execution to skip the remaining statements
in the body of
the current for/while/do-while loop. For-loop increments are
still applied.
do
do: with "while", introduces a do-while loop. The
body statements must be wrapped
in curly braces.
dump
dump: prints all currently defined out-of-stream variables
immediately
to stdout as JSON.
With >,
>>, or |, the data do not go directly to stdout but
are instead
redirected.
The > and
>> are for write and append, as in the shell, but (as
with awk) the
file-overwrite for > is on first write, not per record.
The | is for piping to
a process which will process the data. There will be one
open file for each
distinct file name (for > and >>) or one
subordinate process for each distinct
value of the piped-to command (for |). Output-formatting
flags are taken from
the main command line.
Example: mlr
--from f.dat put -q ’@v[NR]=$*; end { dump }’
Example: mlr --from f.dat put -q ’@v[NR]=$*; end {
dump > "mytap.dat"}’
Example: mlr --from f.dat put -q ’@v[NR]=$*; end {
dump >> "mytap.dat"}’
Example: mlr --from f.dat put -q ’@v[NR]=$*; end {
dump | "jq .[]"}’
edump
edump: prints all currently defined out-of-stream variables
immediately
to stderr as JSON.
Example: mlr --from f.dat put -q ’@v[NR]=$*; end { edump }’
elif
elif: the way Miller spells "else if". The body
statements must be wrapped
in curly braces.
else
else: terminates an if/elif/elif chain. The body statements
must be wrapped
in curly braces.
emit1
emit1: inserts an out-of-stream variable into the output
record stream. Unlike
the other map variants, side-by-sides, indexing, and
redirection are not supported,
but you can emit any map-valued expression.
Example: mlr
--from f.dat put ’emit1 $*’
Example: mlr --from f.dat put ’emit1
mapsum({"id": NR}, $*)’
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
emit
emit: inserts an out-of-stream variable into the output
record stream. Hashmap
indices present in the data but not slotted by emit
arguments are not output.
With >,
>>, or |, the data do not become part of the output
record stream but
are instead redirected.
The > and
>> are for write and append, as in the shell, but (as
with awk) the
file-overwrite for > is on first write, not per record.
The | is for piping to
a process which will process the data. There will be one
open file for each
distinct file name (for > and >>) or one
subordinate process for each distinct
value of the piped-to command (for |). Output-formatting
flags are taken from
the main command line.
You can use any
of the output-format command-line flags, e.g. --ocsv, --ofs,
etc., to control the format of the output if the output is
redirected. See also mlr -h.
Example: mlr
--from f.dat put ’emit > "/tmp/data-".$a,
$*’
Example: mlr --from f.dat put ’emit >
"/tmp/data-".$a, mapexcept($*,
"a")’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x; emit
@sums’
Example: mlr --from f.dat put --ojson
’@sums[$a][$b]+=$x; emit >
"tap-".$a.$b.".dat", @sums’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x; emit
@sums, "index1", "index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x; emit
@*, "index1", "index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x; emit
> "mytap.dat", @*, "index1",
"index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x; emit
>> "mytap.dat", @*, "index1",
"index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x; emit
| "gzip > mytap.dat.gz", @*,
"index1", "index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x; emit
> stderr, @*, "index1",
"index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x; emit
| "grep somepattern", @*, "index1",
"index2"’
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
emitf
emitf: inserts non-indexed out-of-stream variable(s)
side-by-side into the
output record stream.
With >,
>>, or |, the data do not become part of the output
record stream but
are instead redirected.
The > and
>> are for write and append, as in the shell, but (as
with awk) the
file-overwrite for > is on first write, not per record.
The | is for piping to
a process which will process the data. There will be one
open file for each
distinct file name (for > and >>) or one
subordinate process for each distinct
value of the piped-to command (for |). Output-formatting
flags are taken from
the main command line.
You can use any
of the output-format command-line flags, e.g. --ocsv, --ofs,
etc., to control the format of the output if the output is
redirected. See also mlr -h.
Example: mlr
--from f.dat put ’@a=$i;@b+=$x;@c+=$y; emitf @a’
Example: mlr --from f.dat put --oxtab
’@a=$i;@b+=$x;@c+=$y; emitf >
"tap-".$i.".dat", @a’
Example: mlr --from f.dat put ’@a=$i;@b+=$x;@c+=$y;
emitf @a, @b, @c’
Example: mlr --from f.dat put ’@a=$i;@b+=$x;@c+=$y;
emitf > "mytap.dat", @a, @b, @c’
Example: mlr --from f.dat put ’@a=$i;@b+=$x;@c+=$y;
emitf >> "mytap.dat", @a, @b, @c’
Example: mlr --from f.dat put ’@a=$i;@b+=$x;@c+=$y;
emitf > stderr, @a, @b, @c’
Example: mlr --from f.dat put ’@a=$i;@b+=$x;@c+=$y;
emitf | "grep somepattern", @a, @b, @c’
Example: mlr --from f.dat put ’@a=$i;@b+=$x;@c+=$y;
emitf | "grep somepattern > mytap.dat", @a, @b,
@c’
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
emitp
emitp: inserts an out-of-stream variable into the output
record stream.
Hashmap indices present in the data but not slotted by emitp
arguments are
output concatenated with ":".
With >,
>>, or |, the data do not become part of the output
record stream but
are instead redirected.
The > and
>> are for write and append, as in the shell, but (as
with awk) the
file-overwrite for > is on first write, not per record.
The | is for piping to
a process which will process the data. There will be one
open file for each
distinct file name (for > and >>) or one
subordinate process for each distinct
value of the piped-to command (for |). Output-formatting
flags are taken from
the main command line.
You can use any
of the output-format command-line flags, e.g. --ocsv, --ofs,
etc., to control the format of the output if the output is
redirected. See also mlr -h.
Example: mlr
--from f.dat put ’@sums[$a][$b]+=$x; emitp
@sums’
Example: mlr --from f.dat put --opprint
’@sums[$a][$b]+=$x; emitp >
"tap-".$a.$b.".dat", @sums’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x;
emitp @sums, "index1", "index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x;
emitp @*, "index1", "index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x;
emitp > "mytap.dat", @*, "index1",
"index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x;
emitp >> "mytap.dat", @*,
"index1", "index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x;
emitp | "gzip > mytap.dat.gz", @*,
"index1", "index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x;
emitp > stderr, @*, "index1",
"index2"’
Example: mlr --from f.dat put ’@sums[$a][$b]+=$x;
emitp | "grep somepattern", @*,
"index1", "index2"’
Please see https://miller.readthedocs.io://johnkerl.org/miller/doc for more information.
end
end: defines a block of statements to be executed after
input records
are ingested. The body statements must be wrapped in curly
braces.
Example:
’end { emit @count }’
Example: ’end { eprint "Final count is " .
@count }’
eprint
eprint: prints expression immediately to stderr.
Example: mlr
--from f.dat put -q ’eprint "The sum of x and y
is ".($x+$y)’
Example: mlr --from f.dat put -q ’for (k, v in $*) {
eprint k . " => " . v }’
Example: mlr --from f.dat put ’(NR % 1000 == 0) {
eprint "Checkpoint ".NR}’
eprintn
eprintn: prints expression immediately to stderr, without
trailing newline.
Example: mlr --from f.dat put -q ’eprintn "The sum of x and y is ".($x+$y); eprint ""’
false
false: the boolean literal value.
filter
filter: includes/excludes the record in the output record
stream.
Example: mlr --from f.dat put ’filter (NR == 2 || $x > 5.4)’
Instead of put
with ’filter false’ you can simply use put -q.
The following
uses the input record to accumulate data but only prints the
running sum
without printing the input record:
Example: mlr --from f.dat put -q ’@running_sum += $x * $y; emit @running_sum’
float
float: declares a floating-point local variable in the
current curly-braced scope.
Type-checking happens at assignment: ’float x =
0’ is an error.
for
for: defines a for-loop using one of three styles. The body
statements must
be wrapped in curly braces.
For-loop over stream record:
Example: ’for (k, v in $*) { ... }’
For-loop over out-of-stream variables:
Example:
’for (k, v in @counts) { ... }’
Example: ’for ((k1, k2), v in @counts) { ... }’
Example: ’for ((k1, k2, k3), v in @*) { ...
}’
C-style for-loop:
Example: ’for (var i = 0, var b = 1; i < 10; i += 1, b *= 2) { ... }’
func
func: used for defining a user-defined function.
Example: ’func f(a,b) { return sqrt(a**2+b**2)} $d = f($x, $y)’
funct
funct: used for saying that a function argument is a
user-defined function.
Example: ’func g(num a, num b, funct f) :num { return f(a**2+b**2) }’
if
if: starts an if/elif/elif chain. The body statements must
be wrapped
in curly braces.
in
in: used in for-loops over stream records or out-of-stream
variables.
int
int: declares an integer local variable in the current
curly-braced scope.
Type-checking happens at assignment: ’int x =
0.0’ is an error.
map
map: declares a map-valued local variable in the current
curly-braced scope.
Type-checking happens at assignment: ’map b = 0’
is an error. map b = {} is
always OK. map b = a is OK or not depending on whether a is
a map.
num
num: declares an int/float local variable in the current
curly-braced scope.
Type-checking happens at assignment: ’num b =
true’ is an error.
print
print: prints expression immediately to stdout.
Example: mlr
--from f.dat put -q ’print "The sum of x and y is
".($x+$y)’
Example: mlr --from f.dat put -q ’for (k, v in $*) {
print k . " => " . v }’
Example: mlr --from f.dat put ’(NR % 1000 == 0) {
print > stderr, "Checkpoint ".NR}’
printn
printn: prints expression immediately to stdout, without
trailing newline.
Example: mlr --from f.dat put -q ’printn "."; end { print "" }’
return
return: specifies the return value from a user-defined
function.
Omitted return statements (including via if-branches) result
in an absent-null
return value, which in turns results in a skipped assignment
to an LHS.
stderr
stderr: Used for tee, emit, emitf, emitp, print, and dump in
place of filename
to print to standard error.
stdout
stdout: Used for tee, emit, emitf, emitp, print, and dump in
place of filename
to print to standard output.
str
str: declares a string local variable in the current
curly-braced scope.
Type-checking happens at assignment.
subr
subr: used for defining a subroutine.
Example: ’subr s(k,v) { print k . " is " . v} call s("a", $a)’
tee
tee: prints the current record to specified file.
This is an immediate print to the specified file (except for
pprint format
which of course waits until the end of the input stream to
format all output).
The > and
>> are for write and append, as in the shell, but (as
with awk) the
file-overwrite for > is on first write, not per record.
The | is for piping to
a process which will process the data. There will be one
open file for each
distinct file name (for > and >>) or one
subordinate process for each distinct
value of the piped-to command (for |). Output-formatting
flags are taken from
the main command line.
You can use any
of the output-format command-line flags, e.g. --ocsv, --ofs,
etc., to control the format of the output. See also mlr
-h.
emit with
redirect and tee with redirect are identical, except tee can
only
output $*.
Example: mlr
--from f.dat put ’tee > "/tmp/data-".$a,
$*’
Example: mlr --from f.dat put ’tee >>
"/tmp/data-".$a.$b, $*’
Example: mlr --from f.dat put ’tee > stderr,
$*’
Example: mlr --from f.dat put -q ’tee | "tr
\[a-z\\] \[A-Z\\]", $*’
Example: mlr --from f.dat put -q ’tee | "tr
\[a-z\\] \[A-Z\\] > /tmp/data-".$a, $*’
Example: mlr --from f.dat put -q ’tee | "gzip
> /tmp/data-".$a.".gz", $*’
Example: mlr --from f.dat put -q --ojson ’tee |
"gzip > /tmp/data-".$a.".gz",
$*’
true
true: the boolean literal value.
unset
unset: clears field(s) from the current record, or an
out-of-stream or local variable.
Example: mlr
--from f.dat put ’unset $x’
Example: mlr --from f.dat put ’unset $*’
Example: mlr --from f.dat put ’for (k, v in $*) { if
(k =~ "a.*") { unset $[k] } }’
Example: mlr --from f.dat put ’...; unset @sums’
Example: mlr --from f.dat put ’...; unset
@sums["green"]’
Example: mlr --from f.dat put ’...; unset
@*’
var
var: declares an untyped local variable in the current
curly-braced scope.
Examples: ’var a=1’, ’var xyz=""’
while
while: introduces a while loop, or with "do",
introduces a do-while loop.
The body statements must be wrapped in curly braces.
ENV
ENV: access to environment variables by name, e.g.
’$home = ENV["HOME"]’
FILENAME
FILENAME: evaluates to the name of the current file being
processed.
FILENUM
FILENUM: evaluates to the number of the current file being
processed,
starting with 1.
FNR
FNR: evaluates to the number of the current record within
the current file
being processed, starting with 1. Resets at the start of
each file.
IFS
IFS: evaluates to the input field separator from the command
line.
IPS
IPS: evaluates to the input pair separator from the command
line.
IRS
IRS: evaluates to the input record separator from the
command line,
or to LF or CRLF from the input data if in autodetect mode
(which is
the default).
M_E
M_E: the mathematical constant e.
M_PI
M_PI: the mathematical constant pi.
NF
NF: evaluates to the number of fields in the current
record.
NR
NR: evaluates to the number of the current record over all
files
being processed, starting with 1. Does not reset at the
start of each file.
OFS
OFS: evaluates to the output field separator from the
command line.
OPS
OPS: evaluates to the output pair separator from the command
line.
ORS
ORS: evaluates to the output record separator from the
command line,
or to LF or CRLF from the input data if in autodetect mode
(which is
the default).
AUTHOR
Miller is written by John Kerl <kerl.john.r [AT] gmail.com>.
This manual page has been composed from Miller’s help output by Eric MSP Veith <eveith [AT] veith-m.de>.
SEE ALSO
awk(1), sed(1), cut(1), join(1), sort(1), RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files, the Miller docsite https://miller.readthedocs.io