Manpages

NAME

kmer-mask - mask and filter set of nucleotide sequences by kmer content

SYNOPSIS

kmer-mask {-novel|-confirmed} [-mdb mer-database] [-ms mer-size] [-edb exist-database] [-m min-size] [-e extend-size] [-lowthreshold l] [-highthreshold h] [-t threads] [-v] [-h histogram] [-promote|-demote|-discard] -1 in.1.fastq [-2 in.2.fastq] -o output-prefix

DESCRIPTION

Mask and filter set of sequences (presumed to be reads) by kmer content. Masking can be done to retain novel sequence not in the database, or to retain confirmed sequence present in the database. Filtering will segregate sequences fully, partially or not masked.

OPTIONS

-mdb mer-database

load masking kmers from meryl(1) mer-database

-ms mer-size
-edb
exist-database

save masking kmers to an existDB(1) file exist-database for faster restarts

-1 in.1.fastq
-2
in.2.fastq

input reads files in fastq, fastq.gz, fastq.bz2 or fastq.xz format. The second is optional, but messes up the output classification if not present.

-o out

prefix for output reads

out.fullymasked.[12].fastq

reads with below ’lowthreshold’ bases retained

out.partiallymasked.[12].fastq

reads in between

out.retained.[12].fastq

reads with more than ’hightreshold’ bases retained

out.discarded.[12].fastq

reads with conflicting status

-m min-size

ignore database hits below this many consecutive kmers (0)

-e extend-size

extend database hits across this many missing kmers (0)

-novel

RETAIN novel sequence not present in the database

-confirmed

RETAIN confirmed sequence present in the database

-promote

promote the lesser RETAINED read to the status of the more RETAINED read read1=fullymasked and read2=partiallymasked -> both are partiallymasked

-demote

demote the more RETAINED read to the status of the lesser RETAINED read read1=fullymasked and read2=partiallymasked -> both are fullymasked

-discard

discard pairs with conflicting status (DEFAULT) read1=fullymasked and read2=partiallymasked -> both are discarded

stats on stderr, number of sequences with amount RETAINED:
-lowthreshold
t

(0.3333)

-highthreshold t

(0.6667)

-h histogram

write a histogram of the amount of sequence RETAINED

-t t

use t compute threads

-v

show progress

SEE ALSO

meryl(1) existDB(1)