Manpages

NAME

xtract - NCBI Entrez Direct XML conversion and transformation tool

SYNOPSIS

xtract [-help] [-strict] [-mixed] [-self] [-accent] [-ascii] [-compress] [-stops] [-input filename] [-transform filename] [-aliases filename] [-pattern expr] [-group expr] [-block expr] [-subset expr] [-path path] [-if expr [constraint]] [-unless expr [constraint]] [-and condition] [-or condition] [-else] [-position pos] [-equals str] [-contains str] [-includes str] [-is-within str] [-starts-with str] [-ends-with str] [-is-not str] [-is-before str] [-is-after str] [-matches str] [-resembles str] [-is-equal-to expr] [-differs-from expr] [-gt N] [-ge N] [-lt N] [-le N] [-eq N] [-ne N] [-ret str] [-tab str] [-sep str] [-pfx str] [-sfx str] [-rst] [-clr] [-pfc str] [-deq str] [-def str] [-lbl str] [-set tag] [-rec tag] [-wrp tag] [-enc tag] [-plg str] [-elg str] [-pkg tag] [-fwd str] [-awd str] [-tag tag] [-att key value] [-cls] [-slf] [-end tag] [-element element] [-first element] [-last element] [-backward element] [-NAME] [--STATS] [-num element] [-len element] [-sum element] [-acc element] [-min element] [-max element] [-inc element] [-dec element] [-sub element] [-avg element] [-dev element] [-med element] [-mul element] [-div element] [-mod element] [-bin element] [-oct element] [-hex element] [-bit element] [-pad element] [-encode element] [-upper element] [-lower element] [-chain element] [-title element] [-mirror element] [-alnum element] [-basic element] [-plain element] [-simple element] [-author element] [-prose element] [-terms element] [-words element] [-pairs element] [-order element] [-reverse element] [-letters element] [-clauses element] [-year element] [-month element] [-date element] [-page element] [-auth element] [-initials element] [-jour element] [-trim element] [-wct element] [-doi element] [-translate element] [-classify element] [-replace -reg target -exp replacement] [-revcomp] [-nucleic] [-fasta] [-ncbi2na] [-ncbi4na] [-molwt] [-0-based element] [-1-based element] [-ucsc-based element] [-insd arg ...] [-histogram] [-e2index [extras]] [-indices element] [-article element] [-abstract element] [-paragraph element] [-stemmed element] [-head str] [-tail str] [-hd str] [-tl str] [-select condition] [-in filename] [-sort[-fwdelement] [-sort-rev element] [-format fmt [-unicode style]] [-verify] [-outline] [-synopsis] [-contour [delimiter]] [-examples] [-unix] [-version]

DESCRIPTION

xtract converts an XML document into a table of data values according to user-specified rules.

OPTIONS

Processing Flags
-strict

Remove HTML and MathML tags.

-mixed

Allow mixed content XML.

-self

Allow detection of empty self-closing tags.

-accent

Delete Unicode accents and diacritical marks.

-ascii

Convert Unicode to numeric HTML character entities.

-compress

Compress runs of spaces.

-stops

Retain stop words in selected phrases.

Data Source
-input 
filename

Read XML from file instead of standard input.

-transform filename

File of substitutions for -translate.

-aliases filename

Mappings file for -classify operation.

Exploration Argument Hierarchy
-pattern 
expr
-group 
expr
-block 
expr
-subset 
expr

Name of record within set. Use of different argument names allows command-line control of nested looping.

Path Navigation
-path 
path

Explore by list of adjacent object names.

Exploration Constructs

Object

DateRevised

Parent/Child

Book/AuthorList

Path

MedlineCitation/Article/Journal/JournalIssue/PubDate

Heterogeneous

"PubmedArticleSet/*"

Exhaustive

"History/**"

Nested

"*/Taxon"

Conditional Execution
-if 
expr [constraint]

Element (or @attribute) must exist and satisfy any specified constraint.

-unless expr [constraint]

Skip if element matches.

-and condition

Preceding and following tests must both pass.

-or condition

Any passing test suffices.

-else

Execute if conditional test failed.

-position pos

first/last/outer/inner/even/odd/all.

String Constraints
-equals 
str

String must match exactly.

-contains str

Substring must be present.

-includes str

Substring must match at word boundaries.

-is-within str

String must be present.

-starts-with str

Substring must be at beginning.

-ends-with str

Substring must be at end.

-is-not str

String must not match.

-is-before str

First string < second string.

-is-after str

First string > second string.

-matches str

Matches without commas or semicolons.

-resembles str

Requires all words, but in any order.

Object Constraints
-is-equal-to 
expr

Object values must match.

-differs-from expr

Object values must differ.

Numeric Constraints

-gt N

Greater than.

-ge N

Greater than or equal to.

-lt N

Less than to.

-le N

Less than or equal to.

-eq N

Equal to.

-ne N

Not equal to.

Format Customization
-ret 
str

Override line break between patterns.

-tab str

Replace tab character between fields.

-sep str

Separator between group members.

-pfx str

Prefix to print before group.

-sfx str

Suffix to print after group.

-rst

Reset -sep through -elg.

-clr

Clear queued tab separator.

-pfc str

Preface combines -clr and -pfx.

-deq str

Delete and replace queued tab separator.

-def str

Default placeholder for missing fields.

-lbl str

Insert arbitrary text.

XML Generation
-set 
tag

XML tag for entire set.

-rec tag

XML tag for each record.

-wrp tag

Wrap elements in XML object.

-enc tag

Encase instance in XML object.

-plg str

Prologue to print before instance.

-elg str

Epilogue to print after instance.

-pkg tag

Package subset in XML object.

-fwd str

Foreword to print before subset.

-awd str

Afterword to print after subset.

Tag and Attribute Construction
-tag 
tag

Start with <tag.

-att key value

Attribute key and value.

-cls

Close with >.

-slf

Self-close with />.

-end tag

End contents with </tag>.

Element Selection
-element 
element

Print all items that match tag name.

-first element

Only print value of first item.

-last element

Only print value of last item.

-backward element

Print values in reverse order.

-NAME

Record value in named variable.

--STATS

Accumulate values into variable.

-element Constructs

Tag

Caption

Group

Initials,LastName

Parent/Child

MedlineCitation/PMID

Recursive

"**/Gene-commentary_accession"

Unrestricted

PubDate/*

Attribute

DescriptorName@MajorTopicYN

Range

MedlineDate[1:4]

Substring

"Title[phospholipase | rattlesnake]"

Object Count

"#Author"

Item Length

"%Title"

Element Depth

"^PMID"

Variable

"&NAME"

Special -element Operations

Parent Index

"+"

Object Name

"?"

Object Value

"~"

XML Subtree

"*"

Children

"$"

Attributes

"@"

ASN.1 Record

"."

JSON Record

"%"

Numeric Processing
-num 
element

Count.

-len element

Length.

-sum element

Sum.

-acc element

Accumulator.

-min element

Minimum.

-max element

Maximum.

-inc element

Increment.

-dec element

Decrement.

-sub element

Difference.

-avg element

Average.

-dev element

Deviation.

-med element

Median.

-mul element

Product.

-div element

Quotient.

-mod element

Remainder.

-bin element

Binary.

-oct element

Octal.

-hex element

Hexadecimal.

-bit element

Bit count.

-pad element

Zero-pad to eight digits.

Character Processing
-encode 
element

XML-encode <, >, &, ", and ' characters.

-upper element

Convert text to uppercase.

-lower element

Convert text to lowercase.

-chain element

Change spaces to underscores.

-title element

Capitalize initial letters of words.

-mirror element

Reverse order of letters.

-alnum element

Non-alphanumeric characters to space.

String Processing
-basic 
element

Convert superscripts and subscripts.

-plain element

Remove embedded mixed-content markup tags.

-simple element

Normalize accented letters; spell Greek letters.

-author element

Multi-step author cleanup.

-prose element

Text conversion to ASCII.

Text Processing
-terms 
element

Partition text at spaces.

-words element

Split at punctuation marks.

-pairs element

Adjacent informative words.

-order element

Rearrange words in sorted order.

-reverse element

Reverse words in string.

-letters element

Separate individual letters.

-clauses element

Break at phrase separators.

Citation Functions
-year 
element

Extract first 4-digit year from string.

-month element

Match first month name and return a corresponding integer.

-date element

YYYY/MM/DD from -unit "PubDate" -date "*"

-page element

Get digits (and letters) of first page number.

-auth element

Change GenBank authors to Medline form.

-initials element

Parse initials from forename or given name.

-jour element

Clean up journal name punctuation.

-trim element

Remove extra spaces and leading zeros.

-wct element

Count number of -words in a string.

-doi element

Add https://doi.org/ prefix, URL encode.

Value Transformation
-translate 
element

Substitute values with -transform table.

-classify element

Substring word or phrase matches to -aliases table.

Regular Expression
-replace

Substitute text using regular expressions.

-reg target

Target expression.

-exp pattern

Replacement pattern.

Sequence Processing
-revcomp

Reverse complement nucleotide sequence.

-nucleic

Subrange determines forward or revcomp.

-fasta

Split sequence into blocks of 70 uppercase letters.

-ncbi2na

Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)

-ncbi4na

Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)

-molwt

Calculate molecular weight of peptide.

Sequence Coordinates
-0-based 
element

Zero-based.

-1-based element

One-based.

-ucsc-based element

Half-open.

Command Generator
-insd 
arg ...

Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:

Descriptor(s)

INSDSeq_sequence/INSDSeq_definition/ INSDSeq_division/... [...]

Completeness

complete/partial

Feature(s)

CDS/mRNA/...[,...]

Qualifier(s)

INSDFeature_key/"#INSDInterval"/gene/product/ feat_location/sub_sequence/... [...]

Frequency Table
-histogram

Collects data for sort-uniq-count(1) on entire set of records.

Entrez Indexing
-e2index 
[extras]

Create Entrez index XML. extras (true or false; false by default) indicates whether to index extra fields.

-indices element

Index normalized words.

-article element

Title positional index.

-abstract element

Abstract positional index.

-paragraph element

Index text paragraphs.

-stemmed element

Apply Porter2 algorithm.

Output Organization
-head 
str

Print before everything else.

-tail str

Print after everything else.

-hd str

Print before each record.

-tl str

Print after each record.

Record Selection
-select 
condition

Select record subset by conditions.

-in filename

File of identifiers to use for selection.

Record Rearrangement
-sort
[-fwdelement

Element to use as sort key.

-sort-rev element

Sort records in reverse order.

Reformatting
-format 
fmt

copy

Fast block copy (still applies processing flags).

compact

Compress runs of spaces.

flush

Suppress line indentation.

indent

Indent according to nesting depth.

expand

Place each attribute on a separate line.

Validation
-verify

Report XML data integrity problems.

Summary
-outline

Display outline of XML structure.

-synopsis

Display individual XML paths.

-contour [delimiter]

Display XML paths to leaf nodes (delimited by / by default).

Full Exploration Command Precedence
-pattern

-path

-division

-group

-branch

-block

-section
-subset

-unit

Documentation

-help

Print usage information and some example argument combinations.

-examples

Complete usage examples, involving additional Entrez Direct tools.

-unix

Illustrate common Unix command arguments.

-version

Print version number.

NOTES

String constraints use case-insensitive comparisons.

Numeric constraints and selection arguments use integer values.

-num and -len selections are synonyms for Object Count (#) and Item Length (%).

-words, -pairs, and -indices convert to lower case.

SEE ALSO

archive-pmc(1), archive-pubmed(1), custom-index(1), disambiguate-nucleotides(1), download-ncbi-data(1), ds2pme(1), esample(1), fetch-pmc(1), fetch-pubmed(1), find-in-gene(1), fuse-segments(1), gene2range(1), hgvs2spdi(1), index-extras(1), index-pubmed(1), pma2pme(1), rchive(1), snp2hgvs(1), snp2tbl(1), sort-uniq-count(1), spdi2tbl(1), tbl2prod(1), transmute(1), uniq-table(1), xml2fsa(1), xml2tbl(1), xy-plot(1).