NAME
xtract - NCBI Entrez Direct XML conversion and transformation tool
SYNOPSIS
xtract [-help] [-strict] [-mixed] [-self] [-accent] [-ascii] [-compress] [-stops] [-input filename] [-transform filename] [-aliases filename] [-pattern expr] [-group expr] [-block expr] [-subset expr] [-path path] [-if expr [constraint]] [-unless expr [constraint]] [-and condition] [-or condition] [-else] [-position pos] [-equals str] [-contains str] [-includes str] [-is-within str] [-starts-with str] [-ends-with str] [-is-not str] [-is-before str] [-is-after str] [-matches str] [-resembles str] [-is-equal-to expr] [-differs-from expr] [-gt N] [-ge N] [-lt N] [-le N] [-eq N] [-ne N] [-ret str] [-tab str] [-sep str] [-pfx str] [-sfx str] [-rst] [-clr] [-pfc str] [-deq str] [-def str] [-lbl str] [-set tag] [-rec tag] [-wrp tag] [-enc tag] [-plg str] [-elg str] [-pkg tag] [-fwd str] [-awd str] [-tag tag] [-att key value] [-cls] [-slf] [-end tag] [-element element] [-first element] [-last element] [-backward element] [-NAME] [--STATS] [-num element] [-len element] [-sum element] [-acc element] [-min element] [-max element] [-inc element] [-dec element] [-sub element] [-avg element] [-dev element] [-med element] [-mul element] [-div element] [-mod element] [-bin element] [-oct element] [-hex element] [-bit element] [-pad element] [-encode element] [-upper element] [-lower element] [-chain element] [-title element] [-mirror element] [-alnum element] [-basic element] [-plain element] [-simple element] [-author element] [-prose element] [-terms element] [-words element] [-pairs element] [-order element] [-reverse element] [-letters element] [-clauses element] [-year element] [-month element] [-date element] [-page element] [-auth element] [-initials element] [-jour element] [-trim element] [-wct element] [-doi element] [-translate element] [-classify element] [-replace -reg target -exp replacement] [-revcomp] [-nucleic] [-fasta] [-ncbi2na] [-ncbi4na] [-molwt] [-0-based element] [-1-based element] [-ucsc-based element] [-insd arg ...] [-histogram] [-e2index [extras]] [-indices element] [-article element] [-abstract element] [-paragraph element] [-stemmed element] [-head str] [-tail str] [-hd str] [-tl str] [-select condition] [-in filename] [-sort[-fwd] element] [-sort-rev element] [-format fmt [-unicode style]] [-verify] [-outline] [-synopsis] [-contour [delimiter]] [-examples] [-unix] [-version]
DESCRIPTION
xtract converts an XML document into a table of data values according to user-specified rules.
OPTIONS
Processing
Flags
-strict
Remove HTML and MathML tags.
-mixed |
Allow mixed content XML. |
|||
-self |
Allow detection of empty self-closing tags. |
-accent
Delete Unicode accents and diacritical marks.
-ascii |
Convert Unicode to numeric HTML character entities. |
-compress
Compress runs of spaces.
-stops |
Retain stop words in selected phrases. |
Data Source
-input filename
Read XML from file instead of standard input.
-transform filename
File of substitutions for -translate.
-aliases filename
Mappings file for -classify operation.
Exploration
Argument Hierarchy
-pattern expr
-group expr
-block expr
-subset expr
Name of record within set. Use of different argument names allows command-line control of nested looping.
Path
Navigation
-path path
Explore by list of adjacent object names.
Exploration Constructs
Object |
DateRevised | ||
Parent/Child |
Book/AuthorList | ||
Path |
MedlineCitation/Article/Journal/JournalIssue/PubDate | ||
Heterogeneous |
"PubmedArticleSet/*" | ||
Exhaustive |
"History/**" | ||
Nested |
"*/Taxon" |
Conditional
Execution
-if expr [constraint]
Element (or @attribute) must exist and satisfy any specified constraint.
-unless expr [constraint]
Skip if element matches.
-and condition
Preceding and following tests must both pass.
-or condition
Any passing test suffices.
-else |
Execute if conditional test failed. |
-position pos
first/last/outer/inner/even/odd/all.
String
Constraints
-equals str
String must match exactly.
-contains str
Substring must be present.
-includes str
Substring must match at word boundaries.
-is-within str
String must be present.
-starts-with str
Substring must be at beginning.
-ends-with str
Substring must be at end.
-is-not str
String must not match.
-is-before str
First string < second string.
-is-after str
First string > second string.
-matches str
Matches without commas or semicolons.
-resembles str
Requires all words, but in any order.
Object
Constraints
-is-equal-to expr
Object values must match.
-differs-from expr
Object values must differ.
Numeric Constraints
-gt N |
Greater than. |
|||
-ge N |
Greater than or equal to. |
|||
-lt N |
Less than to. |
|||
-le N |
Less than or equal to. |
|||
-eq N |
Equal to. |
|||
-ne N |
Not equal to. |
Format
Customization
-ret str
Override line break between patterns.
-tab str
Replace tab character between fields.
-sep str
Separator between group members.
-pfx str
Prefix to print before group.
-sfx str
Suffix to print after group.
-rst |
Reset -sep through -elg. |
|||
-clr |
Clear queued tab separator. |
-pfc str
Preface combines -clr and -pfx.
-deq str
Delete and replace queued tab separator.
-def str
Default placeholder for missing fields.
-lbl str
Insert arbitrary text.
XML
Generation
-set tag
XML tag for entire set.
-rec tag
XML tag for each record.
-wrp tag
Wrap elements in XML object.
-enc tag
Encase instance in XML object.
-plg str
Prologue to print before instance.
-elg str
Epilogue to print after instance.
-pkg tag
Package subset in XML object.
-fwd str
Foreword to print before subset.
-awd str
Afterword to print after subset.
Tag and
Attribute Construction
-tag tag
Start with <tag.
-att key value
Attribute key and value.
-cls |
Close with >. |
|||
-slf |
Self-close with />. |
-end tag
End contents with </tag>.
Element
Selection
-element element
Print all items that match tag name.
-first element
Only print value of first item.
-last element
Only print value of last item.
-backward element
Print values in reverse order.
-NAME |
Record value in named variable. |
--STATS
Accumulate values into variable.
-element Constructs
Tag |
Caption |
|||
Group |
Initials,LastName |
|||
Parent/Child |
MedlineCitation/PMID |
|||
Recursive |
"**/Gene-commentary_accession" |
|||
Unrestricted |
PubDate/* |
|||
Attribute |
DescriptorName@MajorTopicYN |
|||
Range |
MedlineDate[1:4] |
|||
Substring |
"Title[phospholipase | rattlesnake]" |
|||
Object Count |
"#Author" |
|||
Item Length |
"%Title" |
|||
Element Depth |
"^PMID" |
|||
Variable |
"&NAME" |
Special -element Operations
Parent Index |
"+" |
|||
Object Name |
"?" |
|||
Object Value |
"~" |
|||
XML Subtree |
"*" |
|||
Children |
"$" |
|||
Attributes |
"@" |
|||
ASN.1 Record |
"." |
|||
JSON Record |
"%" |
Numeric
Processing
-num element
Count.
-len element
Length.
-sum element
Sum.
-acc element
Accumulator.
-min element
Minimum.
-max element
Maximum.
-inc element
Increment.
-dec element
Decrement.
-sub element
Difference.
-avg element
Average.
-dev element
Deviation.
-med element
Median.
-mul element
Product.
-div element
Quotient.
-mod element
Remainder.
-bin element
Binary.
-oct element
Octal.
-hex element
Hexadecimal.
-bit element
Bit count.
-pad element
Zero-pad to eight digits.
Character
Processing
-encode element
XML-encode <, >, &, ", and ' characters.
-upper element
Convert text to uppercase.
-lower element
Convert text to lowercase.
-chain element
Change spaces to underscores.
-title element
Capitalize initial letters of words.
-mirror element
Reverse order of letters.
-alnum element
Non-alphanumeric characters to space.
String
Processing
-basic element
Convert superscripts and subscripts.
-plain element
Remove embedded mixed-content markup tags.
-simple element
Normalize accented letters; spell Greek letters.
-author element
Multi-step author cleanup.
-prose element
Text conversion to ASCII.
Text
Processing
-terms element
Partition text at spaces.
-words element
Split at punctuation marks.
-pairs element
Adjacent informative words.
-order element
Rearrange words in sorted order.
-reverse element
Reverse words in string.
-letters element
Separate individual letters.
-clauses element
Break at phrase separators.
Citation
Functions
-year element
Extract first 4-digit year from string.
-month element
Match first month name and return a corresponding integer.
-date element
YYYY/MM/DD from -unit "PubDate" -date "*"
-page element
Get digits (and letters) of first page number.
-auth element
Change GenBank authors to Medline form.
-initials element
Parse initials from forename or given name.
-jour element
Clean up journal name punctuation.
-trim element
Remove extra spaces and leading zeros.
-wct element
Count number of -words in a string.
-doi element
Add https://doi.org/ prefix, URL encode.
Value
Transformation
-translate element
Substitute values with -transform table.
-classify element
Substring word or phrase matches to -aliases table.
Regular
Expression
-replace
Substitute text using regular expressions.
-reg target |
Target expression. |
|||
-exp pattern |
Replacement pattern. |
Sequence
Processing
-revcomp
Reverse complement nucleotide sequence.
-nucleic
Subrange determines forward or revcomp.
-fasta |
Split sequence into blocks of 70 uppercase letters. |
-ncbi2na
Expand ncbi2na to IUPAC. (May need to truncate result to actual sequence length.)
-ncbi4na
Expand ncbi4na to IUPAC. (May need to truncate result to actual sequence length.)
-molwt |
Calculate molecular weight of peptide. |
Sequence
Coordinates
-0-based element
Zero-based.
-1-based element
One-based.
-ucsc-based element
Half-open.
Command
Generator
-insd arg ...
Generate INSDSeq extraction commands. Print them if invoked standalone; run them if invoked as part of a pipeline. Requires one or more arguments, which may appear in the following order:
Descriptor(s) |
INSDSeq_sequence/INSDSeq_definition/ INSDSeq_division/... [...] | ||
Completeness |
complete/partial | ||
Feature(s) |
CDS/mRNA/...[,...] | ||
Qualifier(s) |
INSDFeature_key/"#INSDInterval"/gene/product/ feat_location/sub_sequence/... [...] |
Frequency
Table
-histogram
Collects data for sort-uniq-count(1) on entire set of records.
Entrez
Indexing
-e2index [extras]
Create Entrez index XML. extras (true or false; false by default) indicates whether to index extra fields.
-indices element
Index normalized words.
-article element
Title positional index.
-abstract element
Abstract positional index.
-paragraph element
Index text paragraphs.
-stemmed element
Apply Porter2 algorithm.
Output
Organization
-head str
Print before everything else.
-tail str
Print after everything else.
-hd str
Print before each record.
-tl str
Print after each record.
Record
Selection
-select condition
Select record subset by conditions.
-in filename
File of identifiers to use for selection.
Record
Rearrangement
-sort[-fwd] element
Element to use as sort key.
-sort-rev element
Sort records in reverse order.
Reformatting
-format fmt
copy |
Fast block copy (still applies processing flags). | ||
compact |
Compress runs of spaces. | ||
flush |
Suppress line indentation. | ||
indent |
Indent according to nesting depth. | ||
expand |
Place each attribute on a separate line. |
Validation
-verify
Report XML data integrity problems.
Summary
-outline
Display outline of XML structure.
-synopsis
Display individual XML paths.
-contour [delimiter]
Display XML paths to leaf nodes (delimited by / by default).
Full
Exploration Command Precedence
-pattern
-path |
-division
-group |
-branch
-block |
-section
-subset
-unit |
Documentation
-help |
Print usage information and some example argument combinations. |
-examples
Complete usage examples, involving additional Entrez Direct tools.
-unix |
Illustrate common Unix command arguments. |
-version
Print version number.
NOTES
String constraints use case-insensitive comparisons.
Numeric constraints and selection arguments use integer values.
-num and -len selections are synonyms for Object Count (#) and Item Length (%).
-words, -pairs, and -indices convert to lower case.
SEE ALSO
archive-pmc(1), archive-pubmed(1), custom-index(1), disambiguate-nucleotides(1), download-ncbi-data(1), ds2pme(1), esample(1), fetch-pmc(1), fetch-pubmed(1), find-in-gene(1), fuse-segments(1), gene2range(1), hgvs2spdi(1), index-extras(1), index-pubmed(1), pma2pme(1), rchive(1), snp2hgvs(1), snp2tbl(1), sort-uniq-count(1), spdi2tbl(1), tbl2prod(1), transmute(1), uniq-table(1), xml2fsa(1), xml2tbl(1), xy-plot(1).