Natural Language Processing¶

BatteryDataExtractor also includes state-of-the-art Natural Language Processing (NLP) facilities, as described here.

Tokenization¶

Sentence Tokenization

Use the sentences property on a text-based document element to perform sentence segmentation. The sentence tokenizer is based on en_core_sci_sm package that is trained specifically on scientific text by SciPy:

>>> from batterydataextractor.doc import Paragraph
>>> para = Paragraph('The mechanism of lithium intercalation in the so-called ‘soft’ anodes, i.e. graphite or graphitable carbons, is well known. It develops through well-identified, reversible stages, corresponding to progressive intercalation within discrete graphene layers, to reach the formation of LiC6 with a maximum theoretical capacity of 372 ± 2.4 mAh g−1.')
>>> para.sentences
[Sentence('The mechanism of lithium intercalation in the so-called ‘soft’ anodes, i.e. graphite or graphitable carbons, is well known.', 0, 123),
 Sentence('It develops through well-identified, reversible stages, corresponding to progressive intercalation within discrete graphene layers, to reach the formation of LiC6 with a maximum theoretical capacity of 372 ± 2.4 mAh g−1.', 124, 344)]

Each sentence object is a document element in itself, and additionally contains the start and end character offsets within its parent element.

Word Tokenization

Use the tokens property to get the word tokens:

>>> para.tokens
[[Token('The', 0, 3),
Token('mechanism', 4, 13),
Token('of', 14, 16),
Token('lithium', 17, 24),
Token('intercalation', 25, 38),
Token('in', 39, 41),
Token('the', 42, 45),
Token('so', 46, 48),
Token('-', 48, 49),
...
]]

>>> para.sentences[0].tokens
[Token('1,4-Dibromoanthracene', 0, 21),
 Token('was', 22, 25),
 Token('prepared', 26, 34),
 Token('from', 35, 39),
 Token('1,4-diaminoanthraquinone', 40, 64),
 Token('.', 64, 65)]

There are also raw_sentences and raw_tokens properties that return strings instead of Sentence and Token objects.

Set CPU/GPU Device¶

Each document and paragraph are assigned a default device value as -1 (CPU). You can set the device as a local GPU rank (e.g. 0, 1) to accelerate the NLP pipeline:

>>> para.device = 0
>>> s = Sentence("Li-ion battery")
>>> s.device = 1
>>> print(doc.device, para.device, s.device)
-1, 0, 1

Part-of-speech Tagging¶

BatteryDataExtractor contains a chemistry-aware Part-of-speech tagger that is based on the fine-tuned base BERT-cased model. Use the pos_tagged_tokens property on a document element to get the tagged tokens:

>>> s = Sentence('1H NMR spectra were recorded on a 300 MHz BRUKER DPX300 spectrometer.')
>>> s.pos_tagged_tokens
[('1H', 'CD'),
('NMR', 'NNP'),
('spectra', 'NN'),
('were', 'VBD'),
('recorded', 'VBN'),
('on', 'IN'),
('a', 'DT'),
('300', 'CD'),
('MHz', '.'),
('BRUKER', 'NNP'),
('DPX300', 'NNP'),
('spectrometer', 'NN'),
('.', '.')]

Using Taggers Directly

All taggers have a tag method that takes a list of RichToken instances and returns a list of (token, tag) tuples. For more information on how to use these taggers directly, see the documentation for BaseTagger.

Lexicon¶

As BatteryDataExtractor processes documents, it adds each unique word that it encounters to the Lexicon as a Lexeme. Each Lexeme stores various word features, so they don’t have to be re-calculated for every occurrence of that word.

You can access the Lexeme for a token using the lex property:

>>> s = Sentence('Sulphur and Oxygen.')
>>> s.tokens[0]
Token('Sulphur', 0, 7)
>>> s.tokens[0].lex.normalized
'sulfur'
>>> s.tokens[0].lex.is_hyphenated
False

Abbreviation Detection¶

Abbreviation detection is based on the fine-tuned BatteryOnlyBERT-cased model:

>>> p = Paragraph(u'Dye-sensitized solar cells (DSSCs) with ZnTPP = Zinc tetraphenylporphyrin.')
>>> p.abbreviation_definitions
[[('Abbr: ', 'DSSCs'), ('Abbr: ', 'ZnTPP')],
 [('LF: ', 'Dye - sensitized solar cells'),
  ('LF: ', 'Zinc tetraphenylporphyrin')]]

Chemical Named Entity Recognition (CNER)¶

Chemical Named Entity Recognition (CNER) is based on the fine-tuned BatteryOnlyBERT-uncased model:

>>> doc.cems
  [Span('lithium', 17, 24),
   Span('graphite', 76, 84),
   Span('carbons', 100, 107),
   Span('graphene', 239, 247),
   Span('LiC6', 282, 286)]

Each mention is returned as a Span, which contains the mention text, as well as the start and end character offsets within the containing document element.