openmolecules.org Forum: Functionality » feature suggestion "pdf2dwar"

Home » DataWarrior » Functionality » feature suggestion "pdf2dwar"

Show: Today's Messages :: Polls :: Message Navigator

feature suggestion "pdf2dwar" [message #573]

Thu, 06 June 2019 11:37

nbehrnd
Messages: 240
Registered: June 2019

Senior Member

Dear Thomas,

maybe the following feature might be implemented in future versions
of the program.

Among the complementary .dwar files is the one about reactions
associated with US Patents. Beside these, publications in scientific
journals represent an additional source of information; frequently
available as .pdf files. Given Lowe's thesis "Extraction of chemical
structures and reactions from the literature" already mentioned, and
picture-to-SMILES converters like OSRA on

https: // cactus.nci.nih.gov/cgi-bin/osra/index.cgi

perhaps DataWarrior may be enabled to harvest equally their
information, too.

I speculate current publications already set to appear as .pdf might
be easier to work with, than those scanned after their publication in
print (e.g., Acta Chemica Scandinavica).

Well, it may sound like a resurrection of MDL's IsisBase seen in the
later 1990s. To some extent, it is tangential to webreactions, too.
The idea surfaced (again) while accessing my literature reference
program, zotero. So far, however, zotero's indexing is limited to
text-only information; the addition of a key reaction is constrained
by pasting a figure from the publication as annotation then accessible
in its browser or report by entry (cf. the two example files attached).

Harvesting this information might be eased if the relevant .pdf are all
deposit into a dedicated partition, rather than multiple sub-folders of
the webbrowser directory requiring an os.walk.

Norwid

Attachment: example_report_zotero.html
(Size: 24.13KB, Downloaded 995 times)
Attachment: Weiberth-2011.pdf
(Size: 767.82KB, Downloaded 907 times)