|
Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and ReliabilityCiteScore 3.200 Safety, Risk, Reliability and Quality 61 out of 165 Source-normalized Impact per Paper (SNIP) 1.022 SNIP calculates a journal's average citations from the SNIP year to items published in the prior three years, and divides the average by the citation potential in the journal's subject area to account for variability between subject categories. SCImago Journal Rank (SJR) 0.518 Safety, Risk, Reliability and Quality 81 out of 1080 |
Peer Reviewed |
Electronic tabular forms are an intuitive way for organisations to collect, present and store structured information for human readers. Forms use features such as fonts, colours and cell positioning to help readers navigate and find information. Millions of forms, typically in Portable Document Format (PDF), are generated by businesses as part of routine operations. Unlike human readers, machines are not able to directly 'understand' the implicit cues contained in the fonts, colours and use of boxes without explicit processing. In this paper, a supervised computer vision model is proposed to decompose the PDF form document into nested microtables. The cells within these microtables are then processed using a customisable rule bank for meaningful table content and semantic relationship extraction. The process is demonstrated on an industry dataset of 37 maintenance procedure documents containing 373 pages and 1016 unique microtables. A web application EMU (Extracting Machine Understandable Semantics from Forms) demonstrates how data captured in tables with different dimensions in procedural forms can be automatically extracted and stored in JavaScript Object Notation (JSON). Identifying and extracting nested tables is a critical fundamental step for future applications to support machine-automated search and extraction of data at scale for both maintenance and other procedural documentation. |