Predicting the effectiveness of pattern-based entity extractor inference
Graphical abstract
Introduction
An essential component of any workflow leveraging digital data consists in the identification and extraction of relevant patterns from a data stream. This task occurs routinely in virtually every sector of business, government, science, technology, and so on. In this work we are concerned with extraction from an unstructured text stream of entities that adhere to a syntactic pattern. We consider a scenario in which an extractor is obtained by tailoring a generic tool to a specific problem instance. The extractor may consist, e.g., of a regular expression, or of an expression in a more general formalism [1], or of full programs suitable to be executed by NLP tools [2], [3]. The problem instance is characterized by a dataset from which a specified entity type is to be extracted, e.g., VAT numbers, IP addresses, or more complex entities.
The difficulty of generating an extractor is clearly dependent on the specific problem. However, we are not aware of any methodology for providing a practically useful answer to questions of this sort: generating an extractor for describing IP addresses is more or less difficult than generating one for extracting email addresses? Is it possible to generate an extractor for drug dosages in medical recipes, or for ingredients in cake recipes, with a specified accuracy level? Does the difficulty of generating an extractor for a specified entity type depend on the properties of the text that is not to be extracted? Not only answering such questions may provide crucial insights on extractor generation techniques, it may also be of practical interest to end users. For example, a prediction of low effectiveness could be exploited by providing more examples of the desired extraction behavior; the user might even decide to adopt a manual approach, perhaps in crowdsourcing, for problems that appear to be beyond the scope of the extractor generation technique being used.
In this work we propose an approach for addressing questions of this sort systematically. We consider on a scenario of increasing interest in which the problem instance is specified by examples of the desired behavior and the target extractor is generated based on those examples automatically [4], [5], [6], [7], [8], [9], [10], [11], [12]. We propose a methodology for predicting the accuracy of the extractor that may be inferred by a given extraction inference engine from the available examples. Our prediction methodology does not depend on the inference engine internals and can in principle be applied to any inference engine: indeed, we validate it on two different engines which infer different forms of extractors.
The basic idea is to use string similarity metrics to characterize the examples. In this respect, an “easy” problem instance is one in which (i) strings to be extracted are “similar” to each other, (ii) strings not to be extracted are “similar” to each other, and (iii) strings to be extracted are not “similar” to strings not to be extracted. Despite its apparent simplicity, implementing this idea is highly challenging for several reasons.
To be practically useful, a prediction methodology shall satisfy these requirements: (a) the prediction must be reliable; (b) it must be computed without actually generating the extractor; (c) it must be computed very quickly w.r.t. the time taken for inferring the extractor. First and foremost, predicting the performance of a solution without actually generating the solution is clearly very difficult (see also the related work section).
Second, it is not clear to which degree a string similarity metric can capture the actual difficulty in inferring an extractor for a given problem instance. Consider, for instance, the Levenshtein distance (string edit distance) applied to a problem instance in which entities to be extracted are dates. Two dates (e.g., 2-3-1979 and 7-2-2011, whose edit distance is 6) could be as distant as a date and a snippet not to be extracted (e.g, 2-3-1979 and 19.79$, whose edit distance is 6 too); yet dates could be extracted by an extractor in the form of regular expression that is very compact, does not extract any of the other snippets and could be very easy to generate (∖d+-∖d+-∖d+). However, many string similarity metrics exist and their effectiveness is tightly dependent on the specific application [13], [14]. Indeed, one of the contributions of our proposal is precisely to investigate which metric is the most suitable for assessing the difficulty of extractor inference.
Third, the number of snippets in an input text grows quadratically with the text size and becomes huge very quickly—e.g., a text composed of just 105 characters includes ≈1010 snippets. It follows that computing forms of similarity between all pairs of snippets may be feasible for snippets that are to be extracted but is not practically feasible for snippets that are not to be extracted.
We propose several prediction techniques and analyze experimentally our proposals in great depth, with reference to a number of different similarity metrics and of challenging problem instances. We validate our techniques with respect to a state-of-the-art extractor generator1 approach that we have recently proposed [9], [5], [6]; we further validate our predictor on a worse-performing alternative extractor generator [15] which works internally in a different way. The results are highly encouraging suggesting that reliable predictions for tasks of practical complexity may indeed be obtained quickly.
Section snippets
Related work
Although we are not aware of any work specifically devoted to predicting the effectiveness of a pattern-based entity extractor inference method, there are several research fields that addressed similar issues. The underlying common motivation is twofold: inferring a solution to a given problem instance may be a lengthy procedure; and, the inference procedure is based on heuristics that cannot provide any optimality guarantees. Consequently, lightweight methods for estimating the quality of a
Pattern-based entity extraction
The application problem consists in extracting entities that follow a syntactic pattern from a potentially large text. Extraction is performed by means of an extractor tailored to the specific pattern of interest. We consider a scenario in which the extractor is generated automatically by an extraction inference engine, based on examples of the desired behavior in the form of snippets to be extracted (i.e., the entities) and of snippets not to be extracted. Such examples usually consist of
Our prediction method
Our prediction method consists of three steps. First, we transform the input (s, X, s′) in an intermediate representation which is suitable to be processed using string similarities. Second, we extract a set of numerical features consisting in several statistics of similarities among strings of the intermediate representation. Finally, we apply a regressor to the vector of features and obtain an estimate of the F-measure f′ which an extractor would have on X′.
In the following sections, we
Experimental evaluation
We constructed and assessed experimentally all the 48 prediction model variants resulting from the combination of: 2 feature set construction methods (Sample and Rep, Section 4.2); 8 string similarity metrics (Section 4.2.1); 3 regressors (LM, RF, and SVM, Section 4.3). We trained each model variant with a set of solved problem instances and assessed the resulting predictor on a set of solved problem instances disjoint from , as detailed in the next sections.
Concluding remarks
We have considered a scenario in which an extraction inference engine generates an extractor automatically from user-provided examples of the entities to be extracted from a dataset. We have addressed the problem of predicting the accuracy of the extractor that may be inferred from the available examples, by requiring that the prediction be obtained very quickly w.r.t. the time required for actually inferring the extractor. This problem is highly challenging and we are not aware of any earlier
Acknowledgements
The authors are grateful to the anonymous reviewers for their constructive comments.
References (35)
- et al.
Measuring instance difficulty for combinatorial optimization problems
Comput. Oper. Res.
(2012) TokensRegex
(2011)- A. Project, UIMA-Ruta rule-based text annotation,...
- et al.
Uima ruta: rapid development of rule-based information extraction applications
Nat. Lang. Eng. FirstView
(2015) - et al.
Data quality challenge: toward a tool for string processing by examples
J. Data Inf. Qual.
(2015) - et al.
Learning text patterns using separate-and-conquer genetic programming
- et al.
Inference of regular expressions for text extraction from examples
IEEE Trans. Knowl. Data Eng.
(2016) - et al.
Program boosting: program synthesis via crowd-sourcing
- et al.
Flashextract: a framework for data extraction by examples
- et al.
Automatic synthesis of regular expressions from examples
Computer
(2014)
Smart Autofill – Harnessing the Predictive Power of Machine Learning in Google Sheets
Enabling information extraction by inference of regular expressions from sample entities
Regular expression learning for information extraction
String similarity metrics for ontology alignment
Learning deterministic finite automata with a smart state labeling evolutionary algorithm
IEEE Trans. Pattern Anal. Mach. Intell.
Application of machine learning to algorithm selection for TSP
Cited by (2)
Interactive example-based finding of text items
2020, Expert Systems with ApplicationsCitation Excerpt :The key idea for this string classifier is to classify an input string based on its average similarity to strings in P and N. We considered several variants for the actual string similarity index m being used: based on exploratory experimentation and previous studies (Bartoli et al., 2016d; Bilenko et al., 2003; Cheatham & Hitzler, 2013a; Cohen et al., 2003), we actually implemented three extractors using Jaccard, Jaro-Winkler, and Needleman-Wunsch similarity indexes. In particular, we chose these metrics because they represent, as shown by Bartoli et al. (2016d), an interesting trade-off between computational effort and effectiveness in evaluating as (dis)similar pairs of string which have the same (or different) extraction outcome.
Predicate logic based tooling drawing design of aircraft harness
2021, Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics