Artificial intelligence algorithm tests, trusted AI, black box effect: interview with Guillaume Avrin from LNE

5 mars 2021

In the context of our dossier Trusted Artificial Intelligence: From Critical Systems to the Common Good published in issue 3 of the magazine ActuIA currently on newsstands and available in our online shop, we spoke with Guillaume Avrin, Head of the “Artificial Intelligence Evaluation” department at the Laboratoire National de Métrologie et d’Essais (LNE). In February 2021, LNE, which has established itself for several years as a trusted third-party assessor of artificial intelligence, announced that it had obtained government funding for the creation of the world’s first generic platform dedicated to the evaluation of artificial intelligence, called LEIA.

In this interview, Guillaume Avrin explains LNE’s missions in the field of artificial intelligence, what artificial intelligence algorithm testing and the “black box” effect are, but also the challenges of certification.

ActuIA: Could you give us a brief presentation of the LNE?

Guillaume Avrin: The Laboratoire national de métrologie et d’essais (LNE) is a public establishment of an industrial and commercial nature (EPIC) attached to the Ministry of Industry. It is the national reference body for testing, evaluation and metrology. In support of public policies, its action aims to assess and structure the supply of new products, seeking both to protect and meet the needs of consumers and to develop and promote the competitiveness of national industry (cf. Article L823-1 of the Consumer Code). It also assists manufacturers in their efforts to innovate and become more competitive in many fields of expertise and sectors of activity. LNE carries out work on the characterisation, qualification and certification of systems and technologies to support all breakthrough innovations (artificial intelligence, nanotechnologies, additive manufacturing, radioactivity measurement, hydrogen storage, etc.) for the benefit of the scientific, regulatory and industrial community.

In the emerging field of artificial intelligence, LNE has metrological and methodological expertise in evaluation, which has no equivalent at European level. It has carried out more than 950 evaluations of AI systems since 2008, notably in language processing (translation, transcription, speaker recognition, etc.), image processing (person recognition, object recognition, etc.) and robotics (autonomous vehicles, service robots, agricultural robots, collaborative robots, intelligent medical devices, etc.). It is involved in all the major cross-cutting challenges of AI and, in parallel, ensures the implementation of a solution qualification system based on an efficient and partly internalized national network, particularly in the upstream stages of this deployment and for the development of the metrological, methodological and instrumental models to be implemented.

Are the tests of artificial intelligence algorithms similar to the tests you are used to doing in other specialties, or have they required the development of new skills and protocols? How are they different? How many artificial intelligence experts work for LNE? Do they all work at LNE or do you use independent experts?

The tests that we conduct in AI contribute to estimating fairly and in the absolute (for purposes of development, performance characterization, benchmarking, certification, etc.) the use value, performance, hazards, environmental impact, positive or negative, i.e. also its impact, even indirect, on societies and individual lifestyles: socio-economic consequences, ethical, legal, sociological questioning, etc.

This is a significantly new issue linked to the strong professional and social substitution power of AI, but which also has a metrological specificity: the ability of intelligent systems is to be measured mainly on the functional level and lies above all in their adaptability (a conventional system is on the contrary judged quantitatively on its performance in compliance with a job framework that is perfectly defined from the design stage). It is therefore not only a question of objectively quantifying functions and performance, but also of validating and characterising operating environments (perimeters of use) which are by nature variable, often highly variable, particularly in the case of so-called “open” environments. It is this variability of the situation to be treated, specific to the terrain of the human mind, which confers the quality of intelligence to the system and even measures its degree.

This extensive field of use and experimentation and the at least partially autonomous and often non-convex, non-linear, non-deterministic behaviour of AI systems require the development of sui generis protocols and measurement instruments:

The measurement of intelligent systems is based on a so-called “soft” metrology that is more functional than quantitative, more attentive to robustness than to performance, and using composite and multidimensional metrics capable of accurately and precisely reflecting the magnitude and sensitivities of the environmental field of operation of the component under test.
It is essentially a question of covering a field of use with a necessarily limited but minimal sampling fineness to guarantee the absence of aberrant reactions. The test scenarios to be presented to the system under evaluation are therefore potentially very numerous, some of which may lead to accident or near-accident situations and can only be generated using simulation means. As simulation implies a necessarily reductive modelling of reality, compromises must be found between the needs for completeness and realism.

This compromise imperative directly justifies and structures the “LEIA” (Laboratory for the Evaluation of Artificial Intelligence) initiative coordinated by LNE, which will eventually bring together all national players (research laboratories, test centres, equipment manufacturers and test bench architects, data owners, standardisation bodies, administrations, investors, etc.) to build an initial integrated platform for evaluating artificial intelligence on an international scale. This system will be based on a network that takes into account the existing situation (HR, resources, know-how, statutory missions), operating in a distributed but structured way (at least by AI application sector) to jointly meet the need for system evaluation. This network is built over time by relying on various collaborative projects at national (ANR, Grand défi), European (H2020, CHIST-ERA, COVR, etc.) and international levels.) and international (in particular through its strategic partnership with NIST, which is specific to AI), as well as through LNE’s participation in various standardization committees dealing with AI and robotics (Afnor’s Strategic Orientation Committee for Information and Digital Communication, Afnor’s AI committee, CEN-CENELEC’s Focus Group AI and Section 81 of the Union de Normalisation de la Mécanique on industrial robotics).

To meet this practical and programmatic need, LNE now has some fifteen in-house engineers and PhDs specializing in AI, who are regularly accompanied by post-docs, PhD students and trainees, as well as its own experts in cybersecurity, biology, mathematics and statistics, medical devices, etc. Even if this number of staff is eventually to be doubled, LNE will continue to work to the maximum extent possible in co-contracting with its academic, scientific, industrial and administrative partners, if only to promote technical appropriation and technology transfer.

How many years have you been involved in the measurement of artificial intelligence algorithms and how is this demand evolving? Can we have figures? What types of actors are you working with?

The activity was launched in 2008 as a result of guidance given by the public authorities, in the context of specific sovereignty issues, but it is only three years since artificial intelligence was made one of the top national and international priorities for technological and industrial development, and thus for the LNE.

Since all professional and domestic sectors (medical devices and diagnostics, autonomous mobility, industrial and agricultural robotics, legaltech, fintech, assurtech, etc.) are now included in the scope of the project, the project is now being implemented in the following areas: medical devices and diagnostics, autonomous mobility, industrial and agricultural robotics, legaltech, fintech, assurtech, etc.) which is gradually becoming automated, the spectrum of LNE’s partners and customers has in parallel broadened to include a wide range of institutional (Ministry of the Interior, Ministry of the Armed Forces, Ministry of Research, Ministry of Agriculture and Food, Ministry of Ecological Transition, High Authority for Health, DG CONNECT of the European Commission, INC, CEA, etc.) and industrial (Thales, Dassault, Airbus, Facebook, Numalis, etc.) players, and continues to grow.

The difficulty will be precisely to manage this transition from a developing AI to a commercial AI, which could be massive and for which it is therefore necessary to actively prepare, well in advance of commercial maturities and qualification needs. In programmatic terms, this difficulty has been recognized by the EC in a recent White Paper and in its “TEF” test centre project of the order of 3 billion euros, which could be launched at the beginning of 2021, as well as in France in its decision to retain trusted AI as one of its rare “major challenges” financed by the Innovation and Industry Fund.

Data is very important in artificial intelligence: how to ensure that training data is not biased? Their constitution, especially in Deep Learning, requires very large games that can be extremely expensive to collect, is it realistic to expect you to compile such data sets for testing on your own, and if you rely on your clients’ data, aren’t the test results likely to be very far from reality?

The qualification of the data is aimed at verifying their representativeness, i.e. their completeness with regard to the targeted application and their realism, in order to limit the potential associated biases (selection bias, classification bias, etc.).

The completeness of the data is quantified in particular through coverage rate calculations, while compliance with realism requirements is assessed thanks to qualification procedures that are currently being standardised, in particular within the framework of the ISO/IEC JTC 1/SC 42 Artificial Intelligence standardisation committee to which we contribute (cf. ISO/IEC WD 5259 – in preparation – “Data quality for analytics and ML”).

Evaluating completeness requires, in particular, formalizing the use cases, identifying the conditions of use and contraindications, formalizing the influencing factors (weather conditions, luminosity, temperature, etc.) impacting the performance of the system. A corpus analysis can then be carried out in order to verify the proper coverage of application needs.

Several methods exist to generate or artificially augment data to improve the coverage rate of learning and test scenarios, including the addition of perturbations (rain, snow, etc.), noise and sensor defects from models, the application of metamorphic transformations (inversion, rotation, etc.), the automatic production of corner cases by auto-encoders or GANs, etc.

The realism assessment includes checks for meta-information potentially associated with the raw data, called annotations. When these are carried out by humans, they can thus be subject to specific analyses in order to verify the existence of a field truth associated with the data (complete manual examination or random sampling, selection by majority vote, calculation of inter- and intra-anotator agreement rates, etc.). It also includes the identification and analysis of “outliers” (“outliers” or “atypical” values) in order to decide whether they should be maintained/deleted/adapted, or the consistency between the means (sensors, pre-processing chains, etc.) used for the acquisition of learning and test data, and those implemented in real conditions, once the system has been deployed.

Concerning the tools available for corpus exploration, since 2015 LNE has been developing the Matics software platform, free and open source (CeCILL-B licence), dedicated to data visualization and the evaluation of automatic information processing systems. In constant evolution, the software can be used, for example, to automate the evaluation of automatic language processing tasks (translation, transcription, speaker verification, named entity recognition, tokenization, lemmatization, etc.) and image recognition tasks (computer vision).

Both AI system developers and their assessors are concerned by these requirements for realism and completeness in the test databases to be used. Both the physical and logical separation and the precellence of the assessor’s test data must be guaranteed. Depending on the application area, several approaches are currently being considered:

If the application concerns very rare data (for example medical data of orphan pathologies), the AI developer is generally the only one or one of the only ones in the world to have data on the subject. The trusted third-party evaluator will therefore need to be able to access corpuses of the developer that have not been used during learning, which he can “augment” (noise, disturbances, transformations, etc.) in order to limit bias, especially overlearning, for use in evaluations.
If the application concerns common, transapplicable and/or inexpensive to produce data (recognition of everyday objects such as humans, animals, road signs, office equipment, etc.), then it is relevant to set up reference databases (standards for AI) independently of the developer.

These processes are obviously long and costly, and therefore pre-emptive. Their strategic nature cannot therefore be over-emphasised, and some even believe that they could go so far as to call into question the principle of European solidarity as an exception to sovereignty.

In any case, and even though such data may often be of a sensitive nature (personal, professional, military), the LEIA initiative leaves open at this stage the question of the best form of collection and aggregation to adopt, centralised or distributed, public or private. However, since the need is unavoidable and must be met first, it should above all be covered as soon as possible and without waiting for it to start happening across the Atlantic or elsewhere. Because, if behind the race to AI, there is a race to the data, nothing will serve to run, it will especially be necessary to be started at the right time.

We often talk about the “black box” effect of Deep Learning. Does it seem to you to be a real obstacle, or do you believe that a Deep Learning algorithm can still be qualified as “trusted”, i.e. that it meets performance, robustness, explicability and ethical issues, at the end of the tests carried out?

The term “black box” refers to the inaccessibility (physical, logical or simply human understanding) of the internal workings of a system. For AIs of this type, the decisions made are generally not explicable, the decision rules applied are not interpretable and the functionality is not formally demonstrable.

Only the experimental way remains, which incidentally obliges us to better identify the conditions of use by having to carry them out concretely. It is the whole system that is then tested and we can even argue that intelligence is a holistic property of such a black box, i.e. there would be no sense in sharing it between its possible components.

Input-output tests (also called “black box” tests) are then set up to characterize the AI system by soliciting it according to a scenario combining various stimuli and evaluating the quality of its behavior (for example via comparisons to references, to a field truth). The confidence to be placed in these AIs thus depends directly on the results of evaluations conducted on representative test data.

To be reassured, we can note that human beings often function in “black box” mode, the recognition of their environment, for example, appeals more to their intuition than to their reasoning (we immediately recognize a person by their particular features without detailing them formally).

Some tests, usually involving real-time decision-making capabilities, are also used to assess these skills. This is the case of the driver’s licence test. It can also be noted that if a human being passes his driving test in thirty minutes, the autonomous vehicle is required to cover several thousand kilometres before being approved. It is therefore not so much the “black box” nature of the autonomous vehicle that raises the question as the identification of the right size of the test environment (which influencing factors to vary, which sampling step, etc.) to put in place to report on its performance.

Do you have any other examples of areas where you are being asked for help and where this “black box” effect also exists?

Because of its missions, LNE conducts its assessments mainly through tests. The hardness of a material, for example, is assessed experimentally, in addition or when it cannot be calculated from its chemical composition.
It is the very notion of measurement that refers to this usual practice of a posteriori evaluation.

What are the main weaknesses of the artificial intelligence algorithms that you notice during the tests?

The main weaknesses we find are directly related to the existence of measurement instruments to identify, quantify and correct them. As previously mentioned, there is a whole metrology of AI to be set up, making it possible to characterize the operating environments of AI systems beyond the error rate calculations currently carried out. We thus note the lack of robustness (to meteorological variations, rare events, etc.) and resilience (adverse attacks, sensor perturbations, etc.) of the systems we are evaluating.

In the same way, the explicable or ethical nature of decisions made by AI is difficult to assess objectively today, due to the lack of a reference system and tools to quantify them. And we know the weaknesses of current AIs concerning these two aspects.

Deep Learning’s algorithms have proven to be very sensitive to adverse attacks, allowing to deceive a machine vision AI by adding a few almost imperceptible pixels to an image. Have you seen any progress in this area, and if not, isn’t this likely to undermine the reliability of AI systems in an uncontrolled environment, such as an autonomous vehicle?

New approaches appear to have been developed to mitigate the “fragility” of Deep Learning algorithms, including the concepts of robust learning and robust data. To date, we have not seen any real improvement with respect to adversary attacks as a whole.

Nevertheless, it seems to us relevant to differentiate between the adversary examples corresponding to realistic corner squares (the scenario that the data represents could indeed occur in real and nominal operating conditions) and the adversary examples that would result from a malicious attack. While the robustness of Deep Learning algorithms improves when they are trained on databases containing disrupted/enhanced/transformed data, their resilience to attacks seems to be primarily a matter of other types of protection, including cyber security.

Are you in favour of making the certification of artificial intelligence algorithms mandatory, including in systems that are not considered “critical” but whose consequences may be significant for society (for example in the case of recruitment algorithms)? Is there not a risk that the cost of these tests could be a brake on innovation and the emergence of new startups?

In a context of plethoric development, it seems indispensable to give guarantees and objective criteria of choice to the users of AI systems. Certification, which is based on a reference system shared by all and on evaluation results carried out by independent third-party bodies, is an effective tool for building this confidence. LNE, which combines expertise in the field of AI assessment with the expertise of a certification body, has therefore chosen to build a voluntary certification reference system on the subject. This is a first step that will enable developers to acquire and demonstrate the implementation of good practices, attesting to the reliability of the AI systems developed.

Rather than a hindrance, this certification will be a competitive advantage. The working group in charge of building the reference system, made up of representative players in the field (Arcure, Axionable, IRT Railenium, Kickmaker, Michelin, Orange, Proxinnov, Scortex, Thales, etc.), has moreover chosen to build the requirements in terms of results to be achieved and not the means to be implemented, thus avoiding slowing down innovation. The cost of a certification should be quite reasonable in view of the stakes of the conformity assessment carried out.

It is in fact wealth creation that our economy needs, for the domestic market and for our international trade. Wealth is directly linked to demand, it is only a measure of its satisfaction. In order to extend to commercial success an AI market that has so far been driven mainly by supply, it is imperative to anticipate market reactions and to activate as soon as possible the conditions for bringing supply and demand closer together, which is the main purpose of certification.

Thank you to Guillaume Avrin for agreeing to answer our questions.

Translated from Tests d’algorithmes d’intelligence artificielle, IA de confiance, effet black box : entretien avec Guillaume Avrin du LNE