[By Jonathan Wright]
After spending a year in Silicon Valley helping Apple & PayPal build next-generation AI platforms, I’m back in the UK assisting the government in preparing for Brexit, supporting start-ups create bleeding edge AI as a Service (AIaaS).
How do you know when your AI is ready “to be unleashed into the wild”? How does testing AIaaS compare to traditional (non-AI) platforms?
Let me start by defining “black box” test methods. There focus based on what I found back in the 1990s when I started as an automation engineer:
- Functional Testing – User interface and application programming interface
- Non-functional Testing – Performance, load, and security testing
The logical move to combine both “black” and “white” boxes to create various shades of grey depends upon the levels of the solution architecture needed to understand before you can begin testing.
Traditional platforms (non-AI) like VoIP back in 1998 were built on a typical multi-tier architecture, with the presentation/UI layer based on Java applet (e.g., WinRunner) having native support through Java hooks.
Note Oracle has recently started to charge for Java, so OpenJava distributions like Amazon Corretto have emerged.
At first glance, AIaaS can feel similar, they may even have a dynamic UI layer through programmable implementation like GraphQL (above) hosted as a local Kubernates cluster or multi-cloud environment. For example, a simple GraphQL contract query is based on type system (schema registry) which cover tree-based read queries, mutations for updates, and subscriptions for live updates from the GraphDB which can dynamically change the UI (Apollo). However, hiding behind the AIaaS wrapper and hiding away behind the Amazon gateway is where the enterprise AI implementation is hiding and is where we start our GATED.AI testing journey.
What does good look like?
In our example GATED.AI scenario (below) we have been asked by a client to list 1,000 products onto Amazon marketplace, which we have been provided with images of each product. After running a simple GraphQL test and I discover that I have 95 unpublished products? shall I pass or fail the test? The question is if we don’t know what the expected outcomes are? how can we simply pass or fail the output? How do we know what good looks like and when to stop? Utilizing the GATED.AI approach it makes sense to switch to a more goal-oriented approach (e.g., capability to identify X). With what acceptance/accuracy such as clarification rate (e.g., 7 out of 10 times) within what timeframe/training time (seconds/minutes/hours) in what type of environment (Computer/IOPs) with what size of training data (hundreds or even millions).
This allows us to define GATED.AI scenarios ahead of time during the idealization phase:
- GOAL: CAPABILITY to identify and correctly categorize images of products
- ACCURACY: REQUIREMENT can successfully categorize women’s fashion (1,000+ subcategories on the Amazon channel) with a CLARIFICATION rate of over 70%
- TIME: TIMEFRAME/day to process over 10,000 product images
- ENRICHMENT: SEMANTIC MODELS applying data engineering (Extract, Transform, and Load) heuristics for mining ecosystems to enable AI data lakes
- DATA: CLUSTERING (percentage split) for the development training set (60%) then testing training set (30%) proving training set (10%) of the training set sizes (5,000/10,000)
Now we know we have an idea what good looks like we can start to prepare the GATED.AI data lake e.g., the baseline training dataset dependences for our GATED.AI tests.
The main challenge around the GATED.AI approach is producing a realistic baseline training dataset that is representative of the target AI consumer needs.
- AI CAPABILITY (GOAL): Identity and correctly categorize images of products
The temptation is for the test data engineer to use a generic dataset (for example stock images returned by a search engine) which would be more suitable for a more generalized AI (unsupervised).
In previous GATED.AI scenarios, we have seen much higher accuracy around the 80%+ classification rates from mining the training data set from harvesting a specific dataset from the target AI consumer (for example, using a spider to transpose a data source such as a website containing existing product imaging with associated metadata).
- AI REQUIREMENT (ACCURACY): To successfully categorize women’s fashion
- AI CLARIFICATION RATE (ACCURACY): Over 70%
NOTE – Data Visualisation platforms will help to identify a subset of the test training dataset and avoid cognitive bias/overfitting of the training data.
If I was going to utilize an AI crowdsourced platform with gamification say Kaggle.com list a competition I would be looking for accuracy as close to the 90% based on the industry understanding that Enterprise AI scores above 95% are nearly impossible based on current cognitive technologies capabilities. However, I would also be equality interested in the time, and data variance along with associated consumed compute power.
- AI TIMEFRAME: Process 10,000 images/day
- AI COMPUTE (TIME): Auto-scale training nodes/clusters (COMPUTE/GPU/IOPS) based on AI Benchmarks/Performance Index (PI)
- AI VARIANCE (TIME): Future growth (customers utilizing the service/increase in product images)
Check out my previous blog, on Digital Performance Lifecycle – Cognitive Learning (AIOps) for further detail.
Optionally, the baseline training dataset can also be enriched by applying various semantic maps, for example, cross-validation from different ecosystems of ecosystems.
- AI SEMANTIC MODEL (ENRICHMENT): Data engineering (Enhance, Transform & Load) heuristics
The above approach enables the creation of an AI ready data lake with an improved level of data hygiene; this can be process mined to identify business process flows for robotic process automation along with the associated datasets.
In the experiences, in test automation, we explore the use of model-based testing (MBT) to support our business process testing efforts to model out something as simple microservice testing based on specification (SWAGGER/OpenAPI).
In the example above we have modeled (MBT) a simple business process flow with acceptance criteria (e.g., login success) using cause and effect modeling, so that we can fully understand the relationship between the A to B mappings (inputs/output) based on dataset used, so it is important not only model the business process flows but model the associated test dataset (MBD).
As previously mentioned, this requires the test data engineer to fully understand not only the data quality of the unstructured/structured dataset but the data coverage and the context-sensitive nature of the domain dataset.
- AI TRAINING DATA – mining for clustering, number of variables and associated test training data set size
Data engineering is as important as the data science activities as the ability to establish the data pipework to funnel unstructured data from heritage platforms into structured data through a combination of business process automation, and test data management (TDM) capabilities are essential.
In the above GATED.AI scenario, we define the high-level goal of identifying a product image (e.g., pair of trousers) and the image category mapping (e.g., product type).
- GOAL – AI CAPABILITY: Image category mapping
- ACCURACY – AI CLASSIFICATION RATE: >70%
- TIME – AI TIMEFRAME: <1 day for 10,000
- ENRICHMENT – AI SEMANTIC: Harvest manufactures website and internal ERP platform
- DATA – AI TRAINING DATA: Training images dataset size (5,000/10,000), semantic model parameters (500+) & training cluster nodes (4/8)
Now the acceptance criteria for the GATED.AI scenario was that the classification rate was over 70% and able to process 10,000 images per day.
In the above three cases, only one AI model passes the GATED.AI scenario (e.g., Model 1 classification rate and Model 3 training time (even with 4 cluster nodes exceeds the day timeframe).
NOTE: If we had not used the GATED.AI approach, then the temptation would have been to select Model 3 as it has the highest clarification rate but takes 5 times longer than Model 2.
GATED.AI Performance Engineering
The above GATED.AI scenario demonstrates the importance of effective performance engineering in AIaaS to assure that it cannot only handle the expected GATED.AI volumetric models, but the underlining enterprise AI platform can scale e.g., auto-scale compute nodes to handle future growth variations and be resilience e.g., self-healing (Chaos Engineering). At the start of this blog we mentioned that traditional testing was no longer going to cut it, so we needed to adopt a “Grey Box” testing approach to improve visibility of individual components so that we could identify bottlenecks throughout the target AIaaS architecture.
Therefore, I am going to refer to a keynote that I gave to the British Computer Society (BCS) back in 2011. In this keynote, I proposed that user load, similar to the “automation pyramid” was only the tip of the iceberg and that interface (messaging/APIs) combined with an ambient background (traffic/noise) should be the real focus to achieve accurate performance engineering.
In the example above, for a sizeable E-SAP migration program, we identified over 500 interfaces in the enterprise architecture diagram above, which either need to be stubbed out with service virtualization or to generate bulk transactions as messages or traffic (e.g., iDocs).
Like our GATED.AI scenario, the business process flow to trigger a simple report may only be a couple of steps through the SAP GUI (learn more about SAP testing), and the response for submitting the report may be a couple of seconds.
However, in the background, the number of transactions that are going on across the ecosystem (e.g., internal & external systems both upstream and downstream) monitored above by the application performance management (APM) platform.
Keeping this in mind, if we return to the GATED.AI scenario, our AIaaS platforms are built on Kubernates cluster which can be deployed locally or multi-cloud.
For this example, I will be running deploying the following Kubernates cluster locally (on Alienware m51 due to the amount of memory required):
- API Gateway (REST API)
- Apache Kafka (Streaming Service)
- Avro Schemas (Schema Registry)
- Neo4J (Graph Database)
- MongoDB (NoSQL)
- Postgres (RDBMS)
- Apache Zookeeper (Coordination Service)
In the above screenshot, I’m sending a simple JSON message to the ML microservice (which I can intercept/manipulate the request/response pairs), which triggers a number cipher queries and sets a flag in MongoDB that the product is ready to be listed to the channel.
Now depending on the product successfully matches a valid women’s fashion category on the channel specified e.g., eBay vs. Amazon. The status will change from “received” to “ready to list.”
Once the “channel listing service” identifies a cube of 10,000 products on that are “ready to list” to a channel, then it publishes them to the appropriate marketplace every 15mins.
So as a performance engineer, where do I focus my efforts to prove the system is performant, resilient, and can scale?
- Observing the behavior of the front end or API?
- Interpreting the interactions between node/endpoints within the ecosystem (upstream & downstream), e.g., Kafka Producers & Consumers?
- Modeling the sentiment/context of the business processes (cause & effect modeling), e.g., How long does it take images to list on the channel?
Traditional performance testing focused on observing the behavior of the endpoint (UI/API), so if the response time for a REST call took longer than a few hundred milliseconds, then the microservice was worth investigating.
However, that is no longer the case, due to system dynamics (epistemic & systemic entropy) leads to so many unknown/unknowns, for example in our example above the REST call, it triggers a number cypher queries and sets a flag in MongoDB (which minus the overhead establishing & closing the MongoDB connection is lightning fast).
In our example, our understanding is that this REST call triggers (or causes) a number of cascading events (e.g., Kafka producers & consumers, Cypher queries (GraphDB), schema registry calls, drools, and TensorFlow).
So how can we efficiently discover other computer related events (e.g., unknown/unknowns), the reverse discovery based on dynamic requests is one approach that intelligent APMs take to understand the full stack topology when auto-tags discovered nodes (e.g., traffic & ambient noise).
In the example above, we can overlay the dynamic requests, with a specific request attribute to filter down to the individual transaction level and then break down the service flow of that request and how long each service call took (e.g., root cause analysis).
This multidimensional transactional correlation, in the above case PurePath™ from Dynatrace, allows us to establish the baseline blueprint/schematic of the flow of events (e.g., cause and effect modeling) along with associated time series data for every request/response over time.
Finally, the slide above taken from a recent keynote on intelligent operations (AIOps) (https://youtu.be/1kyNIOFY6s) demonstrating how you can “Shift Right” by pulling the BAU/digital experiences (DX) sessions from APM, in this case, Dynatrace using Probit to generate semantic maps (user journeys, digital interactions & transactions) through open testing API available through the Intelligent APM. Then you can use an enterprise performance engineering tool like NeoLoad to generate performance as code (PaC) by reverse engineering the BAU/DX sessions (either through captured R/R pairs or DX sessions against the NeoLoad recorder port).
Alternately, you could also use a proxy like Portswagger to intercept (man in the middle) messages between endpoints using a Rasberry PI Zero with Kali installed.
In summary, when testing complex ecosystems of ecosystems (such as AIaaS or platform), the need for intelligent application performance management (APM) toolchains such as Dynatrace and Neotys (load testing) are essential.
Learn More about the Performance Advisory Council and AI as a Service
Want to learn more about this event, see Jonathon’s presentation here.