WebMall - A Multi-Shop Benchmark for Evaluating Web Agents

Introduction

This page presents the WebMall benchmark for evaluating the capability of Web agents to find and compare product offers from multiple e-shops. WebMall is the first simulated environment benchmark with tasks requiring navigation of multiple web shops to collect and aggregate information given various levels of user-query specificity, as well as performing actions such as adding items to carts and finalizing a purchase by checking out. The WebMall task set spans tasks like searching and comparing offers for specific products, adding offers to the shopping cart, completing the checkout procedure, product searches given vague user requirements as well as searches for compatible products or cheaper substitute products and full search to checkout workflows.

Given a user task, the agent is asked to visit four e-shops exposing heterogeneous product offers via heterogeneous HTML interfaces. WebMall differs from existing e-commerce benchmarks, such as WebShop, WebArena, or Mind2Web, by 1) requiring the agent to visit multiple e-shops, 2) featuring heterogeneous product offers from different real-world sources, and 3) its increased task complexity, mirroring comparison shopping customer journeys.

News

01-08-2025: Release of WebMall-Interfaces: MCP vs RAG vs NLWeb vs HTML - A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web
29-07-2025: Version 1.0 released: Featuring additional product offers, new task category "End to End", and cleanup of tasks und solutions.
05-06-2025: Initial release: Version 0.7 released.

Screencasts

The following screencasts show an agent performing different tasks across the four shops.

Agent finding the cheapest product that meets specific technical requirements.

Task showing agent interpreting vague user requirements to find suitable products.

Task demonstrating agent finding substitute products for a given item.

Demo showing an agent completing the checkout process for selected products.

The WebMall Task Set

The WebMall benchmark covers 91 tasks distributed over 11 task categories grouped into five task groups. 10 out of 11 task categories require the agents to visit four different webshops to find relevant product offers. 4 out of the 10 further require the comparison of product prices across shops in order to find the cheapest offer(s). The task categories are grouped into Specific Product Search, Vague Product Search, Cheapest Product Search, Action & Transaction and End-to-End.

Specific Product Search tasks represent overview searches where users know the exact product or can formulate specific requirements and want to find all fitting offers in all shops. This includes Find Specific Product tasks (locating all offers for a named product across all shops) and Products Fulfilling Specific Requirements tasks (searches with constraints on attributes such as display size without naming the specific product).

Vague Product Search tasks incorporate the uncertainty present in real-world shopping when users are not aware of the available options or lack domain expertise. The tasks encompass Products Satisfying Vague Requirements (requiring reasoning about vague descriptions to return relevant offers), Find Substitutes (suggesting cheaper alternatives when items are unavailable or unsatisfactory), and Find Compatible Products (reasoning over compatibility, e.g., finding compatible CPUs for a motherboard).

Cheapest Product Search tasks require agents to find the cheapest offer(s) rather than all fitting offers and include Find Cheapest Offer (examining all shops for the lowest price on a named product), Cheapest Offer with Specific Requirements (constraint-based search with price comparison), and Cheapest Offer with Vague Requirements (reasoning about vague descriptions while comparing prices).

Action & Transaction tasks include Add To Cart (adding offers for specific products to the cart) and Checkout tasks (adding an offer to the cart and proceeding through checkout, including providing shipping and billing details via HTML forms).

Finally, End-to-End tasks combine searching for the cheapest offer, adding it to the cart, and completing checkout into a single workflow.

Each task is defined by a specific instruction to the agent in string format as well as the expected answers (URLs of the correct product offers). The instruction to the agent consists of a general part which is the same for all tasks and a task specific part. The general part contains links to the four webshops as well as instructions on how to submit the final solution after completing a task. An example of a complete instruction string is found here.

The table below gives an overview of the 11 task categories and includes an example task from each category. A list containing all 91 tasks of the WebMall benchmark is provided here.

Task Categories Overview

Task Category	Task Group	Count	Examples
Find Specific Product	Specific Product Search	12	Find all offers for the AMD Ryzen 9 5900X. Find all offers for the Canon EOS R5 Mark II.
Products Fulfilling Specific Requirements	Specific Product Search	11	Find all offers for orange straps that fit with the Apple Watch Series 6. Find all offers for Samsung Tablets which support 5G and come with an S-Pen stylus.
Products Satisfying Vague Requirements	Vague Product Search	8	Find all offers for the largest available MX500 model by Crucial. Find all offers for an adapter so I can connect my monitor, which does not support HDMI, to an HDMI cable. The monitor's connector looks quite similar to HDMI.
Find Substitutes		6	Find the cheapest alternative for this item: https://webmall-3.[local_path].de/product/arctic-liquid-freezer-iii-360mm-liquid-cpu-cooler-p12-pwm-pst-fans-pwm-controlled-pump . Find the cheapest alternative with at least the same capacity and speed for this product: https://webmall-3.[local_path].de/product/corsair-1tb-mp600-core-xt-m-2-nvme-ssd-m-2-2280-pcie4-3d-qlc-nand-r-w-5000-3500-mb-s-700k-900k-iops .
Find Compatible Products		5	Find all offers for compatible CPUs for this motherboard: https://webmall-3.[local_path].de/product/asus-pro-ws-wrx80e-sage-se-wifi-ii-workstation-amd-wrx80-swrx8-eatx-8-ddr4-sli-wi-fi-6e-dual-10g-lan-hyper-m-2-card-3x-m-2 . Find kits with single or multiple 32 GB RAM sticks compatible with this motherboard: https://webmall-4.[local_path].de/product/asus-rog-strix-z790-e-gaming-wifi-intel-z790-1700-atx-4-ddr5-hdmi-dp-wi-fi-6e-2-5g-lan-pcie5-rgb-5x-m-2 .
Find Cheapest Offer	Cheapest Product Search	10	Find the cheapest offer for the Samsung Galaxy S24 Plus. Find the cheapest offer for the Netac Z Slim 1TB M.2 External SSD.
Cheapest Offer Specific Requirements		10	Find the cheapest offer for a new Xbox gaming console with at least 512gb disk space in white. Find the cheapest offer for a Samsung Galaxy smartphone from the S24 series which has a camera with 200 Megapixel resolution.
Cheapest Offer Vague Requirements		6	Find the cheapest offer for each Smartphone model of Samsungs budget-friendly smartphone series. Find the cheapest offers for each model of mid-tier nVidia gaming GPUs in the 4000 series.
Add to Cart	Action & Transaction	7	Find all offers for the GameMax Iceburg 360mm ARGB Liquid CPU Cooler and add each of them to the respective shopping cart of the shop where you found the offer. Find all offers for the Asus DUAL RTX4070 SUPER OC White and add each of them to the respective shopping cart of the shop where you found the offer.
Checkout	Action & Transaction	8	Add the product on page https://webmall-3.[local_path].de/product/trust-tk-350-wireless-membrane-keyboard-spill-proof-silent-keys-media-keys-black to the shopping cart and complete the checkout process. Pay via credit card using the following information: Address: Jessica Morgan, jessica.morgan@yahoo.com, Maple Avenue, 742, 60614, IL, USA, Credit card number: 4242424242424242, CVV: 123, expiry date: 12/28. Add the product on page https://webmall-1.[local_path].de/product/palit-rtx3050-dual-v2-pcie4-8gb-ddr6-dvi-hdmi-dp-1777mhz-clock-rgb-lighting to the shopping cart and complete the checkout process. Pay via credit card using the following information: Address: Jessica Morgan, jessica.morgan@yahoo.com, Maple Avenue, 742, 60614, IL, USA, Credit card number: 4242424242424242, CVV: 123, expiry date: 12/28.
End To End	End to End	8	Find the cheapest offer for the Asrock B550 PHANTOM GAMING 4, add it to the shopping cart and complete the checkout process. Pay via credit card using the following information: Address: Jessica Morgan, jessica.morgan@yahoo.com, Maple Avenue, 742, 60614, IL, USA, Credit card number: 4242424242424242, CVV: 123, expiry date: 12/28. Find the cheapest offer for the Asus ROG Ryuo III 360 ARGB 360mm Liquid CPU Cooler and the cheapest offer for the Corsair Vengeance LPX 16GB Kit (2 x 8GB), add the respective cheapest offers to the shopping cart and complete the checkout process. If they are found in the same shop, put both in the shopping cart and checkout only once. Pay via credit card using the following information: Address: Jessica Morgan, jessica.morgan@yahoo.com, Maple Avenue, 742, 60614, IL, USA, Credit card number: 4242424242424242, CVV: 123, expiry date: 12/28.

The WebMall Shops

The WebMall benchmark asks agents to search for products in four distinct webshops which provide heterogeneous user interfaces. The webshops are implemented using the WordPress plugin WooCommerce and can be hosted via docker either locally or on a remote machine.

Screenshot Shop 1: E-Store Athletics

Screenshot Shop 2: TechTalk

Each shop contains heterogeneous product offers originating from a wide set of e-shops which annotate product offers within their pages using the schema.org vocabulary. The product offers were extracted from the October 2024 version of the CommonCrawl by the WebDataCommons project [Brinkmann2023].

The four WebMall shops contain a total of 4,421 product offers distributed across three main categories: PC Components, PC Peripherals, and Other Electronics. The distribution varies across shops to create diverse shopping environments for agent evaluation. The PC Components category includes internal computer parts such as CPUs, RAM, and motherboards. PC Peripherals covers external devices like monitors, keyboards, and external hard drives, while Other Electronics features consumer tech products such as gaming consoles, headphones, and smartwatches.

Product Distribution Across All Shops

Product Category	Overall Total		Shop 1		Shop 2		Shop 3		Shop 4
	Offers	%	Offers	%	Offers	%	Offers	%	Offers	%
PC Components	1,477	33.4	348	30.2	369	33.7	430	37.2	330	32.4
PC Peripherals	1,388	31.4	432	37.5	255	23.3	336	29.1	365	35.8
Other Electronics	1,556	35.2	370	32.3	471	43.0	390	33.7	325	31.9
Total	4,421	100.0	1,150	100.0	1,095	100.0	1,156	100.0	1,020	100.0

Baseline Experiments

We conduct a series of baseline experiments using web agents implemented as part of the AgentLab library that accompanies BrowserGym [Chezelles2025]. We test 8 agent setups along the three dimensions (1) observation space (AX-Tree or Screenshots or AX-Tree+Screenshots), (2) availability of short-term memory, and (3) the used LLM (GPT4.1 or Claude Sonnet 4). The observation space of the agent is either just the AX-Tree or screenshot of the visited webpages or both the AX-Tree and screenshot of the currently visible page. In the screenshot, each element of the visible page is annotated with a number that corresponds to the AX-tree id of the element. If short-term memory is activated, the agent can note down information it deems relevant to remember at each step. An example of the full final message passed to the agent, which also contains an action history, for two experimental settings with the GPT4.1 model can be found here (AXTree only) and here (AXTree+Memory).

Agent effectiveness is measured in terms of task completion rate, precision, recall, and F1 score. The completion rate is calculated as the fraction of tasks for which the agent returns all relevant answers without adding any non-relevant ones. The completion rate for add-to-cart and checkout tasks follows the same logic. Only if all relevant offers are added to the cart or checked out, is the task marked as complete. In order to capture partial task completion, precision, recall, and F1 are useful metrics. These metrics are computed per task and are subsequently macro-averaged. Metrics for efficiency are the average number of steps taken by the agent to complete a task, the average number of input and output tokens used by the agent to complete a single task, as well as the average runtime and cost per task. The results are presented in the tables below.

Completion Rates and F1 Score by Aggregate Task Groups

Model	Task Group	AX-Tree				AX-Tree + Memory				AX-Tree + Vision				Vision
Model	Task Group	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score
GPT4.1	Specific Product Search	30.30	67.71	53.54	59.79	51.52	86.74	70.35	77.32	39.39	67.95	55.79	61.27	34.47	61.82	47.23	53.47
	Cheapest Product Search	38.89	51.39	49.54	50.41	45.56	62.22	54.81	57.78	28.89	41.67	40.37	40.99	23.33	30.00	26.94	28.24
	Vague Product Search	34.17	58.01	48.25	52.48	32.78	51.14	46.20	48.42	39.44	47.73	48.09	47.86	21.94	32.36	28.37	30.05
	Action & Transaction	92.86	92.86	92.86	92.86	100.00	100.00	100.00	100.00	92.86	100.00	96.43	98.15	49.11	56.25	52.68	54.40
	End To End	37.50	50.00	43.75	46.67	62.50	62.50	62.50	62.50	75.00	75.00	75.00	75.00	0.00	0.00	0.00	0.00
Claude Sonnet 4	Specific Product Search	56.06	73.48	65.86	69.41	60.23	82.58	69.39	75.08	60.23	73.48	68.22	70.74	4.55	47.35	24.22	31.46
	Cheapest Product Search	54.44	62.59	58.52	60.29	51.11	51.11	51.11	51.11	48.89	48.89	48.89	48.89	16.67	23.33	20.00	21.52
	Vague Product Search	53.61	67.98	72.92	70.16	41.39	64.18	61.44	62.49	38.06	45.05	46.27	45.64	6.67	17.08	10.31	12.07
	Action & Transaction	79.46	79.46	79.46	79.46	86.61	86.61	86.61	86.61	86.61	86.61	86.61	86.61	0.00	0.00	0.00	0.00
	End To End	62.50	62.50	62.50	62.50	75.00	87.50	81.25	84.26	37.50	37.50	37.50	37.50	0.00	0.00	0.00	0.00

Completion Rates and F1 Score per Task Category

Model	Task Category	AX-Tree				AX-Tree + Memory				AX-Tree + Vision				Vision
Model	Task Category	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score	Completion Rate	Precision	Recall	F1 Score
GPT4.1	Find Specific Product	33.33	85.42	66.48	74.77	66.67	88.64	81.69	85.02	33.33	67.71	54.61	60.46	41.67	69.10	56.44	62.13
	Find Cheapest Offer	60.00	60.00	60.00	60.00	90.00	90.00	90.00	90.00	40.00	42.50	42.50	42.50	50.00	63.33	57.50	60.28
	Products Fulfilling Specific Requirements	27.27	50.00	40.61	44.82	36.36	84.85	59.01	69.61	45.45	68.18	56.97	62.07	27.27	54.55	38.03	44.81
	Add to Cart	85.71	85.71	85.71	85.71	100.00	100.00	100.00	100.00	85.71	100.00	92.86	96.30	85.71	100.00	92.86	96.30
	Checkout	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	12.50	12.50	12.50	12.50
	Cheapest Offer Specific Requirements	40.00	40.00	40.00	40.00	30.00	30.00	30.00	30.00	30.00	30.00	30.00	30.00	20.00	20.00	20.00	20.00
	Products Satisfying Vague Requirements	12.50	64.03	48.09	54.93	25.00	80.09	65.28	71.93	25.00	39.87	44.27	41.95	12.50	43.75	31.77	36.81
	Cheapest Offer Vague Requirements	16.67	54.17	48.61	51.24	16.67	66.67	44.44	53.33	16.67	52.50	48.61	50.48	0.00	6.67	3.33	4.44
	Find Substitutes	50.00	50.00	50.00	50.00	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33	33.33
	Find Compatible Products	40.00	60.00	46.67	52.50	40.00	40.00	40.00	40.00	60.00	70.00	66.67	68.29	20.00	20.00	20.00	20.00
	End To End	37.50	50.00	43.75	46.67	62.50	62.50	62.50	62.50	75.00	75.00	75.00	75.00	0.00	0.00	0.00	0.00
Claude Sonnet 4	Find Specific Product	66.67	83.33	78.41	80.80	75.00	83.33	79.17	81.20	75.00	83.33	79.17	81.20	0.00	58.33	22.98	32.97
	Find Cheapest Offer	70.00	75.00	75.00	75.00	70.00	70.00	70.00	70.00	80.00	80.00	80.00	80.00	40.00	60.00	50.00	54.55
	Products Fulfilling Specific Requirements	45.45	63.64	53.31	58.01	45.45	81.82	59.61	68.97	45.45	63.64	57.27	60.29	9.09	36.36	25.45	29.95
	Add to Cart	71.43	71.43	71.43	71.43	85.71	85.71	85.71	85.71	85.71	85.71	85.71	85.71	0.00	0.00	0.00	0.00
	Checkout	87.50	87.50	87.50	87.50	87.50	87.50	87.50	87.50	87.50	87.50	87.50	87.50	0.00	0.00	0.00	0.00
	Cheapest Offer Specific Requirements	60.00	60.00	60.00	60.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00	10.00	10.00	10.00	10.00
	Products Satisfying Vague Requirements	37.50	68.39	68.75	68.57	37.50	71.88	57.64	63.97	37.50	58.48	62.15	60.26	0.00	31.25	10.94	16.20
	Cheapest Offer Vague Requirements	33.33	52.78	40.56	45.87	33.33	33.33	33.33	33.33	16.67	16.67	16.67	16.67	0.00	0.00	0.00	0.00
	Find Substitutes	83.33	83.33	83.33	83.33	66.67	66.67	66.67	66.67	16.67	16.67	16.67	16.67	0.00	0.00	0.00	0.00
	Find Compatible Products	40.00	52.22	66.67	58.57	20.00	54.00	60.00	56.84	60.00	60.00	60.00	60.00	20.00	20.00	20.00	20.00
	End To End	62.50	62.50	62.50	62.50	75.00	87.50	81.25	84.26	37.50	37.50	37.50	37.50	0.00	0.00	0.00	0.00

Token Usage, Cost and Runtime across all Tasks

Model	Observation Space	Avg. Steps	Avg. Input Tokens	Avg. Output Tokens	Avg. Runtime	Avg. Cost
GPT4.1	AX-Tree	23.83	146,112	2,642	144.8s	0.31$
	AX-Tree + Memory	22.53	154,609	4,084	159.7s	0.34$
	AX-Tree + Vision	22.33	152,659	2,184	171.6s	0.32$
	Vision	30.94	119,294	2,786	196.3s	0.26$
Claude Sonnet 4	AX-Tree	26.67	239,563	8,427	277.2s	0.85$
	AX-Tree + Memory	24.68	300,745	16,628	377.8s	1.15$
	AX-Tree + Vision	31.44	361,398	9,442	375.7s	1.23$
	Vision	45.57	393,199	15,696	491.6s	1.42$

Token Usage, Cost and Runtime by Aggregate Task Groups

Model	Task Group	Observation Space	Avg. Steps	Avg. Input Tokens	Avg. Output Tokens	Avg. Runtime	Avg. Cost
GPT4.1	Specific Product Search	AX-Tree	24.10	150,386	2,670	144.3s	0.32$
		AX-Tree + Memory	21.94	146,238	3,920	154.8s	0.32$
		AX-Tree + Vision	22.23	151,173	2,197	175.0s	0.32$
		Vision	27.75	101,728	2,571	183.1s	0.22$
Claude Sonnet 4	Specific Product Search	AX-Tree	26.27	239,064	6,812	263.6s	0.82$
		AX-Tree + Memory	24.22	310,928	20,503	428.1s	1.24$
		AX-Tree + Vision	28.01	296,043	7,641	333.7s	1.00$
		Vision	44.16	371,030	13,746	463.7s	1.32$
GPT4.1	Cheapest Product Search	AX-Tree	22.98	146,274	2,906	156.2s	0.32$
		AX-Tree + Memory	23.45	183,770	5,148	184.4s	0.41$
		AX-Tree + Vision	22.48	163,864	2,538	190.1s	0.35$
		Vision	20.90	65,711	1,758	135.1s	0.15$
Claude Sonnet 4	Cheapest Product Search	AX-Tree	18.80	118,040	3,786	148.8s	0.41$
		AX-Tree + Memory	23.88	292,819	15,763	374.2s	1.11$
		AX-Tree + Vision	22.50	181,149	4,451	236.8s	0.61$
		Vision	44.73	385,126	17,170	531.3s	1.41$
GPT4.1	Vague Product Search	AX-Tree	24.45	167,834	3,024	148.9s	0.36$
		AX-Tree + Memory	21.90	150,835	4,152	156.9s	0.33$
		AX-Tree + Vision	19.45	132,548	2,211	159.2s	0.28$
		Vision	30.03	117,495	3,000	216.7s	0.26$
Claude Sonnet 4	Vague Product Search	AX-Tree	28.22	283,376	8,944	313.2s	0.98$
		AX-Tree + Memory	25.04	316,236	19,723	431.2s	1.24$
		AX-Tree + Vision	33.24	438,647	13,033	457.7s	1.51$
		Vision	48.46	433,979	17,340	516.6s	1.56$
GPT4.1	Action & Transaction	AX-Tree	20.65	107,453	1,895	114.9s	0.23$
		AX-Tree + Memory	20.73	120,193	3,081	132.5s	0.27$
		AX-Tree + Vision	20.88	129,358	1,671	141.2s	0.27$
		Vision	34.29	131,994	2,702	189.6s	0.29$
Claude Sonnet 4	Action & Transaction	AX-Tree	23.33	162,441	9,198	215.8s	0.63$
		AX-Tree + Memory	21.46	184,701	10,181	248.0s	0.71$
		AX-Tree + Vision	24.23	205,604	5,396	228.4s	0.70$
		Vision	46.26	419,990	16,154	468.9s	1.50$

Cost vs Performance Analysis

The following scatter plot visualizes the trade-off between cost per task and completion rate for different agent configurations, averaged across all tasks. Each point represents an agent setup with different observation spaces (AX-Tree, AX-Tree + Memory, AX-Tree + Vision, Vision) and models (GPT-4.1, Claude Sonnet 4). The plot uses a logarithmic scale for cost to better show the range of values.

Running the WebMall Benchmark

For running the benchmark, we assume a unix operating system in order to run Docker. If you are using Windows, please refer to WSL setup of docker.

How to Setup the Shops

The setup consists of different Docker containers, two for each shop (shop + database). This allows for a simple setup using Docker Compose. Please refer to the installation guide on GitHub for setting up the shops.

Performing Experiments

Instructions on how to perform experiments on single tasks as well as running a full study on the benchmark sets is found in the GitHub installation guide. The system writes comprehensive logs during execution, including agent actions, observations, and performance metrics. All logs including summary results are stored in the output directory set in the .env file for later analysis.

Related Work

[Yehudai2025] survey benchmarks for the evaluation of LLM agents and categorize them according to the agent's application domain as well as the agent capabilities that are evaluated. Other benchmarks that also evaluate the capability of Web agents to perform online shopping are the WebShop benchmark as well as the WebArena and Mind2Web benchmarks, which feature e-commerce tasks as part of a wider task set. Compared to these benchmarks, the WebMall benchmark requires agents to perform longer running tasks (due to visiting multiple shops), to deal with heterogeneous product data originating from different real-world sources, and perform advanced searches such as finding compatible or substitute products. A shopping benchmark that requires agents to perform product searches including conditions on user ratings and shipping details is DeepShop. DeepShop requires agents to perform searches on the live Web which limits the reproducibility of experimental results due to the evolving nature of the Web. In contrast, WebMall provides a simulated environment which allows the comparison of agents using exactly the same experimental setup.

Feedback

We welcome feedback and contributions via GitHub issues and discussions. Alternatively, you can also contact the authors of the benchmark directly via email.

References

[Brinkmann2023] Brinkmann, Alexander, et al.: The Web Data Commons Schema.org Data Set Series. Companion Proceedings of the ACM Web Conference, 2023.

[Chezelles2025] Le Sellier De Chezelles, Thibault, et al.: The BrowserGym Ecosystem for Web Agent Research. arXiv:2412.05467, 2025.

[Yehudai2025] Yehudai, Asaf , et al.: Survey on Evaluation of LLM-based Agents, arXiv:2503.16416, 2025.

[Zhou2023] Zhou, Shuyan, et al.: Webarena: A realistic web environment for building autonomous agents. arXiv:2307.13854, 2023.

[Lyu2025] Lyu, Yougang, et al.: DeepShop: A Benchmark for Deep Research Shopping Agents. arXiv:2506.02839, 2025.