Results

Stripe Inc.

11/12/2024 | News release | Distributed by Public on 11/12/2024 11:25

How we built it: Payment method A/B testing

A/B testing is the gold-standard approach to evaluating impact, with many online businesses running experiments to optimize everything from their website to their product experience. The same is true for Stripe: we run A/B tests and share our insights with users to help improve their checkout experience and increase their bottom line. However, users also tell us that they want to set up and run their own experiments that are tailored to their unique business. In particular, users want concrete data to understand how introducing new payment methods will impact their conversion and revenue.

Getting the data required to make strategic payment decisions like this one is hard. You either have to integrate with third-party experimentation software or allocate significant internal resources to running the experiments yourself. Then, you have to understand checkout click-through rates, payment conversion rates, and the frequency of refunds and disputes. All these factors play a part in determining a payment method's overall revenue and conversion impact, and it's more information than most businesses have the time and resources to collect and analyze on their own.

At Sessions earlier this year, we launched our no-code payment method A/B testing tool to give you a better way of determining which payment methods to add to your checkout page.

This feature, which is part of our Optimized Checkout Suite, allows you to select different payment methods to be part of your control and treatment buckets (the A and the B buckets of A/B testing). You specify a traffic allocation split between 0.01% to 99.99% to indicate what percentage of your traffic should be exposed to the treatment bucket (the rest are exposed to control). Then, based on the performance of the treatment and control buckets, we surface relevant metrics such as conversion rate, average order value, and average revenue per session so you can understand which bucket performed better.

Building this tool was an industry first, and we had to find creative solutions to some of the most complex scenarios in A/B testing such as improving time to statistical significance with a small sample size, avoiding dilution, and calculating uplift. We'll take you through some of the key decisions we made-and lessons we learned-along the way that can also help inform and improve your business's experiments, regardless of what you're testing.

Lesson one: Improve time to statistical significance by creating new data points

The time for an A/B experiment to reach statistical significance is inversely proportional to the number of data points observed during the experiment. Depending on the size and scale of a user's transactions and customer base, it can take a long time for some experiments to reach statistical significance since we count each checkout session as a data point in the experiment.

This presented a challenge: how could we increase the sample size so all users-even those with a small pool of customers-could get their results faster? Of course, there is no way to artificially increase transaction volume. However, we realized there was another way to increase sample size by using a time window component to allow the same customer to potentially appear in both control and treatment groups over time. This means that for a specific time interval, a customer would see the same treatment (meaning, the same set of payment methods) for all their purchases at a given business, but once that interval elapsed, they could see a different set of payment methods on their next purchase at the same business. This increases the number of experiment data points as each time interval is now considered a data point, reducing the time to statistical significance.

To implement this methodology, we used UserAgent, IP address, and the time window component* as inputs to a deterministic hash function. With this hash function, each customer gets randomly assigned a number between 1 and 10,000 and that number is compared to the experiment session allocation split to determine the experiment outcome. For example, if a user sets up a 90/10 experiment split, any customer with a number above 1,000 sees the treatment and any customer with a number below 1,000 gets the control (or vice versa).

This approach guarantees that a customer's exposure to the A/B test is balanced across Stripe's systems, ensuring that a returning customer using the same IP address and UserAgent will see the same set of payment methods for a given time window. Keeping the randomization key simple also means we can ensure a consistent payment experience no matter which combination of Stripe UIs are used (for example, Stripe Checkout and Payment Element).

Lesson two: Avoid dilution by testing eligibility before triggering an experiment

All A/B tests, regardless of what you're testing, need to be controlled for dilution. Dilution, which happens when customers in the treatment group don't actually see anything different from the control group, adds noise to the experiment dataset and results in the experiment taking longer to reach statistical significance. In our case, avoiding dilution is tricky because payment methods have constraints that affect whether a customer can use them to check out.

For example, let's say a business wants to do an A/B test with a buy now, pay later (BNPL) that is only allowed for transactions of $50 or more. If the majority of the business's transactions are less than $50, the payment sessions would only display the control behavior and not the BNPL in treatment-even when customers are assigned to treatment. If these transactions get included in the experiment analysis, they will push the metrics difference (or "effect size") between control and treatment buckets toward zero, reducing the "statistical power" and increasing the time to statistical significance. Other variables that can lead to dilution include transaction type (one time or recurring), currency, merchant category, and custom payment method rules. These all prevent a payment method from being displayed on the checkout page even when a session is assigned to the treatment group.

We can avoid dilution by testing the eligibility of a transaction before triggering an experiment. This way, we can ensure that only eligible transactions are counted toward the experiment's outcome. Identifying if a single payment method is eligible can be done by comparing constraints for that payment method against the transaction independently, but it becomes really difficult to manage when multiple payment methods are A/B tested simultaneously.

To compute the payment method eligibility at scale, we synchronously validate all control and treatment payment methods for that session to see which ones are eligible to be displayed. Once we confirm that payment methods being A/B tested are eligible for the session, we then randomly assign an experiment outcome to the transaction and return the corresponding set of payment methods.

The algorithm follows these steps:

  1. Create a superset of control- and treatment-enabled payment methods.
  2. Filter the superset by removing all payment methods that don't meet general constraints.
  3. Create subsets from the filtered superset for control and treatment.
  4. Filter each subset by removing all payment methods that don't meet their respective payment method rule constraints.
  5. Compare filtered control and treatment subsets. If at least one payment method is different, mark the session as eligible for the experiment and count its outcome toward the experiment results.

This approach introduces additional latency to compute both results, but it allows us to ensure that everyone in the treatment group sees different payment methods than the control group-reducing dilution so users can get the results of their A/B test faster.

Lesson three: Calculate results by indirectly joining events

We wanted A/B testing to be supported out of the box for all user integration types, including for users who finalize payments using calls from their own servers. However, unlike payments confirmed client-side using Stripe.js, we don't receive certain information (UserAgent or IP address) that we need to define control and treatment groups.

This prevents us from connecting two important events, which are required to calculate the conversion uplift for each treatment:

  • The "render event," which logs which payment methods were shown in a session. This is where the treatment assignment happens (which denotes whether the session was in the control or treatment group).
  • The "confirm event," which indicates that the customer made a payment during that session. This logs which payment method was used when the session was confirmed.

To solve this problem, we use PaymentMethod metadata to facilitate the joining between "render" and "confirm" events. It works like this:

  • When payment methods are rendered to the customer, a "render" event is fired, which includes the treatment assignment, UserAgent, IP address, and a unique ID created when Stripe.js was loaded on the customer's page.
  • The customer chooses a payment method, fills in the necessary details to check out, then submits the payment. A PaymentMethod object is created that includes the same unique ID as its metadata.
  • The business confirms the transaction server-side using the PaymentIntents API with the PaymentMethod's ID as a parameter.

We then built a data pipeline that queries server-side "confirm" events, joins them with payment methods by PaymentMethod ID, and then finally joins the "confirm" events with "render" events using the unique ID from Stripe.js load. The data pipeline stores final join results in a single table used for aggregating experiment summary results and allowing users to export as a downloadable report.

This approach allows us to evaluate and surface the right data to users, no matter how they choose to integrate with Stripe.

Driving more revenue for businesses

We love that our work abstracts away this complexity so businesses can make strategic payment decisions that increase their bottom line. Stripe users such as Mixtiles and Indiegogo have seen a significant increase in conversion after using our A/B testing tool to introduce new payment methods.

"The A/B testing feature allowed us to carefully experiment with new payment methods like Cash App Pay, WeChat Pay, and Amazon Pay," said John Stokvis, principal product manager at Indiegogo. "The results were impressive: we saw a 2% increase in conversion in our first experiment with Cash App Pay. This insight not only gave us the confidence to implement these new payment options but also provided compelling proof to share with our team and get buy-in for supporting more options. Now, we're able to continuously optimize our checkout and back every decision with solid data."

We are continuing to invest in payment method A/B testing to help you capture even more revenue. At Stripe Tour New York, we introduced the ability to run A/B tests on popular wallets such as Apple Pay and Google Pay. And today, we're announcing payment method A/B testing for businesses using Stripe Checkout, expanding on our goal to make our A/B testing tool work on every prebuilt payment UI on Stripe.

You can set up an A/B test directly from the Stripe Dashboard if you use Payment Element or Checkout with dynamic payment methods. To learn more about our payment method A/B testing tool, read our docs. If solving tough, meaningful engineering, data science, and design challenges excites you, consider becoming part of our Payments team.

*We collect UserAgent and IP address based on the user's instruction and where permitted by applicable law.