Warning: long comment ahead! Read at your own risk. If you ask me for a tl;dr I'll kick you in the shins.
I'm so happy to see them measuring "median item impressions" rather than "mean item impressions." Many of the underlying variables describing consumer behavior aren't normally distributed. Talking about "average number of friends invited" when 50% of your users invite 0, 25% invite 1, 13% invite 2, etc. but 3 users invite 5,000 will necessarily lead you to propose bad product ideas and decisions.
However, I'm curious about the experimental design. For example, was this tested on new users, old users, or both? Changing a fundamental part of a site's experience like this will have some cost as users acclimate. I'd wager Etsy's audience is less technically inclined, too, so it might take them longer to acclimate.
They also commit a small fallacy when they talk about how they should have done it instead, and IMO it's a fallacy that frequent A/B testing encourages people to commit. They suggest that instead they should have determined first whether more items are better and faster items are better.
On the most surface level, perhaps there's something about more items AND faster items that outperforms either one or the other in isolation. That's easy enough to accomplish, technically. You use different statistical tests, but it's possible at the cost of perhaps a larger sample size.
On a deeper level, you're providing the users with a fairly different overall user experience. Their sense of where things are placed, what they're supposed to do when they want to "see more," how they know they have the opportunity to "see more," etc. are aspects of the infinite scroll design that aren't encapsulated in either rendering more items or rendering those items more quickly.
For example, can users bookmark specific search result pages under the current design? Can they still do the same thing under the "infinite scroll" design? I imagine there are lots of little things like this and that the UX difference alone would have a larger impact on the results than just changing the number of products per page.
To get more meaningful results from this, I'd run this experiment under the following assumptions.
1. Assume that existing users will be more impacted by this change than new users. Therefore the cost of "failure" for existing users is higher.
2. Assume that at the end of the day the #1 thing Etsy cares about is "dollar throughput" of the Etsy platform. Engagement, favoriting, searching, etc. are all positive indicators of an increased dollar velocity.
3. Assume they have information about what aspects of a users' first visit are indicators of their long-term ability to contribute to Etsy's dollar throughput.
4. Assume that eventually every user will have the same experience, new or existing alike.
So, I'd start by running the experiment with new users only. Over the course of a week or a month I'd put a % of the users who joined each day into the "infinite scroll" bucket. I'd then run the study as a longitudinal study.
Assumption (3) can guide us as to whether we need to cut off the experiment early. The length of the study would be determined by the particularities of an Etsy user's life-cycle, e.g., maybe given a cohort of users, we care about the length of time it takes 75% of the eventual purchasers to make their first purchase.
Because of assumption (4) we know that if the "infinite scroll" design is terrible for new users, we never have to bother testing it on existing users.
[1]: Non-technical users, in particular, are sensitive to sudden change. I forget where, but I read a research paper once that implied that the worst thing you can do to harm a person's user experience is change the placement of links, buttons, etc. You can change the color, text, icons, etc. but if you change the placement, they essentially have to "re-learn" the interface.
IIRC, the users were given a task (e.g., "create a document") and they measured two core variables: time to accomplish the given task and time until their "performance" at a given task was equivalent to the control user interface.
Changing the placement of a certain action in the UI had a deeper and longer-lasting impact on users' ability to perform tasks than changing anything else about the UI by a large margin.
I got the impression from the article that their experiment had very little planning from the statement about it being to prove it was good and then celebrate. This probably means it was a straight X% of traffic A/B test and the results were only analysed deeply when it wasn't a positive result. This is speculation only based on the tone of the article.
I didn't rewatch the entire presentation but McKinley does discuss the user makeup in the experimental groups...yes they do account for different kinds of users, and the most drastic difference between user behavior are between sellers and non-sellers. Someone from Etsy would have to talk about how much slicing-and-dicing of the demographic that they do...but even if infinite scroll was good for some users (new users without pagination-related habits) and not for others, it's probably not a good idea to have two kinds of search experience in the hopes that the "oldies" eventually figure it out...based solely on how hard it is to implement infinite-scroll in the technical sense.
It's pretty common for us to look at new vs. returning users and Etsy sellers vs. others (sellers are obviously really engaged users, and behave differently). Occasionally one group will stand out, but not in this case.
I'm so happy to see them measuring "median item impressions" rather than "mean item impressions." Many of the underlying variables describing consumer behavior aren't normally distributed. Talking about "average number of friends invited" when 50% of your users invite 0, 25% invite 1, 13% invite 2, etc. but 3 users invite 5,000 will necessarily lead you to propose bad product ideas and decisions.
However, I'm curious about the experimental design. For example, was this tested on new users, old users, or both? Changing a fundamental part of a site's experience like this will have some cost as users acclimate. I'd wager Etsy's audience is less technically inclined, too, so it might take them longer to acclimate.
They also commit a small fallacy when they talk about how they should have done it instead, and IMO it's a fallacy that frequent A/B testing encourages people to commit. They suggest that instead they should have determined first whether more items are better and faster items are better.
On the most surface level, perhaps there's something about more items AND faster items that outperforms either one or the other in isolation. That's easy enough to accomplish, technically. You use different statistical tests, but it's possible at the cost of perhaps a larger sample size.
On a deeper level, you're providing the users with a fairly different overall user experience. Their sense of where things are placed, what they're supposed to do when they want to "see more," how they know they have the opportunity to "see more," etc. are aspects of the infinite scroll design that aren't encapsulated in either rendering more items or rendering those items more quickly.
For example, can users bookmark specific search result pages under the current design? Can they still do the same thing under the "infinite scroll" design? I imagine there are lots of little things like this and that the UX difference alone would have a larger impact on the results than just changing the number of products per page.
To get more meaningful results from this, I'd run this experiment under the following assumptions.
1. Assume that existing users will be more impacted by this change than new users. Therefore the cost of "failure" for existing users is higher.
2. Assume that at the end of the day the #1 thing Etsy cares about is "dollar throughput" of the Etsy platform. Engagement, favoriting, searching, etc. are all positive indicators of an increased dollar velocity.
3. Assume they have information about what aspects of a users' first visit are indicators of their long-term ability to contribute to Etsy's dollar throughput.
4. Assume that eventually every user will have the same experience, new or existing alike.
So, I'd start by running the experiment with new users only. Over the course of a week or a month I'd put a % of the users who joined each day into the "infinite scroll" bucket. I'd then run the study as a longitudinal study.
Assumption (3) can guide us as to whether we need to cut off the experiment early. The length of the study would be determined by the particularities of an Etsy user's life-cycle, e.g., maybe given a cohort of users, we care about the length of time it takes 75% of the eventual purchasers to make their first purchase.
Because of assumption (4) we know that if the "infinite scroll" design is terrible for new users, we never have to bother testing it on existing users.
[1]: Non-technical users, in particular, are sensitive to sudden change. I forget where, but I read a research paper once that implied that the worst thing you can do to harm a person's user experience is change the placement of links, buttons, etc. You can change the color, text, icons, etc. but if you change the placement, they essentially have to "re-learn" the interface.
IIRC, the users were given a task (e.g., "create a document") and they measured two core variables: time to accomplish the given task and time until their "performance" at a given task was equivalent to the control user interface.
Changing the placement of a certain action in the UI had a deeper and longer-lasting impact on users' ability to perform tasks than changing anything else about the UI by a large margin.