Global Voices - RSS http://feedraider.com/rss-feed/9zh8f/ What is the benefit of freaking customers out? http://glinden.blogspot.com/2010/08/what-is-benefit-of-freaking-customers.html Retargeting Ads Follow Surfers to Other Sites", on a form of personalized web advertising now being called retargeting.

An excerpt:
People have grown accustomed to being tracked online and shown ads for categories of products they have shown interest in, be it tennis or bank loans.

Increasingly, however, the ads tailored to them are for specific products that they have perused online. While the technique, which the ad industry calls personalized retargeting or remarketing, is not new, it is becoming more pervasive as companies like Google and Microsoft have entered the field. And retargeting has reached a level of precision that is leaving consumers with the palpable feeling that they are being watched as they roam the virtual aisles of online stores.

In remarketing, when a person visits an e-commerce site and looks at say, an Etienne Aigner Athena satchel on eBags.com, a cookie is placed into that person’s browser, linking it with the handbag. When that person, or someone using the same computer, visits another site, the advertising system creates an ad for that very purse.
The article later goes on to contrast this technique of following you around with products you looked at before with behavioral targeting like Google is doing, which learns your broader category interests and shows ads from those categories.

If the goal of the advertising is to be useful and relevant, though, I think both of these are missing the mark. What you want to do is help people discover something they want to buy. Since the item they looked at before obviously wasn't quite right -- they didn't buy it after all -- showing that again doesn't help. Showing closely related alternatives, items that people might buy after rejecting the first item, could be quite useful though.

As marketing exec Alan Pearlstein says at the end of the NYT article, "What is the benefit of freaking customers out?" Remarketing freaks people out. If we are going to do personalized advertising, the goal should be to have the advertising be useful, either by sharing value with consumers using coupons as Pearlstein suggests, or by helping consumers find something interesting that they wouldn't have discovered on their own.

But, publishers should be careful when working with these new ad startups. A startup has a huge incentive to maximize short-term revenue and little incentive to maximize relevance. For the startup, as long as it brings in more immediate revenue, it is perfectly fine to show annoying ads that freak customers out and drive many away. Publishers need to force the focus to be on the value of the ads to the consumer so their customers are happy, satisfied, and keep coming back.]]>
Mon, 30 Aug 2010 17:01:00 GMT
Measuring online brand advertising without experiments http://glinden.blogspot.com/2010/08/measuring-online-brand-advertising.html abstract, PDF), at the KDD 2010 conference.

The paper turns out to be a quite interesting attempt to measure the impact of online display advertising -- a notoriously difficult problem -- by looking at how it changes people's searching and browsing online. That's hard enough, but these crazy Googlers also are trying to do this without using A/B testing. To do that last trick, they separate people into those the exposed who have seen the ad and the controls who have not seen the ad while carefully limiting the controls only to people who are similar to the exposed.

From the paper:
Traditionally, online campaign effectiveness has been measured by "clicks" ... However, many display ads are not click-able ... and some campaigns hope to build longer-term interest in the brand rather than drive immediate response. Counting clicks alone then misses much of the value of the campaign.

Better measures of campaign effectiveness are based on the change in online brand awareness ... [due] to the display ad campaign alone. We ... [find] the change in probability that a user searches for brand terms or navigates to brand sites that can be attributed to an online ad campaign.

Randomized experiments ... are the gold standard for estimating treatment effects ... [but it] requires an advertiser to forego showing ads to some users ... [which] advertisers are not keen to [do] ... Estimation without randomization is more difficult but not always impossible .... Simply put, the controls [we pick] were eligible to be served campaign ads but were not.

Our estimates require summary (not personally identifiable) data on exposed and controls. The summary data are obtained from several sources, including the advertiser's own campaign information, ad serving logs, and sampled data from users who have installed Google toolbar and opted in to enhanced features.
By the way, some have speculated in the past ([1] [2]) that Google toolbar data is being used for Google's advertising, but there was no public confirmation of that from Google. To my knowledge, this is the first public confirmation that data from Google's ubiquitous toolbar is being used by them in at least some way in their advertising.

For more on related topics, please see also my November 2008 post, "Measuring offline ads by their online impact", and my July 2008 post, "Google Toolbar data and the actual surfer model".]]>
Thu, 19 Aug 2010 17:52:00 GMT
Human computation and lemons http://glinden.blogspot.com/2010/08/human-computation-and-lemons.html Mechanical Turk, Low Wages, and the Market for Lemons", that looks at why wages are so low, usually well below minimum wage, on Amazon's MTurk.

His theory is that spammers and cheaters have turned MTurk into a market for lemons. The quality is now so bad that buyers demand a risk premium and require redundant work for quality checks, splitting what might be a risk-reduced fair wage three to five ways among the workers.

An excerpt from his post:
A market for lemons is a market where the sellers cannot evaluate beforehand the quality of the goods that they are buying. So, if you have two types of products (say good workers and low quality workers) and cannot tell who is whom, the price that the buyer is willing to pay will be proportional to the average quality of the worker.

So the offered price will be between the price of a good worker and a low quality worker. What a good worker would do? Given that good workers will not get enough payment for their true quality, they leave the market. This leads the buyer to lower the price even more towards the price for low quality workers. At the end, we only have low quality workers in the market (or workers willing to work for similar wages) and the offered price reflects that.

This is exactly what is happening on Mechanical Turk today. Requesters pay everyone as if they are low quality workers, assuming that extra quality assurance techniques will be required on top of Mechanical Turk.

So, how can someone resolve such issues? The basic solution is the concept of signalling. Good workers need a method to signal to the buyer their higher quality. In this way, they can differentiate themselves from low quality workers.

Unfortunately, Amazon has not implemented a good reputation mechanism. The "number of HITs worked" and the "acceptance percentage" are simply not sufficient signalling mechanisms.
If you like Panos' post, you might also be interested in GWAP guru and CMU Professor Luis von Ahn's recent post, "Work and the Internet", where Luis bemoans the low wages on MTurk and questions whether they amount to exploitation. Panos' post is a response to Luis'.

Please see also my 2005 post, "Amazon Mechanical Turk?", where I wrote, "If I scale up by doing cheaper answers, I won't be able to filter experts as carefully, and quality of the answers will be low. Many of the answers will be utter crap, just made up, quick bluffs in an attempt to earn money from little or no work. How will they deal with this?"]]>
Tue, 17 Aug 2010 01:29:00 GMT
Big redesign at Google News http://glinden.blogspot.com/2010/07/big-redesign-at-google-news.html widely reported that Google News has done a major redesign -- its first since 2002 apparently -- to more prominently feature personalization and customization.

Before I comment on it, in the interest of full disclosure, I should say that I am absurdly biased on this particular topic, having run Findory and talked at length over the years with Google, their partners, and their competitors about news personalization.

That said, I don't like what they've done. And I'm not the only one. Thomas Claburn at InformationWeek catalogs the complaints he is seeing at InformationWeek and elsewhere, summarizing it all by comparing it to the "New Coke" flop.

I think what the Google team has done is a lovely example of personalization done poorly, by people who really should know better. They change navigation links based on personalization even when confidence is low (one of my links in the left hand nav is for "Lindsay Lohan", which is hard to stomach). The article recommendations are often off, cannot be corrected, do not change in real time as you read articles, and there is no explanation of why something was recommended. There is no ability to see, edit, or rate your reading history. The ability to exclude or favor sources appears to be hacked on; the only way to do it is to manually type in the names of sources.

Under the surface, there appears to still be a lot of implicit personalization based on past behavior, but, from what someone using it sees, the focus is entirely on customization. I can "edit personalization" and "add sections" to put categories on my page. And that is about the limit of my control and the limit of the explanations of why articles are appearing. People like to be in control. They like to understand why something happens, especially if they don't agree with it. And Google News offers very little control or explanations.

Adding to the other problems, the design seems really busy and confused to me, like the Googlers can't decide what they are doing and -- in a fashion more typical of Microsoft -- just keep adding features. Hey, look, it's your fast flipping, clustered, personalized, customizable, widget-complete newspaper! Love it, it's Googly! C'mon, Google, what happened to keeping it clean, simple, and relevant?]]>
Sat, 10 Jul 2010 00:34:00 GMT
Google to personalize metashopping http://glinden.blogspot.com/2010/06/google-to-personalize-metashopping.html interview with CNet, Googler Sameer Samat talks about Google's future plans for shopping search, including personalization and recommendations.

An excerpt:
One thing Google doesn't do very well is provide the shopping-as-adventure experience ... You might go to the mall with a specific product in mind, but a well-designed mall ... forces you to discover -- and hopefully purchase -- other products that you might not have even known you wanted: the marketing types like to call this "serendipity." Google wants to be known as a destination for that kind of experience, said Sameer Samat, director of product management.

After years of trying and failing to reach that goal, Google plans to give it another go over the coming months. Don't expect Google to turn into a full-blown online retailer among the likes of Amazon.com or Buy.com just yet. But the combination of personalized features for product search pages and what Samat thinks is "the largest database of products that has been created" could entice people to actually shop on Google.

Google's current approach works best for those who are on a mission when they shop, shoppers who already know what they want and are just looking for additional information before sealing the deal ... [But] there are millions of other people who treat shopping as leisure, rather than a simple transaction. These are people who ... prefer browsing to targeted shopping, knowing that every now and then they'll discover something totally unique or completely unexpected.

Google wants to serve more of those people ... [by making] recommendations based on that list of products and lists submitted by others to help you discover new products: sort of like Amazon's recommendations page meets Pandora's radio stations meets Google.

"Shopping is not just about search, it's not just about intent, it's about discovery," Samat said. "If we can do it, and do it well, we will have built something that's really amazing; it should be the most comprehensive experience for shopping you could ever find."
On a related note, Google is pushing aggressively to get retailers to use Google's commerce search engine to run their search experience. Each deal Google signs gives them more detailed information about another retailing vertical. It's all about the data.]]>
Tue, 22 Jun 2010 17:44:00 GMT
Google on presentation bias in search http://glinden.blogspot.com/2010/06/google-on-presentation-bias-in-search.html PDF), that explores how much people tend to click on eye-catching search results rather than seeking the most relevant search results.

The work itself was pretty simple -- just looking at how bolding title and abstract terms changes clickthough rates in A/B tests -- but I think the paper is worth a peek for two reasons. First, it is a decent survey of some of the current work on position and presentation bias. Second, it exposes some of Google's struggles with the difficulty of deriving searcher satisfaction from the noisy proxies that we have available like click data.

By the way, I love the fact, noted in the paper, that people tend to click on the last result much more than you would expect. The reason is that people don't linearly scan down a page, but often jump to the bottom and focus attention there. A decade ago at Amazon, the personalization team exploited this effect and seized the space at the bottom of most pages on the site for our features. You see, when we saw no one had built tools to track click and conversion data, we built them, and then we used them. No one else realized the value of the space at the bottom of the page, but we did.

For more on the struggle to evaluate search results from noisy click data, please see some of my older posts, "Modeling how searchers look at search results ", "Finding task boundaries in search logs", and "Testing rankers by interleaving search results".]]>
Mon, 14 Jun 2010 18:17:00 GMT
Travel itineraries from Flickr photo trails http://glinden.blogspot.com/2010/06/travel-itineraries-from-flickr-photo.html
The paper, "Automatic Construction of Travel Itineraries using Social Breadcrumbs" (PDF), cleverly uses the data often embedded in Flickr photos (e.g. timestamp, tags, sometimes GPS) to produce trails of where people have been in their travels. Then, they combine all those past trails to generate high quality itineraries for future tourists that tell them what to see, where to go, how long to expect to spend at each sight, and how long to allow for travel times between the sights.

Some excerpts from the paper:
Shared photos can be seen as billions of geo-temporal breadcrumbs that can promisingly serve as a latent source reflecting the trips of millions of users ... [We] automatically construct travel itineraries at large scale from those breadcrumbs.

By analyzing these breadcrumbs associated with a person's photo stream, one can deduce the cities visited by a person, which Points of Interest (POI) that the person took photos at, how long that person spent at each POI, and what the transit time was between POIs visited in succession.

By aggregating such timed paths of many users, one can construct itineraries that reflect the "wisdom" of touring crowds. Each such itinerary is composed of a sequence of POIs, with recommended visit times and approximate transit times between them.

[In surveys] users perceive our automatically generated itineraries to be as good as (or even slightly better than) itineraries provided by professional tour companies.
This reminds me quite a bit of the work on using GPS trails from mobile devices like phones (e.g. [1] or [2]) or search histories on maps (e.g. [3]). But, the use of Flickr photos as the data source is clever, especially for this application where the photos are also useful in the final output and the gaps in the data stream are not important.

Fun idea, nicely implemented, and very convincing results. Definitely worth a read. Don't miss the thoughts at the end on expansions to the idea, such as changing how the trails are filtered and aggregated based on individual preferences to generate personalized itineraries.]]>
Tue, 8 Jun 2010 23:39:00 GMT
A Findory buyout offer from Yahoo? http://glinden.blogspot.com/2010/06/findory-buyout-offer-from-yahoo.html
]]>
Mon, 7 Jun 2010 18:38:00 GMT
How Bing predicts the CTR of ads http://glinden.blogspot.com/2010/06/how-bing-predicts-ctr-of-ads.html Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine", describes the algorithm actually used in the Bing search engine to predict the clickthrough rates of ads.

From the paper:
Recognising the importance of CTR estimation for online advertising ... Bing/adCenter decided to run a competition to entice people across the company to develop the most accurate and scalable CTR predictor.

The algorithm described in this publication tied for first place in the first competition and won the subsequent competition based on prediction accuracy. As a consequence, it was chosen to replace Bing's previous CTR prediction algorithm, a transition that was completed in the summer of 2009.
The paper goes on to describe why the problem is important, the algorithm used, and some of the nastiness of getting something that works in the lab to run on the live site.

Don't miss the tidbit at the end where they say that they are "investigating the use of more powerful models, such as the feature-based collaborative filtering method Matchbox (Stern, Herbrich, & Graepel, 2009) for latent feature discovery and personalisation."]]> Tue, 1 Jun 2010 23:28:00 GMT Geeking with Greg administrivia http://glinden.blogspot.com/2010/05/geeking-with-greg-administrivia.html
When I started this blog in 2004, I wanted to bring personalization to information. Just as personalization and recommendations help people discover what they want in a massive product catalog, I thought personalization and recommendations could help tame the information streams that flood over us. Over the next five years, I wrote on this blog as I worked on personalized news, personalized search, and personalized advertising.

Regular readers may have noticed that posts on this blog have slowed down a fair bit over the last year. In large part, this is because I am no longer working on personalized search or personalized advertising, nor does my work still benefit from tracking what is going on in the information retrieval research community.

I try hard to keep this blog on topic. My plan is to have posts here continue to focus on personalized information, perhaps a bit on research papers, but mostly tracking the increasingly aggressive moves of the internet giants toward personalized search and advertising. But, that means posts will continue to be fairly infrequent.

If for some reason you can't get enough of geeking with me, if you really must have more, you might consider tracking what I am reading more broadly by following me in Google Reader. I use the shared items there as a link blog and comment broadly there on many topics.

Also, on this sixth anniversary, I welcome feedback on what you might enjoy seeing more of on this blog, especially from long-time readers. Do you mostly like the reviews of research papers (which, many have told me, are a great time-saver)? Comments on events in the industry (which might be snark, but perhaps useful snark)? Do you find the posts to be too long or too infrequent? Do you care if this blog stays on the topic of personalized information? If there is anything you want more of, please let me know in the comments!]]>
Thu, 20 May 2010 18:17:00 GMT
Yahoo as an internet information filter http://glinden.blogspot.com/2010/05/yahoo-as-internet-information-filter.html Esquire article:
When you talk about the Internet growing to 225 million sites, you've got to ask: Who's parsing all that? How do you make sense of all that stuff?

I mean, who has time to wander all over the Internet?

Tomorrow's Yahoo! is going to be really tailored. I'm not talking about organization — organizing means that you already know what you want and somebody's just putting it in shape for you. I'm talking about both smart science and people culling through masses of information on the fly and figuring out what people want to know.

We will be delivering your interests to you. For instance, if you're a sports fan but have no interest in tennis, we won't show you tennis. We would know that you do things in a certain sequence, so we'd say, "Here's your portfolio. Here's some news you might like. Oh, you went to this movie last week, here's some other movies you might want to check out."

I call it the Internet of One. I want it to be mine, and I don't want to work too hard to get what I need. In a way, I want it to be HAL. I want it to learn about me, to be me, and cull through the massive amount of information that's out there to find exactly what I want.
Please see also my June 2009 post, "Yahoo CEO Carol Bartz on personalization", for more on Yahoo's vision of recommending information.

[Esquire article found Nick Carr]]]>
Fri, 14 May 2010 18:28:00 GMT
Google tries to save the news http://glinden.blogspot.com/2010/05/google-tries-to-save-news.html How to Save the News", on what Googlers think about the future of the news industry.

Some key excerpts:
"If you were starting from scratch, you could never possibly justify [the current] business model," [Google Chief Economist] Hal Varian said ... "Grow trees -- then grind them up, and truck big rolls of paper down from Canada? Then run them through enormously expensive machinery, hand-deliver them overnight to thousands of doorsteps, and leave more on newsstands, where the surplus is out of date immediately and must be thrown away? Who would say that made sense?" The old-tech wastefulness of the process is obvious, but Varian added a less familiar point. Burdened as they are with these "legacy" print costs, newspapers typically spend about 15 percent of their revenue on what, to the Internet world, are their only valuable assets: the people who report, analyze, and edit the news.

"Nothing that I see suggests the 'death of newspapers,'" [Google CEO] Eric Schmidt told me. The problem was the high cost and plummeting popularity of their print versions. "Today you have a subscription to a print newspaper," he said. "In the future model, you'll have subscriptions to information sources that will have advertisements embedded in them, like a newspaper. You'll just leave out the print part. I am quite sure that this will happen ... As print circulation falls, the growth of the online audience is dramatic ... Newspapers don't have a demand problem; they have a business-model problem." Many of his company’s efforts are attempts to solve this, so that newspaper companies can survive, as printed circulation withers away.

The three pillars of the new online business model, as I heard them invariably described, are distribution, engagement, and monetization. That is: getting news to more people, and more people to news-oriented sites; making the presentation of news more interesting, varied, and involving; and converting these larger and more strongly committed audiences into revenue, through both subscription fees and ads.

The best monetizing schemes are of course ones that people like -- ads they enjoy seeing, products for which they willingly pay. Online display ads should be better on these counts too, [Google VP Neal] Mohan said. "here are things we can do online that we simply can't do in print," he said. An ad is "intrusive" mainly if it is not related to what you care about at that time ... "The online world will be a lot more attuned to who you are and what you care about" ... Advertising has been around forever, Mohan said, "but until now it has always been a one-way conversation."
The entire article is well worth reading. It gives a great feel for how Googlers are thinking about the future of news (and is mostly in line with my own thoughts).

Please see also my Oct 2009 post, "Google CEO on personalized news".]]>
Wed, 12 May 2010 04:20:00 GMT
Facebook's moves and personalized advertising http://glinden.blogspot.com/2010/04/facebooks-moves-and-personalized.html widely reported that Facebook has launched Open Graph and Implicit Personalization, which, among other things, give Facebook information about people's movements and what they like on the web. The service was launched opt-out and, even if you do want to opt-out, requires diving into confusing privacy settings to opt-out.

The prolific discussion of this elsewhere has thoroughly exhausted most of what there is to say, but I wanted to emphasize two things about this launch.

First, the fact that Facebook is so aggressively seeking this "treasure trove" of browsing behavior data may signal a major shift in its revenue model. Prior claims aside, the company now may be realizing that it is hard to target advertising to profile information and status updates because there is no commercial intent. This new source of data -- the websites people are visiting and what they like -- contains the purchase intent that Facebook so desperately needs.

Second, as Steve Lohr at the NYT reported today, other companies considering heavy use of personalized advertising have been waiting for someone else to take the first step and bear the brunt of any privacy-related backlash. It will be interesting to see if Facebook's latest move -- which probably is aggressive enough to count as the first step everyone was waiting for -- will result in a backlash or will open the floodgates.]]>
Fri, 30 Apr 2010 23:07:00 GMT
Google launches web search similarities http://glinden.blogspot.com/2010/04/google-launches-web-search-similarities.html reports.

For example, if I search for [engadget], at the bottom of the page, I see:
Pages similar to: www.engadget.com
Gizmodo ... gizmodo.com
Ubergizmo ... ubergizmo.com
Wired ... wired.com
Lifehacker ... lifehacker.com
The similarity algorithm also appears to have changed, with noticeably better quality in the spot checks I did.

This is a fairly big deal. Similarities based aggregate behavior and targeted to the immediate context is a big step toward personalization. The next step is to tie the data to individual history and target similarities and recommendations both to the context and each person's past history.

Google has done a version of personalized search for some time, but the technique used mostly was based on biasing search results toward people's long-term interests. More recently, they also started boosting previously clicked search results to help support re-finding.

Google's latest move should let Google add fine-grained personalization based on current missions and short-term trends, which, in combination with their current search personalization, is likely to improve Google's ability to help people find what they need.

Update: Here is the announcement of the new feature on the Official Google Blog.]]>
Wed, 28 Apr 2010 23:18:00 GMT
Google News hybrid recommendations http://glinden.blogspot.com/2010/04/google-news-hybrid-recommendations.html ACM), at the recent IUI 2010 conference that describes a hybrid recommender system combining user-based and content-based recommendations. This new hybrid recommender now appears to be deployed on Google News.

Some excerpts from the paper:
[The] previous Google News recommendation system was developed using a collaborative filtering method. It recommends news stories that were read by users with similar click history. This method has two major drawbacks ... First, the system cannot recommend stories that have not yet been read by other users ... Second ... news stories [that] are generally very popular ... are constantly recommended to most of the users, even for those users who never [are interested because] ... there are always enough clicks ... to make the recommendation.

A solution to these two problems would be to build profiles of user's genuine interests and use them to make news recommendations. The profiles ... filter out the stories that are not of interest ... [and recommend stories] even if [they have] not been clicked on by other users ... Based on a user's news reading history, the recommender predicts the topic categories of interest ... News articles in those categories are ranked higher in the candidate list.

On average, the hybrid method ... improves the CTR [of] the existing collaborative method by 30.9% ... [and increased] the frequency of website visits in the test group [by] 14.1%.
Hybrid recommenders are not that new. In the past, as in this paper, they usually were motivated by trying to deal with the sparsity and cold start problems that challenge collaborative filtering recommenders. Hybrid systems also have been used to deal with the so-called Harry Potter problem -- recommendations that focus too much on popular items -- by constraining the collaborative recommendations to the interests expressed in the profile, though that often can be better dealt with by tuning a collaborative recommender to discourage correlations between unpopular and popular items.

One thing that is surprising in this paper is the use of high-level topics rather than fine-grained topics. I would think that you would be better off getting as specific as possible on the profile, then branching out to related topics. The paper briefly addresses this, arguing that "specializing the user profile may limit the recommendations to news that the user already knew", but that seems like it would only happen if you rather foolishly only used read topics rather than including topics that appear to be related to read topics.

By the way, when you have as much data as Google should have, it is not at all clear you want to fall back on a content approach like they did in this paper here. Yehuda Koren, for example, has convincingly argued that, when you have big data, latent factor models extract these content-based relationships automatically in much more detail and much more accurately than you could hope to do with a manually constructed model.

Finally, I cannot quite let this one go by without mentioning that Findory was a hybrid news recommender, launched in January 2004, that dealt with the cold start and sparsity problems of a collaborative recommender, the same problems the Google News team apparently is still struggling with six years later. Findory is not mentioned in this paper in the related work, but I know the Google team is quite aware of Findory.]]>
Mon, 26 Apr 2010 00:45:00 GMT
Asking questions of your social network http://glinden.blogspot.com/2010/03/asking-questions-of-your-social-network.html for $50 million. The idea behind Aardvark is to provide a way to ask complicated or subjective questions of your friends and colleagues.

There are two papers that will be published at upcoming conferences that provide useful details on this idea. The first is actually by two members of the Aardvark team -- co-founder and Aardvark CTO Damon Horowitz and ex-Googler and Aardvark advisor Sep Kamvar -- and will be published at the upcoming WWW 2010 conference. The paper is called, "The Anatomy of a Large-Scale Social Search Engine" (PDF). An excerpt:
With Aardvark, users ask a question, either by instant message, email, web input, text message, or voice. Aardvark then routes the question to the person in the user's extended social network most likely to be able to answer that question.

Aardvark queries tend to be long, highly contextualized, and subjective -- in short, they tend to be the types of queries that are not well-serviced by traditional search engines. We also find that the vast majority of questions get answered promptly and satisfactorily, and that users are surprisingly active, both in asking and answering.
The paper is well written and convincing, establishing that this idea works reasonably well for a small (50k) group of enthusiastic early adopters. The paper does not answer whether this will work at large scale on a less motivated, lower quality mainstream audience. It also does not provide data to be able to evaluate the common criticism of asking questions of social networks, which is that, at large scale, the burden from a flood of often irrelevant incoming questions creates too much pain for too little benefit.

To get a bit more illumination on that question, turn to another upcoming paper, this one out of Microsoft Research and to be published at CHI 2010. The paper is "What Do People Ask Their Social Networks, and Why? A Survey Study of Status Message Q&A Behavior" (PDF). Some excerpts:
50.6% ... used their status messages to ask a question .... [on] sites like Facebook and Twitter .... 24.3% received a response in 30 minutes or less, 42.8% in one hour or less, and 90.1% within one day .... 69.3% ... who received responses reported they found the responses helpful.

The most common reason to search socially, rather than through a search engine, was that participants had more trust in the responses provided by their friends [24.8%]. A belief that social networks were better than search engines for subjective questions, such as seeking opinions or recommendations, was also a common explanation [21.5%].

The most common motivation given for responding to a question was altruism [37.0%]. Expertise was the next biggest factor [31.9%], with respondents being motivated because they felt they had special knowledge of the topic ... Nature of the relationship with the asker was an important motivator [13.7%], with closer friends more likely to get answers ... [as well as] the desire to connect socially [13.5%] ... free time [12.3%] ... [and] earning social capital [10.5%].

Many indicated they would prefer a face-to-face or personal request, and ignored questions directed broadly to the network-at-large .... [But] participants enjoyed the fun and social aspects of posing questions to their networks.
The key insight in the CHI paper is that people view asking questions of their social network as fun, entertaining, part of building relationships, and as a form of a gift exchange. The Aardvark paper focuses on a topical relevance rank of your social network, but maintaining relevance is going to be difficult at large scale when you have an unmotivated, lower quality, mainstream audience. The CHI paper might offer a path forward, suggesting we instead focus on game playing, entertainment, and the social rewards people enjoy when answering questions from their network.]]>
Wed, 24 Mar 2010 15:23:00 GMT
Security advice is wrong http://glinden.blogspot.com/2010/03/security-advice-is-wrong.html PDF), looks at why people ignore security advice.

The surprising conclusion is that some security advice we give to people -- such as inspect URLs carefully, pay attention to https certificate warnings, and use complicated passwords that change frequently -- does more harm than good. It actually costs someone far more to follow the advice than the benefit that person should expect to get.

Extended excerpts from the paper:
It is often suggested that users are hopelessly lazy and unmotivated on security ... [This] is entirely rational from an economic perspective ... Most security advice simply offers a poor cost-benefit tradeoff to users and is rejected.

[Security] advice offers to shield [people] from the direct costs of attacks, but burdens them with increased indirect costs ... Since victimization is rare, and imposes a one-time cost, while security advice applies to everyone and is an ongoing cost, the burden ends up being larger than that caused by the ill it addresses.

To make this concrete, consider an exploit that affects 1% of users annually, and they waste 10 hours clearing up when they become victims. Any security advice should place a daily burden of no more than 10/(365 * 100) hours or 0.98 seconds per user in order to reduce rather than increase the amount of user time consumed. This generated the profound irony that much security advice ... does more harm than good.

We estimate US annual phishing losses at $60 million ... Even for minimum wage any advice that consumes more than ... 2.6 minutes per year to follow is unprofitable [for users] from a cost benefit point of view ... Banks [also] have more to fear from ... indirect losses such as support costs ... than direct losses. For example ... an agent-assisted password reset by 10% of their users would cost $48 million, easily dwarfing Wells Fargo's share of the overall $60 million in phishing losses.

Users are effectively trained to ignore certificate warnings by seeing them repeatedly when there is no real security threat .... As far as we can determine, there is no evidence of a single user being saved from harm by a certificate error, anywhere, ever ... The idea that certificate errors are a useful tool in protecting [users] from harm ... is entirely abstract and not evidence-based. The effort we ask of [users] is real, while the harm we warn them of is theoretical.

Advice almost always ignores the cost of user effort ... The main reason security advice is ignored is that it makes an enormous miscalculation: it treats as free a resource that is actually worth $2.6 billion an hour ... Advice-givers and policy-mandaters demand far more effort than any user can give .... User education is a cost borne by the whole population, while offering benefit only to the fraction that fall victim ... The cost of any security advice should be in proportion to the victimization rate .... [lest] in trying to defend against everything we end up defending nothing.

It is not users who need to be better educated on the risks of various attacks, but the security community. Security advice simply offers a bad cost-benefit tradeoff to users .... We must respect users' time and effort. Viewing the user's time as worth $2.6 billion an hour is a better starting point than valuing it at zero ... When we exaggerate all dangers we simply train users to ignore us.
The paper also has a great discussion of password policies and how they appear to be counter-productive. When system administrators require passwords with weird special characters than need to be changed regularly, they make passwords difficult to remember and impose a substantial burden on users, but the benefit from this policy appears to be minimal.

[Paper found via Bruce Schneier]]]>
Tue, 23 Mar 2010 15:33:00 GMT
World of Goo on Google http://glinden.blogspot.com/2010/03/world-of-goo-on-google.html World of Goo, in one of its last levels, has a hilarious jab at the search engines, personalized search, and personalized advertising.

An excerpt from the dialogue with "MOM ... [the] Automated Search Engine Companion":
What would you like to ask MOM today?

Do I smell cookies? ... Yes! I am baking your personal information into each one.

What personal information is in your cookies? ... Cookies may contain:
your location
complete search history
whipped soy product
online purchases
medical records
telephone logs
web logs
email logs
streaming video from your current location
and more!
That's MOM's special recipe for offering free, convenient, relevant information to valued users like you.

Is my personal information safe with you? ... Your personal information is stored in a secure database and will never be shared with anyone*
* Unless they ask.
** Or if someone says they are you and takes your cookies
*** Or if the venture firm finds out it's profitable.
**** Or if unhappy employees release copies of your cookies to other online databases.
***** Or if outsourced data centers sell illegal copies of your cookies to other sites.
****** Or if my parent corporation is acquired and my data including your cookies becomes part of a larger aggregate system without your knowledge or consent.

Delete my cookies ... Are you sure? ... Yes ... Your cookies have been deleted*.
* Cookies may not actually be deleted.
** Cookies may be stored indefinitely for evaluation and training purposes to better serve you.
*** MOM knows best.

Everyone loves receiving special offers from MOM and the MOM's affiliate network of adver-bots.
Video is available as well as a full transcript (search for [Conversation with MOM transcript]).]]>
Mon, 22 Mar 2010 00:36:00 GMT
Designing search for re-finding http://glinden.blogspot.com/2010/03/designing-search-for-re-finding.html PDF), is notable not so much for the statistics on how much people search again for what they have searched for before, but for its fascinating list of suggestions (in Section 6, "Design Implications") of what search engines should do to support re-finding.

An extended excerpt from that section:
The most obvious way that a search tool can improve the user experience given the prevalence of re-finding is for the tool to explicitly remember and expose that user's search history.

Certain aspects of a person's history may be more useful to expose ... For example, results that are re-found often may be worth highlighting ... The result that is clicked first ... is more likely to be useful later, and thus should be emphasized, while results that are clicked in the middle may be worth forgetting ... to reduce clutter. Results found at the end of a query session are more likely to be re-found.

The query used to re-find a URL is often better than the query used initially to find it ... [because of] how the person has come to understand this result. [Emphasize] re-finding queries ... in the history ... The previous query may even be worth forgetting to reduce clutter.

When exposing previously found results, it is sometimes useful to label or name those results, particularly when those results are exposed as a set. Re-finding queries may make useful labels. A Web browser could even take these bookmark queries and make them into real bookmarks.

A previously found result ... may be what the person is looking for ... even when the result [normally] is not going to be returned ... For example, [if] the user's current query is a substring of a previous query, the search engine may want to suggest the results from the history that were clicked from the longer query. In contrast, queries that overlap with but are longer than previous queries may be intended to find new results.

[An] identical search [is] highly predictive of a repeat click ... [We] can treat the result specially and, for example, [take] additional screen real estate to try to meet the user's information need with that result ... [with] deep functionality like common paths and uses in [an expanded] snippet. For results that are re-found across sessions, it may make sense instead to provide the user with deep links to [some] new avenues within the result to explore.

At the beginning of a session, when people are more likely to be picking up a previous task, a search engine should provide access into history. In the middle of the session ... focus on providing access to new information or new ways to explore previously viewed results. At the end of a session ... suggest storing any valuable information that has been found for future use.
Great advice from Jaime Teevan at Microsoft Research. For more on this, please see my earlier post, "People often repeat web searches", which summarizes a 2007 paper by Teevan and others on the prevalence of re-finding.]]>
Tue, 16 Mar 2010 17:13:00 GMT
GFS and its evolution http://glinden.blogspot.com/2010/03/gfs-and-its-evolution.html GFS: Evolution on Fast-Forward", in the latest CACM magazine interviews Googler Sean Quinlan and exposes the problems Google has had with the legendary Google File System as the company has grown.

Some key excerpts:
The decision to build the original GFS around [a] single master really helped get something out the door ... more rapidly .... [But] problems started to occur ... going from a few hundred terabytes up to petabytes, and then up to tens of petabytes ... [because of] the amount of metadata the master had to maintain.

[Also] when you have thousands of clients all talking to the same master at the same time ... the average client isn't able to command all that many operations per second. There are applications such as MapReduce, where you might suddenly have a thousand tasks, each wanting to open a number of files. Obviously, it would take a long time to handle all those requests and the master would be under a fair amount of duress.

64MB [was] the standard chunk size ... As the application mix changed over time, however, ways had to be found to let the system deal efficiently with large numbers of files [of] far less than 64MB (think in terms of Gmail, for example). The problem was not so much with the number of files itself, but rather with the memory demands all those [small] files made on the centralized master .... There are only a finite number of files you can accommodate before the master runs out of memory.

Many times, the most natural design for some application just wouldn't fit into GFS -- even though at first glance you would think the file count would be perfectly acceptable, it would turn out to be a problem .... BigTable ... [is] one potential remedy ... [but] I'd say that the people who have been using BigTable purely to deal with the file-count problem probably haven't been terribly happy.

The GFS design model from the get-go was all about achieving throughput, not about the latency at which it might be achieved .... Generally speaking, a hiccup on the order of a minute over the course of an hour-long batch job doesn't really show up. If you are working on Gmail, however, and you're trying to write a mutation that represents some user action, then getting stuck for a minute is really going to mess you up. We had similar issues with our master failover. Initially, GFS had no provision for automatic master failure. It was a manual process ... Our initial [automated] master-failover implementation required on the order of minutes .... Trying to build an interactive database on top of a file system designed from the start to support more batch-oriented operations has certainly proved to be a pain point.

They basically try to hide that latency since they know the system underneath isn't really all that great. The guys who built Gmail went to a multihomed model, so if one instance of your GMail account got stuck, you would basically just get moved to another data center ... That capability was needed ... [both] to ensure availability ... [and] to hide the GFS [latency] problems.

The model in GFS is that the client just continues to push the write until it succeeds. If the client ends up crashing in the middle of an operation, things are left in a bit of an indeterminate state ... RecordAppend does not offer any replay protection either. You could end up getting the data multiple times in a file. There were even situations where you could get the data in a different order ... [and then] discover the records in different orders depending on which chunks you happened to be reading ... At the time, it must have seemed like a good idea, but in retrospect I think the consensus is that it proved to be more painful than it was worth. It just doesn't meet the expectations people have of a file system, so they end up getting surprised. Then they had to figure out work-arounds.
Interesting to see exposed the warts of Google File System and Bigtable. I remember when reading the Bigtable paper being surprised that it was layered on top of GFS. Those early decisions to use a file system designed for logs and batch processing of logs as the foundation for Google's interactive databases appear to have caused a lot of pain and workarounds over the years.

On a related topic, a recent paper out of Google, "Using a Market Economy to Provision Compute Resources Across Planet-wide Clusters" (PDF), looks at another problem Google is having, prioritizing all the MapReduce batch jobs at Google and trying to maximize utilization of their cluster. The paper only describes a test of one promising solution, auctioning off the cluster time to incent developers to move their jobs to non-peak times and idle compute resources, but still an interesting read.]]>
Sun, 14 Mar 2010 23:31:00 GMT
The Onion on Google's data http://glinden.blogspot.com/2010/03/onion-on-googles-data.html Google Responds To Privacy Concerns With Unsettlingly Specific Apology", that should be enjoyable for this crowd. An excerpt as a teaster:
Acknowledging that Google hasn't always been open about how it mines the roughly 800 terabytes of personal data it has gathered since 1998, [CEO Eric] Schmidt apologized to users -- particularly the 1,237,948 who take daily medication to combat anxiety --for causing any unnecessary distress, and he expressed regret -- especially to Patricia Fort, a single mother taking care of Jordan, Sam, and Rebecca, ages 3, 7, and 9 -- for not doing more to ensure that private information remains private.

Monday's apology comes after the controversial launch of Google Buzz, a social networking platform that publicly linked Gmail users to their most e-mailed contacts by default.

"I'd like nothing more than to apologize in person to everyone we've let down, but as you can see, many of our users are rarely home at this hour," said Google cofounder and president Sergey Brin, pointing to several Google Map street-view shots of empty bedroom and living room windows on a projection screen behind him. "And, if last night's searches are any indication, Boston's Robert Hornick is probably out shopping right now for the spaghetti and clam sauce he'll be cooking tonight ... Either that, or hunting down that blond coworker of his, Samantha, whose Picasa photos he stares at every night."
[Article found via Bruce Schneier]]]>
Fri, 12 Mar 2010 03:04:00 GMT
Personalization and differential pricing http://glinden.blogspot.com/2010/02/personalization-and-differential.html PDF). An excerpt of his predictions on personalization:
Instead of a "one size fits all" model, the web offers a "market of one" ... [powered by] suggestions of things to buy based on your previous purchases, or on purchases of customers like you.

Not only content, but prices may also be personalized, leading to various forms of differential pricing ... [But] the ability of firms to extract surplus [may be] quite limited when consumers are sophisticated ... [And] perfect price description and free entry ... pushes profits to zero, conferring all benefits to the customers.

The same sort of personalization can occur in advertising ... Google and Yahoo ... [already] allow users to specify their areas of interest and then see ads related to those interests. It is also relatively common for advertisers ... to show ads based on previous responses of users to related ads.
Back in 2000, Amazon got slammed (e.g. [1]) for an experiment with differential pricing, but Hal appears to be predicting differential pricing will rise again.

The paper also talks briefly about how experimentation changes how companies make decisions ("when experiments are cheap, they are likely provide more reliable answers than opinions"), data mining, online advertising, legal contracts that use computer monitoring to enforce their terms, and cloud computing. The paper is from the 2010 Ely Lecture at the American Economics Association and video of the talk is available.]]>
Sat, 27 Feb 2010 16:55:00 GMT
How we all teach Google to Google http://glinden.blogspot.com/2010/02/how-we-all-teach-google-to-google.html How Google's Algorithm Rules the Web", with some fun details on how Google uses constant experimentation, logs of searches and clicks, and many small tweaks to keep improving their search results.

Well worth reading. Some excerpts as a teaser:
[Google Fellow Amit] Singhal notes that the engineers in Building 43 are exploiting ... the hundreds of millions who search on Google. The data people generate when they search -- what results they click on, what words they replace in the query when they're unsatisfied, how their queries match with their physical locations -- turns out to be an invaluable resource in discovering new signals and improving the relevance of results.

"On most Google queries, you're actually in multiple control or experimental groups simultaneously," says search quality engineer Patrick Riley. Then he corrects himself. "Essentially," he says, "all the queries are involved in some test." In other words, just about every time you search on Google, you're a lab rat.

This flexibility -- the ability to add signals, tweak the underlying code, and instantly test the results -- is why Googlers say they can withstand any competition from Bing or Twitter or Facebook. Indeed, in the last six months, Google has [found and] made more than 200 improvements.
Even so, this raises the question of where the point of diminishing returns is with more data and more users. While startups lack Google's heft, Yahoo and Bing are big enough that -- if they continuously experiment, tweak, and learn from their data as much as Google does -- search quality differences likely would be in an imperceptibly small chunk of long tail queries.]]>
Tue, 23 Feb 2010 23:23:00 GMT
Google Reader recommends articles http://glinden.blogspot.com/2010/02/google-reader-recommends-articles.html May we recommend...", Laurence Gonsalves describes a new recommendation feature for Google Reader that recommends articles based on what you have read in the past. An excerpt:
Many of you wanted to see even more personalized recommendations ... [Now], we've started inserting items selected just for you inside the Recommended items section. This is great if you've got interests that are less mainstream. If you love Lego robots, for example, then you should start to notice more of them in your Recommended items.
Sadly, no additional details appear to be available. In my usage, there were rare gems in the recommendations, but a lot of randomness, and a strong bias toward very popular items. The lack of explanation -- why was this item recommended? -- and lack of a way to correct the recommendations likely will make people less forgiving of these problems. I also saw recommendations for items I had already read recently; items you have already seen always should be filtered from recommendations.

For more on that, you might enjoy some of my previous posts on this topic, such as the Mar 2009 "What is a good recommendation algorithm?" and the much older Dec 2006 "The RSS beast".]]>
Tue, 23 Feb 2010 14:26:00 GMT
New details on LinkedIn architecture http://glinden.blogspot.com/2010/02/new-details-on-linkedin-architecture.html LinkedIn Search: A Look Beneath the Hood", that has slides from a talk by LinkedIn engineers along with some commentary on LinkedIn's search architecture.

What makes LinkedIn search so interesting is that the search does real-time updates (the "time between when user updates a profile and being able to find him/herself by that update need to be near-instantaneous"), faceted search (">100 OR clauses", "NOT support", complex boolean logic, some facets are hierarchical, some are dynamic over time), and personalized relevance ranking of search results (ordered by distance in your LinkedIn social graph).

LinkedIn appears to use a combination of aggressive partitioning, keeping data in-memory, and a lot of custom code (mostly modifications to Lucene, some of which have been released open source) to handle these challenges. One interesting tidbit is, going against current conventional wisdom, LinkedIn appears to only use caching minimally, preferring to spend their efforts and machine resources on making sure they can recompute computations quickly than on hiding poor performance behind caching layers.]]>
Tue, 2 Feb 2010 22:40:00 GMT