The Google-Searchification of AI

Douglas Adams had an idea about technology that whatever technology existed at the time you were born is normal and ordinary, anything that was invented between whenb you’re fifteen and thirty fice is new and exciting and you can probably make a career of it, anything invented after the agge of thirty five is against the natural order of things.

Internet search, specifically Google Search, was a technology largely developed when I was in my gee whiz stage of life.

As a young adult, it was pretty magical to be able to (seemingly) find whatever I was looking for, quickly. This only became more incredible after the smartphone revolution and this capability went mobile.

The search (especially google search) of today isn’t what it once was… so much so that I don’t even utilize Google anymore. I can still find what I seek, mostly, though I now need much more specialized skills, a wider panoply of specialized tools, and a tolerance to wade through much more onerous ads.

If google search was what it once was in the early aughts, I wonder if I’d be as enamored by generative AI.

–-

There is a very popular idea right now of ‘enshitification’ attributed to Corey Doctorow. His point is that services get worse as stakeholders (especially investors) seek to extract more value to the detriment of everyone else. I think this is helpful, and often true, but insufficiently nuanced; it does that modern trick of laying all problems at the feet of the especially (dis)favored outgroup — in this case a (certain kind) of capitalist. But I think why Google search has gotten worse is more nuanced and complicated than that; it is at least partially grown worse because it’s evolving alongside other interrelated systems, which are also evolving.

Much as Google Search was really, truly better twenty years, I wonder if we’ll see a similar cycle with generative AI (or at least LLMs). I wonder if the contradictions & tensions in the (eco)systems in and around the technology (corporate structure, advertising needs, intelligence./disinformation campaigns, et al) will see its use & power hobbled. When I think of (why) google search got so much worse, I wonder how many of these issues will also apply to Generative AI development.

Audience/Feedstock Dilution aka Worse Training Data

The internet used to be primarily the output of passionate nerds, now it’s an aggregate of something closer to a (biased) version of pretty much more everybody (at least everyone in the west). The average quality of information posted on the internet is lower, now than when there was only 4, or 40, or 400,000 serves connected. Search is less useful because there’s more, lower quality, material to sort through.

Less Accessible Feedstock

Somewhat ironically, it used to be that a greater percentage of (higher quality) information on the internet was accessible to search algorithms, but eventually more and more of it was siloed where it would be inaccessible (or at least hardened against) harvesting by those same algorithms. Some of this is change in technical capabilities (more video), the rise walled gardens (social media generally). I can see a similar thing happening in the training data for AI, as new monetization issues change the legal landscape.

A particular Profit motive/Enshitification

Google simply shows more ads now, by what feels like an order of magnitude, and the PH balance of ‘Sorting information to be most valuable for the user’ vs ‘sorting information to be most profitable for Alphabet’ it skews, heavily, towards the latter. I can see similar forces changing LLM utility (assuming the corporate/for profit versions maintain a consistent lead vs open source alternatives).

Data Set Poisoning

Search is worse because it has had to evolve defenses against malicious attacks, like say deliberate hacking a wordpress sites to installing a thousand hidden porn links for instance. I don’t think LLMs are as vulnerable to this as the open web, but I can imagine similar incentive to poison training data at a level below human perceptibility. And, somewhat ironically, LLMs ability to generate additional training data quickly means this might be easier/there might be more options for doing this that aren’t conceivable with current frame.

Top Down Control

Google search is also different (worse IMO) because of top down control attempts. This comes in many flavors, on the one hand advertisers find certain information distasteful and do not want their ads to be associated with various forms of ‘profanity’ and so profane things (but also things that are simply a particular flavor of unfashionable) are generally downranked. There’s also explicit asks of government agencies to censor which I’m not sure how much an effect that has on google search, though it’s well documented for social media certain effects youtube monetization which by extensions effects cultural cultural production.

Recursively

It’s currently unknown if we can train Generative AI on Ai output. I would venture a guess every level of recursively is only 70% as useful as the level that spawned it. This isn’t as big an issue with search, but I imagine it could hobble the future utility of LLMs.

Perceptions of Fairness

People are more likely to give gifts if no one is clearly profiting from them (which is a large part of why Wikipedia continues to function). Meanwhile if someone is clearly making a lot of money, people are much less likely to donate their time/energy/attention without receiving ‘fair’ compensation. This cultural perception of reasonable compensation is upstream of law. This is related the changing demographics of cultural production, when a scene is just geeks and fanatics creating for the joy of it, they can continue potentially indefinitely. Once a labor of (mostly) love is (overly, aggressively) monetized via any mechanism, there’s a strong risk of stymying the production if not stopping it. Currently, with LLMs, there’s no compensation mechanism for the millions whose work make them possible.

Goodhearting

What gets measured gets managed. In the earlier days of the web, the focus was to a much greater degree to share interesting information. This was probably the best chance to get your information seen. Eventually, a new factor entered the consciousness, that of getting higher search ranking in google, or ‘pleasing’ the various social media algorithms (which is not quite the same thing) with an accompanying arms race (and develop of superstition/new segment of specialists). However we develop the various generative Ais, we are constantly at risk of incentivizing the wrong behavior.

–-

To put this in naturalistic metaphors I’m fond of, it’s entirely possible there are hypothetical organisms that would be much more efficient at some core functionality (photosynthesizing, running) if they didn’t also have to evolve capabilities to fend off parasites, predators, or environmental degradation/variance. It’s possible something like early(ier) google was something like this, a unique bloom that was only possible in a less challenging environment where there hadn’t been more environmental variation to survive, nor as many pathogens and hazards, or where it hadn’t decimated it’s food stuff species (possible unto the point that certain types of cultural/information producers have gone effectively extinct from the web’s perspective).

I don’t believe there will never be another training data set like the one we’re currently benefitting from. That is to say, every scannable, machine translatable book written before 1929, and everything scrapeable from the web when (we assume) a majority of the information offered to the digital networks was not hobbled by one of the forces mentioned above, not suspect for being recursive machine generated slop that (if used for training) will lead to crappier, crazier outcomes. From here on out, our information faces at least the possibility of being suspect.

It’s possible via methods like Monte Carlo or mimicking evolutionary processes we can produce novel information configurations without the need for the contributions of millions (billions) of clever, motivated humans: but if we run an entirely virtual evolutionary process I wonder how compatible that will be with existing evolved systems, which all developed deeply entrained with one another.

Even with their explosive growth in utility, even for all we’ve learned about how to push the models further (with perhaps even less resources) it’s entirely possible we’re at (or far closer) than we care to believe to the peak of easy utility for the LLMs, in the same way I didn’t think that google search was near the peak of its simple utility 20 years ago. Not least because we tend to focus attention on performance of whatever centralized bureaucracy is capturing most of the value rather than understanding that this will be the target for inevitable parasitism, hacking (benign or otherwise) and it can only function as it does because it sits on top of a fragile ecosystem it will eventually, inevitably alter… potentially to the point it can no longer function.