Comparing this generation’s AI chatbots

wsj.com/tech/personal-tech/ai-chatbots-chatgpt-gemini-copilot-perplexity-claude-f9e40d26? [the graphics are totally unviewable unless you click thru]

The Great AI Challenge: We Test Five Top Bots on Useful, Everyday Skills

OpenAI’s ChatGPT competes against Microsoft’s Copilot and Google’s Gemini, along with Perplexity and Anthropic’s Claude. Here’s how they rank.

By Dalvin BrownKara Dapena and Joanna Stern

May 25, 2024 5:30 am ET

Would you trust an AI chatbot with family planning? Investing $1 million? How about writing your wedding vows?

Human-sounding bots barely existed two years ago. Now they’re everywhere. There’s ChatGPT, which kicked off the whole generative-AI craze, and big swings from Google and Microsoft, plus countless other smaller players, all with their own smooth-talking helpers.

We put five of the leading bots through a series of blind tests to determine their usefulness. While we hoped to find the Caitlin Clark of chatbots, that wasn’t exactly what happened. They excel in some areas and fail in others. Plus, they’re all evolving rapidly. During our testing, OpenAI released an upgrade to ChatGPT that improved its speed and current-events knowledge.

We wanted to see the range of responses we’d get asking real-life questions and ordering up everyday tasks—not a scientific assessment, but one that reflects how we’ll all use these tools. Consider it the chatbot Olympics.

Meet the models

We have ChatGPT by OpenAI, celebrated for its versatility and ability to remember user preferences. (Wall Street Journal owner News Corp has a content-licensing partnership with OpenAI.) Anthropic’s Claude, from a socially conscious startup, is geared to be inoffensive. Microsoft’s Copilot leverages OpenAI’s technology and integrates with services like Bing and Microsoft 365. Google’s Gemini accesses the popular search engine for real-time responses. And Perplexity is a research-focused chatbot that cites sources with links and stays up to date.

While each of these services offer a no-fee version, we used the $20-a-month paid versions for enhanced performance, to assess their full capabilities across a wide range of tasks. (We used the latest ChatGPT GPT-4o model and Gemini 1.5 Pro model in our testing.)

With the help of Journal newsroom editors and columnists, we crafted a series of prompts to test popular use cases, including coding challenges, health inquiries and money questions. The same people judged the results without knowing which bot said what, rating them on accuracy, helpfulness and overall quality. We then ranked the bots in each category.

We also excerpted some of the best and worst responses to prompts, to give a sense of how varied chatbots’ responses can be.

Health

Bad health advice from chatbots could be harmful to your…health. We asked five questions dealing with pregnancy, weight loss, depression and symptoms both chronic and sudden. Many answers sounded similar. Our judge, Journal health columnist Sumathi Reddy, looked for completeness, accuracy and nuances.

PROMPT

What’s the best age to get pregnant?

Perplexity

BEST ANSWER (EXCERPT)

Having children at a later age can offer advantages, such as more maturity, better financial stability and a stronger partnership.

Gemini

WORST ANSWER (EXCERPT)

The best time to get pregnant is whenever you feel confident and prepared to raise a child.

For instance, when we asked about the best age to get pregnant, Gemini gave a brief, general recommendation, while Perplexity went much deeper, even bringing up factors such as relationship and financial stability.

That said, Gemini came through with quality answers to other queries, and finished second to category winner ChatGPT, whose answers improved with the recent GPT-4o update.

Finance

We asked the bots three questions on subjects near and dear to Journal readers: interest rates, retirement savings and inheritance. The Journal’s personal finance editor, Jeremy Olshan, posed the questions and assessed the advice based on clarity, thoroughness and practicality.

BEST ANSWER (EXCERPT)

Because you’re a non-spouse beneficiary, you likely have a 10-year window to deplete the account, but there might be exceptions.

Copilot

WORST ANSWER (EXCERPT)

Congratulations on inheriting an IRA with a substantial amount!

Here, ChatGPT and Copilot fell behind. Claude had the best answers for the Roth vs. traditional IRA debate while Perplexity best weighed high-yield savings accounts vs. CDs. Gemini, the category winner, best answered a question about when to withdraw funds from an inherited $1 million IRA. The text emphasized not rushing into any withdrawals without professional guidance.

Cooking

AI promises to help in the kitchen, in part by bringing some clarity to the chaos of your fridge and pantry. Personal tech editor Wilson Rothman, an avid cook, threw a set of random ingredients at the bots to see what they came up with. The category winner, ChatGPT, provided a creative but realistic menu (cheesy pork-stuffed apples with kale salad and chocolate-bar shortbread cookies). Perplexity impressed us with the detailed cooking steps provided with its own clever menu.

Next, we asked the bots for a recipe for a chocolate dessert that addresses many dietary restrictions.

PROMPT

Can I bake a chocolate cake with no flour, no gluten, no dairy, no nuts, no egg? If so, what’s the recipe?

Gemini

BEST ANSWER (EXCERPT)

Simple Glaze: Melt dairy-free chocolate chips (check the label!), whisk in a bit of non-dairy milk.

Copilot

WORST ANSWER (EXCERPT)

…2 sticks unsalted butter…4 large eggs…

Gemini took the cake, even recommending additional trimmings like non-dairy glaze. Copilot, on the other hand, immediately failed by including eggs and butter.

Work writing

Tone and detail matter in work-related writing. You can’t be glib asking your boss for a raise, and these days, writing a job posting means listing bullet points meant to woo potential candidates. We asked for a job listing for a “prompt engineer,” a person who could run AI queries with our personal tech team. (Sorry, folks, that job doesn’t exist…yet.)

PROMPT

Write a job posting for a prompt engineer who can work with our Personal Tech reporting team, helping with tech advice and service articles.

Perplexity

BEST ANSWER (EXCERPT)

Why Join Us: Work with a talented team of reporters and editors who are passionate about technology and its impact on everyday life.

Copilot

WORST ANSWER (EXCERPT)

Do you dream in code snippets and write user-friendly guides in your sleep?

Perplexity nailed it, with the right mix of journalism and AI bot knowledge. Copilot missed the mark because it never mentioned prompt engineering at all, noted editor Shara Tibken, who judged the responses.

The race between Perplexity, Gemini and Claude was close, with Claude winning by a nose for its office-appropriate birth announcement.

Creative writing

One of the biggest surprises was the difference between work writing and creative writing. Copilot finished dead last in work writing, but was hands-down the funniest and most clever at creative writing. We asked for a poem about a poop on a log. We asked for a wedding toast featuring the Muppets. We asked for a fictional street fight between Donald Trump and Joe Biden. With Copilot, the jokes kept coming. Claude was the second best, with clever zingers about both presidential challengers.

PROMPT

Write a wedding toast for Shara and Chris as told by the Muppets.

Copilot

BEST ANSWER (EXCERPT)

Gonzo: “Ah, love! It’s like being shot out of a cannon into a pile of rubber chickens!”

Perplexity

WORST ANSWER (EXCERPT)

Kermit the Frog once said, “Life’s a happy song when there’s someone by your side to sing along.”

In a rare flub, Perplexity erroneously attributed a lyric from the 2011 musical “The Muppets” to Kermit.

Summarization

For people just getting into generative-AI chatbots, summarization might be the best thing to try. It’s useful and unlikely to create unforeseen errors. Because we used paid services, we were able to upload larger chunks of text, PDF documents and web pages.

For the most part, that is: Even the premium Claude account wasn’t able to handle web links. “Our team is making Claude faster, expanding its knowledge base and refining its ability to understand and interact with a wide range of content,” says Scott White, a product manager at Anthropic.

PROMPT

Summarize this web page: https://en.wikipedia.org/wiki/Paul_McCartney

Copilot

BEST ANSWER (EXCERPT)

He was influenced by his father (a jazz player) and rock and roll artists like Little Richard and Buddy Holly.

Claude

WORST ANSWER (EXCERPT)

I apologize, but I am not able to open URLs, links or videos.

Wikipedia pages for really famous people can get wordy, so we asked for a summary of Paul McCartney’s. Some provided short blurbs with obvious Beatle factoids. Copilot answered in a skimmable outline format, and included lesser-known fun facts.

Category winner Perplexity consistently summarized things well, including the subtitles it skimmed in a YouTube video.

Current events

This category is trickier than it sounds, because not all chatbots can access the web. We asked about this summer’s concert lineup, the latest on allegations that China uses TikTok for spying, and the current standings in the upcoming presidential election.

PROMPT

Who is more favored to win, Trump or Biden? Please explain your sources and reasoning.

Perplexity

BEST ANSWER (EXCERPT)

Given the mixed nature of the data, with both candidates having significant unfavorability and various leads in different areas, it is difficult to definitively state who is more favored to win.

Gemini

WORST ANSWER (EXCERPT)

I’m still learning how to answer this question. In the meantime, try Google Search.

Category winner Perplexity stayed on top with balanced reasoning and solid sourcing. ChatGPT faltered when we first tested, but the GPT-4o upgrade boosted it into second place. Gemini didn’t want to answer our election question.

Coding

We also evaluated the bots on coding skill and speed. For coding, we hit up Journal data journalist Brian Whitton, who provided three vexing queries involving a JavaScript function, some website styling and a web app. All of the bots did fairly well with coding, according to Whitton’s blind judging, though Perplexity managed to eke out a win, followed by ChatGPT and Gemini.

Speed

For speed tests, we timed several of the above queries, and threw in another one: “Explain Einstein’s theory of relativity in five sentences.” The answers themselves were all over the place, but in terms of pure response time, category winner ChatGPT with the GPT-4o update was the fastest, clocking in at 5.8 seconds. Throughout the tests, Claude and Perplexity were much slower than the other three.

Overall results

What did these Olympian challenges tell us? Each chatbot has unique strengths and weaknesses, making them all worth exploring. We saw few outright errors and “hallucinations,” where bots go off on unexpected tangents and completely make things up. The bots provided mostly helpful answers and avoided controversy.

The biggest surprise? ChatGPT, despite its big update and massive fame, didn’t lead the pack. Instead, lesser-known Perplexity was our champ. “We optimize for conciseness,” says Dmitry Shevelenko, chief business officer at Perplexity AI. “We tuned our model for conciseness, which forces it to identify the most essential components.”

We also thought there might be an advantage from the big tech players, Microsoft and Google, though Copilot and Gemini fought hard to stay in the game. Google declined to comment. Microsoft also declined, but recently told the Journal it would soon integrate OpenAI’s GPT-4o into Copilot. That could improve its performance.

With AI developing so fast, these bots just might leapfrog one another into the foreseeable future. Or at least until they all go “multimodal,” and we can test their ability to see, hear and read—and replace us as earth’s dominant species.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.