ChatGPT fails for Research
edited February 14 in Research & Reading
I tried to use ChatGPT for several research queries. I am currently processing the articles it gave me on maximum strength predictions based on repetition maximums (basically: Perform a set to failure and then calculate your maximum weight for one repetition).
5/5 studies it gave me don't exist.
EDIT: Luckily, I have a textbook with extensive references provided by a human.
I am a Zettler
It looks like you're new here. If you want to get involved, click one of these buttons!
ChatGPT will not replace human writers. It will augment, but more for the weak than the gifted.
It's axiomatic. ChatGPT feeds off content it finds. The more content it replaces, the more it feeds off its own output. Eventually, that leads to an unwholesome alimentary loop that I dare not describe in a genteel environment.
I am not afraid of CGPT replacing anything that is already behaving algorithmically.
It fails at the most basic requests. Seems that giving correct bibliographical data is more difficult than to write a basic article on XY which says a lot about those basic articles.
My personal hope was to give me a jump start on my tasks. It is annoying to find reviews that are up to date and employ proper methods. At least, I'd suspect CGPT give me up to date reviews on a given topic.
I am a Zettler
Had similar results for programming packages. It invented packages for authors that do exist and who were likely to have created them. While the authors were real, none of the packages existed.
ChatGPT is weird since it's only good at one thing: "guessing" which word should likely be used next, given the words already produced to match the input query.
It's absurd that coherent sentences come out of this at all, given this approach of "guessing what's next"
It can't do any fact checking. A simple web search produces better results for things that do exist. In that sense, ChatGPT is chiefly generative.
Author at Zettelkasten.de • https://christiantietze.de/
Many subscription academic databases already allow you to search for reviews, and even for different types of reviews. I imagine we will see more AI capabilities added to such databases in the future.
Have you tried scite.ai?
Following up on this thought, I googled: "ChatGPT" + "scite.ai" OR "scite". One of the top results was a blog post by (of course) the always-interesting Aaron Tay: "Are we undervaluing Open Access by not correctly factoring in the potentially huge impacts of machine learning? — an academic librarian's view (I)" (6 December 2022). It's something to read if you're interested in the use of AI in academic literature discovery.
And try scite.ai if you haven't tried it recently. They implemented natural-language search late last year.
EDIT 1: See also Tay's earlier post that has a more extensive comparison of scite.ai with four other services: "Q&A academic systems – Elicit.org, Scispace, Consensus.app, Scite.ai and Galactica" (27 November 2022).
EDIT 2: I discovered that four days ago, Aaron Tay, who wrote the blog posts that I mentioned above, responded in horror on Twitter to a tweet that said: "The user started by asking Bing to find 5 studies about aerobic exercise that were conducted in the last 5 years." Tay responded: "This is horrible. Based on links gpt is extracting from popular websites that describe papers and we know how accurate that is. As the thread goes on they find it hallucinates. Btw @elicitorg exists for over a year using LLMs for this use case and it extracts from source papers." Now you know how to horrify librarians: Tell them that you used ChatGPT to do your literature review!
That was my understanding of how it worked also, which doesn't make sense to me, because how do you get coherent paragraphs out of that? I like https://www.perplexity.ai/ because it gives citations of where it found the information from, and you can get a rough estimate of how much it should be trusted.
The trustworthiness of the information is a big deal. I'm not sure if they've publicly talked about how they are going to deal with that. I remember seeing something about how they think its going to solve itself if they are able to train it on a bigger data set, which doesn't make sense to me.
Are you referring to chat GPT or perplexity.ai?
Anyways Chat GPT is a language model. The knowledge base is its input and modulated human text its output. There is no credibility at all.
my first Zettel uid: 202008120915
My work is dependent on quite a number of fields. So, I am often in the situation that I need to build up some orientation how to penetrate in the first place. The bottle neck is not to find reviews or reviews of certain methods. But to filter the ones I find.
My favorite address for all things health and fitness is semantic scholar btw.
I am a Zettler
I don't typically trust information I get from human friends, but prefer to confirm by researching several credible sources and also making sure the information isn't listed on sites that detect false news. Having said that, why would I even think of trusting an AI source that adds at least one (if not many) more layers of stupidity onto the information it dispenses?
scite.ai and elicit.org will be a big improvement over semantic scholar for helping you filter articles, if their coverage is adequate for you.
Playing arround with elicit. So far, I might lack the skill of making the right queries. Something feels off.
I am a Zettler
Dear all - Aaron Tay here
My view is the world is overly obsessed on ChatGPT. W
Asking ChatGPT to write an essay is like asking an unaided human to write a paper with references. The human brain/ChatGPT's Neural Nets might be trained with sufficient weights to occasionally pop out a real reference but mostly, they will misremember, and this is what you see when ChatGPT is asked to do so. Similarly, meta's Galactica.org despite being trained only on academic content is in the same boat.
Search+LLM's (Large language models) are the future. They work similarly to how humans use a search engine to look for relevant documents and extract the parts that may answer the question. Tools like the new Bing+chat, Perplexity.ai and Elicit.org, scite.ai's (ask a question feature) work on the academic side.
I predict that in one year's time, ChatGPT will be forgotten and most people will be using search enhanced by LLMs.
Bing+chat is specifically impressive in my testing. Such systems no longer make up false references but their interpretation of the paper of course can still be mostly. Elicit.org is maybe 70% reliable. But the best of such systems allows you to see which document and which specific sentences are generating the evidence, so you can always check for yourself.
There are a ton of interesting implications when/if such systems become the norm., both for information literacy and how search engines work.
Here's just one -Google Scholar now has free access to index full-text of both publishers papers even those behind the paywall. The logic for publishers was that you wanted to be discoverable by the many searchers on such academic search, and you not really giving anything away because people still need subscriptions to read. But once Bing+LLM type systems are the norm, these systems would tell you what was on these papers and you can even ask specific qs about such papers.....
More recent articles
Increasingly I actually get far better results with Elicit than Google Scholar. Which is shocking to me since Google Scholar has the advantage in size and also the amount of full text indexed.
For example, I was writing a blog post on techniques to find seminal papers and in Elicit I got this
compare to the results I get with Google Scholar.
In fact, if I change the search to identifying GS results get better, but Elicit.org is clearly always as good and usually better. This is the power of Semantic Scholar that Elicit and a few other new search engines have,
One of the lesser noticed things is what the power of large language models gives to search.
If you think about it when you use ChatGPT you will notice that you can type roughly with tons of typos and it still understands you. In essence, Large language models have achieved a high, near human level in the area of NLP (natural language processing).
If ChatGPT can understand you, surely this can be used to enhance the quality of your search.
Standard search engines even Google, are still mostly keyword based. Typically running off Elastic search, TF-IDF, BM25 etc. They may do stemming, search expansion etc but it is still mostly keyword based. This is true even in Google Scholar (less true in Google).
In the last few years, more advanced techniques based on neutral nets, in particularly transformer based models produce representations in the form of contextual embeddings (e.g. Encoder models like BERT - and specialized BERT models like scibert, pubbert etc) tend to produce state of art results.
Such search algos tend to be more expensive than standard keyword type searches (sparse representations) so most search engines do a two part ranking, with the first part using a cheaper way to relevant rank say the top 100 and then use the "neutral" part to rerank these 100.
Elicit.org initially used Semantic Scholar keyword API (a very capable search algo in its own right) , before applying embedding matchings between the top ranked documents and the query. But of course if the first step misses relevant results, no amount of advanced algo can help in the second step.
This is why Elicit.org recently switched to completely doing embedding matching in even the first step
"We search our corpus of 115M papers from the Semantic Scholar Academic Graph dataset. We search for papers that are semantically similar to your question. This means that if you enter in the keyword “anxiety”, we’ll also return papers that include similar words, like “panic attack”, so you don’t need to know exactly the right keywords to search.
We perform semantic search by storing embeddings of the titles and abstracts using paraphrase-mpnet-base-v2 in a vector database; when you enter a question, we embed it using the same model then ask the vector database to return the 400 closest embeddings."
The top 400 gets reranked with even more advanced and expensive processing steps including notably OpenAi's GPT-3 Babbage model and then it does a variety of thing to extract various characteristics of papers such as “outcome measured”, “Intervention”, and “sample size” ...
This is the layperson explanation of how it works. But more or less accurate.
If you go deeper into it, you will learn about positional embeddings, self-attention mechanisms (encoder model) , masked self-attention mechanisms (decoder only model) which gives you an idea how it tries to "understand"
To be fair depending on the type of model, there is other types of training like instruction based learning etc.
It does appear magical. All we know is when we train neutral nets with transformer type models with self attention mechanics etc, such magic appears when the dataset and compute is big enough.
It's hard to believe it is just "predicting the next word" or that learning how probable words are can let it seemingly understand things like "pretending.." to bypass guard rails.
Which is why there are quite a lot of papers arguing if these LLMs really "understand", whatever that means.
Perplexity is not just a LLM. It combines a search engine with LLM. Very roughly speaking, the search engine ranks a set of documents that might answer the query. Of the top ranked papers, different passages are compared with the query to see what is likely relevant to the query.
The top ranked queries are then sent to the LLM with a prompt like "answer the query given the information below
LLMs like most machine learning tasks are tested against a series of standard suites. https://paperswithcode.com/area/natural-language-processing shows many of them.= eg TruthfulQA
There has been a gold rush in making models bigger and bigger because they have found so far there are no diminishing returns to performance on such test suites which do include test suites on facts. There is debate over whether scaling up alone is all you need...
One of the most interesting unreleased systems by OpenAI was something called WebGPT https://openai.com/research/webgpt
They hooked up the LLM to a bing api and it was trained to search the web! It would learn what keywords to use, to go forward to next page of results, change keywords etc.. until it was satisfied.
And yes, every new LLM now is tested on factual accuracy etc.
This isn't the problem of ChatGPT. This is Bing+GPT, my "horror" is it is extracting results about studies from random layperson pages rather than actual studies! It's like seeing papers that cite random blogs
This could be fixed by restricting results to be extracted only from scholarly domains like Sciencedirect.com, Wiley.com, Arxiv.org (if you okay with preprints) etc. In bing you can even give natural language commands iike restrict results only from xyz.com
In Perplexity you need to do site:xyz.com
My view is it will make generalists less and less valuable. I was going to write an article on something for work. But I quickly realized that because the topic was so mundane, chatGPT or search engines enhanced by LLMs could do it as well as I could and much faster. so why borther.
But when I had an idea for a blog post, ChatGPT despite me coaching it couldn't come up with anything much better than what was already on the net. This is because my ideas that I want to write up are relatively novel and not known to the general web.
The trouble is i predict people will get really lazy. Search+LLM is the next step beyond just google it. People will start faking expertise or at least gain a super shallow level of understanding of things, which means they cant go beyond...
Honestly, this fails to see that you won't be using ChatGPT or most language models alone.
The initial training to create LLM models like ChatGPT or other LLMs is not for it to learn facts. It is for the NLP capabilities.
The facts and information will come from the search retrieval part. Granted there is fear that generative AI will make search engines results messed up , but you could always "teach" the LLM to value results from certain domains or even setup your own whitelists or blacklists of domains you trust or dont trust.
Many thanks for your big number of elaborate thoughts on that topic!
Similarity is a very tricky concept but central to the whole topic. It is actually not trivial why a similar looking bibliographical data set is not the same than a bibliographical data set with similar entries.
What you have written is a technical perspective of AI tries to deal with the problem of similarity.
I am a Zettler
Thanks so much for all your responses, @aarontay! I always feel more knowledgeable about library tools after I read your blog!
There are features I like about Google Scholar, and it is still my standard tool for the task of getting an initial overview of a particular academic author's publications, but it is generally very incompetent about weighting search results. I have to use another tool if what I want are the most relevant results about a topic or question. Search+LLM shows promise for the latter task.
I posted with feathers ruffled. I hate to see human work devalued, and your prediction about laziness is probably very accurate.
AI will be very useful until it's overused, which is probably already happening.
I read an article in The Daily Mail this morning about contraband substances. There were errors that might have been from machine reasoning.
The one I remember was a statement that something was "highly sort after." Clearly, "highly sought after" was the intent. I would think it was just a spellcheck thing, but there is the ordinal meaning of "highly" and "after", and I suppose "sort" could be assumed to fit that context.
Or, I could be spooked and see AI crouching in the shadows everywhere.
Your points are well taken.