11 thoughts on “Natural Language Processing has been overrun by large neural language models! What should we make of that?”
Hello Chris, thank you for the wonderful talk. I was wondering if with these hidden layers representing structures, is there any sense that syntactic ambiguity is is represented? Ambiguities such as attachment ambiguity or garden path effects. Thank you!
Hi Don, thanks! I unfortunately don’t know a good answer to your question. I guess this partly shows that big transformer models were only first built in 2018, and there are still a lot of unexplored sentences. The success of recovering dependency parses from high levels of a transformer model show that they have converged to a particular syntactic structure for a sentence in context. You’d tend to think that in lower levels of the network that you would see the network considering (probably simultaneously) multiple possible analyses. But I can’t think if any work that has actually shown that happening. It could be a research project for you!
Thanks – it’s another interesting project to add to the list!
If I may ask another question – It is not always clear (at least to me) whether these models can be “degraded” in some way to more accurately represent human processing, or if they can what parameters would make the most sense to impair in some way. I have heard that in some types on neural networks there are ways to degrade specific parameters to simulate a variety of (human) disorder states. Do you know if anyone has looked into how we could make one of these more complex Transformer language model less perfect to reflect actual/disordered human language processing?
I was interested in your point in the first half of the talk about further investigation of the Chomsky hierarchy. There is a lot of interesting work in mathematical/computational linguistics, especially in phonology, but also in syntax, looking at subregular langauges (e.g., the work of Jeffrey Heinz, Jane Chandlee, Adam Jardine, Bill Idsardi, Thomas Graf, Greg Kobele, Jim Rogers, etc.). I’d be curious to hear your thoughts on the relative merits of work like this that is more mathematical/algebraic in nature vs. trying to investigate the Chomsky hierarchy further with neural models of language. Given that neural models of languages are largely black boxes (i.e., it’s not clear what they learn), it strikes me that the former kind of investigation is much more likely to provide insights and understanding of further aspects of the Chomsky hierarchy, but I’d love to hear your thoughts on the matter.
I think this work has a lot of merit. I don’t know all the work as well as I maybe should. Nevertheless, to the extent that the high level direction is that we should be examining subclasses of regular languages that are appropriate for modeling human linguistic production, then that is also exactly what they are doing with things like strictly piecewise languages. Our emphasis was indeed more neural network first, since the starting point was that these recurrent neural networks seem to do great at modeling recursive language constructions, when you might of thought that they can’t, and trying to explain that, but I think there is equal value in trying to describe appropriate subsets as formal languages and modeling them as regular languages. Indeed, to the extent that I suggested that it’s useful to examine memory-bounded handling of language classes, then this is exactly the question that Heinz and Rogers are also asking with their Factored Deterministic Automata.
Back in the 80s, Fodor & Pylylshyn criticized the connectionist models of that era on a priori grounds. Especially relevant to your argument that the new language models might be ‘discovering’ syntactic structure is their argument that the old PDP models could not in principle represent any kind of constituency structure, even if they could represent a similar-looking network of causal or probabilistic dependencies. Do you think the architectures of the newer models overcome those concerns in a principled way, or is it just that their performance looks so good that the a priori issues seem moot?
Hi Roman, well, this isn’t an issue that you can do full justice to in a short response, but basically I do think they were wrong. I think even contemporaneously there were fairly convincing arguments that they were wrong. I remember Smolensky’s paper The Constituent Structure of Connectionist Mental States: A Reply to Fodor and Pylyshyn, and Chalmers’s paper Connectionism and compositionality: Why Fodor and Pylyshyn were wrong. But I also think that there are now very different models. While recurrent neural networks were already available in the 1980s, much of the debate was relative to feedforward neural networks. At any rate, since 2014, neural networks in NLP have been transformed in functionality by the introduction of the idea of attention – very extensively used by transformer networks – which give a much more direct means for encoding dependencies or notions like symbol binding than are present in feed forward networks.
Hi Don, thanks! I unfortunately don’t know a good answer to your question. I guess this partly shows that big transformer models were only first built in 2018, and there are still a lot of unexplored sentences. The success of recovering dependency parses from high levels of a transformer model show that they have converged to a particular syntactic structure for a sentence in context. You’d tend to think that in lower levels of the network that you would see the network considering (probably simultaneously) multiple possible analyses. But I can’t think if any work that has actually shown that happening. It could be a research project for you!
Thanks – it’s another interesting project to add to the list!
If I may ask another question – It is not always clear (at least to me) whether these models can be “degraded” in some way to more accurately represent human processing, or if they can what parameters would make the most sense to impair in some way. I have heard that in some types on neural networks there are ways to degrade specific parameters to simulate a variety of (human) disorder states. Do you know if anyone has looked into how we could make one of these more complex Transformer language model less perfect to reflect actual human language processing?
Hello Chris, thank you for the wonderful talk. I was wondering if with these hidden layers representing structures, is there any sense that syntactic ambiguity is is represented? Ambiguities such as attachment ambiguity or garden path effects. Thank you!
Hi Don, thanks! I unfortunately don’t know a good answer to your question. I guess this partly shows that big transformer models were only first built in 2018, and there are still a lot of unexplored sentences. The success of recovering dependency parses from high levels of a transformer model show that they have converged to a particular syntactic structure for a sentence in context. You’d tend to think that in lower levels of the network that you would see the network considering (probably simultaneously) multiple possible analyses. But I can’t think if any work that has actually shown that happening. It could be a research project for you!
Typing too quickly! “unexplored sentences” –> “unexplored questions”!
Sorry, typing too quickly! “unexplored sentences” should be “unexplored questions”!
Thanks – it’s another interesting project to add to the list!
If I may ask another question – It is not always clear (at least to me) whether these models can be “degraded” in some way to more accurately represent human processing, or if they can what parameters would make the most sense to impair in some way. I have heard that in some types on neural networks there are ways to degrade specific parameters to simulate a variety of (human) disorder states. Do you know if anyone has looked into how we could make one of these more complex Transformer language model less perfect to reflect actual/disordered human language processing?
I was interested in your point in the first half of the talk about further investigation of the Chomsky hierarchy. There is a lot of interesting work in mathematical/computational linguistics, especially in phonology, but also in syntax, looking at subregular langauges (e.g., the work of Jeffrey Heinz, Jane Chandlee, Adam Jardine, Bill Idsardi, Thomas Graf, Greg Kobele, Jim Rogers, etc.). I’d be curious to hear your thoughts on the relative merits of work like this that is more mathematical/algebraic in nature vs. trying to investigate the Chomsky hierarchy further with neural models of language. Given that neural models of languages are largely black boxes (i.e., it’s not clear what they learn), it strikes me that the former kind of investigation is much more likely to provide insights and understanding of further aspects of the Chomsky hierarchy, but I’d love to hear your thoughts on the matter.
I think this work has a lot of merit. I don’t know all the work as well as I maybe should. Nevertheless, to the extent that the high level direction is that we should be examining subclasses of regular languages that are appropriate for modeling human linguistic production, then that is also exactly what they are doing with things like strictly piecewise languages. Our emphasis was indeed more neural network first, since the starting point was that these recurrent neural networks seem to do great at modeling recursive language constructions, when you might of thought that they can’t, and trying to explain that, but I think there is equal value in trying to describe appropriate subsets as formal languages and modeling them as regular languages. Indeed, to the extent that I suggested that it’s useful to examine memory-bounded handling of language classes, then this is exactly the question that Heinz and Rogers are also asking with their Factored Deterministic Automata.
Back in the 80s, Fodor & Pylylshyn criticized the connectionist models of that era on a priori grounds. Especially relevant to your argument that the new language models might be ‘discovering’ syntactic structure is their argument that the old PDP models could not in principle represent any kind of constituency structure, even if they could represent a similar-looking network of causal or probabilistic dependencies. Do you think the architectures of the newer models overcome those concerns in a principled way, or is it just that their performance looks so good that the a priori issues seem moot?
Hi Roman, well, this isn’t an issue that you can do full justice to in a short response, but basically I do think they were wrong. I think even contemporaneously there were fairly convincing arguments that they were wrong. I remember Smolensky’s paper The Constituent Structure of Connectionist Mental States: A Reply to Fodor and Pylyshyn, and Chalmers’s paper Connectionism and compositionality: Why Fodor and Pylyshyn were wrong. But I also think that there are now very different models. While recurrent neural networks were already available in the 1980s, much of the debate was relative to feedforward neural networks. At any rate, since 2014, neural networks in NLP have been transformed in functionality by the introduction of the idea of attention – very extensively used by transformer networks – which give a much more direct means for encoding dependencies or notions like symbol binding than are present in feed forward networks.
Hi Don, thanks! I unfortunately don’t know a good answer to your question. I guess this partly shows that big transformer models were only first built in 2018, and there are still a lot of unexplored sentences. The success of recovering dependency parses from high levels of a transformer model show that they have converged to a particular syntactic structure for a sentence in context. You’d tend to think that in lower levels of the network that you would see the network considering (probably simultaneously) multiple possible analyses. But I can’t think if any work that has actually shown that happening. It could be a research project for you!
Thanks – it’s another interesting project to add to the list!
If I may ask another question – It is not always clear (at least to me) whether these models can be “degraded” in some way to more accurately represent human processing, or if they can what parameters would make the most sense to impair in some way. I have heard that in some types on neural networks there are ways to degrade specific parameters to simulate a variety of (human) disorder states. Do you know if anyone has looked into how we could make one of these more complex Transformer language model less perfect to reflect actual human language processing?