A systems perspective for why prompt engineering matters
aiIt finally clicked for me just why (at the systems level) transformers are so effective and how this is related to why prompt engineering is so important (in the sense that it actually yields massive performance differences):
yadda yadda yadda yes attention mechanism is important but why? because it enables weight sharing! in prev neural networks, once a model was trained, the weights were static and the only thing that was dynamic at inference time was the input, but because of the key, query, value construction (recall we do key * query for all of the neighboring context’s queries, then weight the values by these affinities) in the self-attention block, the weights are in an essence dynamically computed from the input (i.e. the context!), they are not static! why is this important?
well, we know in the abstract sense that these networks contain all these smaller sub-systems that capture their learned knowledge of different things. this construction (i.e. dynamically computing weights given the context) means that given good context, we can steer the network to more precisely (in a sense) “locate” the correct knowledge by “activating” the part (learned weights for the given context) that contains this knowledge! which brings us to why at the systems-level prompt engineering matters..
you have all these smaller sub-systems that live in the model’s weights that contain all this knowledge about different stuff, prompt engineering is really the art about (trying to) correctly trigger the subsystems (which have the corresponding knowledge for this particular input) to “fire”. Concretely, “fire” here refers to getting activated by drawing on whatever the corresponding weights are in the learned internal representation of the network. In some sense, at the systems level, good context produces good outputs become it means the models can tap into the correct corresponding learned representations for that particular input.
These crazy prompts you see “work” shift the distribution of produced results into ones where models can take advantage of the correct corresponding (given the context) learned representations in the network, and hence produce better outputs, very neat!