For the sake of illustration, this issue may be of a similar nature as scaling graphics. In such a scenario, you're scaling small graphics to a higher resolution, which means that more pixel data is required than is known. The result is that the rest of the data has to be made up out of thin air based on some sort of algorithm... which works fine for some graphics and not so well for others.
It's the same here. Since not only the pitch, but also the tempo of the sound, are changed by audio_sound_pitch (assuming that you're using that), lowering the pitch of the sound also lowers its tempo, thus increasing its length and, finally, increasing the amount of data needed to represent it. This data, again, has to be created out of thin air based on an algorithm which may or may not have the desired results in any given case and may lead to effects such as the one you're experiencing when not enough data exists to accurately calculate the pitched-down version.
This shouldn't be an issue (or less of an issue - using anything other than the source material means some data will have to be made up or truncated, both of which are not guaranteed to work out perfectly, but this way the effects are much less noticeable) if, say, your sound file naturally starts at a lower pitch and then has its pitch increased programmatically instead of the other way, as there would not be a need to create data out of thin air.