While building parser, we found that we are creating huge list of tokens/nodes to build abstract syntax tree, and from abstract syntax tree we are creating IL code. Each of this list are very huge in size. And while trying to parse script with over 100K lines, parser would take more than 30 seconds.
And while profiling, we found two hot paths taking maximum time.
String.Substring by introducing
StringSpan, this will be covered in a different post.
Array.Resize is called by
List<T>.SetCapacity method. So we looked up source code of
List<T> to investigate further.
Internals of List
List<T> uses an internal array to store items, and when there is no space left in the array, it resizes the array (which basically creates a larger array and copies all items from old array to new array). Default size of the array is 16, and as the list grows, it keeps on doubling the capacity. So when you are adding 17th item on the array, the capacity is set to 32. And when you are adding 33rd item, the capacity is set to 64 and so on with little variance.
This does not hurt performance for small scripts. But when we started parsing larger scripts, it started slowing down exponentially. As an entire script with too many closures and many nested functions, the tree is very large and memory usage and GC pressure is very high.
So we decided to drop
List<T> and we created
Sequence<T> which is suitable for append and read once scenario. Most of the times, in parser we are creating a list by appending items but we are only reading them once. However, when array is resized, every node is copied from old array to new array making many enumerations.
Sequence<T> stores nodes of smaller array, and when there is no more free space, it creates a new node with a smaller array. Basically
Sequence<T> is a linked list of variable sized arrays. Default size is 16, but if you specify an initial capacity, it will create first node of same capacity. `Sequence will not resize any array. The only drawback is indexed access is slower. When a sequence needs to be read, it starts from first node till last and enumerates all array items in the node.
- It does not resize existing array
- It does not increase exponentially
- Appending is fast
- Enumerating is marginally slow but not at all noticeable.
- Indexed access is slow
- Insert is slow
Sequence is a basically a linked list of small arrays. Linked List are not suitable as smaller arrays will be localized while accessing and single node can have multiple array items. When
AddRange is called with known size, Sequence will allocate all items at once if they are more than capacity.