A guide to garbage collection in python

Christian Abrokwa
5 min readDec 14, 2022

Programmer: “Ah crap, system memory is full time to dealloacte pointers. *Deallocates pointers*. That was draining, back to programming. Oh no, I deallocated a pointer that was still in session. Damn, I have to reallocate it all over again. Can we automate this process, so I only have to focus on code?

Garbage Collection:

Outline

  • What is Garbage Collection?
  • Why do we need Garbage Collection?
  • Types of Garbage Collection
  • Garbage Collection in Python
  • Did you know?
  • Glossary
  • Links
  • Contact

What is garbage collection?

Garbage collection is an automatic memory recovery feature built into programming languages. It relinquishes memory space that has been allocated to objects which are no longer needed by the program. This ensures that a program does not exceed its memory quota nor reach a point where it can no longer function. Also, it allows the developer to focus on programming, which reduces the potential for memory-related bugs.

Garbage collection is triggered when one of the following conditions is met:

  • The system has low physical memory.
  • Memory that’s used by allocated objects on a heap(memory stack) surpasses an acceptable threshold.
  • The collection method in the garbage collector module is called.

Why do we need Garbage Collection?

Before GC (Garbage Collection), developers had to manually allocate and deallocate memory for objects. This led to errors such as:

  • Dangling pointers: When a programmer deallocates a memory that still has pointers referencing it.
  • Double-free bugs: When a programmer attempts to deallocate a region of memory that has already been freed.
  • Memory leaks: When a programmer fails to deallocate a memory occupied by objects that have become unreachable. This often leads to memory exhaustion.

These errors made it a pressing need for programmers to develop an automatic way of deallocating and allocating memory and that’s what led to the Garbage Collection we see today. Although most programming languages implement GC, there are a few which still require the programmer to allocate and deallocate memory, popular among those are C and C++.

Types of Garbage Collection

There are two main types. Tracing GC and Reference Counting.

Tracing GC

Tracing GC is so popular that it’s the de facto term for Garbage Collection. When an individual mentions Garbage Collection is more than likely, they are referencing Tracing Garbage Collection.

Under Tracing GC, there are two main algorithms.

Mark & Sweep

Mark & Sweep algorithm
Each object in memory has a flag reserved for garbage collection. At the init state, every object in the system has an “in-use” value of False. The mark stage does a tree traversal of the entire root and marks the flag of each reachable object as True. In the sweep stage, all memory is scanned from start to finish and unreachable objects are collected. Unreachable objects have an “in-use” value of True. After this, the flag of reachable objects still in the system is updated to False.

Tri-color marking

There are three sets. White set (Objects that are candidates for having their memory recycled). Black set (Objects that can be referenced from the root but have no references to the white set). Gray set (Objects reachable from the root but yet to be scanned for references to white objects). At the init state, the black set is empty, the gray set is filled with objects which are directly referenced from the roots. The white set contains the rest of the objects. The algorithm picks an object from the gray set and moves it to the black set. Move each white object it references to the gray set. Repeat the last two steps until the gray set is empty. When the gray set is empty, the scan is complete. Black objects are reachable from the roots, while white objects aren’t and can be garbage-collected.

There are many implementations of Tracing Garbage collection, but the most popular is the generational Garbage Collection. It works by utilizing the infant mortality principle.

“Infant mortality or the generational hypothesis is the observation that, in most cases, young objects are much more likely to die than old objects.” — Anonymous

In a programming context, infant mortality says that most objects are discarded shortly after being used. Generational GC implements this by partitioning objects into different generations (time intervals) based on the time of allocation, and giving them different GC policies depending on age.

Reference Counting

With reference counting, every object has a count of the number of references to it. Garbage is identified by having a reference count of zero. When the count reaches zero, the object’s memory is automatically reclaimed.

The reference count of an object increases when

  • An object is assigned to a literal with the use of the assignment operator.
  • An object is passed as an argument to a function.
  • An object is appended to a list. (The objects reference count increases not the list).

The reference count of an object decreases when

  • Del is used on an object.
  • An object is assigned to the None keyword.
  • The object referencing it reassigns the reference to a different object.
  • The object referencing it goes out of scope.

Advantage of Reference Counting

  • Objects are reclaimed as soon as they can no longer be referenced, without long pauses in the system. This makes it ideal for real-time applications, such as air traffic control systems and autonomous driving systems.

Disadvantages of Reference Counting

  • Cycles: Two objects referencing each other can create a cycle whereby neither will be collected as their mutual references never let their reference counts become zero. If object A refers to object B and vice versa. Even if A and B become unreachable from the rest of the program, their reference count will never reach zero and won’t be collected, leading to a potential memory leak.
  • Space overhead: It requires space to be allocated to each object to store its reference count.
  • Speed overhead: Incrementing and decrementing a reference required modifications to the reference counter.

Due to the limitations of both approaches, many Garbage Collectors use a reference counter implementation with a Tracing Garbage Collector as a backup.

Garbage Collection in Python

In python, everything is an object, and each object consists of an object type, object value, and a reference count. When assigning a value to a variable, python automatically detects the type and increments the reference count. Python uses both reference counting and generational garbage collector known as GC module. The reference counting module is fundamental to Python and can’t be disabled, whereas the cyclic GC is optional and can be triggered manually. We’ll look at both of them.

Reference Counting

Garbage collection with GC module

Did you know?

Because of the inefficiencies(memory) of Garbage Collectors, Apple has never implemented it in their programming languages.

Glossary

Reachable objects: Objects that can be accessed by a reference object, either directly or through references from other reachable objects. So, objects are reachable if it’s a GC root or is referenced from a reachable object.

Unreachable objects. There are no references to that object.

Garbage: Useless/unreachable objects.

GC roots: Special objects to which all other objects are referenced and considered alive.

Further Reading

Official Documentation

Cyclic Referencing

Python Garbage Collection

Python Memory Management

Why do variables have a reference count of 4

Contact me

If you found this article helpful, give me some claps and if you want to be updated with my latest articles, follow me on Medium. You can connect with me on LinkedIn, and email me at cabrokwa11@gmail.com!

--

--

Christian Abrokwa

I write about Software[Backend], System Architecture and Algorithms mixed with a bit of philosophy.