TIL: DataFrame reshaping in Pandas - melt, unstack

Written on February 20, 2020
As a data engineer, part of my daily work involves performing data processing and manipulation on raw data into data that is ready for analysis. As my development team primarily uses Python for our data science workflow, we often use Pandas to perform operations and transformations on datasets before analysing the data. While we primarily use Pandas for data cleaning and engineering as part of the data science process, sometimes we also have to perform complex data transformations to obtain actionable insights that business users can leverage on to improve their processes.
Read More

Contributing to pandas documentation for the first time - and lessons learnt

Written on November 8, 2019
To mark my first year as a data engineer, I started thinking "How can I contribute back to the community that enables my work for the past year?" I came across an open issue on documentation for pandas, a popular open-source Python library for data analysis and manipulation, and decided to give it a try. Here's a work-in-progress developer log on the lessons learnt through contributing documentation to an open-source Python library for the first time - and how contributing to open-source projects is not as scary as it might seem to be.
Read More

Understanding Python Dependency Management using pideptree

Written on October 13, 2019
Dependency management is important, as packages depend on versions of other core packages in order to run as intended. Typically in a Python project, dependencies are downloaded using a requirements.txt file, which lists the packages and their dependencies as a flat file. While the package versions are included in the requirements.txt file, the dependency relationships are not explicitly stated.
Read More

TIL: Migrating Git repositories manually from GitLab to Azure DevOps (TFS)

Written on October 12, 2019
My development team has been using GitLab on-premise to manage their code repositories. As we are moving development work to our new on-premise development cloud with expanded processing capabilities, we also need to migrate our code repositories to the new development cloud which uses Azure DevOps Team Foundation Server (TFS) for Git workflows. To support the chief architect in the migration, I prepared a quick migration guide for the team's move to Azure DevOps TFS.
Read More

Musings about Remote Development with Visual Studio Code

Written on August 22, 2019
Keeping codes and configuration files in sync between client machine and remote server used to be a drawn-out exercise in personal responsibility via SFTP/SCP. VS Code Remote Development looks to change that - for the better. Here's my notes on VS Code Remote Development.
Read More

Accelerating Batch Processing of Images in Python — with gsutil, numba and concurrent.futures

Written on May 27, 2019
In a data science project, one of the biggest bottlenecks (in terms of time) is the constant wait for the data processing code to finish executing. Sometimes, the gigantic execution times even end up making the project infeasible and often forces a data scientist to work with only a subset of the entire dataset, depriving the data scientist of insights and performance improvements that could be obtained with a larger dataset.
Read More