Talk: Speed Up Your Data Processing: Parallel and Asynchronous Programming in Python
[talks
]
Quick speaker notes on parallel and asynchronous programming in Python + FOSSASIA Summit 2020 talk reflection
Where: FOSSASIA Summit 2020
When: 20 March 2020
Location: Lifelong Learning Institute, Singapore
Resources used:
Recap
A recap of what I went through during my talk:
- Bottlenecks in a data science project
- Challenges in data processing in Python
- For loops in Python are slow
- List comprehensions are slightly better
- Pandas is great for analytics but doesn’t quite scale for larger data
- Spark cluster? Problem of “Small Big Data” + communication overhead across computes
- What is parallel processing? What do you mean by asynchronous?
- Sequential vs parallel processing
- Synchronous vs asynchronous execution
- When should you go for parallelism?
- Practical considerations
- Is your code already optimised?
- Problem architecture
- Overhead in parallelism
- Amdahl’s Law - not everything can be parallelized
- Multithreading vs Multiprocessing
- Practical considerations
- Parallel + asynchronous programming for data processing
- Data processing tends to be more compute-intensive
- Most data folks use Python – GIL does not allow parallel thread execution
- How to do parallel + asynchronous processing in Python?
- concurrent.futures module in Python
- ThreadPoolExecutor vs ProcessPoolExecutor
- executor.submit() and executor.map() in concurrent.futures
- Putting Them All Together
- Case: Network I/O Operations
- Case: Image Processing
- Key Takeaways
Slides
GitHub Repository for Demo
hweecat/talk_parallel-async-python
Video
Speed Up Your Data Processing: Parallel and Asynchronous Programming in Python
Quick reflection
Year 2020 started out bad. Real bad. The COVID-19 pandemic led to a string of cancellations for tech events (including PyCon) and travel restrictions. As a result of travel restrictions and Business Continuity Plans, many speakers were not able to deliver their talks in person at the eleventh hour, participation was greatly reduced by around 90%, and even the FOSSASIA organizers could not make it to the venue in person due to COVID-19 restrictions. Still, the organizers made the decision to proceed with the event with a mix of offline and online talks with live streaming and chats.
The first version of this talk was given as a gentle introduction to parallel processing for data science at the Python User Group Singapore meetup last August. Based on the warm reception and feedback from the audience, I revamped the talk content to include more in-depth points on parallel programming for network I/O operations and multithreading. When crafting the revamped talk, I assumed that the audience has a basic understanding of Python and data processing in data science - for loops, list comprehensions, pandas DataFrames etc.
It does feel a bit weird giving a talk to a room of less than 10 people and a livestream audience of I-have-no-idea-how-many, but I still did my show anyway - complete with loads of animated hand-flailing and coffee analogies.
Things that could have been improved
- I may have gone slightly beyond 25 minutes for my talk - but still within schedule!
- Can I just emphasize that it feels kinda weird not being able to gauge audience feedback when you’re speaking to a livestream audience and you don’t know how many people are actually watching your talk?
- Slight technical issues with livestream caused a bit of interruption in the middle.
- I wish there were a way to engage the livestream audience too, but I guess I’ll have to get used to speaking without knowing whether the audience is following or enjoys my talk.
- I swear I use too many “so……” in my talks until I suspect it’s becoming a hindrance to my speaking goals. Time to fix that.
Things that went well
- My talk went well without major hiccups and there were more than 5 persons in the room. Considering the greatly reduced physical attendance and social distancing measures, I guess having more than 5 persons in the room isn’t too bad.
- Ben Sadeghi (the really cool Databricks solution architect who spoke after me) likes my talk!
- Questions! Having people in the audience asking relevant questions is a good indicator of interest and engagement - I like that.