Watch the live stream:
Watch on YouTube
About the show
Sponsored by FusionAuth: pythonbytes.fm/fusionauth
Special guest: Ian Hellen
Brian #1: gensim.parsing.preprocessing
- Problem I’m working on
- Turn a blog title into a possible url
- example: “Twisted and Testing Event Driven / Asynchronous Applications - Glyph”
- would like, perhaps: “twisted-testing-event-driven-asynchrounous-applications”
- Sub-problem: remove stop words ← this is the hard part
- I started with an article called Removing Stop Words from Strings in Python
- It covered how to do this with NLTK, Gensim, and SpaCy
- I was most successful with remove_stopwords() from Gensim
- from gensim.parsing.preprocessing import remove_stopwords
- It’s part of a gensim.parsing.preprocessing package
- I wonder what’s all in there?
- a treasure trove
- gensim.parsing.preprocessing.preprocess_string is one
- this function applies filters to a string, with the defaults almost being just what I want:
- strip_tags()
- strip_punctuation()
- strip_multiple_whitespaces()
- strip_numeric()
- remove_stopwords()
- strip_short()
- stem_text() ← I think I want everything except this
- this one turns “Twisted” into “Twist”, not good.
- There’s lots of other text processing goodies in there also.
- Oh, yeah, and Gensim is also cool.
- topic modeling for training semantic NLP models
- So, I think I found a really big hammer for my little problem.
Michael #2: DevDocs
- via Loic Thomson
- Gather and search a bunch of technology docs together at once
- For example: Python + Flask + JavaScript + Vue + CSS
- Has an offline mode for laptops / tablets
- Installs as a PWA (sadly not on Firefox)
Ian #3: MSTICPy
- MSTICPy is toolset for CyberSecurity investigations and hunting in Jupyter notebooks.
- What is CyberSec hunting/investigating? - responding to security alerts and threat intelligence reports, trawling through security logs from cloud services and hosts to determine if it’s a real threat or not.
- Why Jupyter notebooks?
- SOC (Security Ops Center) tools can be excellent but all have limitations
- You can get data from anywhere
- Use custom analysis and visualizations
- Control the workflow…. workflow is repeatable
- Open source pkg - created originally to support MS Sentinel Notebooks but now supports lots of providers. When I start this 3+ yrs ago I thought a lot this would be in PyPI - but no 😞
- MSTICPy has 4 main functional areas:
- Data querying - import log data (Sentinel, Splunk, MS Defender, others…working on Elastic Search)
- Enrichment - is this IP Address or domain known to be malicious?
- Analysis - extract more info from data, identify anomalies (simple example - spike in logon failures)
- Visualization - more specialized than traditional graphs - timelines, process trees.
- All components use pandas, Bokeh for visualizations
- Current focus on usability, discovery of functionality and being able to chain
- Always looking for collaborators and contributors - code, docs, queries, critiques
- https://github.com/microsoft/msticpy
- https://msticpy.readthedocs.io/
Brian #4: The Right Way To Compare Floats in Python
- David Amos
- Definitely an easier read than the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic
- What many of us remember
- floating point numbers aren’t exact due to representation limitations and rounding error,
- errors can accumulate
- comparison is tricky
- Be careful when comparing floating point numbers, even simple comparisons, like:
>>> 0.1 + 0.2 == 0.3
False
>>> 0.1 + 0.2