Using a Python Markdown ast to Find All Paragraphs ================================================== In looking for a way to automatically generate descriptions for pages I stumbled into a markdown ast in python. It allows me to go over the markdown page and... Date: February 5, 2022 In looking for a way to automatically generate descriptions for pages I stumbled into a markdown ast in python. It allows me to go over the markdown page and get only paragraph text. This will ignore headings, blockquotes, and code fences. ``` python import commonmark import frontmatter post = frontmatter.load("post.md") parser = commonmark.Parser() ast = parser.parse(post.content) paragraphs = '' for node in ast.walker(): if node[0].t == "paragraph": paragraphs += " " paragraphs += node[0].first_child.literal ``` It's also super fast, previously I was rendering to html and using beautifulsoup to get only the paragraphs. Using the commonmark ast was about 5x faster on my site. ### Duplicate Paragraphs When I originally wrote this post, I did not realize at the time that commonmark duplicates nodes. I still do not understand why, but I have had success duplicating them based on the source position of the node with the snippet below. ``` python from itertools import compress import commonmark import frontmatter post = frontmatter.load("post.md") parser = commonmark.Parser() ast = parser.parse(post.content) # find all paragraph nodes paragraph_nodes = [ n[0] for n in ast.walker() if n[0].t == "paragraph" and n[0].first_child.literal is not None ] # for reasons unknown to me commonmark duplicates nodes, dedupe based on sourcepos sourcepos = [p.sourcepos for p in paragraph_nodes] # find first occurence of node based on source position unique_mask = [sourcepos.index(s) == i for i, s in enumerate(sourcepos)] # deduplicate paragraph_nodes based on unique source position unique_paragraph_nodes = list(compress(paragraph_nodes, unique_mask)) paragraphs = " ".join([p.first_child.literal for p in unique_paragraph_nodes]) ```