pdoc.markdown2
A fast and complete Python implementation of Markdown.
[from http://daringfireball.net/projects/markdown/]
Markdown is a text-to-HTML filter; it translates an easy-to-read / easy-to-write structured text format into HTML. Markdown's text format is most similar to that of plain text email, and supports features such as headers, emphasis, code blocks, blockquotes, and links.
Markdown's syntax is designed not as a generic markup language, but specifically to serve as a front-end to (X)HTML. You can use span-level HTML tags anywhere in a Markdown document, and you can use block level HTML tags (like
andas well).
Module usage:
>>> import markdown2 >>> markdown2.markdown("*boo!*") # or use `html = markdown_path(PATH)` u'<p><em>boo!</em></p>\n' >>> markdowner = Markdown() >>> markdowner.convert("*boo!*") u'<p><em>boo!</em></p>\n' >>> markdowner.convert("**boom!**") u'<p><strong>boom!</strong></p>\n'
This implementation of Markdown implements the full "core" syntax plus a number of extras (e.g., code syntax coloring, footnotes) as described on https://github.com/trentm/python-markdown2/wiki/Extras.
1# fmt: off 2# flake8: noqa 3# type: ignore 4# Taken from here: https://github.com/trentm/python-markdown2/tree/6269c1f5f5e812f85ffb8524b8bf10b615579abf 5 6#!/usr/bin/env python 7# Copyright (c) 2012 Trent Mick. 8# Copyright (c) 2007-2008 ActiveState Corp. 9# License: MIT (http://www.opensource.org/licenses/mit-license.php) 10 11r"""A fast and complete Python implementation of Markdown. 12 13[from http://daringfireball.net/projects/markdown/] 14> Markdown is a text-to-HTML filter; it translates an easy-to-read / 15> easy-to-write structured text format into HTML. Markdown's text 16> format is most similar to that of plain text email, and supports 17> features such as headers, *emphasis*, code blocks, blockquotes, and 18> links. 19> 20> Markdown's syntax is designed not as a generic markup language, but 21> specifically to serve as a front-end to (X)HTML. You can use span-level 22> HTML tags anywhere in a Markdown document, and you can use block level 23> HTML tags (like <div> and <table> as well). 24 25Module usage: 26 27 >>> import markdown2 28 >>> markdown2.markdown("*boo!*") # or use `html = markdown_path(PATH)` 29 u'<p><em>boo!</em></p>\n' 30 31 >>> markdowner = Markdown() 32 >>> markdowner.convert("*boo!*") 33 u'<p><em>boo!</em></p>\n' 34 >>> markdowner.convert("**boom!**") 35 u'<p><strong>boom!</strong></p>\n' 36 37This implementation of Markdown implements the full "core" syntax plus a 38number of extras (e.g., code syntax coloring, footnotes) as described on 39<https://github.com/trentm/python-markdown2/wiki/Extras>. 40""" 41 42cmdln_desc = """A fast and complete Python implementation of Markdown, a 43text-to-HTML conversion tool for web writers. 44 45Supported extra syntax options (see -x|--extras option below and 46see <https://github.com/trentm/python-markdown2/wiki/Extras> for details): 47 48* break-on-newline: Replace single new line characters with <br> when True 49* code-friendly: Disable _ and __ for em and strong. 50* cuddled-lists: Allow lists to be cuddled to the preceding paragraph. 51* fenced-code-blocks: Allows a code block to not have to be indented 52 by fencing it with '```' on a line before and after. Based on 53 <http://github.github.com/github-flavored-markdown/> with support for 54 syntax highlighting. 55* footnotes: Support footnotes as in use on daringfireball.net and 56 implemented in other Markdown processors (tho not in Markdown.pl v1.0.1). 57* header-ids: Adds "id" attributes to headers. The id value is a slug of 58 the header text. 59* highlightjs-lang: Allows specifying the language which used for syntax 60 highlighting when using fenced-code-blocks and highlightjs. 61* html-classes: Takes a dict mapping html tag names (lowercase) to a 62 string to use for a "class" tag attribute. Currently only supports "img", 63 "table", "pre" and "code" tags. Add an issue if you require this for other 64 tags. 65* link-patterns: Auto-link given regex patterns in text (e.g. bug number 66 references, revision number references). 67* markdown-in-html: Allow the use of `markdown="1"` in a block HTML tag to 68 have markdown processing be done on its contents. Similar to 69 <http://michelf.com/projects/php-markdown/extra/#markdown-attr> but with 70 some limitations. 71* metadata: Extract metadata from a leading '---'-fenced block. 72 See <https://github.com/trentm/python-markdown2/issues/77> for details. 73* nofollow: Add `rel="nofollow"` to add `<a>` tags with an href. See 74 <http://en.wikipedia.org/wiki/Nofollow>. 75* numbering: Support of generic counters. Non standard extension to 76 allow sequential numbering of figures, tables, equations, exhibits etc. 77* pyshell: Treats unindented Python interactive shell sessions as <code> 78 blocks. 79* smarty-pants: Replaces ' and " with curly quotation marks or curly 80 apostrophes. Replaces --, ---, ..., and . . . with en dashes, em dashes, 81 and ellipses. 82* spoiler: A special kind of blockquote commonly hidden behind a 83 click on SO. Syntax per <http://meta.stackexchange.com/a/72878>. 84* strike: text inside of double tilde is ~~strikethrough~~ 85* tag-friendly: Requires atx style headers to have a space between the # and 86 the header text. Useful for applications that require twitter style tags to 87 pass through the parser. 88* tables: Tables using the same format as GFM 89 <https://help.github.com/articles/github-flavored-markdown#tables> and 90 PHP-Markdown Extra <https://michelf.ca/projects/php-markdown/extra/#table>. 91* toc: The returned HTML string gets a new "toc_html" attribute which is 92 a Table of Contents for the document. (experimental) 93* use-file-vars: Look for an Emacs-style markdown-extras file variable to turn 94 on Extras. 95* wiki-tables: Google Code Wiki-style tables. See 96 <http://code.google.com/p/support/wiki/WikiSyntax#Tables>. 97* xml: Passes one-liner processing instructions and namespaced XML tags. 98""" 99 100# Dev Notes: 101# - Python's regex syntax doesn't have '\z', so I'm using '\Z'. I'm 102# not yet sure if there implications with this. Compare 'pydoc sre' 103# and 'perldoc perlre'. 104 105__version_info__ = (2, 4, 3) 106__version__ = '.'.join(map(str, __version_info__)) 107__author__ = "Trent Mick" 108 109import sys 110import re 111import logging 112from hashlib import sha256 113import optparse 114from random import random, randint 115import codecs 116from collections import defaultdict 117 118 119# ---- Python version compat 120 121# Use `bytes` for byte strings and `unicode` for unicode strings (str in Py3). 122if sys.version_info[0] <= 2: 123 py3 = False 124 try: 125 bytes 126 except NameError: 127 bytes = str 128 base_string_type = basestring 129elif sys.version_info[0] >= 3: 130 py3 = True 131 unicode = str 132 base_string_type = str 133 134# ---- globals 135 136DEBUG = False 137log = logging.getLogger("markdown") 138 139DEFAULT_TAB_WIDTH = 4 140 141 142SECRET_SALT = bytes(randint(0, 1000000)) 143# MD5 function was previously used for this; the "md5" prefix was kept for 144# backwards compatibility. 145def _hash_text(s): 146 return 'md5-' + sha256(SECRET_SALT + s.encode("utf-8")).hexdigest()[32:] 147 148# Table of hash values for escaped characters: 149g_escape_table = dict([(ch, _hash_text(ch)) 150 for ch in '\\`*_{}[]()>#+-.!']) 151 152# Ampersand-encoding based entirely on Nat Irons's Amputator MT plugin: 153# http://bumppo.net/projects/amputator/ 154_AMPERSAND_RE = re.compile(r'&(?!#?[xX]?(?:[0-9a-fA-F]+|\w+);)') 155 156 157# ---- exceptions 158class MarkdownError(Exception): 159 pass 160 161 162# ---- public api 163 164def markdown_path(path, encoding="utf-8", 165 html4tags=False, tab_width=DEFAULT_TAB_WIDTH, 166 safe_mode=None, extras=None, link_patterns=None, 167 footnote_title=None, footnote_return_symbol=None, 168 use_file_vars=False): 169 fp = codecs.open(path, 'r', encoding) 170 text = fp.read() 171 fp.close() 172 return Markdown(html4tags=html4tags, tab_width=tab_width, 173 safe_mode=safe_mode, extras=extras, 174 link_patterns=link_patterns, 175 footnote_title=footnote_title, 176 footnote_return_symbol=footnote_return_symbol, 177 use_file_vars=use_file_vars).convert(text) 178 179 180def markdown(text, html4tags=False, tab_width=DEFAULT_TAB_WIDTH, 181 safe_mode=None, extras=None, link_patterns=None, 182 footnote_title=None, footnote_return_symbol=None, 183 use_file_vars=False, cli=False): 184 return Markdown(html4tags=html4tags, tab_width=tab_width, 185 safe_mode=safe_mode, extras=extras, 186 link_patterns=link_patterns, 187 footnote_title=footnote_title, 188 footnote_return_symbol=footnote_return_symbol, 189 use_file_vars=use_file_vars, cli=cli).convert(text) 190 191 192class Markdown(object): 193 # The dict of "extras" to enable in processing -- a mapping of 194 # extra name to argument for the extra. Most extras do not have an 195 # argument, in which case the value is None. 196 # 197 # This can be set via (a) subclassing and (b) the constructor 198 # "extras" argument. 199 extras = None 200 201 urls = None 202 titles = None 203 html_blocks = None 204 html_spans = None 205 html_removed_text = "{(#HTML#)}" # placeholder removed text that does not trigger bold 206 html_removed_text_compat = "[HTML_REMOVED]" # for compat with markdown.py 207 208 _toc = None 209 210 # Used to track when we're inside an ordered or unordered list 211 # (see _ProcessListItems() for details): 212 list_level = 0 213 214 _ws_only_line_re = re.compile(r"^[ \t]+$", re.M) 215 216 def __init__(self, html4tags=False, tab_width=4, safe_mode=None, 217 extras=None, link_patterns=None, 218 footnote_title=None, footnote_return_symbol=None, 219 use_file_vars=False, cli=False): 220 if html4tags: 221 self.empty_element_suffix = ">" 222 else: 223 self.empty_element_suffix = " />" 224 self.tab_width = tab_width 225 self.tab = tab_width * " " 226 227 # For compatibility with earlier markdown2.py and with 228 # markdown.py's safe_mode being a boolean, 229 # safe_mode == True -> "replace" 230 if safe_mode is True: 231 self.safe_mode = "replace" 232 else: 233 self.safe_mode = safe_mode 234 235 # Massaging and building the "extras" info. 236 if self.extras is None: 237 self.extras = {} 238 elif not isinstance(self.extras, dict): 239 self.extras = dict([(e, None) for e in self.extras]) 240 if extras: 241 if not isinstance(extras, dict): 242 extras = dict([(e, None) for e in extras]) 243 self.extras.update(extras) 244 assert isinstance(self.extras, dict) 245 246 if "toc" in self.extras: 247 if "header-ids" not in self.extras: 248 self.extras["header-ids"] = None # "toc" implies "header-ids" 249 250 if self.extras["toc"] is None: 251 self._toc_depth = 6 252 else: 253 self._toc_depth = self.extras["toc"].get("depth", 6) 254 self._instance_extras = self.extras.copy() 255 256 self.link_patterns = link_patterns 257 self.footnote_title = footnote_title 258 self.footnote_return_symbol = footnote_return_symbol 259 self.use_file_vars = use_file_vars 260 self._outdent_re = re.compile(r'^(\t|[ ]{1,%d})' % tab_width, re.M) 261 self.cli = cli 262 263 self._escape_table = g_escape_table.copy() 264 self._code_table = {} 265 if "smarty-pants" in self.extras: 266 self._escape_table['"'] = _hash_text('"') 267 self._escape_table["'"] = _hash_text("'") 268 269 def reset(self): 270 self.urls = {} 271 self.titles = {} 272 self.html_blocks = {} 273 self.html_spans = {} 274 self.list_level = 0 275 self.extras = self._instance_extras.copy() 276 if "footnotes" in self.extras: 277 self.footnotes = {} 278 self.footnote_ids = [] 279 if "header-ids" in self.extras: 280 self._count_from_header_id = defaultdict(int) 281 if "metadata" in self.extras: 282 self.metadata = {} 283 self._toc = None 284 285 # Per <https://developer.mozilla.org/en-US/docs/HTML/Element/a> "rel" 286 # should only be used in <a> tags with an "href" attribute. 287 288 # Opens the linked document in a new window or tab 289 # should only used in <a> tags with an "href" attribute. 290 # same with _a_nofollow 291 _a_nofollow_or_blank_links = re.compile(r""" 292 <(a) 293 ( 294 [^>]* 295 href= # href is required 296 ['"]? # HTML5 attribute values do not have to be quoted 297 [^#'"] # We don't want to match href values that start with # (like footnotes) 298 ) 299 """, 300 re.IGNORECASE | re.VERBOSE 301 ) 302 303 def convert(self, text): 304 """Convert the given text.""" 305 # Main function. The order in which other subs are called here is 306 # essential. Link and image substitutions need to happen before 307 # _EscapeSpecialChars(), so that any *'s or _'s in the <a> 308 # and <img> tags get encoded. 309 310 # Clear the global hashes. If we don't clear these, you get conflicts 311 # from other articles when generating a page which contains more than 312 # one article (e.g. an index page that shows the N most recent 313 # articles): 314 self.reset() 315 316 if not isinstance(text, unicode): 317 # TODO: perhaps shouldn't presume UTF-8 for string input? 318 text = unicode(text, 'utf-8') 319 320 if self.use_file_vars: 321 # Look for emacs-style file variable hints. 322 emacs_vars = self._get_emacs_vars(text) 323 if "markdown-extras" in emacs_vars: 324 splitter = re.compile("[ ,]+") 325 for e in splitter.split(emacs_vars["markdown-extras"]): 326 if '=' in e: 327 ename, earg = e.split('=', 1) 328 try: 329 earg = int(earg) 330 except ValueError: 331 pass 332 else: 333 ename, earg = e, None 334 self.extras[ename] = earg 335 336 # Standardize line endings: 337 text = text.replace("\r\n", "\n") 338 text = text.replace("\r", "\n") 339 340 # Make sure $text ends with a couple of newlines: 341 text += "\n\n" 342 343 # Convert all tabs to spaces. 344 text = self._detab(text) 345 346 # Strip any lines consisting only of spaces and tabs. 347 # This makes subsequent regexen easier to write, because we can 348 # match consecutive blank lines with /\n+/ instead of something 349 # contorted like /[ \t]*\n+/ . 350 text = self._ws_only_line_re.sub("", text) 351 352 # strip metadata from head and extract 353 if "metadata" in self.extras: 354 text = self._extract_metadata(text) 355 356 text = self.preprocess(text) 357 358 if "fenced-code-blocks" in self.extras and not self.safe_mode: 359 text = self._do_fenced_code_blocks(text) 360 361 if self.safe_mode: 362 text = self._hash_html_spans(text) 363 364 # Turn block-level HTML blocks into hash entries 365 text = self._hash_html_blocks(text, raw=True) 366 367 if "fenced-code-blocks" in self.extras and self.safe_mode: 368 text = self._do_fenced_code_blocks(text) 369 370 # Because numbering references aren't links (yet?) then we can do everything associated with counters 371 # before we get started 372 if "numbering" in self.extras: 373 text = self._do_numbering(text) 374 375 # Strip link definitions, store in hashes. 376 if "footnotes" in self.extras: 377 # Must do footnotes first because an unlucky footnote defn 378 # looks like a link defn: 379 # [^4]: this "looks like a link defn" 380 text = self._strip_footnote_definitions(text) 381 text = self._strip_link_definitions(text) 382 383 text = self._run_block_gamut(text) 384 385 if "footnotes" in self.extras: 386 text = self._add_footnotes(text) 387 388 text = self.postprocess(text) 389 390 text = self._unescape_special_chars(text) 391 392 if self.safe_mode: 393 text = self._unhash_html_spans(text) 394 # return the removed text warning to its markdown.py compatible form 395 text = text.replace(self.html_removed_text, self.html_removed_text_compat) 396 397 do_target_blank_links = "target-blank-links" in self.extras 398 do_nofollow_links = "nofollow" in self.extras 399 400 if do_target_blank_links and do_nofollow_links: 401 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow noopener" target="_blank"\2', text) 402 elif do_target_blank_links: 403 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="noopener" target="_blank"\2', text) 404 elif do_nofollow_links: 405 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow"\2', text) 406 407 if "toc" in self.extras and self._toc: 408 self._toc_html = calculate_toc_html(self._toc) 409 410 # Prepend toc html to output 411 if self.cli: 412 text = '{}\n{}'.format(self._toc_html, text) 413 414 text += "\n" 415 416 # Attach attrs to output 417 rv = UnicodeWithAttrs(text) 418 419 if "toc" in self.extras and self._toc: 420 rv.toc_html = self._toc_html 421 422 if "metadata" in self.extras: 423 rv.metadata = self.metadata 424 return rv 425 426 def postprocess(self, text): 427 """A hook for subclasses to do some postprocessing of the html, if 428 desired. This is called before unescaping of special chars and 429 unhashing of raw HTML spans. 430 """ 431 return text 432 433 def preprocess(self, text): 434 """A hook for subclasses to do some preprocessing of the Markdown, if 435 desired. This is called after basic formatting of the text, but prior 436 to any extras, safe mode, etc. processing. 437 """ 438 return text 439 440 # Is metadata if the content starts with optional '---'-fenced `key: value` 441 # pairs. E.g. (indented for presentation): 442 # --- 443 # foo: bar 444 # another-var: blah blah 445 # --- 446 # # header 447 # or: 448 # foo: bar 449 # another-var: blah blah 450 # 451 # # header 452 _meta_data_pattern = re.compile(r'^(?:---[\ \t]*\n)?((?:[\S\w]+\s*:(?:\n+[ \t]+.*)+)|(?:.*:\s+>\n\s+[\S\s]+?)(?=\n\w+\s*:\s*\w+\n|\Z)|(?:\s*[\S\w]+\s*:(?! >)[ \t]*.*\n?))(?:---[\ \t]*\n)?', re.MULTILINE) 453 _key_val_pat = re.compile(r"[\S\w]+\s*:(?! >)[ \t]*.*\n?", re.MULTILINE) 454 # this allows key: > 455 # value 456 # conutiues over multiple lines 457 _key_val_block_pat = re.compile( 458 r"(.*:\s+>\n\s+[\S\s]+?)(?=\n\w+\s*:\s*\w+\n|\Z)", re.MULTILINE 459 ) 460 _key_val_list_pat = re.compile( 461 r"^-(?:[ \t]*([^\n]*)(?:[ \t]*[:-][ \t]*(\S+))?)(?:\n((?:[ \t]+[^\n]+\n?)+))?", 462 re.MULTILINE, 463 ) 464 _key_val_dict_pat = re.compile( 465 r"^([^:\n]+)[ \t]*:[ \t]*([^\n]*)(?:((?:\n[ \t]+[^\n]+)+))?", re.MULTILINE 466 ) # grp0: key, grp1: value, grp2: multiline value 467 _meta_data_fence_pattern = re.compile(r'^---[\ \t]*\n', re.MULTILINE) 468 _meta_data_newline = re.compile("^\n", re.MULTILINE) 469 470 def _extract_metadata(self, text): 471 if text.startswith("---"): 472 fence_splits = re.split(self._meta_data_fence_pattern, text, maxsplit=2) 473 metadata_content = fence_splits[1] 474 match = re.findall(self._meta_data_pattern, metadata_content) 475 if not match: 476 return text 477 tail = fence_splits[2] 478 else: 479 metadata_split = re.split(self._meta_data_newline, text, maxsplit=1) 480 metadata_content = metadata_split[0] 481 match = re.findall(self._meta_data_pattern, metadata_content) 482 if not match: 483 return text 484 tail = metadata_split[1] 485 486 def parse_structured_value(value): 487 vs = value.lstrip() 488 vs = value.replace(v[: len(value) - len(vs)], "\n")[1:] 489 490 # List 491 if vs.startswith("-"): 492 r = [] 493 for match in re.findall(self._key_val_list_pat, vs): 494 if match[0] and not match[1] and not match[2]: 495 r.append(match[0].strip()) 496 elif match[0] == ">" and not match[1] and match[2]: 497 r.append(match[2].strip()) 498 elif match[0] and match[1]: 499 r.append({match[0].strip(): match[1].strip()}) 500 elif not match[0] and not match[1] and match[2]: 501 r.append(parse_structured_value(match[2])) 502 else: 503 # Broken case 504 pass 505 506 return r 507 508 # Dict 509 else: 510 return { 511 match[0].strip(): ( 512 match[1].strip() 513 if match[1] 514 else parse_structured_value(match[2]) 515 ) 516 for match in re.findall(self._key_val_dict_pat, vs) 517 } 518 519 for item in match: 520 521 k, v = item.split(":", 1) 522 523 # Multiline value 524 if v[:3] == " >\n": 525 self.metadata[k.strip()] = _dedent(v[3:]).strip() 526 527 # Empty value 528 elif v == "\n": 529 self.metadata[k.strip()] = "" 530 531 # Structured value 532 elif v[0] == "\n": 533 self.metadata[k.strip()] = parse_structured_value(v) 534 535 # Simple value 536 else: 537 self.metadata[k.strip()] = v.strip() 538 539 return tail 540 541 _emacs_oneliner_vars_pat = re.compile(r"-\*-\s*(?:(\S[^\r\n]*?)([\r\n]\s*)?)?-\*-", re.UNICODE) 542 # This regular expression is intended to match blocks like this: 543 # PREFIX Local Variables: SUFFIX 544 # PREFIX mode: Tcl SUFFIX 545 # PREFIX End: SUFFIX 546 # Some notes: 547 # - "[ \t]" is used instead of "\s" to specifically exclude newlines 548 # - "(\r\n|\n|\r)" is used instead of "$" because the sre engine does 549 # not like anything other than Unix-style line terminators. 550 _emacs_local_vars_pat = re.compile(r"""^ 551 (?P<prefix>(?:[^\r\n|\n|\r])*?) 552 [\ \t]*Local\ Variables:[\ \t]* 553 (?P<suffix>.*?)(?:\r\n|\n|\r) 554 (?P<content>.*?\1End:) 555 """, re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE) 556 557 def _get_emacs_vars(self, text): 558 """Return a dictionary of emacs-style local variables. 559 560 Parsing is done loosely according to this spec (and according to 561 some in-practice deviations from this): 562 http://www.gnu.org/software/emacs/manual/html_node/emacs/Specifying-File-Variables.html#Specifying-File-Variables 563 """ 564 emacs_vars = {} 565 SIZE = pow(2, 13) # 8kB 566 567 # Search near the start for a '-*-'-style one-liner of variables. 568 head = text[:SIZE] 569 if "-*-" in head: 570 match = self._emacs_oneliner_vars_pat.search(head) 571 if match: 572 emacs_vars_str = match.group(1) 573 assert '\n' not in emacs_vars_str 574 emacs_var_strs = [s.strip() for s in emacs_vars_str.split(';') 575 if s.strip()] 576 if len(emacs_var_strs) == 1 and ':' not in emacs_var_strs[0]: 577 # While not in the spec, this form is allowed by emacs: 578 # -*- Tcl -*- 579 # where the implied "variable" is "mode". This form 580 # is only allowed if there are no other variables. 581 emacs_vars["mode"] = emacs_var_strs[0].strip() 582 else: 583 for emacs_var_str in emacs_var_strs: 584 try: 585 variable, value = emacs_var_str.strip().split(':', 1) 586 except ValueError: 587 log.debug("emacs variables error: malformed -*- " 588 "line: %r", emacs_var_str) 589 continue 590 # Lowercase the variable name because Emacs allows "Mode" 591 # or "mode" or "MoDe", etc. 592 emacs_vars[variable.lower()] = value.strip() 593 594 tail = text[-SIZE:] 595 if "Local Variables" in tail: 596 match = self._emacs_local_vars_pat.search(tail) 597 if match: 598 prefix = match.group("prefix") 599 suffix = match.group("suffix") 600 lines = match.group("content").splitlines(0) 601 # print "prefix=%r, suffix=%r, content=%r, lines: %s"\ 602 # % (prefix, suffix, match.group("content"), lines) 603 604 # Validate the Local Variables block: proper prefix and suffix 605 # usage. 606 for i, line in enumerate(lines): 607 if not line.startswith(prefix): 608 log.debug("emacs variables error: line '%s' " 609 "does not use proper prefix '%s'" 610 % (line, prefix)) 611 return {} 612 # Don't validate suffix on last line. Emacs doesn't care, 613 # neither should we. 614 if i != len(lines)-1 and not line.endswith(suffix): 615 log.debug("emacs variables error: line '%s' " 616 "does not use proper suffix '%s'" 617 % (line, suffix)) 618 return {} 619 620 # Parse out one emacs var per line. 621 continued_for = None 622 for line in lines[:-1]: # no var on the last line ("PREFIX End:") 623 if prefix: line = line[len(prefix):] # strip prefix 624 if suffix: line = line[:-len(suffix)] # strip suffix 625 line = line.strip() 626 if continued_for: 627 variable = continued_for 628 if line.endswith('\\'): 629 line = line[:-1].rstrip() 630 else: 631 continued_for = None 632 emacs_vars[variable] += ' ' + line 633 else: 634 try: 635 variable, value = line.split(':', 1) 636 except ValueError: 637 log.debug("local variables error: missing colon " 638 "in local variables entry: '%s'" % line) 639 continue 640 # Do NOT lowercase the variable name, because Emacs only 641 # allows "mode" (and not "Mode", "MoDe", etc.) in this block. 642 value = value.strip() 643 if value.endswith('\\'): 644 value = value[:-1].rstrip() 645 continued_for = variable 646 else: 647 continued_for = None 648 emacs_vars[variable] = value 649 650 # Unquote values. 651 for var, val in list(emacs_vars.items()): 652 if len(val) > 1 and (val.startswith('"') and val.endswith('"') 653 or val.startswith('"') and val.endswith('"')): 654 emacs_vars[var] = val[1:-1] 655 656 return emacs_vars 657 658 def _detab_line(self, line): 659 r"""Recusively convert tabs to spaces in a single line. 660 661 Called from _detab().""" 662 if '\t' not in line: 663 return line 664 chunk1, chunk2 = line.split('\t', 1) 665 chunk1 += (' ' * (self.tab_width - len(chunk1) % self.tab_width)) 666 output = chunk1 + chunk2 667 return self._detab_line(output) 668 669 def _detab(self, text): 670 r"""Iterate text line by line and convert tabs to spaces. 671 672 >>> m = Markdown() 673 >>> m._detab("\tfoo") 674 ' foo' 675 >>> m._detab(" \tfoo") 676 ' foo' 677 >>> m._detab("\t foo") 678 ' foo' 679 >>> m._detab(" foo") 680 ' foo' 681 >>> m._detab(" foo\n\tbar\tblam") 682 ' foo\n bar blam' 683 """ 684 if '\t' not in text: 685 return text 686 output = [] 687 for line in text.splitlines(): 688 output.append(self._detab_line(line)) 689 return '\n'.join(output) 690 691 # I broke out the html5 tags here and add them to _block_tags_a and 692 # _block_tags_b. This way html5 tags are easy to keep track of. 693 _html5tags = '|article|aside|header|hgroup|footer|nav|section|figure|figcaption' 694 695 _block_tags_a = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del' 696 _block_tags_a += _html5tags 697 698 _strict_tag_block_re = re.compile(r""" 699 ( # save in \1 700 ^ # start of line (with re.M) 701 <(%s) # start tag = \2 702 \b # word break 703 (.*\n)*? # any number of lines, minimally matching 704 </\2> # the matching end tag 705 [ \t]* # trailing spaces/tabs 706 (?=\n+|\Z) # followed by a newline or end of document 707 ) 708 """ % _block_tags_a, 709 re.X | re.M) 710 711 _block_tags_b = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math' 712 _block_tags_b += _html5tags 713 714 _liberal_tag_block_re = re.compile(r""" 715 ( # save in \1 716 ^ # start of line (with re.M) 717 <(%s) # start tag = \2 718 \b # word break 719 (.*\n)*? # any number of lines, minimally matching 720 .*</\2> # the matching end tag 721 [ \t]* # trailing spaces/tabs 722 (?=\n+|\Z) # followed by a newline or end of document 723 ) 724 """ % _block_tags_b, 725 re.X | re.M) 726 727 _html_markdown_attr_re = re.compile( 728 r'''\s+markdown=("1"|'1')''') 729 def _hash_html_block_sub(self, match, raw=False): 730 html = match.group(1) 731 if raw and self.safe_mode: 732 html = self._sanitize_html(html) 733 elif 'markdown-in-html' in self.extras and 'markdown=' in html: 734 first_line = html.split('\n', 1)[0] 735 m = self._html_markdown_attr_re.search(first_line) 736 if m: 737 lines = html.split('\n') 738 middle = '\n'.join(lines[1:-1]) 739 last_line = lines[-1] 740 first_line = first_line[:m.start()] + first_line[m.end():] 741 f_key = _hash_text(first_line) 742 self.html_blocks[f_key] = first_line 743 l_key = _hash_text(last_line) 744 self.html_blocks[l_key] = last_line 745 return ''.join(["\n\n", f_key, 746 "\n\n", middle, "\n\n", 747 l_key, "\n\n"]) 748 key = _hash_text(html) 749 self.html_blocks[key] = html 750 return "\n\n" + key + "\n\n" 751 752 def _hash_html_blocks(self, text, raw=False): 753 """Hashify HTML blocks 754 755 We only want to do this for block-level HTML tags, such as headers, 756 lists, and tables. That's because we still want to wrap <p>s around 757 "paragraphs" that are wrapped in non-block-level tags, such as anchors, 758 phrase emphasis, and spans. The list of tags we're looking for is 759 hard-coded. 760 761 @param raw {boolean} indicates if these are raw HTML blocks in 762 the original source. It makes a difference in "safe" mode. 763 """ 764 if '<' not in text: 765 return text 766 767 # Pass `raw` value into our calls to self._hash_html_block_sub. 768 hash_html_block_sub = _curry(self._hash_html_block_sub, raw=raw) 769 770 # First, look for nested blocks, e.g.: 771 # <div> 772 # <div> 773 # tags for inner block must be indented. 774 # </div> 775 # </div> 776 # 777 # The outermost tags must start at the left margin for this to match, and 778 # the inner nested divs must be indented. 779 # We need to do this before the next, more liberal match, because the next 780 # match will start at the first `<div>` and stop at the first `</div>`. 781 text = self._strict_tag_block_re.sub(hash_html_block_sub, text) 782 783 # Now match more liberally, simply from `\n<tag>` to `</tag>\n` 784 text = self._liberal_tag_block_re.sub(hash_html_block_sub, text) 785 786 # Special case just for <hr />. It was easier to make a special 787 # case than to make the other regex more complicated. 788 if "<hr" in text: 789 _hr_tag_re = _hr_tag_re_from_tab_width(self.tab_width) 790 text = _hr_tag_re.sub(hash_html_block_sub, text) 791 792 # Special case for standalone HTML comments: 793 if "<!--" in text: 794 start = 0 795 while True: 796 # Delimiters for next comment block. 797 try: 798 start_idx = text.index("<!--", start) 799 except ValueError: 800 break 801 try: 802 end_idx = text.index("-->", start_idx) + 3 803 except ValueError: 804 break 805 806 # Start position for next comment block search. 807 start = end_idx 808 809 # Validate whitespace before comment. 810 if start_idx: 811 # - Up to `tab_width - 1` spaces before start_idx. 812 for i in range(self.tab_width - 1): 813 if text[start_idx - 1] != ' ': 814 break 815 start_idx -= 1 816 if start_idx == 0: 817 break 818 # - Must be preceded by 2 newlines or hit the start of 819 # the document. 820 if start_idx == 0: 821 pass 822 elif start_idx == 1 and text[0] == '\n': 823 start_idx = 0 # to match minute detail of Markdown.pl regex 824 elif text[start_idx-2:start_idx] == '\n\n': 825 pass 826 else: 827 break 828 829 # Validate whitespace after comment. 830 # - Any number of spaces and tabs. 831 while end_idx < len(text): 832 if text[end_idx] not in ' \t': 833 break 834 end_idx += 1 835 # - Must be following by 2 newlines or hit end of text. 836 if text[end_idx:end_idx+2] not in ('', '\n', '\n\n'): 837 continue 838 839 # Escape and hash (must match `_hash_html_block_sub`). 840 html = text[start_idx:end_idx] 841 if raw and self.safe_mode: 842 html = self._sanitize_html(html) 843 key = _hash_text(html) 844 self.html_blocks[key] = html 845 text = text[:start_idx] + "\n\n" + key + "\n\n" + text[end_idx:] 846 847 if "xml" in self.extras: 848 # Treat XML processing instructions and namespaced one-liner 849 # tags as if they were block HTML tags. E.g., if standalone 850 # (i.e. are their own paragraph), the following do not get 851 # wrapped in a <p> tag: 852 # <?foo bar?> 853 # 854 # <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="chapter_1.md"/> 855 _xml_oneliner_re = _xml_oneliner_re_from_tab_width(self.tab_width) 856 text = _xml_oneliner_re.sub(hash_html_block_sub, text) 857 858 return text 859 860 def _strip_link_definitions(self, text): 861 # Strips link definitions from text, stores the URLs and titles in 862 # hash references. 863 less_than_tab = self.tab_width - 1 864 865 # Link defs are in the form: 866 # [id]: url "optional title" 867 _link_def_re = re.compile(r""" 868 ^[ ]{0,%d}\[(.+)\]: # id = \1 869 [ \t]* 870 \n? # maybe *one* newline 871 [ \t]* 872 <?(.+?)>? # url = \2 873 [ \t]* 874 (?: 875 \n? # maybe one newline 876 [ \t]* 877 (?<=\s) # lookbehind for whitespace 878 ['"(] 879 ([^\n]*) # title = \3 880 ['")] 881 [ \t]* 882 )? # title is optional 883 (?:\n+|\Z) 884 """ % less_than_tab, re.X | re.M | re.U) 885 return _link_def_re.sub(self._extract_link_def_sub, text) 886 887 def _extract_link_def_sub(self, match): 888 id, url, title = match.groups() 889 key = id.lower() # Link IDs are case-insensitive 890 self.urls[key] = self._encode_amps_and_angles(url) 891 if title: 892 self.titles[key] = title 893 return "" 894 895 def _do_numbering(self, text): 896 ''' We handle the special extension for generic numbering for 897 tables, figures etc. 898 ''' 899 # First pass to define all the references 900 self.regex_defns = re.compile(r''' 901 \[\#(\w+) # the counter. Open square plus hash plus a word \1 902 ([^@]*) # Some optional characters, that aren't an @. \2 903 @(\w+) # the id. Should this be normed? \3 904 ([^\]]*)\] # The rest of the text up to the terminating ] \4 905 ''', re.VERBOSE) 906 self.regex_subs = re.compile(r"\[@(\w+)\s*\]") # [@ref_id] 907 counters = {} 908 references = {} 909 replacements = [] 910 definition_html = '<figcaption class="{}" id="counter-ref-{}">{}{}{}</figcaption>' 911 reference_html = '<a class="{}" href="#counter-ref-{}">{}</a>' 912 for match in self.regex_defns.finditer(text): 913 # We must have four match groups otherwise this isn't a numbering reference 914 if len(match.groups()) != 4: 915 continue 916 counter = match.group(1) 917 text_before = match.group(2).strip() 918 ref_id = match.group(3) 919 text_after = match.group(4) 920 number = counters.get(counter, 1) 921 references[ref_id] = (number, counter) 922 replacements.append((match.start(0), 923 definition_html.format(counter, 924 ref_id, 925 text_before, 926 number, 927 text_after), 928 match.end(0))) 929 counters[counter] = number + 1 930 for repl in reversed(replacements): 931 text = text[:repl[0]] + repl[1] + text[repl[2]:] 932 933 # Second pass to replace the references with the right 934 # value of the counter 935 # Fwiw, it's vaguely annoying to have to turn the iterator into 936 # a list and then reverse it but I can't think of a better thing to do. 937 for match in reversed(list(self.regex_subs.finditer(text))): 938 number, counter = references.get(match.group(1), (None, None)) 939 if number is not None: 940 repl = reference_html.format(counter, 941 match.group(1), 942 number) 943 else: 944 repl = reference_html.format(match.group(1), 945 'countererror', 946 '?' + match.group(1) + '?') 947 if "smarty-pants" in self.extras: 948 repl = repl.replace('"', self._escape_table['"']) 949 950 text = text[:match.start()] + repl + text[match.end():] 951 return text 952 953 def _extract_footnote_def_sub(self, match): 954 id, text = match.groups() 955 text = _dedent(text, skip_first_line=not text.startswith('\n')).strip() 956 normed_id = re.sub(r'\W', '-', id) 957 # Ensure footnote text ends with a couple newlines (for some 958 # block gamut matches). 959 self.footnotes[normed_id] = text + "\n\n" 960 return "" 961 962 def _strip_footnote_definitions(self, text): 963 """A footnote definition looks like this: 964 965 [^note-id]: Text of the note. 966 967 May include one or more indented paragraphs. 968 969 Where, 970 - The 'note-id' can be pretty much anything, though typically it 971 is the number of the footnote. 972 - The first paragraph may start on the next line, like so: 973 974 [^note-id]: 975 Text of the note. 976 """ 977 less_than_tab = self.tab_width - 1 978 footnote_def_re = re.compile(r''' 979 ^[ ]{0,%d}\[\^(.+)\]: # id = \1 980 [ \t]* 981 ( # footnote text = \2 982 # First line need not start with the spaces. 983 (?:\s*.*\n+) 984 (?: 985 (?:[ ]{%d} | \t) # Subsequent lines must be indented. 986 .*\n+ 987 )* 988 ) 989 # Lookahead for non-space at line-start, or end of doc. 990 (?:(?=^[ ]{0,%d}\S)|\Z) 991 ''' % (less_than_tab, self.tab_width, self.tab_width), 992 re.X | re.M) 993 return footnote_def_re.sub(self._extract_footnote_def_sub, text) 994 995 _hr_re = re.compile(r'^[ ]{0,3}([-_*])[ ]{0,2}(\1[ ]{0,2}){2,}$', re.M) 996 997 def _run_block_gamut(self, text): 998 # These are all the transformations that form block-level 999 # tags like paragraphs, headers, and list items. 1000 1001 if "fenced-code-blocks" in self.extras: 1002 text = self._do_fenced_code_blocks(text) 1003 1004 text = self._do_headers(text) 1005 1006 # Do Horizontal Rules: 1007 # On the number of spaces in horizontal rules: The spec is fuzzy: "If 1008 # you wish, you may use spaces between the hyphens or asterisks." 1009 # Markdown.pl 1.0.1's hr regexes limit the number of spaces between the 1010 # hr chars to one or two. We'll reproduce that limit here. 1011 hr = "\n<hr"+self.empty_element_suffix+"\n" 1012 text = re.sub(self._hr_re, hr, text) 1013 1014 text = self._do_lists(text) 1015 1016 if "pyshell" in self.extras: 1017 text = self._prepare_pyshell_blocks(text) 1018 if "wiki-tables" in self.extras: 1019 text = self._do_wiki_tables(text) 1020 if "tables" in self.extras: 1021 text = self._do_tables(text) 1022 1023 text = self._do_code_blocks(text) 1024 1025 text = self._do_block_quotes(text) 1026 1027 # We already ran _HashHTMLBlocks() before, in Markdown(), but that 1028 # was to escape raw HTML in the original Markdown source. This time, 1029 # we're escaping the markup we've just created, so that we don't wrap 1030 # <p> tags around block-level tags. 1031 text = self._hash_html_blocks(text) 1032 1033 text = self._form_paragraphs(text) 1034 1035 return text 1036 1037 def _pyshell_block_sub(self, match): 1038 if "fenced-code-blocks" in self.extras: 1039 dedented = _dedent(match.group(0)) 1040 return self._do_fenced_code_blocks("```pycon\n" + dedented + "```\n") 1041 lines = match.group(0).splitlines(0) 1042 _dedentlines(lines) 1043 indent = ' ' * self.tab_width 1044 s = ('\n' # separate from possible cuddled paragraph 1045 + indent + ('\n'+indent).join(lines) 1046 + '\n\n') 1047 return s 1048 1049 def _prepare_pyshell_blocks(self, text): 1050 """Ensure that Python interactive shell sessions are put in 1051 code blocks -- even if not properly indented. 1052 """ 1053 if ">>>" not in text: 1054 return text 1055 1056 less_than_tab = self.tab_width - 1 1057 _pyshell_block_re = re.compile(r""" 1058 ^([ ]{0,%d})>>>[ ].*\n # first line 1059 ^(\1[^\S\n]*\S.*\n)* # any number of subsequent lines with at least one character 1060 ^\n # ends with a blank line 1061 """ % less_than_tab, re.M | re.X) 1062 1063 return _pyshell_block_re.sub(self._pyshell_block_sub, text) 1064 1065 def _table_sub(self, match): 1066 trim_space_re = '^[ \t\n]+|[ \t\n]+$' 1067 trim_bar_re = r'^\||\|$' 1068 split_bar_re = r'^\||(?<![\`\\])\|' 1069 escape_bar_re = r'\\\|' 1070 1071 head, underline, body = match.groups() 1072 1073 # Determine aligns for columns. 1074 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", underline)))] 1075 align_from_col_idx = {} 1076 for col_idx, col in enumerate(cols): 1077 if col[0] == ':' and col[-1] == ':': 1078 align_from_col_idx[col_idx] = ' style="text-align:center;"' 1079 elif col[0] == ':': 1080 align_from_col_idx[col_idx] = ' style="text-align:left;"' 1081 elif col[-1] == ':': 1082 align_from_col_idx[col_idx] = ' style="text-align:right;"' 1083 1084 # thead 1085 hlines = ['<table%s>' % self._html_class_str_from_tag('table'), '<thead>', '<tr>'] 1086 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", head)))] 1087 for col_idx, col in enumerate(cols): 1088 hlines.append(' <th%s>%s</th>' % ( 1089 align_from_col_idx.get(col_idx, ''), 1090 self._run_span_gamut(col) 1091 )) 1092 hlines.append('</tr>') 1093 hlines.append('</thead>') 1094 1095 # tbody 1096 hlines.append('<tbody>') 1097 for line in body.strip('\n').split('\n'): 1098 hlines.append('<tr>') 1099 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", line)))] 1100 for col_idx, col in enumerate(cols): 1101 hlines.append(' <td%s>%s</td>' % ( 1102 align_from_col_idx.get(col_idx, ''), 1103 self._run_span_gamut(col) 1104 )) 1105 hlines.append('</tr>') 1106 hlines.append('</tbody>') 1107 hlines.append('</table>') 1108 1109 return '\n'.join(hlines) + '\n' 1110 1111 def _do_tables(self, text): 1112 """Copying PHP-Markdown and GFM table syntax. Some regex borrowed from 1113 https://github.com/michelf/php-markdown/blob/lib/Michelf/Markdown.php#L2538 1114 """ 1115 less_than_tab = self.tab_width - 1 1116 table_re = re.compile(r''' 1117 (?:(?<=\n\n)|\A\n?) # leading blank line 1118 1119 ^[ ]{0,%d} # allowed whitespace 1120 (.*[|].*) \n # $1: header row (at least one pipe) 1121 1122 ^[ ]{0,%d} # allowed whitespace 1123 ( # $2: underline row 1124 # underline row with leading bar 1125 (?: \|\ *:?-+:?\ * )+ \|? \s? \n 1126 | 1127 # or, underline row without leading bar 1128 (?: \ *:?-+:?\ *\| )+ (?: \ *:?-+:?\ * )? \s? \n 1129 ) 1130 1131 ( # $3: data rows 1132 (?: 1133 ^[ ]{0,%d}(?!\ ) # ensure line begins with 0 to less_than_tab spaces 1134 .*\|.* \n 1135 )+ 1136 ) 1137 ''' % (less_than_tab, less_than_tab, less_than_tab), re.M | re.X) 1138 return table_re.sub(self._table_sub, text) 1139 1140 def _wiki_table_sub(self, match): 1141 ttext = match.group(0).strip() 1142 # print('wiki table: %r' % match.group(0)) 1143 rows = [] 1144 for line in ttext.splitlines(0): 1145 line = line.strip()[2:-2].strip() 1146 row = [c.strip() for c in re.split(r'(?<!\\)\|\|', line)] 1147 rows.append(row) 1148 # from pprint import pprint 1149 # pprint(rows) 1150 hlines = [] 1151 1152 def add_hline(line, indents=0): 1153 hlines.append((self.tab * indents) + line) 1154 1155 def format_cell(text): 1156 return self._run_span_gamut(re.sub(r"^\s*~", "", cell).strip(" ")) 1157 1158 add_hline('<table%s>' % self._html_class_str_from_tag('table')) 1159 # Check if first cell of first row is a header cell. If so, assume the whole row is a header row. 1160 if rows and rows[0] and re.match(r"^\s*~", rows[0][0]): 1161 add_hline('<thead>', 1) 1162 add_hline('<tr>', 2) 1163 for cell in rows[0]: 1164 add_hline("<th>{}</th>".format(format_cell(cell)), 3) 1165 add_hline('</tr>', 2) 1166 add_hline('</thead>', 1) 1167 # Only one header row allowed. 1168 rows = rows[1:] 1169 # If no more rows, don't create a tbody. 1170 if rows: 1171 add_hline('<tbody>', 1) 1172 for row in rows: 1173 add_hline('<tr>', 2) 1174 for cell in row: 1175 add_hline('<td>{}</td>'.format(format_cell(cell)), 3) 1176 add_hline('</tr>', 2) 1177 add_hline('</tbody>', 1) 1178 add_hline('</table>') 1179 return '\n'.join(hlines) + '\n' 1180 1181 def _do_wiki_tables(self, text): 1182 # Optimization. 1183 if "||" not in text: 1184 return text 1185 1186 less_than_tab = self.tab_width - 1 1187 wiki_table_re = re.compile(r''' 1188 (?:(?<=\n\n)|\A\n?) # leading blank line 1189 ^([ ]{0,%d})\|\|.+?\|\|[ ]*\n # first line 1190 (^\1\|\|.+?\|\|\n)* # any number of subsequent lines 1191 ''' % less_than_tab, re.M | re.X) 1192 return wiki_table_re.sub(self._wiki_table_sub, text) 1193 1194 def _run_span_gamut(self, text): 1195 # These are all the transformations that occur *within* block-level 1196 # tags like paragraphs, headers, and list items. 1197 1198 text = self._do_code_spans(text) 1199 1200 text = self._escape_special_chars(text) 1201 1202 # Process anchor and image tags. 1203 if "link-patterns" in self.extras: 1204 text = self._do_link_patterns(text) 1205 1206 text = self._do_links(text) 1207 1208 # Make links out of things like `<http://example.com/>` 1209 # Must come after _do_links(), because you can use < and > 1210 # delimiters in inline links like [this](<url>). 1211 text = self._do_auto_links(text) 1212 1213 text = self._encode_amps_and_angles(text) 1214 1215 if "strike" in self.extras: 1216 text = self._do_strike(text) 1217 1218 if "underline" in self.extras: 1219 text = self._do_underline(text) 1220 1221 text = self._do_italics_and_bold(text) 1222 1223 if "smarty-pants" in self.extras: 1224 text = self._do_smart_punctuation(text) 1225 1226 # Do hard breaks: 1227 if "break-on-newline" in self.extras: 1228 text = re.sub(r" *\n", "<br%s\n" % self.empty_element_suffix, text) 1229 else: 1230 text = re.sub(r" {2,}\n", " <br%s\n" % self.empty_element_suffix, text) 1231 1232 return text 1233 1234 # "Sorta" because auto-links are identified as "tag" tokens. 1235 _sorta_html_tokenize_re = re.compile(r""" 1236 ( 1237 # tag 1238 </? 1239 (?:\w+) # tag name 1240 (?:\s+(?:[\w-]+:)?[\w-]+=(?:".*?"|'.*?'))* # attributes 1241 \s*/?> 1242 | 1243 # auto-link (e.g., <http://www.activestate.com/>) 1244 <[\w~:/?#\[\]@!$&'\(\)*+,;%=\.\\-]+> 1245 | 1246 <!--.*?--> # comment 1247 | 1248 <\?.*?\?> # processing instruction 1249 ) 1250 """, re.X) 1251 1252 def _escape_special_chars(self, text): 1253 # Python markdown note: the HTML tokenization here differs from 1254 # that in Markdown.pl, hence the behaviour for subtle cases can 1255 # differ (I believe the tokenizer here does a better job because 1256 # it isn't susceptible to unmatched '<' and '>' in HTML tags). 1257 # Note, however, that '>' is not allowed in an auto-link URL 1258 # here. 1259 escaped = [] 1260 is_html_markup = False 1261 for token in self._sorta_html_tokenize_re.split(text): 1262 if is_html_markup: 1263 # Within tags/HTML-comments/auto-links, encode * and _ 1264 # so they don't conflict with their use in Markdown for 1265 # italics and strong. We're replacing each such 1266 # character with its corresponding MD5 checksum value; 1267 # this is likely overkill, but it should prevent us from 1268 # colliding with the escape values by accident. 1269 escaped.append(token.replace('*', self._escape_table['*']) 1270 .replace('_', self._escape_table['_'])) 1271 else: 1272 escaped.append(self._encode_backslash_escapes(token)) 1273 is_html_markup = not is_html_markup 1274 return ''.join(escaped) 1275 1276 def _hash_html_spans(self, text): 1277 # Used for safe_mode. 1278 1279 def _is_auto_link(s): 1280 if ':' in s and self._auto_link_re.match(s): 1281 return True 1282 elif '@' in s and self._auto_email_link_re.match(s): 1283 return True 1284 return False 1285 1286 tokens = [] 1287 is_html_markup = False 1288 for token in self._sorta_html_tokenize_re.split(text): 1289 if is_html_markup and not _is_auto_link(token): 1290 sanitized = self._sanitize_html(token) 1291 key = _hash_text(sanitized) 1292 self.html_spans[key] = sanitized 1293 tokens.append(key) 1294 else: 1295 tokens.append(self._encode_incomplete_tags(token)) 1296 is_html_markup = not is_html_markup 1297 return ''.join(tokens) 1298 1299 def _unhash_html_spans(self, text): 1300 for key, sanitized in list(self.html_spans.items()): 1301 text = text.replace(key, sanitized) 1302 return text 1303 1304 def _sanitize_html(self, s): 1305 if self.safe_mode == "replace": 1306 return self.html_removed_text 1307 elif self.safe_mode == "escape": 1308 replacements = [ 1309 ('&', '&'), 1310 ('<', '<'), 1311 ('>', '>'), 1312 ] 1313 for before, after in replacements: 1314 s = s.replace(before, after) 1315 return s 1316 else: 1317 raise MarkdownError("invalid value for 'safe_mode': %r (must be " 1318 "'escape' or 'replace')" % self.safe_mode) 1319 1320 _inline_link_title = re.compile(r''' 1321 ( # \1 1322 [ \t]+ 1323 (['"]) # quote char = \2 1324 (?P<title>.*?) 1325 \2 1326 )? # title is optional 1327 \)$ 1328 ''', re.X | re.S) 1329 _tail_of_reference_link_re = re.compile(r''' 1330 # Match tail of: [text][id] 1331 [ ]? # one optional space 1332 (?:\n[ ]*)? # one optional newline followed by spaces 1333 \[ 1334 (?P<id>.*?) 1335 \] 1336 ''', re.X | re.S) 1337 1338 _whitespace = re.compile(r'\s*') 1339 1340 _strip_anglebrackets = re.compile(r'<(.*)>.*') 1341 1342 def _find_non_whitespace(self, text, start): 1343 """Returns the index of the first non-whitespace character in text 1344 after (and including) start 1345 """ 1346 match = self._whitespace.match(text, start) 1347 return match.end() 1348 1349 def _find_balanced(self, text, start, open_c, close_c): 1350 """Returns the index where the open_c and close_c characters balance 1351 out - the same number of open_c and close_c are encountered - or the 1352 end of string if it's reached before the balance point is found. 1353 """ 1354 i = start 1355 l = len(text) 1356 count = 1 1357 while count > 0 and i < l: 1358 if text[i] == open_c: 1359 count += 1 1360 elif text[i] == close_c: 1361 count -= 1 1362 i += 1 1363 return i 1364 1365 def _extract_url_and_title(self, text, start): 1366 """Extracts the url and (optional) title from the tail of a link""" 1367 # text[start] equals the opening parenthesis 1368 idx = self._find_non_whitespace(text, start+1) 1369 if idx == len(text): 1370 return None, None, None 1371 end_idx = idx 1372 has_anglebrackets = text[idx] == "<" 1373 if has_anglebrackets: 1374 end_idx = self._find_balanced(text, end_idx+1, "<", ">") 1375 end_idx = self._find_balanced(text, end_idx, "(", ")") 1376 match = self._inline_link_title.search(text, idx, end_idx) 1377 if not match: 1378 return None, None, None 1379 url, title = text[idx:match.start()], match.group("title") 1380 if has_anglebrackets: 1381 url = self._strip_anglebrackets.sub(r'\1', url) 1382 return url, title, end_idx 1383 1384 _safe_protocols = re.compile(r'(https?|ftp):', re.I) 1385 def _do_links(self, text): 1386 """Turn Markdown link shortcuts into XHTML <a> and <img> tags. 1387 1388 This is a combination of Markdown.pl's _DoAnchors() and 1389 _DoImages(). They are done together because that simplified the 1390 approach. It was necessary to use a different approach than 1391 Markdown.pl because of the lack of atomic matching support in 1392 Python's regex engine used in $g_nested_brackets. 1393 """ 1394 MAX_LINK_TEXT_SENTINEL = 3000 # markdown2 issue 24 1395 1396 # `anchor_allowed_pos` is used to support img links inside 1397 # anchors, but not anchors inside anchors. An anchor's start 1398 # pos must be `>= anchor_allowed_pos`. 1399 anchor_allowed_pos = 0 1400 1401 curr_pos = 0 1402 while True: # Handle the next link. 1403 # The next '[' is the start of: 1404 # - an inline anchor: [text](url "title") 1405 # - a reference anchor: [text][id] 1406 # - an inline img:  1407 # - a reference img: ![text][id] 1408 # - a footnote ref: [^id] 1409 # (Only if 'footnotes' extra enabled) 1410 # - a footnote defn: [^id]: ... 1411 # (Only if 'footnotes' extra enabled) These have already 1412 # been stripped in _strip_footnote_definitions() so no 1413 # need to watch for them. 1414 # - a link definition: [id]: url "title" 1415 # These have already been stripped in 1416 # _strip_link_definitions() so no need to watch for them. 1417 # - not markup: [...anything else... 1418 try: 1419 start_idx = text.index('[', curr_pos) 1420 except ValueError: 1421 break 1422 text_length = len(text) 1423 1424 # Find the matching closing ']'. 1425 # Markdown.pl allows *matching* brackets in link text so we 1426 # will here too. Markdown.pl *doesn't* currently allow 1427 # matching brackets in img alt text -- we'll differ in that 1428 # regard. 1429 bracket_depth = 0 1430 for p in range(start_idx+1, min(start_idx+MAX_LINK_TEXT_SENTINEL, 1431 text_length)): 1432 ch = text[p] 1433 if ch == ']': 1434 bracket_depth -= 1 1435 if bracket_depth < 0: 1436 break 1437 elif ch == '[': 1438 bracket_depth += 1 1439 else: 1440 # Closing bracket not found within sentinel length. 1441 # This isn't markup. 1442 curr_pos = start_idx + 1 1443 continue 1444 link_text = text[start_idx+1:p] 1445 1446 # Fix for issue 341 - Injecting XSS into link text 1447 if self.safe_mode: 1448 link_text = self._hash_html_spans(link_text) 1449 link_text = self._unhash_html_spans(link_text) 1450 1451 # Possibly a footnote ref? 1452 if "footnotes" in self.extras and link_text.startswith("^"): 1453 normed_id = re.sub(r'\W', '-', link_text[1:]) 1454 if normed_id in self.footnotes: 1455 self.footnote_ids.append(normed_id) 1456 result = '<sup class="footnote-ref" id="fnref-%s">' \ 1457 '<a href="#fn-%s">%s</a></sup>' \ 1458 % (normed_id, normed_id, len(self.footnote_ids)) 1459 text = text[:start_idx] + result + text[p+1:] 1460 else: 1461 # This id isn't defined, leave the markup alone. 1462 curr_pos = p+1 1463 continue 1464 1465 # Now determine what this is by the remainder. 1466 p += 1 1467 if p == text_length: 1468 return text 1469 1470 # Inline anchor or img? 1471 if text[p] == '(': # attempt at perf improvement 1472 url, title, url_end_idx = self._extract_url_and_title(text, p) 1473 if url is not None: 1474 # Handle an inline anchor or img. 1475 is_img = start_idx > 0 and text[start_idx-1] == "!" 1476 if is_img: 1477 start_idx -= 1 1478 1479 # We've got to encode these to avoid conflicting 1480 # with italics/bold. 1481 url = url.replace('*', self._escape_table['*']) \ 1482 .replace('_', self._escape_table['_']) 1483 if title: 1484 title_str = ' title="%s"' % ( 1485 _xml_escape_attr(title) 1486 .replace('*', self._escape_table['*']) 1487 .replace('_', self._escape_table['_'])) 1488 else: 1489 title_str = '' 1490 if is_img: 1491 img_class_str = self._html_class_str_from_tag("img") 1492 result = '<img src="%s" alt="%s"%s%s%s' \ 1493 % (_html_escape_url(url, safe_mode=self.safe_mode), 1494 _xml_escape_attr(link_text), 1495 title_str, 1496 img_class_str, 1497 self.empty_element_suffix) 1498 if "smarty-pants" in self.extras: 1499 result = result.replace('"', self._escape_table['"']) 1500 curr_pos = start_idx + len(result) 1501 text = text[:start_idx] + result + text[url_end_idx:] 1502 elif start_idx >= anchor_allowed_pos: 1503 safe_link = self._safe_protocols.match(url) or url.startswith('#') 1504 if self.safe_mode and not safe_link: 1505 result_head = '<a href="#"%s>' % (title_str) 1506 else: 1507 result_head = '<a href="%s"%s>' % (_html_escape_url(url, safe_mode=self.safe_mode), title_str) 1508 result = '%s%s</a>' % (result_head, link_text) 1509 if "smarty-pants" in self.extras: 1510 result = result.replace('"', self._escape_table['"']) 1511 # <img> allowed from curr_pos on, <a> from 1512 # anchor_allowed_pos on. 1513 curr_pos = start_idx + len(result_head) 1514 anchor_allowed_pos = start_idx + len(result) 1515 text = text[:start_idx] + result + text[url_end_idx:] 1516 else: 1517 # Anchor not allowed here. 1518 curr_pos = start_idx + 1 1519 continue 1520 1521 # Reference anchor or img? 1522 else: 1523 match = self._tail_of_reference_link_re.match(text, p) 1524 if match: 1525 # Handle a reference-style anchor or img. 1526 is_img = start_idx > 0 and text[start_idx-1] == "!" 1527 if is_img: 1528 start_idx -= 1 1529 link_id = match.group("id").lower() 1530 if not link_id: 1531 link_id = link_text.lower() # for links like [this][] 1532 if link_id in self.urls: 1533 url = self.urls[link_id] 1534 # We've got to encode these to avoid conflicting 1535 # with italics/bold. 1536 url = url.replace('*', self._escape_table['*']) \ 1537 .replace('_', self._escape_table['_']) 1538 title = self.titles.get(link_id) 1539 if title: 1540 title = _xml_escape_attr(title) \ 1541 .replace('*', self._escape_table['*']) \ 1542 .replace('_', self._escape_table['_']) 1543 title_str = ' title="%s"' % title 1544 else: 1545 title_str = '' 1546 if is_img: 1547 img_class_str = self._html_class_str_from_tag("img") 1548 result = '<img src="%s" alt="%s"%s%s%s' \ 1549 % (_html_escape_url(url, safe_mode=self.safe_mode), 1550 _xml_escape_attr(link_text), 1551 title_str, 1552 img_class_str, 1553 self.empty_element_suffix) 1554 if "smarty-pants" in self.extras: 1555 result = result.replace('"', self._escape_table['"']) 1556 curr_pos = start_idx + len(result) 1557 text = text[:start_idx] + result + text[match.end():] 1558 elif start_idx >= anchor_allowed_pos: 1559 if self.safe_mode and not self._safe_protocols.match(url): 1560 result_head = '<a href="#"%s>' % (title_str) 1561 else: 1562 result_head = '<a href="%s"%s>' % (_html_escape_url(url, safe_mode=self.safe_mode), title_str) 1563 result = '%s%s</a>' % (result_head, link_text) 1564 if "smarty-pants" in self.extras: 1565 result = result.replace('"', self._escape_table['"']) 1566 # <img> allowed from curr_pos on, <a> from 1567 # anchor_allowed_pos on. 1568 curr_pos = start_idx + len(result_head) 1569 anchor_allowed_pos = start_idx + len(result) 1570 text = text[:start_idx] + result + text[match.end():] 1571 else: 1572 # Anchor not allowed here. 1573 curr_pos = start_idx + 1 1574 else: 1575 # This id isn't defined, leave the markup alone. 1576 curr_pos = match.end() 1577 continue 1578 1579 # Otherwise, it isn't markup. 1580 curr_pos = start_idx + 1 1581 1582 return text 1583 1584 def header_id_from_text(self, text, prefix, n): 1585 """Generate a header id attribute value from the given header 1586 HTML content. 1587 1588 This is only called if the "header-ids" extra is enabled. 1589 Subclasses may override this for different header ids. 1590 1591 @param text {str} The text of the header tag 1592 @param prefix {str} The requested prefix for header ids. This is the 1593 value of the "header-ids" extra key, if any. Otherwise, None. 1594 @param n {int} The <hN> tag number, i.e. `1` for an <h1> tag. 1595 @returns {str} The value for the header tag's "id" attribute. Return 1596 None to not have an id attribute and to exclude this header from 1597 the TOC (if the "toc" extra is specified). 1598 """ 1599 header_id = _slugify(text) 1600 if prefix and isinstance(prefix, base_string_type): 1601 header_id = prefix + '-' + header_id 1602 1603 self._count_from_header_id[header_id] += 1 1604 if 0 == len(header_id) or self._count_from_header_id[header_id] > 1: 1605 header_id += '-%s' % self._count_from_header_id[header_id] 1606 1607 return header_id 1608 1609 def _toc_add_entry(self, level, id, name): 1610 if level > self._toc_depth: 1611 return 1612 if self._toc is None: 1613 self._toc = [] 1614 self._toc.append((level, id, self._unescape_special_chars(name))) 1615 1616 _h_re_base = r''' 1617 (^(.+)[ \t]{0,99}\n(=+|-+)[ \t]*\n+) 1618 | 1619 (^(\#{1,6}) # \1 = string of #'s 1620 [ \t]%s 1621 (.+?) # \2 = Header text 1622 [ \t]{0,99} 1623 (?<!\\) # ensure not an escaped trailing '#' 1624 \#* # optional closing #'s (not counted) 1625 \n+ 1626 ) 1627 ''' 1628 1629 _h_re = re.compile(_h_re_base % '*', re.X | re.M) 1630 _h_re_tag_friendly = re.compile(_h_re_base % '+', re.X | re.M) 1631 1632 def _h_sub(self, match): 1633 if match.group(1) is not None and match.group(3) == "-": 1634 return match.group(1) 1635 elif match.group(1) is not None: 1636 # Setext header 1637 n = {"=": 1, "-": 2}[match.group(3)[0]] 1638 header_group = match.group(2) 1639 else: 1640 # atx header 1641 n = len(match.group(5)) 1642 header_group = match.group(6) 1643 1644 demote_headers = self.extras.get("demote-headers") 1645 if demote_headers: 1646 n = min(n + demote_headers, 6) 1647 header_id_attr = "" 1648 if "header-ids" in self.extras: 1649 header_id = self.header_id_from_text(header_group, 1650 self.extras["header-ids"], n) 1651 if header_id: 1652 header_id_attr = ' id="%s"' % header_id 1653 html = self._run_span_gamut(header_group) 1654 if "toc" in self.extras and header_id: 1655 self._toc_add_entry(n, header_id, html) 1656 return "<h%d%s>%s</h%d>\n\n" % (n, header_id_attr, html, n) 1657 1658 def _do_headers(self, text): 1659 # Setext-style headers: 1660 # Header 1 1661 # ======== 1662 # 1663 # Header 2 1664 # -------- 1665 1666 # atx-style headers: 1667 # # Header 1 1668 # ## Header 2 1669 # ## Header 2 with closing hashes ## 1670 # ... 1671 # ###### Header 6 1672 1673 if 'tag-friendly' in self.extras: 1674 return self._h_re_tag_friendly.sub(self._h_sub, text) 1675 return self._h_re.sub(self._h_sub, text) 1676 1677 _marker_ul_chars = '*+-' 1678 _marker_any = r'(?:[%s]|\d+\.)' % _marker_ul_chars 1679 _marker_ul = '(?:[%s])' % _marker_ul_chars 1680 _marker_ol = r'(?:\d+\.)' 1681 1682 def _list_sub(self, match): 1683 lst = match.group(1) 1684 lst_type = match.group(3) in self._marker_ul_chars and "ul" or "ol" 1685 result = self._process_list_items(lst) 1686 if self.list_level: 1687 return "<%s>\n%s</%s>\n" % (lst_type, result, lst_type) 1688 else: 1689 return "<%s>\n%s</%s>\n\n" % (lst_type, result, lst_type) 1690 1691 def _do_lists(self, text): 1692 # Form HTML ordered (numbered) and unordered (bulleted) lists. 1693 1694 # Iterate over each *non-overlapping* list match. 1695 pos = 0 1696 while True: 1697 # Find the *first* hit for either list style (ul or ol). We 1698 # match ul and ol separately to avoid adjacent lists of different 1699 # types running into each other (see issue #16). 1700 hits = [] 1701 for marker_pat in (self._marker_ul, self._marker_ol): 1702 less_than_tab = self.tab_width - 1 1703 whole_list = r''' 1704 ( # \1 = whole list 1705 ( # \2 1706 [ ]{0,%d} 1707 (%s) # \3 = first list item marker 1708 [ \t]+ 1709 (?!\ *\3\ ) # '- - - ...' isn't a list. See 'not_quite_a_list' test case. 1710 ) 1711 (?:.+?) 1712 ( # \4 1713 \Z 1714 | 1715 \n{2,} 1716 (?=\S) 1717 (?! # Negative lookahead for another list item marker 1718 [ \t]* 1719 %s[ \t]+ 1720 ) 1721 ) 1722 ) 1723 ''' % (less_than_tab, marker_pat, marker_pat) 1724 if self.list_level: # sub-list 1725 list_re = re.compile("^"+whole_list, re.X | re.M | re.S) 1726 else: 1727 list_re = re.compile(r"(?:(?<=\n\n)|\A\n?)"+whole_list, 1728 re.X | re.M | re.S) 1729 match = list_re.search(text, pos) 1730 if match: 1731 hits.append((match.start(), match)) 1732 if not hits: 1733 break 1734 hits.sort() 1735 match = hits[0][1] 1736 start, end = match.span() 1737 middle = self._list_sub(match) 1738 text = text[:start] + middle + text[end:] 1739 pos = start + len(middle) # start pos for next attempted match 1740 1741 return text 1742 1743 _list_item_re = re.compile(r''' 1744 (\n)? # leading line = \1 1745 (^[ \t]*) # leading whitespace = \2 1746 (?P<marker>%s) [ \t]+ # list marker = \3 1747 ((?:.+?) # list item text = \4 1748 (\n{1,2})) # eols = \5 1749 (?= \n* (\Z | \2 (?P<next_marker>%s) [ \t]+)) 1750 ''' % (_marker_any, _marker_any), 1751 re.M | re.X | re.S) 1752 1753 _task_list_item_re = re.compile(r''' 1754 (\[[\ xX]\])[ \t]+ # tasklist marker = \1 1755 (.*) # list item text = \2 1756 ''', re.M | re.X | re.S) 1757 1758 _task_list_warpper_str = r'<input type="checkbox" class="task-list-item-checkbox" %sdisabled> %s' 1759 1760 def _task_list_item_sub(self, match): 1761 marker = match.group(1) 1762 item_text = match.group(2) 1763 if marker in ['[x]','[X]']: 1764 return self._task_list_warpper_str % ('checked ', item_text) 1765 elif marker == '[ ]': 1766 return self._task_list_warpper_str % ('', item_text) 1767 1768 _last_li_endswith_two_eols = False 1769 def _list_item_sub(self, match): 1770 item = match.group(4) 1771 leading_line = match.group(1) 1772 if leading_line or "\n\n" in item or self._last_li_endswith_two_eols: 1773 item = self._run_block_gamut(self._outdent(item)) 1774 else: 1775 # Recursion for sub-lists: 1776 item = self._do_lists(self._outdent(item)) 1777 if item.endswith('\n'): 1778 item = item[:-1] 1779 item = self._run_span_gamut(item) 1780 self._last_li_endswith_two_eols = (len(match.group(5)) == 2) 1781 1782 if "task_list" in self.extras: 1783 item = self._task_list_item_re.sub(self._task_list_item_sub, item) 1784 1785 return "<li>%s</li>\n" % item 1786 1787 def _process_list_items(self, list_str): 1788 # Process the contents of a single ordered or unordered list, 1789 # splitting it into individual list items. 1790 1791 # The $g_list_level global keeps track of when we're inside a list. 1792 # Each time we enter a list, we increment it; when we leave a list, 1793 # we decrement. If it's zero, we're not in a list anymore. 1794 # 1795 # We do this because when we're not inside a list, we want to treat 1796 # something like this: 1797 # 1798 # I recommend upgrading to version 1799 # 8. Oops, now this line is treated 1800 # as a sub-list. 1801 # 1802 # As a single paragraph, despite the fact that the second line starts 1803 # with a digit-period-space sequence. 1804 # 1805 # Whereas when we're inside a list (or sub-list), that line will be 1806 # treated as the start of a sub-list. What a kludge, huh? This is 1807 # an aspect of Markdown's syntax that's hard to parse perfectly 1808 # without resorting to mind-reading. Perhaps the solution is to 1809 # change the syntax rules such that sub-lists must start with a 1810 # starting cardinal number; e.g. "1." or "a.". 1811 self.list_level += 1 1812 self._last_li_endswith_two_eols = False 1813 list_str = list_str.rstrip('\n') + '\n' 1814 list_str = self._list_item_re.sub(self._list_item_sub, list_str) 1815 self.list_level -= 1 1816 return list_str 1817 1818 def _get_pygments_lexer(self, lexer_name): 1819 try: 1820 from pygments import lexers, util 1821 except ImportError: 1822 return None 1823 try: 1824 return lexers.get_lexer_by_name(lexer_name) 1825 except util.ClassNotFound: 1826 return None 1827 1828 def _color_with_pygments(self, codeblock, lexer, **formatter_opts): 1829 import pygments 1830 import pygments.formatters 1831 1832 class HtmlCodeFormatter(pygments.formatters.HtmlFormatter): 1833 def _wrap_code(self, inner): 1834 """A function for use in a Pygments Formatter which 1835 wraps in <code> tags. 1836 """ 1837 yield 0, "<code>" 1838 for tup in inner: 1839 yield tup 1840 yield 0, "</code>" 1841 1842 def wrap(self, source, outfile=None): 1843 """Return the source with a code, pre, and div.""" 1844 if outfile is None: 1845 # pygments >= 2.12 1846 return self._wrap_pre(self._wrap_code(source)) 1847 else: 1848 # pygments < 2.12 1849 return self._wrap_div(self._wrap_pre(self._wrap_code(source))) 1850 1851 formatter_opts.setdefault("cssclass", "codehilite") 1852 formatter = HtmlCodeFormatter(**formatter_opts) 1853 return pygments.highlight(codeblock, lexer, formatter) 1854 1855 def _code_block_sub(self, match, is_fenced_code_block=False): 1856 lexer_name = None 1857 if is_fenced_code_block: 1858 lexer_name = match.group(2) 1859 if lexer_name: 1860 formatter_opts = self.extras['fenced-code-blocks'] or {} 1861 codeblock = match.group(3) 1862 codeblock = codeblock[:-1] # drop one trailing newline 1863 else: 1864 codeblock = match.group(1) 1865 codeblock = self._outdent(codeblock) 1866 codeblock = self._detab(codeblock) 1867 codeblock = codeblock.lstrip('\n') # trim leading newlines 1868 codeblock = codeblock.rstrip() # trim trailing whitespace 1869 1870 # Note: "code-color" extra is DEPRECATED. 1871 if "code-color" in self.extras and codeblock.startswith(":::"): 1872 lexer_name, rest = codeblock.split('\n', 1) 1873 lexer_name = lexer_name[3:].strip() 1874 codeblock = rest.lstrip("\n") # Remove lexer declaration line. 1875 formatter_opts = self.extras['code-color'] or {} 1876 1877 # Use pygments only if not using the highlightjs-lang extra 1878 if lexer_name and "highlightjs-lang" not in self.extras: 1879 def unhash_code(codeblock): 1880 for key, sanitized in list(self.html_spans.items()): 1881 codeblock = codeblock.replace(key, sanitized) 1882 replacements = [ 1883 ("&", "&"), 1884 ("<", "<"), 1885 (">", ">") 1886 ] 1887 for old, new in replacements: 1888 codeblock = codeblock.replace(old, new) 1889 return codeblock 1890 lexer = self._get_pygments_lexer(lexer_name) 1891 if lexer: 1892 codeblock = unhash_code( codeblock ) 1893 colored = self._color_with_pygments(codeblock, lexer, 1894 **formatter_opts) 1895 return "\n\n%s\n\n" % colored 1896 1897 codeblock = self._encode_code(codeblock) 1898 pre_class_str = self._html_class_str_from_tag("pre") 1899 1900 if "highlightjs-lang" in self.extras and lexer_name: 1901 code_class_str = ' class="%s language-%s"' % (lexer_name, lexer_name) 1902 else: 1903 code_class_str = self._html_class_str_from_tag("code") 1904 1905 return "\n\n<pre%s><code%s>%s\n</code></pre>\n\n" % ( 1906 pre_class_str, code_class_str, codeblock) 1907 1908 def _html_class_str_from_tag(self, tag): 1909 """Get the appropriate ' class="..."' string (note the leading 1910 space), if any, for the given tag. 1911 """ 1912 if "html-classes" not in self.extras: 1913 return "" 1914 try: 1915 html_classes_from_tag = self.extras["html-classes"] 1916 except TypeError: 1917 return "" 1918 else: 1919 if tag in html_classes_from_tag: 1920 return ' class="%s"' % html_classes_from_tag[tag] 1921 return "" 1922 1923 def _do_code_blocks(self, text): 1924 """Process Markdown `<pre><code>` blocks.""" 1925 code_block_re = re.compile(r''' 1926 (?:\n\n|\A\n?) 1927 ( # $1 = the code block -- one or more lines, starting with a space/tab 1928 (?: 1929 (?:[ ]{%d} | \t) # Lines must start with a tab or a tab-width of spaces 1930 .*\n+ 1931 )+ 1932 ) 1933 ((?=^[ ]{0,%d}\S)|\Z) # Lookahead for non-space at line-start, or end of doc 1934 # Lookahead to make sure this block isn't already in a code block. 1935 # Needed when syntax highlighting is being used. 1936 (?!([^<]|<(/?)span)*\</code\>) 1937 ''' % (self.tab_width, self.tab_width), 1938 re.M | re.X) 1939 return code_block_re.sub(self._code_block_sub, text) 1940 1941 _fenced_code_block_re = re.compile(r''' 1942 (?:\n+|\A\n?|(?<=\n)) 1943 (^`{3,})\s{0,99}?([\w+-]+)?\s{0,99}?\n # $1 = opening fence (captured for back-referencing), $2 = optional lang 1944 (.*?) # $3 = code block content 1945 \1[ \t]*\n # closing fence 1946 ''', re.M | re.X | re.S) 1947 1948 def _fenced_code_block_sub(self, match): 1949 return self._code_block_sub(match, is_fenced_code_block=True) 1950 1951 def _do_fenced_code_blocks(self, text): 1952 """Process ```-fenced unindented code blocks ('fenced-code-blocks' extra).""" 1953 return self._fenced_code_block_re.sub(self._fenced_code_block_sub, text) 1954 1955 # Rules for a code span: 1956 # - backslash escapes are not interpreted in a code span 1957 # - to include one or or a run of more backticks the delimiters must 1958 # be a longer run of backticks 1959 # - cannot start or end a code span with a backtick; pad with a 1960 # space and that space will be removed in the emitted HTML 1961 # See `test/tm-cases/escapes.text` for a number of edge-case 1962 # examples. 1963 _code_span_re = re.compile(r''' 1964 (?<!\\) 1965 (`+) # \1 = Opening run of ` 1966 (?!`) # See Note A test/tm-cases/escapes.text 1967 (.+?) # \2 = The code block 1968 (?<!`) 1969 \1 # Matching closer 1970 (?!`) 1971 ''', re.X | re.S) 1972 1973 def _code_span_sub(self, match): 1974 c = match.group(2).strip(" \t") 1975 c = self._encode_code(c) 1976 return "<code>%s</code>" % c 1977 1978 def _do_code_spans(self, text): 1979 # * Backtick quotes are used for <code></code> spans. 1980 # 1981 # * You can use multiple backticks as the delimiters if you want to 1982 # include literal backticks in the code span. So, this input: 1983 # 1984 # Just type ``foo `bar` baz`` at the prompt. 1985 # 1986 # Will translate to: 1987 # 1988 # <p>Just type <code>foo `bar` baz</code> at the prompt.</p> 1989 # 1990 # There's no arbitrary limit to the number of backticks you 1991 # can use as delimters. If you need three consecutive backticks 1992 # in your code, use four for delimiters, etc. 1993 # 1994 # * You can use spaces to get literal backticks at the edges: 1995 # 1996 # ... type `` `bar` `` ... 1997 # 1998 # Turns to: 1999 # 2000 # ... type <code>`bar`</code> ... 2001 return self._code_span_re.sub(self._code_span_sub, text) 2002 2003 def _encode_code(self, text): 2004 """Encode/escape certain characters inside Markdown code runs. 2005 The point is that in code, these characters are literals, 2006 and lose their special Markdown meanings. 2007 """ 2008 replacements = [ 2009 # Encode all ampersands; HTML entities are not 2010 # entities within a Markdown code span. 2011 ('&', '&'), 2012 # Do the angle bracket song and dance: 2013 ('<', '<'), 2014 ('>', '>'), 2015 ] 2016 for before, after in replacements: 2017 text = text.replace(before, after) 2018 hashed = _hash_text(text) 2019 self._code_table[text] = hashed 2020 return hashed 2021 2022 _strike_re = re.compile(r"~~(?=\S)(.+?)(?<=\S)~~", re.S) 2023 def _do_strike(self, text): 2024 text = self._strike_re.sub(r"<strike>\1</strike>", text) 2025 return text 2026 2027 _underline_re = re.compile(r"--(?=\S)(.+?)(?<=\S)--", re.S) 2028 def _do_underline(self, text): 2029 text = self._underline_re.sub(r"<u>\1</u>", text) 2030 return text 2031 2032 _strong_re = re.compile(r"(\*\*|__)(?=\S)(.+?[*_]*)(?<=\S)\1", re.S) 2033 _em_re = re.compile(r"(\*|_)(?=\S)(.+?)(?<=\S)\1", re.S) 2034 _code_friendly_strong_re = re.compile(r"\*\*(?=\S)(.+?[*_]*)(?<=\S)\*\*", re.S) 2035 _code_friendly_em_re = re.compile(r"\*(?=\S)(.+?)(?<=\S)\*", re.S) 2036 def _do_italics_and_bold(self, text): 2037 # <strong> must go first: 2038 if "code-friendly" in self.extras: 2039 text = self._code_friendly_strong_re.sub(r"<strong>\1</strong>", text) 2040 text = self._code_friendly_em_re.sub(r"<em>\1</em>", text) 2041 else: 2042 text = self._strong_re.sub(r"<strong>\2</strong>", text) 2043 text = self._em_re.sub(r"<em>\2</em>", text) 2044 return text 2045 2046 # "smarty-pants" extra: Very liberal in interpreting a single prime as an 2047 # apostrophe; e.g. ignores the fact that "round", "bout", "twer", and 2048 # "twixt" can be written without an initial apostrophe. This is fine because 2049 # using scare quotes (single quotation marks) is rare. 2050 _apostrophe_year_re = re.compile(r"'(\d\d)(?=(\s|,|;|\.|\?|!|$))") 2051 _contractions = ["tis", "twas", "twer", "neath", "o", "n", 2052 "round", "bout", "twixt", "nuff", "fraid", "sup"] 2053 def _do_smart_contractions(self, text): 2054 text = self._apostrophe_year_re.sub(r"’\1", text) 2055 for c in self._contractions: 2056 text = text.replace("'%s" % c, "’%s" % c) 2057 text = text.replace("'%s" % c.capitalize(), 2058 "’%s" % c.capitalize()) 2059 return text 2060 2061 # Substitute double-quotes before single-quotes. 2062 _opening_single_quote_re = re.compile(r"(?<!\S)'(?=\S)") 2063 _opening_double_quote_re = re.compile(r'(?<!\S)"(?=\S)') 2064 _closing_single_quote_re = re.compile(r"(?<=\S)'") 2065 _closing_double_quote_re = re.compile(r'(?<=\S)"(?=(\s|,|;|\.|\?|!|$))') 2066 def _do_smart_punctuation(self, text): 2067 """Fancifies 'single quotes', "double quotes", and apostrophes. 2068 Converts --, ---, and ... into en dashes, em dashes, and ellipses. 2069 2070 Inspiration is: <http://daringfireball.net/projects/smartypants/> 2071 See "test/tm-cases/smarty_pants.text" for a full discussion of the 2072 support here and 2073 <http://code.google.com/p/python-markdown2/issues/detail?id=42> for a 2074 discussion of some diversion from the original SmartyPants. 2075 """ 2076 if "'" in text: # guard for perf 2077 text = self._do_smart_contractions(text) 2078 text = self._opening_single_quote_re.sub("‘", text) 2079 text = self._closing_single_quote_re.sub("’", text) 2080 2081 if '"' in text: # guard for perf 2082 text = self._opening_double_quote_re.sub("“", text) 2083 text = self._closing_double_quote_re.sub("”", text) 2084 2085 text = text.replace("---", "—") 2086 text = text.replace("--", "–") 2087 text = text.replace("...", "…") 2088 text = text.replace(" . . . ", "…") 2089 text = text.replace(". . .", "…") 2090 2091 # TODO: Temporary hack to fix https://github.com/trentm/python-markdown2/issues/150 2092 if "footnotes" in self.extras and "footnote-ref" in text: 2093 # Quotes in the footnote back ref get converted to "smart" quotes 2094 # Change them back here to ensure they work. 2095 text = text.replace('class="footnote-ref”', 'class="footnote-ref"') 2096 2097 return text 2098 2099 _block_quote_base = r''' 2100 ( # Wrap whole match in \1 2101 ( 2102 ^[ \t]*>%s[ \t]? # '>' at the start of a line 2103 .+\n # rest of the first line 2104 (.+\n)* # subsequent consecutive lines 2105 )+ 2106 ) 2107 ''' 2108 _block_quote_re = re.compile(_block_quote_base % '', re.M | re.X) 2109 _block_quote_re_spoiler = re.compile(_block_quote_base % '[ \t]*?!?', re.M | re.X) 2110 _bq_one_level_re = re.compile('^[ \t]*>[ \t]?', re.M) 2111 _bq_one_level_re_spoiler = re.compile('^[ \t]*>[ \t]*?![ \t]?', re.M) 2112 _bq_all_lines_spoilers = re.compile(r'\A(?:^[ \t]*>[ \t]*?!.*[\n\r]*)+\Z', re.M) 2113 _html_pre_block_re = re.compile(r'(\s*<pre>.+?</pre>)', re.S) 2114 def _dedent_two_spaces_sub(self, match): 2115 return re.sub(r'(?m)^ ', '', match.group(1)) 2116 2117 def _block_quote_sub(self, match): 2118 bq = match.group(1) 2119 is_spoiler = 'spoiler' in self.extras and self._bq_all_lines_spoilers.match(bq) 2120 # trim one level of quoting 2121 if is_spoiler: 2122 bq = self._bq_one_level_re_spoiler.sub('', bq) 2123 else: 2124 bq = self._bq_one_level_re.sub('', bq) 2125 # trim whitespace-only lines 2126 bq = self._ws_only_line_re.sub('', bq) 2127 bq = self._run_block_gamut(bq) # recurse 2128 2129 bq = re.sub('(?m)^', ' ', bq) 2130 # These leading spaces screw with <pre> content, so we need to fix that: 2131 bq = self._html_pre_block_re.sub(self._dedent_two_spaces_sub, bq) 2132 2133 if is_spoiler: 2134 return '<blockquote class="spoiler">\n%s\n</blockquote>\n\n' % bq 2135 else: 2136 return '<blockquote>\n%s\n</blockquote>\n\n' % bq 2137 2138 def _do_block_quotes(self, text): 2139 if '>' not in text: 2140 return text 2141 if 'spoiler' in self.extras: 2142 return self._block_quote_re_spoiler.sub(self._block_quote_sub, text) 2143 else: 2144 return self._block_quote_re.sub(self._block_quote_sub, text) 2145 2146 def _form_paragraphs(self, text): 2147 # Strip leading and trailing lines: 2148 text = text.strip('\n') 2149 2150 # Wrap <p> tags. 2151 grafs = [] 2152 for i, graf in enumerate(re.split(r"\n{2,}", text)): 2153 if graf in self.html_blocks: 2154 # Unhashify HTML blocks 2155 grafs.append(self.html_blocks[graf]) 2156 else: 2157 cuddled_list = None 2158 if "cuddled-lists" in self.extras: 2159 # Need to put back trailing '\n' for `_list_item_re` 2160 # match at the end of the paragraph. 2161 li = self._list_item_re.search(graf + '\n') 2162 # Two of the same list marker in this paragraph: a likely 2163 # candidate for a list cuddled to preceding paragraph 2164 # text (issue 33). Note the `[-1]` is a quick way to 2165 # consider numeric bullets (e.g. "1." and "2.") to be 2166 # equal. 2167 if (li and len(li.group(2)) <= 3 2168 and ( 2169 (li.group("next_marker") and li.group("marker")[-1] == li.group("next_marker")[-1]) 2170 or 2171 li.group("next_marker") is None 2172 ) 2173 ): 2174 start = li.start() 2175 cuddled_list = self._do_lists(graf[start:]).rstrip("\n") 2176 assert cuddled_list.startswith("<ul>") or cuddled_list.startswith("<ol>") 2177 graf = graf[:start] 2178 2179 # Wrap <p> tags. 2180 graf = self._run_span_gamut(graf) 2181 grafs.append("<p%s>" % self._html_class_str_from_tag('p') + graf.lstrip(" \t") + "</p>") 2182 2183 if cuddled_list: 2184 grafs.append(cuddled_list) 2185 2186 return "\n\n".join(grafs) 2187 2188 def _add_footnotes(self, text): 2189 if self.footnotes: 2190 footer = [ 2191 '<div class="footnotes">', 2192 '<hr' + self.empty_element_suffix, 2193 '<ol>', 2194 ] 2195 2196 if not self.footnote_title: 2197 self.footnote_title = "Jump back to footnote %d in the text." 2198 if not self.footnote_return_symbol: 2199 self.footnote_return_symbol = "↩" 2200 2201 for i, id in enumerate(self.footnote_ids): 2202 if i != 0: 2203 footer.append('') 2204 footer.append('<li id="fn-%s">' % id) 2205 footer.append(self._run_block_gamut(self.footnotes[id])) 2206 try: 2207 backlink = ('<a href="#fnref-%s" ' + 2208 'class="footnoteBackLink" ' + 2209 'title="' + self.footnote_title + '">' + 2210 self.footnote_return_symbol + 2211 '</a>') % (id, i+1) 2212 except TypeError: 2213 log.debug("Footnote error. `footnote_title` " 2214 "must include parameter. Using defaults.") 2215 backlink = ('<a href="#fnref-%s" ' 2216 'class="footnoteBackLink" ' 2217 'title="Jump back to footnote %d in the text.">' 2218 '↩</a>' % (id, i+1)) 2219 2220 if footer[-1].endswith("</p>"): 2221 footer[-1] = footer[-1][:-len("</p>")] \ 2222 + ' ' + backlink + "</p>" 2223 else: 2224 footer.append("\n<p>%s</p>" % backlink) 2225 footer.append('</li>') 2226 footer.append('</ol>') 2227 footer.append('</div>') 2228 return text + '\n\n' + '\n'.join(footer) 2229 else: 2230 return text 2231 2232 _naked_lt_re = re.compile(r'<(?![a-z/?\$!])', re.I) 2233 _naked_gt_re = re.compile(r'''(?<![a-z0-9?!/'"-])>''', re.I) 2234 2235 def _encode_amps_and_angles(self, text): 2236 # Smart processing for ampersands and angle brackets that need 2237 # to be encoded. 2238 text = _AMPERSAND_RE.sub('&', text) 2239 2240 # Encode naked <'s 2241 text = self._naked_lt_re.sub('<', text) 2242 2243 # Encode naked >'s 2244 # Note: Other markdown implementations (e.g. Markdown.pl, PHP 2245 # Markdown) don't do this. 2246 text = self._naked_gt_re.sub('>', text) 2247 return text 2248 2249 _incomplete_tags_re = re.compile(r"<(/?\w+?(?!\w).+?[\s/]+?)") 2250 2251 def _encode_incomplete_tags(self, text): 2252 if self.safe_mode not in ("replace", "escape"): 2253 return text 2254 2255 if text.endswith(">"): 2256 return text # this is not an incomplete tag, this is a link in the form <http://x.y.z> 2257 2258 return self._incomplete_tags_re.sub("<\\1", text) 2259 2260 def _encode_backslash_escapes(self, text): 2261 for ch, escape in list(self._escape_table.items()): 2262 text = text.replace("\\"+ch, escape) 2263 return text 2264 2265 _auto_link_re = re.compile(r'<((https?|ftp):[^\'">\s]+)>', re.I) 2266 def _auto_link_sub(self, match): 2267 g1 = match.group(1) 2268 return '<a href="%s">%s</a>' % (g1, g1) 2269 2270 _auto_email_link_re = re.compile(r""" 2271 < 2272 (?:mailto:)? 2273 ( 2274 [-.\w]+ 2275 \@ 2276 [-\w]+(\.[-\w]+)*\.[a-z]+ 2277 ) 2278 > 2279 """, re.I | re.X | re.U) 2280 def _auto_email_link_sub(self, match): 2281 return self._encode_email_address( 2282 self._unescape_special_chars(match.group(1))) 2283 2284 def _do_auto_links(self, text): 2285 text = self._auto_link_re.sub(self._auto_link_sub, text) 2286 text = self._auto_email_link_re.sub(self._auto_email_link_sub, text) 2287 return text 2288 2289 def _encode_email_address(self, addr): 2290 # Input: an email address, e.g. "foo@example.com" 2291 # 2292 # Output: the email address as a mailto link, with each character 2293 # of the address encoded as either a decimal or hex entity, in 2294 # the hopes of foiling most address harvesting spam bots. E.g.: 2295 # 2296 # <a href="mailto:foo@e 2297 # xample.com">foo 2298 # @example.com</a> 2299 # 2300 # Based on a filter by Matthew Wickline, posted to the BBEdit-Talk 2301 # mailing list: <http://tinyurl.com/yu7ue> 2302 chars = [_xml_encode_email_char_at_random(ch) 2303 for ch in "mailto:" + addr] 2304 # Strip the mailto: from the visible part. 2305 addr = '<a href="%s">%s</a>' \ 2306 % (''.join(chars), ''.join(chars[7:])) 2307 return addr 2308 2309 def _do_link_patterns(self, text): 2310 link_from_hash = {} 2311 for regex, repl in self.link_patterns: 2312 replacements = [] 2313 for match in regex.finditer(text): 2314 if hasattr(repl, "__call__"): 2315 href = repl(match) 2316 else: 2317 href = match.expand(repl) 2318 replacements.append((match.span(), href)) 2319 for (start, end), href in reversed(replacements): 2320 2321 # Do not match against links inside brackets. 2322 if text[start - 1:start] == '[' and text[end:end + 1] == ']': 2323 continue 2324 2325 # Do not match against links in the standard markdown syntax. 2326 if text[start - 2:start] == '](' or text[end:end + 2] == '")': 2327 continue 2328 2329 # Do not match against links which are escaped. 2330 if text[start - 3:start] == '"""' and text[end:end + 3] == '"""': 2331 text = text[:start - 3] + text[start:end] + text[end + 3:] 2332 continue 2333 2334 escaped_href = ( 2335 href.replace('"', '"') # b/c of attr quote 2336 # To avoid markdown <em> and <strong>: 2337 .replace('*', self._escape_table['*']) 2338 .replace('_', self._escape_table['_'])) 2339 link = '<a href="%s">%s</a>' % (escaped_href, text[start:end]) 2340 hash = _hash_text(link) 2341 link_from_hash[hash] = link 2342 text = text[:start] + hash + text[end:] 2343 for hash, link in list(link_from_hash.items()): 2344 text = text.replace(hash, link) 2345 return text 2346 2347 def _unescape_special_chars(self, text): 2348 # Swap back in all the special characters we've hidden. 2349 for ch, hash in list(self._escape_table.items()) + list(self._code_table.items()): 2350 text = text.replace(hash, ch) 2351 return text 2352 2353 def _outdent(self, text): 2354 # Remove one level of line-leading tabs or spaces 2355 return self._outdent_re.sub('', text) 2356 2357 2358class MarkdownWithExtras(Markdown): 2359 """A markdowner class that enables most extras: 2360 2361 - footnotes 2362 - code-color (only has effect if 'pygments' Python module on path) 2363 2364 These are not included: 2365 - pyshell (specific to Python-related documenting) 2366 - code-friendly (because it *disables* part of the syntax) 2367 - link-patterns (because you need to specify some actual 2368 link-patterns anyway) 2369 """ 2370 extras = ["footnotes", "code-color"] 2371 2372 2373# ---- internal support functions 2374 2375 2376def calculate_toc_html(toc): 2377 """Return the HTML for the current TOC. 2378 2379 This expects the `_toc` attribute to have been set on this instance. 2380 """ 2381 if toc is None: 2382 return None 2383 2384 def indent(): 2385 return ' ' * (len(h_stack) - 1) 2386 lines = [] 2387 h_stack = [0] # stack of header-level numbers 2388 for level, id, name in toc: 2389 if level > h_stack[-1]: 2390 lines.append("%s<ul>" % indent()) 2391 h_stack.append(level) 2392 elif level == h_stack[-1]: 2393 lines[-1] += "</li>" 2394 else: 2395 while level < h_stack[-1]: 2396 h_stack.pop() 2397 if not lines[-1].endswith("</li>"): 2398 lines[-1] += "</li>" 2399 lines.append("%s</ul></li>" % indent()) 2400 lines.append('%s<li><a href="#%s">%s</a>' % ( 2401 indent(), id, name)) 2402 while len(h_stack) > 1: 2403 h_stack.pop() 2404 if not lines[-1].endswith("</li>"): 2405 lines[-1] += "</li>" 2406 lines.append("%s</ul>" % indent()) 2407 return '\n'.join(lines) + '\n' 2408 2409 2410class UnicodeWithAttrs(unicode): 2411 """A subclass of unicode used for the return value of conversion to 2412 possibly attach some attributes. E.g. the "toc_html" attribute when 2413 the "toc" extra is used. 2414 """ 2415 metadata = None 2416 toc_html = None 2417 2418## {{{ http://code.activestate.com/recipes/577257/ (r1) 2419_slugify_strip_re = re.compile(r'[^\w\s-]') 2420_slugify_hyphenate_re = re.compile(r'[-\s]+') 2421def _slugify(value): 2422 """ 2423 Normalizes string, converts to lowercase, removes non-alpha characters, 2424 and converts spaces to hyphens. 2425 2426 From Django's "django/template/defaultfilters.py". 2427 """ 2428 import unicodedata 2429 value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode() 2430 value = _slugify_strip_re.sub('', value).strip().lower() 2431 return _slugify_hyphenate_re.sub('-', value) 2432## end of http://code.activestate.com/recipes/577257/ }}} 2433 2434 2435# From http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52549 2436def _curry(*args, **kwargs): 2437 function, args = args[0], args[1:] 2438 def result(*rest, **kwrest): 2439 combined = kwargs.copy() 2440 combined.update(kwrest) 2441 return function(*args + rest, **combined) 2442 return result 2443 2444 2445# Recipe: regex_from_encoded_pattern (1.0) 2446def _regex_from_encoded_pattern(s): 2447 """'foo' -> re.compile(re.escape('foo')) 2448 '/foo/' -> re.compile('foo') 2449 '/foo/i' -> re.compile('foo', re.I) 2450 """ 2451 if s.startswith('/') and s.rfind('/') != 0: 2452 # Parse it: /PATTERN/FLAGS 2453 idx = s.rfind('/') 2454 _, flags_str = s[1:idx], s[idx+1:] 2455 flag_from_char = { 2456 "i": re.IGNORECASE, 2457 "l": re.LOCALE, 2458 "s": re.DOTALL, 2459 "m": re.MULTILINE, 2460 "u": re.UNICODE, 2461 } 2462 flags = 0 2463 for char in flags_str: 2464 try: 2465 flags |= flag_from_char[char] 2466 except KeyError: 2467 raise ValueError("unsupported regex flag: '%s' in '%s' " 2468 "(must be one of '%s')" 2469 % (char, s, ''.join(list(flag_from_char.keys())))) 2470 return re.compile(s[1:idx], flags) 2471 else: # not an encoded regex 2472 return re.compile(re.escape(s)) 2473 2474 2475# Recipe: dedent (0.1.2) 2476def _dedentlines(lines, tabsize=8, skip_first_line=False): 2477 """_dedentlines(lines, tabsize=8, skip_first_line=False) -> dedented lines 2478 2479 "lines" is a list of lines to dedent. 2480 "tabsize" is the tab width to use for indent width calculations. 2481 "skip_first_line" is a boolean indicating if the first line should 2482 be skipped for calculating the indent width and for dedenting. 2483 This is sometimes useful for docstrings and similar. 2484 2485 Same as dedent() except operates on a sequence of lines. Note: the 2486 lines list is modified **in-place**. 2487 """ 2488 DEBUG = False 2489 if DEBUG: 2490 print("dedent: dedent(..., tabsize=%d, skip_first_line=%r)"\ 2491 % (tabsize, skip_first_line)) 2492 margin = None 2493 for i, line in enumerate(lines): 2494 if i == 0 and skip_first_line: continue 2495 indent = 0 2496 for ch in line: 2497 if ch == ' ': 2498 indent += 1 2499 elif ch == '\t': 2500 indent += tabsize - (indent % tabsize) 2501 elif ch in '\r\n': 2502 continue # skip all-whitespace lines 2503 else: 2504 break 2505 else: 2506 continue # skip all-whitespace lines 2507 if DEBUG: print("dedent: indent=%d: %r" % (indent, line)) 2508 if margin is None: 2509 margin = indent 2510 else: 2511 margin = min(margin, indent) 2512 if DEBUG: print("dedent: margin=%r" % margin) 2513 2514 if margin is not None and margin > 0: 2515 for i, line in enumerate(lines): 2516 if i == 0 and skip_first_line: continue 2517 removed = 0 2518 for j, ch in enumerate(line): 2519 if ch == ' ': 2520 removed += 1 2521 elif ch == '\t': 2522 removed += tabsize - (removed % tabsize) 2523 elif ch in '\r\n': 2524 if DEBUG: print("dedent: %r: EOL -> strip up to EOL" % line) 2525 lines[i] = lines[i][j:] 2526 break 2527 else: 2528 raise ValueError("unexpected non-whitespace char %r in " 2529 "line %r while removing %d-space margin" 2530 % (ch, line, margin)) 2531 if DEBUG: 2532 print("dedent: %r: %r -> removed %d/%d"\ 2533 % (line, ch, removed, margin)) 2534 if removed == margin: 2535 lines[i] = lines[i][j+1:] 2536 break 2537 elif removed > margin: 2538 lines[i] = ' '*(removed-margin) + lines[i][j+1:] 2539 break 2540 else: 2541 if removed: 2542 lines[i] = lines[i][removed:] 2543 return lines 2544 2545 2546def _dedent(text, tabsize=8, skip_first_line=False): 2547 """_dedent(text, tabsize=8, skip_first_line=False) -> dedented text 2548 2549 "text" is the text to dedent. 2550 "tabsize" is the tab width to use for indent width calculations. 2551 "skip_first_line" is a boolean indicating if the first line should 2552 be skipped for calculating the indent width and for dedenting. 2553 This is sometimes useful for docstrings and similar. 2554 2555 textwrap.dedent(s), but don't expand tabs to spaces 2556 """ 2557 lines = text.splitlines(1) 2558 _dedentlines(lines, tabsize=tabsize, skip_first_line=skip_first_line) 2559 return ''.join(lines) 2560 2561 2562class _memoized(object): 2563 """Decorator that caches a function's return value each time it is called. 2564 If called later with the same arguments, the cached value is returned, and 2565 not re-evaluated. 2566 2567 http://wiki.python.org/moin/PythonDecoratorLibrary 2568 """ 2569 def __init__(self, func): 2570 self.func = func 2571 self.cache = {} 2572 2573 def __call__(self, *args): 2574 try: 2575 return self.cache[args] 2576 except KeyError: 2577 self.cache[args] = value = self.func(*args) 2578 return value 2579 except TypeError: 2580 # uncachable -- for instance, passing a list as an argument. 2581 # Better to not cache than to blow up entirely. 2582 return self.func(*args) 2583 2584 def __repr__(self): 2585 """Return the function's docstring.""" 2586 return self.func.__doc__ 2587 2588 2589def _xml_oneliner_re_from_tab_width(tab_width): 2590 """Standalone XML processing instruction regex.""" 2591 return re.compile(r""" 2592 (?: 2593 (?<=\n\n) # Starting after a blank line 2594 | # or 2595 \A\n? # the beginning of the doc 2596 ) 2597 ( # save in $1 2598 [ ]{0,%d} 2599 (?: 2600 <\?\w+\b\s+.*?\?> # XML processing instruction 2601 | 2602 <\w+:\w+\b\s+.*?/> # namespaced single tag 2603 ) 2604 [ \t]* 2605 (?=\n{2,}|\Z) # followed by a blank line or end of document 2606 ) 2607 """ % (tab_width - 1), re.X) 2608_xml_oneliner_re_from_tab_width = _memoized(_xml_oneliner_re_from_tab_width) 2609 2610 2611def _hr_tag_re_from_tab_width(tab_width): 2612 return re.compile(r""" 2613 (?: 2614 (?<=\n\n) # Starting after a blank line 2615 | # or 2616 \A\n? # the beginning of the doc 2617 ) 2618 ( # save in \1 2619 [ ]{0,%d} 2620 <(hr) # start tag = \2 2621 \b # word break 2622 ([^<>])*? # 2623 /?> # the matching end tag 2624 [ \t]* 2625 (?=\n{2,}|\Z) # followed by a blank line or end of document 2626 ) 2627 """ % (tab_width - 1), re.X) 2628_hr_tag_re_from_tab_width = _memoized(_hr_tag_re_from_tab_width) 2629 2630 2631def _xml_escape_attr(attr, skip_single_quote=True): 2632 """Escape the given string for use in an HTML/XML tag attribute. 2633 2634 By default this doesn't bother with escaping `'` to `'`, presuming that 2635 the tag attribute is surrounded by double quotes. 2636 """ 2637 escaped = _AMPERSAND_RE.sub('&', attr) 2638 2639 escaped = (attr 2640 .replace('"', '"') 2641 .replace('<', '<') 2642 .replace('>', '>')) 2643 if not skip_single_quote: 2644 escaped = escaped.replace("'", "'") 2645 return escaped 2646 2647 2648def _xml_encode_email_char_at_random(ch): 2649 r = random() 2650 # Roughly 10% raw, 45% hex, 45% dec. 2651 # '@' *must* be encoded. I [John Gruber] insist. 2652 # Issue 26: '_' must be encoded. 2653 if r > 0.9 and ch not in "@_": 2654 return ch 2655 elif r < 0.45: 2656 # The [1:] is to drop leading '0': 0x63 -> x63 2657 return '&#%s;' % hex(ord(ch))[1:] 2658 else: 2659 return '&#%s;' % ord(ch) 2660 2661 2662def _html_escape_url(attr, safe_mode=False): 2663 """Replace special characters that are potentially malicious in url string.""" 2664 escaped = (attr 2665 .replace('"', '"') 2666 .replace('<', '<') 2667 .replace('>', '>')) 2668 if safe_mode: 2669 escaped = escaped.replace('+', ' ') 2670 escaped = escaped.replace("'", "'") 2671 return escaped 2672 2673 2674# ---- mainline 2675 2676class _NoReflowFormatter(optparse.IndentedHelpFormatter): 2677 """An optparse formatter that does NOT reflow the description.""" 2678 def format_description(self, description): 2679 return description or "" 2680 2681 2682def _test(): 2683 import doctest 2684 doctest.testmod() 2685 2686 2687def main(argv=None): 2688 if argv is None: 2689 argv = sys.argv 2690 if not logging.root.handlers: 2691 logging.basicConfig() 2692 2693 usage = "usage: %prog [PATHS...]" 2694 version = "%prog "+__version__ 2695 parser = optparse.OptionParser(prog="markdown2", usage=usage, 2696 version=version, description=cmdln_desc, 2697 formatter=_NoReflowFormatter()) 2698 parser.add_option("-v", "--verbose", dest="log_level", 2699 action="store_const", const=logging.DEBUG, 2700 help="more verbose output") 2701 parser.add_option("--encoding", 2702 help="specify encoding of text content") 2703 parser.add_option("--html4tags", action="store_true", default=False, 2704 help="use HTML 4 style for empty element tags") 2705 parser.add_option("-s", "--safe", metavar="MODE", dest="safe_mode", 2706 help="sanitize literal HTML: 'escape' escapes " 2707 "HTML meta chars, 'replace' replaces with an " 2708 "[HTML_REMOVED] note") 2709 parser.add_option("-x", "--extras", action="append", 2710 help="Turn on specific extra features (not part of " 2711 "the core Markdown spec). See above.") 2712 parser.add_option("--use-file-vars", 2713 help="Look for and use Emacs-style 'markdown-extras' " 2714 "file var to turn on extras. See " 2715 "<https://github.com/trentm/python-markdown2/wiki/Extras>") 2716 parser.add_option("--link-patterns-file", 2717 help="path to a link pattern file") 2718 parser.add_option("--self-test", action="store_true", 2719 help="run internal self-tests (some doctests)") 2720 parser.add_option("--compare", action="store_true", 2721 help="run against Markdown.pl as well (for testing)") 2722 parser.set_defaults(log_level=logging.INFO, compare=False, 2723 encoding="utf-8", safe_mode=None, use_file_vars=False) 2724 opts, paths = parser.parse_args() 2725 log.setLevel(opts.log_level) 2726 2727 if opts.self_test: 2728 return _test() 2729 2730 if opts.extras: 2731 extras = {} 2732 for s in opts.extras: 2733 splitter = re.compile("[,;: ]+") 2734 for e in splitter.split(s): 2735 if '=' in e: 2736 ename, earg = e.split('=', 1) 2737 try: 2738 earg = int(earg) 2739 except ValueError: 2740 pass 2741 else: 2742 ename, earg = e, None 2743 extras[ename] = earg 2744 else: 2745 extras = None 2746 2747 if opts.link_patterns_file: 2748 link_patterns = [] 2749 f = open(opts.link_patterns_file) 2750 try: 2751 for i, line in enumerate(f.readlines()): 2752 if not line.strip(): continue 2753 if line.lstrip().startswith("#"): continue 2754 try: 2755 pat, href = line.rstrip().rsplit(None, 1) 2756 except ValueError: 2757 raise MarkdownError("%s:%d: invalid link pattern line: %r" 2758 % (opts.link_patterns_file, i+1, line)) 2759 link_patterns.append( 2760 (_regex_from_encoded_pattern(pat), href)) 2761 finally: 2762 f.close() 2763 else: 2764 link_patterns = None 2765 2766 from os.path import join, dirname, abspath, exists 2767 markdown_pl = join(dirname(dirname(abspath(__file__))), "test", 2768 "Markdown.pl") 2769 if not paths: 2770 paths = ['-'] 2771 for path in paths: 2772 if path == '-': 2773 text = sys.stdin.read() 2774 else: 2775 fp = codecs.open(path, 'r', opts.encoding) 2776 text = fp.read() 2777 fp.close() 2778 if opts.compare: 2779 from subprocess import Popen, PIPE 2780 print("==== Markdown.pl ====") 2781 p = Popen('perl %s' % markdown_pl, shell=True, stdin=PIPE, stdout=PIPE, close_fds=True) 2782 p.stdin.write(text.encode('utf-8')) 2783 p.stdin.close() 2784 perl_html = p.stdout.read().decode('utf-8') 2785 if py3: 2786 sys.stdout.write(perl_html) 2787 else: 2788 sys.stdout.write(perl_html.encode( 2789 sys.stdout.encoding or "utf-8", 'xmlcharrefreplace')) 2790 print("==== markdown2.py ====") 2791 html = markdown(text, 2792 html4tags=opts.html4tags, 2793 safe_mode=opts.safe_mode, 2794 extras=extras, link_patterns=link_patterns, 2795 use_file_vars=opts.use_file_vars, 2796 cli=True) 2797 if py3: 2798 sys.stdout.write(html) 2799 else: 2800 sys.stdout.write(html.encode( 2801 sys.stdout.encoding or "utf-8", 'xmlcharrefreplace')) 2802 if extras and "toc" in extras: 2803 log.debug("toc_html: " + 2804 str(html.toc_html.encode(sys.stdout.encoding or "utf-8", 'xmlcharrefreplace'))) 2805 if opts.compare: 2806 test_dir = join(dirname(dirname(abspath(__file__))), "test") 2807 if exists(join(test_dir, "test_markdown2.py")): 2808 sys.path.insert(0, test_dir) 2809 from test_markdown2 import norm_html_from_html 2810 norm_html = norm_html_from_html(html) 2811 norm_perl_html = norm_html_from_html(perl_html) 2812 else: 2813 norm_html = html 2814 norm_perl_html = perl_html 2815 print("==== match? %r ====" % (norm_perl_html == norm_html)) 2816 2817 2818if __name__ == "__main__": 2819 sys.exit(main(sys.argv)) class MarkdownError(builtins.Exception): Common base class for all non-exit exceptions.
Inherited Members
- builtins.Exception
- Exception
- builtins.BaseException
- with_traceback
- args
def markdown_path( path, encoding='utf-8', html4tags=False, tab_width=4, safe_mode=None, extras=None, link_patterns=None, footnote_title=None, footnote_return_symbol=None, use_file_vars=False)165def markdown_path(path, encoding="utf-8", 166 html4tags=False, tab_width=DEFAULT_TAB_WIDTH, 167 safe_mode=None, extras=None, link_patterns=None, 168 footnote_title=None, footnote_return_symbol=None, 169 use_file_vars=False): 170 fp = codecs.open(path, 'r', encoding) 171 text = fp.read() 172 fp.close() 173 return Markdown(html4tags=html4tags, tab_width=tab_width, 174 safe_mode=safe_mode, extras=extras, 175 link_patterns=link_patterns, 176 footnote_title=footnote_title, 177 footnote_return_symbol=footnote_return_symbol, 178 use_file_vars=use_file_vars).convert(text) def markdown( text, html4tags=False, tab_width=4, safe_mode=None, extras=None, link_patterns=None, footnote_title=None, footnote_return_symbol=None, use_file_vars=False, cli=False)181def markdown(text, html4tags=False, tab_width=DEFAULT_TAB_WIDTH, 182 safe_mode=None, extras=None, link_patterns=None, 183 footnote_title=None, footnote_return_symbol=None, 184 use_file_vars=False, cli=False): 185 return Markdown(html4tags=html4tags, tab_width=tab_width, 186 safe_mode=safe_mode, extras=extras, 187 link_patterns=link_patterns, 188 footnote_title=footnote_title, 189 footnote_return_symbol=footnote_return_symbol, 190 use_file_vars=use_file_vars, cli=cli).convert(text) class Markdown:193class Markdown(object): 194 # The dict of "extras" to enable in processing -- a mapping of 195 # extra name to argument for the extra. Most extras do not have an 196 # argument, in which case the value is None. 197 # 198 # This can be set via (a) subclassing and (b) the constructor 199 # "extras" argument. 200 extras = None 201 202 urls = None 203 titles = None 204 html_blocks = None 205 html_spans = None 206 html_removed_text = "{(#HTML#)}" # placeholder removed text that does not trigger bold 207 html_removed_text_compat = "[HTML_REMOVED]" # for compat with markdown.py 208 209 _toc = None 210 211 # Used to track when we're inside an ordered or unordered list 212 # (see _ProcessListItems() for details): 213 list_level = 0 214 215 _ws_only_line_re = re.compile(r"^[ \t]+$", re.M) 216 217 def __init__(self, html4tags=False, tab_width=4, safe_mode=None, 218 extras=None, link_patterns=None, 219 footnote_title=None, footnote_return_symbol=None, 220 use_file_vars=False, cli=False): 221 if html4tags: 222 self.empty_element_suffix = ">" 223 else: 224 self.empty_element_suffix = " />" 225 self.tab_width = tab_width 226 self.tab = tab_width * " " 227 228 # For compatibility with earlier markdown2.py and with 229 # markdown.py's safe_mode being a boolean, 230 # safe_mode == True -> "replace" 231 if safe_mode is True: 232 self.safe_mode = "replace" 233 else: 234 self.safe_mode = safe_mode 235 236 # Massaging and building the "extras" info. 237 if self.extras is None: 238 self.extras = {} 239 elif not isinstance(self.extras, dict): 240 self.extras = dict([(e, None) for e in self.extras]) 241 if extras: 242 if not isinstance(extras, dict): 243 extras = dict([(e, None) for e in extras]) 244 self.extras.update(extras) 245 assert isinstance(self.extras, dict) 246 247 if "toc" in self.extras: 248 if "header-ids" not in self.extras: 249 self.extras["header-ids"] = None # "toc" implies "header-ids" 250 251 if self.extras["toc"] is None: 252 self._toc_depth = 6 253 else: 254 self._toc_depth = self.extras["toc"].get("depth", 6) 255 self._instance_extras = self.extras.copy() 256 257 self.link_patterns = link_patterns 258 self.footnote_title = footnote_title 259 self.footnote_return_symbol = footnote_return_symbol 260 self.use_file_vars = use_file_vars 261 self._outdent_re = re.compile(r'^(\t|[ ]{1,%d})' % tab_width, re.M) 262 self.cli = cli 263 264 self._escape_table = g_escape_table.copy() 265 self._code_table = {} 266 if "smarty-pants" in self.extras: 267 self._escape_table['"'] = _hash_text('"') 268 self._escape_table["'"] = _hash_text("'") 269 270 def reset(self): 271 self.urls = {} 272 self.titles = {} 273 self.html_blocks = {} 274 self.html_spans = {} 275 self.list_level = 0 276 self.extras = self._instance_extras.copy() 277 if "footnotes" in self.extras: 278 self.footnotes = {} 279 self.footnote_ids = [] 280 if "header-ids" in self.extras: 281 self._count_from_header_id = defaultdict(int) 282 if "metadata" in self.extras: 283 self.metadata = {} 284 self._toc = None 285 286 # Per <https://developer.mozilla.org/en-US/docs/HTML/Element/a> "rel" 287 # should only be used in <a> tags with an "href" attribute. 288 289 # Opens the linked document in a new window or tab 290 # should only used in <a> tags with an "href" attribute. 291 # same with _a_nofollow 292 _a_nofollow_or_blank_links = re.compile(r""" 293 <(a) 294 ( 295 [^>]* 296 href= # href is required 297 ['"]? # HTML5 attribute values do not have to be quoted 298 [^#'"] # We don't want to match href values that start with # (like footnotes) 299 ) 300 """, 301 re.IGNORECASE | re.VERBOSE 302 ) 303 304 def convert(self, text): 305 """Convert the given text.""" 306 # Main function. The order in which other subs are called here is 307 # essential. Link and image substitutions need to happen before 308 # _EscapeSpecialChars(), so that any *'s or _'s in the <a> 309 # and <img> tags get encoded. 310 311 # Clear the global hashes. If we don't clear these, you get conflicts 312 # from other articles when generating a page which contains more than 313 # one article (e.g. an index page that shows the N most recent 314 # articles): 315 self.reset() 316 317 if not isinstance(text, unicode): 318 # TODO: perhaps shouldn't presume UTF-8 for string input? 319 text = unicode(text, 'utf-8') 320 321 if self.use_file_vars: 322 # Look for emacs-style file variable hints. 323 emacs_vars = self._get_emacs_vars(text) 324 if "markdown-extras" in emacs_vars: 325 splitter = re.compile("[ ,]+") 326 for e in splitter.split(emacs_vars["markdown-extras"]): 327 if '=' in e: 328 ename, earg = e.split('=', 1) 329 try: 330 earg = int(earg) 331 except ValueError: 332 pass 333 else: 334 ename, earg = e, None 335 self.extras[ename] = earg 336 337 # Standardize line endings: 338 text = text.replace("\r\n", "\n") 339 text = text.replace("\r", "\n") 340 341 # Make sure $text ends with a couple of newlines: 342 text += "\n\n" 343 344 # Convert all tabs to spaces. 345 text = self._detab(text) 346 347 # Strip any lines consisting only of spaces and tabs. 348 # This makes subsequent regexen easier to write, because we can 349 # match consecutive blank lines with /\n+/ instead of something 350 # contorted like /[ \t]*\n+/ . 351 text = self._ws_only_line_re.sub("", text) 352 353 # strip metadata from head and extract 354 if "metadata" in self.extras: 355 text = self._extract_metadata(text) 356 357 text = self.preprocess(text) 358 359 if "fenced-code-blocks" in self.extras and not self.safe_mode: 360 text = self._do_fenced_code_blocks(text) 361 362 if self.safe_mode: 363 text = self._hash_html_spans(text) 364 365 # Turn block-level HTML blocks into hash entries 366 text = self._hash_html_blocks(text, raw=True) 367 368 if "fenced-code-blocks" in self.extras and self.safe_mode: 369 text = self._do_fenced_code_blocks(text) 370 371 # Because numbering references aren't links (yet?) then we can do everything associated with counters 372 # before we get started 373 if "numbering" in self.extras: 374 text = self._do_numbering(text) 375 376 # Strip link definitions, store in hashes. 377 if "footnotes" in self.extras: 378 # Must do footnotes first because an unlucky footnote defn 379 # looks like a link defn: 380 # [^4]: this "looks like a link defn" 381 text = self._strip_footnote_definitions(text) 382 text = self._strip_link_definitions(text) 383 384 text = self._run_block_gamut(text) 385 386 if "footnotes" in self.extras: 387 text = self._add_footnotes(text) 388 389 text = self.postprocess(text) 390 391 text = self._unescape_special_chars(text) 392 393 if self.safe_mode: 394 text = self._unhash_html_spans(text) 395 # return the removed text warning to its markdown.py compatible form 396 text = text.replace(self.html_removed_text, self.html_removed_text_compat) 397 398 do_target_blank_links = "target-blank-links" in self.extras 399 do_nofollow_links = "nofollow" in self.extras 400 401 if do_target_blank_links and do_nofollow_links: 402 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow noopener" target="_blank"\2', text) 403 elif do_target_blank_links: 404 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="noopener" target="_blank"\2', text) 405 elif do_nofollow_links: 406 text = self._a_nofollow_or_blank_links.sub(r'<\1 rel="nofollow"\2', text) 407 408 if "toc" in self.extras and self._toc: 409 self._toc_html = calculate_toc_html(self._toc) 410 411 # Prepend toc html to output 412 if self.cli: 413 text = '{}\n{}'.format(self._toc_html, text) 414 415 text += "\n" 416 417 # Attach attrs to output 418 rv = UnicodeWithAttrs(text) 419 420 if "toc" in self.extras and self._toc: 421 rv.toc_html = self._toc_html 422 423 if "metadata" in self.extras: 424 rv.metadata = self.metadata 425 return rv 426 427 def postprocess(self, text): 428 """A hook for subclasses to do some postprocessing of the html, if 429 desired. This is called before unescaping of special chars and 430 unhashing of raw HTML spans. 431 """ 432 return text 433 434 def preprocess(self, text): 435 """A hook for subclasses to do some preprocessing of the Markdown, if 436 desired. This is called after basic formatting of the text, but prior 437 to any extras, safe mode, etc. processing. 438 """ 439 return text 440 441 # Is metadata if the content starts with optional '---'-fenced `key: value` 442 # pairs. E.g. (indented for presentation): 443 # --- 444 # foo: bar 445 # another-var: blah blah 446 # --- 447 # # header 448 # or: 449 # foo: bar 450 # another-var: blah blah 451 # 452 # # header 453 _meta_data_pattern = re.compile(r'^(?:---[\ \t]*\n)?((?:[\S\w]+\s*:(?:\n+[ \t]+.*)+)|(?:.*:\s+>\n\s+[\S\s]+?)(?=\n\w+\s*:\s*\w+\n|\Z)|(?:\s*[\S\w]+\s*:(?! >)[ \t]*.*\n?))(?:---[\ \t]*\n)?', re.MULTILINE) 454 _key_val_pat = re.compile(r"[\S\w]+\s*:(?! >)[ \t]*.*\n?", re.MULTILINE) 455 # this allows key: > 456 # value 457 # conutiues over multiple lines 458 _key_val_block_pat = re.compile( 459 r"(.*:\s+>\n\s+[\S\s]+?)(?=\n\w+\s*:\s*\w+\n|\Z)", re.MULTILINE 460 ) 461 _key_val_list_pat = re.compile( 462 r"^-(?:[ \t]*([^\n]*)(?:[ \t]*[:-][ \t]*(\S+))?)(?:\n((?:[ \t]+[^\n]+\n?)+))?", 463 re.MULTILINE, 464 ) 465 _key_val_dict_pat = re.compile( 466 r"^([^:\n]+)[ \t]*:[ \t]*([^\n]*)(?:((?:\n[ \t]+[^\n]+)+))?", re.MULTILINE 467 ) # grp0: key, grp1: value, grp2: multiline value 468 _meta_data_fence_pattern = re.compile(r'^---[\ \t]*\n', re.MULTILINE) 469 _meta_data_newline = re.compile("^\n", re.MULTILINE) 470 471 def _extract_metadata(self, text): 472 if text.startswith("---"): 473 fence_splits = re.split(self._meta_data_fence_pattern, text, maxsplit=2) 474 metadata_content = fence_splits[1] 475 match = re.findall(self._meta_data_pattern, metadata_content) 476 if not match: 477 return text 478 tail = fence_splits[2] 479 else: 480 metadata_split = re.split(self._meta_data_newline, text, maxsplit=1) 481 metadata_content = metadata_split[0] 482 match = re.findall(self._meta_data_pattern, metadata_content) 483 if not match: 484 return text 485 tail = metadata_split[1] 486 487 def parse_structured_value(value): 488 vs = value.lstrip() 489 vs = value.replace(v[: len(value) - len(vs)], "\n")[1:] 490 491 # List 492 if vs.startswith("-"): 493 r = [] 494 for match in re.findall(self._key_val_list_pat, vs): 495 if match[0] and not match[1] and not match[2]: 496 r.append(match[0].strip()) 497 elif match[0] == ">" and not match[1] and match[2]: 498 r.append(match[2].strip()) 499 elif match[0] and match[1]: 500 r.append({match[0].strip(): match[1].strip()}) 501 elif not match[0] and not match[1] and match[2]: 502 r.append(parse_structured_value(match[2])) 503 else: 504 # Broken case 505 pass 506 507 return r 508 509 # Dict 510 else: 511 return { 512 match[0].strip(): ( 513 match[1].strip() 514 if match[1] 515 else parse_structured_value(match[2]) 516 ) 517 for match in re.findall(self._key_val_dict_pat, vs) 518 } 519 520 for item in match: 521 522 k, v = item.split(":", 1) 523 524 # Multiline value 525 if v[:3] == " >\n": 526 self.metadata[k.strip()] = _dedent(v[3:]).strip() 527 528 # Empty value 529 elif v == "\n": 530 self.metadata[k.strip()] = "" 531 532 # Structured value 533 elif v[0] == "\n": 534 self.metadata[k.strip()] = parse_structured_value(v) 535 536 # Simple value 537 else: 538 self.metadata[k.strip()] = v.strip() 539 540 return tail 541 542 _emacs_oneliner_vars_pat = re.compile(r"-\*-\s*(?:(\S[^\r\n]*?)([\r\n]\s*)?)?-\*-", re.UNICODE) 543 # This regular expression is intended to match blocks like this: 544 # PREFIX Local Variables: SUFFIX 545 # PREFIX mode: Tcl SUFFIX 546 # PREFIX End: SUFFIX 547 # Some notes: 548 # - "[ \t]" is used instead of "\s" to specifically exclude newlines 549 # - "(\r\n|\n|\r)" is used instead of "$" because the sre engine does 550 # not like anything other than Unix-style line terminators. 551 _emacs_local_vars_pat = re.compile(r"""^ 552 (?P<prefix>(?:[^\r\n|\n|\r])*?) 553 [\ \t]*Local\ Variables:[\ \t]* 554 (?P<suffix>.*?)(?:\r\n|\n|\r) 555 (?P<content>.*?\1End:) 556 """, re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE) 557 558 def _get_emacs_vars(self, text): 559 """Return a dictionary of emacs-style local variables. 560 561 Parsing is done loosely according to this spec (and according to 562 some in-practice deviations from this): 563 http://www.gnu.org/software/emacs/manual/html_node/emacs/Specifying-File-Variables.html#Specifying-File-Variables 564 """ 565 emacs_vars = {} 566 SIZE = pow(2, 13) # 8kB 567 568 # Search near the start for a '-*-'-style one-liner of variables. 569 head = text[:SIZE] 570 if "-*-" in head: 571 match = self._emacs_oneliner_vars_pat.search(head) 572 if match: 573 emacs_vars_str = match.group(1) 574 assert '\n' not in emacs_vars_str 575 emacs_var_strs = [s.strip() for s in emacs_vars_str.split(';') 576 if s.strip()] 577 if len(emacs_var_strs) == 1 and ':' not in emacs_var_strs[0]: 578 # While not in the spec, this form is allowed by emacs: 579 # -*- Tcl -*- 580 # where the implied "variable" is "mode". This form 581 # is only allowed if there are no other variables. 582 emacs_vars["mode"] = emacs_var_strs[0].strip() 583 else: 584 for emacs_var_str in emacs_var_strs: 585 try: 586 variable, value = emacs_var_str.strip().split(':', 1) 587 except ValueError: 588 log.debug("emacs variables error: malformed -*- " 589 "line: %r", emacs_var_str) 590 continue 591 # Lowercase the variable name because Emacs allows "Mode" 592 # or "mode" or "MoDe", etc. 593 emacs_vars[variable.lower()] = value.strip() 594 595 tail = text[-SIZE:] 596 if "Local Variables" in tail: 597 match = self._emacs_local_vars_pat.search(tail) 598 if match: 599 prefix = match.group("prefix") 600 suffix = match.group("suffix") 601 lines = match.group("content").splitlines(0) 602 # print "prefix=%r, suffix=%r, content=%r, lines: %s"\ 603 # % (prefix, suffix, match.group("content"), lines) 604 605 # Validate the Local Variables block: proper prefix and suffix 606 # usage. 607 for i, line in enumerate(lines): 608 if not line.startswith(prefix): 609 log.debug("emacs variables error: line '%s' " 610 "does not use proper prefix '%s'" 611 % (line, prefix)) 612 return {} 613 # Don't validate suffix on last line. Emacs doesn't care, 614 # neither should we. 615 if i != len(lines)-1 and not line.endswith(suffix): 616 log.debug("emacs variables error: line '%s' " 617 "does not use proper suffix '%s'" 618 % (line, suffix)) 619 return {} 620 621 # Parse out one emacs var per line. 622 continued_for = None 623 for line in lines[:-1]: # no var on the last line ("PREFIX End:") 624 if prefix: line = line[len(prefix):] # strip prefix 625 if suffix: line = line[:-len(suffix)] # strip suffix 626 line = line.strip() 627 if continued_for: 628 variable = continued_for 629 if line.endswith('\\'): 630 line = line[:-1].rstrip() 631 else: 632 continued_for = None 633 emacs_vars[variable] += ' ' + line 634 else: 635 try: 636 variable, value = line.split(':', 1) 637 except ValueError: 638 log.debug("local variables error: missing colon " 639 "in local variables entry: '%s'" % line) 640 continue 641 # Do NOT lowercase the variable name, because Emacs only 642 # allows "mode" (and not "Mode", "MoDe", etc.) in this block. 643 value = value.strip() 644 if value.endswith('\\'): 645 value = value[:-1].rstrip() 646 continued_for = variable 647 else: 648 continued_for = None 649 emacs_vars[variable] = value 650 651 # Unquote values. 652 for var, val in list(emacs_vars.items()): 653 if len(val) > 1 and (val.startswith('"') and val.endswith('"') 654 or val.startswith('"') and val.endswith('"')): 655 emacs_vars[var] = val[1:-1] 656 657 return emacs_vars 658 659 def _detab_line(self, line): 660 r"""Recusively convert tabs to spaces in a single line. 661 662 Called from _detab().""" 663 if '\t' not in line: 664 return line 665 chunk1, chunk2 = line.split('\t', 1) 666 chunk1 += (' ' * (self.tab_width - len(chunk1) % self.tab_width)) 667 output = chunk1 + chunk2 668 return self._detab_line(output) 669 670 def _detab(self, text): 671 r"""Iterate text line by line and convert tabs to spaces. 672 673 >>> m = Markdown() 674 >>> m._detab("\tfoo") 675 ' foo' 676 >>> m._detab(" \tfoo") 677 ' foo' 678 >>> m._detab("\t foo") 679 ' foo' 680 >>> m._detab(" foo") 681 ' foo' 682 >>> m._detab(" foo\n\tbar\tblam") 683 ' foo\n bar blam' 684 """ 685 if '\t' not in text: 686 return text 687 output = [] 688 for line in text.splitlines(): 689 output.append(self._detab_line(line)) 690 return '\n'.join(output) 691 692 # I broke out the html5 tags here and add them to _block_tags_a and 693 # _block_tags_b. This way html5 tags are easy to keep track of. 694 _html5tags = '|article|aside|header|hgroup|footer|nav|section|figure|figcaption' 695 696 _block_tags_a = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math|ins|del' 697 _block_tags_a += _html5tags 698 699 _strict_tag_block_re = re.compile(r""" 700 ( # save in \1 701 ^ # start of line (with re.M) 702 <(%s) # start tag = \2 703 \b # word break 704 (.*\n)*? # any number of lines, minimally matching 705 </\2> # the matching end tag 706 [ \t]* # trailing spaces/tabs 707 (?=\n+|\Z) # followed by a newline or end of document 708 ) 709 """ % _block_tags_a, 710 re.X | re.M) 711 712 _block_tags_b = 'p|div|h[1-6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math' 713 _block_tags_b += _html5tags 714 715 _liberal_tag_block_re = re.compile(r""" 716 ( # save in \1 717 ^ # start of line (with re.M) 718 <(%s) # start tag = \2 719 \b # word break 720 (.*\n)*? # any number of lines, minimally matching 721 .*</\2> # the matching end tag 722 [ \t]* # trailing spaces/tabs 723 (?=\n+|\Z) # followed by a newline or end of document 724 ) 725 """ % _block_tags_b, 726 re.X | re.M) 727 728 _html_markdown_attr_re = re.compile( 729 r'''\s+markdown=("1"|'1')''') 730 def _hash_html_block_sub(self, match, raw=False): 731 html = match.group(1) 732 if raw and self.safe_mode: 733 html = self._sanitize_html(html) 734 elif 'markdown-in-html' in self.extras and 'markdown=' in html: 735 first_line = html.split('\n', 1)[0] 736 m = self._html_markdown_attr_re.search(first_line) 737 if m: 738 lines = html.split('\n') 739 middle = '\n'.join(lines[1:-1]) 740 last_line = lines[-1] 741 first_line = first_line[:m.start()] + first_line[m.end():] 742 f_key = _hash_text(first_line) 743 self.html_blocks[f_key] = first_line 744 l_key = _hash_text(last_line) 745 self.html_blocks[l_key] = last_line 746 return ''.join(["\n\n", f_key, 747 "\n\n", middle, "\n\n", 748 l_key, "\n\n"]) 749 key = _hash_text(html) 750 self.html_blocks[key] = html 751 return "\n\n" + key + "\n\n" 752 753 def _hash_html_blocks(self, text, raw=False): 754 """Hashify HTML blocks 755 756 We only want to do this for block-level HTML tags, such as headers, 757 lists, and tables. That's because we still want to wrap <p>s around 758 "paragraphs" that are wrapped in non-block-level tags, such as anchors, 759 phrase emphasis, and spans. The list of tags we're looking for is 760 hard-coded. 761 762 @param raw {boolean} indicates if these are raw HTML blocks in 763 the original source. It makes a difference in "safe" mode. 764 """ 765 if '<' not in text: 766 return text 767 768 # Pass `raw` value into our calls to self._hash_html_block_sub. 769 hash_html_block_sub = _curry(self._hash_html_block_sub, raw=raw) 770 771 # First, look for nested blocks, e.g.: 772 # <div> 773 # <div> 774 # tags for inner block must be indented. 775 # </div> 776 # </div> 777 # 778 # The outermost tags must start at the left margin for this to match, and 779 # the inner nested divs must be indented. 780 # We need to do this before the next, more liberal match, because the next 781 # match will start at the first `<div>` and stop at the first `</div>`. 782 text = self._strict_tag_block_re.sub(hash_html_block_sub, text) 783 784 # Now match more liberally, simply from `\n<tag>` to `</tag>\n` 785 text = self._liberal_tag_block_re.sub(hash_html_block_sub, text) 786 787 # Special case just for <hr />. It was easier to make a special 788 # case than to make the other regex more complicated. 789 if "<hr" in text: 790 _hr_tag_re = _hr_tag_re_from_tab_width(self.tab_width) 791 text = _hr_tag_re.sub(hash_html_block_sub, text) 792 793 # Special case for standalone HTML comments: 794 if "<!--" in text: 795 start = 0 796 while True: 797 # Delimiters for next comment block. 798 try: 799 start_idx = text.index("<!--", start) 800 except ValueError: 801 break 802 try: 803 end_idx = text.index("-->", start_idx) + 3 804 except ValueError: 805 break 806 807 # Start position for next comment block search. 808 start = end_idx 809 810 # Validate whitespace before comment. 811 if start_idx: 812 # - Up to `tab_width - 1` spaces before start_idx. 813 for i in range(self.tab_width - 1): 814 if text[start_idx - 1] != ' ': 815 break 816 start_idx -= 1 817 if start_idx == 0: 818 break 819 # - Must be preceded by 2 newlines or hit the start of 820 # the document. 821 if start_idx == 0: 822 pass 823 elif start_idx == 1 and text[0] == '\n': 824 start_idx = 0 # to match minute detail of Markdown.pl regex 825 elif text[start_idx-2:start_idx] == '\n\n': 826 pass 827 else: 828 break 829 830 # Validate whitespace after comment. 831 # - Any number of spaces and tabs. 832 while end_idx < len(text): 833 if text[end_idx] not in ' \t': 834 break 835 end_idx += 1 836 # - Must be following by 2 newlines or hit end of text. 837 if text[end_idx:end_idx+2] not in ('', '\n', '\n\n'): 838 continue 839 840 # Escape and hash (must match `_hash_html_block_sub`). 841 html = text[start_idx:end_idx] 842 if raw and self.safe_mode: 843 html = self._sanitize_html(html) 844 key = _hash_text(html) 845 self.html_blocks[key] = html 846 text = text[:start_idx] + "\n\n" + key + "\n\n" + text[end_idx:] 847 848 if "xml" in self.extras: 849 # Treat XML processing instructions and namespaced one-liner 850 # tags as if they were block HTML tags. E.g., if standalone 851 # (i.e. are their own paragraph), the following do not get 852 # wrapped in a <p> tag: 853 # <?foo bar?> 854 # 855 # <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="chapter_1.md"/> 856 _xml_oneliner_re = _xml_oneliner_re_from_tab_width(self.tab_width) 857 text = _xml_oneliner_re.sub(hash_html_block_sub, text) 858 859 return text 860 861 def _strip_link_definitions(self, text): 862 # Strips link definitions from text, stores the URLs and titles in 863 # hash references. 864 less_than_tab = self.tab_width - 1 865 866 # Link defs are in the form: 867 # [id]: url "optional title" 868 _link_def_re = re.compile(r""" 869 ^[ ]{0,%d}\[(.+)\]: # id = \1 870 [ \t]* 871 \n? # maybe *one* newline 872 [ \t]* 873 <?(.+?)>? # url = \2 874 [ \t]* 875 (?: 876 \n? # maybe one newline 877 [ \t]* 878 (?<=\s) # lookbehind for whitespace 879 ['"(] 880 ([^\n]*) # title = \3 881 ['")] 882 [ \t]* 883 )? # title is optional 884 (?:\n+|\Z) 885 """ % less_than_tab, re.X | re.M | re.U) 886 return _link_def_re.sub(self._extract_link_def_sub, text) 887 888 def _extract_link_def_sub(self, match): 889 id, url, title = match.groups() 890 key = id.lower() # Link IDs are case-insensitive 891 self.urls[key] = self._encode_amps_and_angles(url) 892 if title: 893 self.titles[key] = title 894 return "" 895 896 def _do_numbering(self, text): 897 ''' We handle the special extension for generic numbering for 898 tables, figures etc. 899 ''' 900 # First pass to define all the references 901 self.regex_defns = re.compile(r''' 902 \[\#(\w+) # the counter. Open square plus hash plus a word \1 903 ([^@]*) # Some optional characters, that aren't an @. \2 904 @(\w+) # the id. Should this be normed? \3 905 ([^\]]*)\] # The rest of the text up to the terminating ] \4 906 ''', re.VERBOSE) 907 self.regex_subs = re.compile(r"\[@(\w+)\s*\]") # [@ref_id] 908 counters = {} 909 references = {} 910 replacements = [] 911 definition_html = '<figcaption class="{}" id="counter-ref-{}">{}{}{}</figcaption>' 912 reference_html = '<a class="{}" href="#counter-ref-{}">{}</a>' 913 for match in self.regex_defns.finditer(text): 914 # We must have four match groups otherwise this isn't a numbering reference 915 if len(match.groups()) != 4: 916 continue 917 counter = match.group(1) 918 text_before = match.group(2).strip() 919 ref_id = match.group(3) 920 text_after = match.group(4) 921 number = counters.get(counter, 1) 922 references[ref_id] = (number, counter) 923 replacements.append((match.start(0), 924 definition_html.format(counter, 925 ref_id, 926 text_before, 927 number, 928 text_after), 929 match.end(0))) 930 counters[counter] = number + 1 931 for repl in reversed(replacements): 932 text = text[:repl[0]] + repl[1] + text[repl[2]:] 933 934 # Second pass to replace the references with the right 935 # value of the counter 936 # Fwiw, it's vaguely annoying to have to turn the iterator into 937 # a list and then reverse it but I can't think of a better thing to do. 938 for match in reversed(list(self.regex_subs.finditer(text))): 939 number, counter = references.get(match.group(1), (None, None)) 940 if number is not None: 941 repl = reference_html.format(counter, 942 match.group(1), 943 number) 944 else: 945 repl = reference_html.format(match.group(1), 946 'countererror', 947 '?' + match.group(1) + '?') 948 if "smarty-pants" in self.extras: 949 repl = repl.replace('"', self._escape_table['"']) 950 951 text = text[:match.start()] + repl + text[match.end():] 952 return text 953 954 def _extract_footnote_def_sub(self, match): 955 id, text = match.groups() 956 text = _dedent(text, skip_first_line=not text.startswith('\n')).strip() 957 normed_id = re.sub(r'\W', '-', id) 958 # Ensure footnote text ends with a couple newlines (for some 959 # block gamut matches). 960 self.footnotes[normed_id] = text + "\n\n" 961 return "" 962 963 def _strip_footnote_definitions(self, text): 964 """A footnote definition looks like this: 965 966 [^note-id]: Text of the note. 967 968 May include one or more indented paragraphs. 969 970 Where, 971 - The 'note-id' can be pretty much anything, though typically it 972 is the number of the footnote. 973 - The first paragraph may start on the next line, like so: 974 975 [^note-id]: 976 Text of the note. 977 """ 978 less_than_tab = self.tab_width - 1 979 footnote_def_re = re.compile(r''' 980 ^[ ]{0,%d}\[\^(.+)\]: # id = \1 981 [ \t]* 982 ( # footnote text = \2 983 # First line need not start with the spaces. 984 (?:\s*.*\n+) 985 (?: 986 (?:[ ]{%d} | \t) # Subsequent lines must be indented. 987 .*\n+ 988 )* 989 ) 990 # Lookahead for non-space at line-start, or end of doc. 991 (?:(?=^[ ]{0,%d}\S)|\Z) 992 ''' % (less_than_tab, self.tab_width, self.tab_width), 993 re.X | re.M) 994 return footnote_def_re.sub(self._extract_footnote_def_sub, text) 995 996 _hr_re = re.compile(r'^[ ]{0,3}([-_*])[ ]{0,2}(\1[ ]{0,2}){2,}$', re.M) 997 998 def _run_block_gamut(self, text): 999 # These are all the transformations that form block-level 1000 # tags like paragraphs, headers, and list items. 1001 1002 if "fenced-code-blocks" in self.extras: 1003 text = self._do_fenced_code_blocks(text) 1004 1005 text = self._do_headers(text) 1006 1007 # Do Horizontal Rules: 1008 # On the number of spaces in horizontal rules: The spec is fuzzy: "If 1009 # you wish, you may use spaces between the hyphens or asterisks." 1010 # Markdown.pl 1.0.1's hr regexes limit the number of spaces between the 1011 # hr chars to one or two. We'll reproduce that limit here. 1012 hr = "\n<hr"+self.empty_element_suffix+"\n" 1013 text = re.sub(self._hr_re, hr, text) 1014 1015 text = self._do_lists(text) 1016 1017 if "pyshell" in self.extras: 1018 text = self._prepare_pyshell_blocks(text) 1019 if "wiki-tables" in self.extras: 1020 text = self._do_wiki_tables(text) 1021 if "tables" in self.extras: 1022 text = self._do_tables(text) 1023 1024 text = self._do_code_blocks(text) 1025 1026 text = self._do_block_quotes(text) 1027 1028 # We already ran _HashHTMLBlocks() before, in Markdown(), but that 1029 # was to escape raw HTML in the original Markdown source. This time, 1030 # we're escaping the markup we've just created, so that we don't wrap 1031 # <p> tags around block-level tags. 1032 text = self._hash_html_blocks(text) 1033 1034 text = self._form_paragraphs(text) 1035 1036 return text 1037 1038 def _pyshell_block_sub(self, match): 1039 if "fenced-code-blocks" in self.extras: 1040 dedented = _dedent(match.group(0)) 1041 return self._do_fenced_code_blocks("```pycon\n" + dedented + "```\n") 1042 lines = match.group(0).splitlines(0) 1043 _dedentlines(lines) 1044 indent = ' ' * self.tab_width 1045 s = ('\n' # separate from possible cuddled paragraph 1046 + indent + ('\n'+indent).join(lines) 1047 + '\n\n') 1048 return s 1049 1050 def _prepare_pyshell_blocks(self, text): 1051 """Ensure that Python interactive shell sessions are put in 1052 code blocks -- even if not properly indented. 1053 """ 1054 if ">>>" not in text: 1055 return text 1056 1057 less_than_tab = self.tab_width - 1 1058 _pyshell_block_re = re.compile(r""" 1059 ^([ ]{0,%d})>>>[ ].*\n # first line 1060 ^(\1[^\S\n]*\S.*\n)* # any number of subsequent lines with at least one character 1061 ^\n # ends with a blank line 1062 """ % less_than_tab, re.M | re.X) 1063 1064 return _pyshell_block_re.sub(self._pyshell_block_sub, text) 1065 1066 def _table_sub(self, match): 1067 trim_space_re = '^[ \t\n]+|[ \t\n]+$' 1068 trim_bar_re = r'^\||\|$' 1069 split_bar_re = r'^\||(?<![\`\\])\|' 1070 escape_bar_re = r'\\\|' 1071 1072 head, underline, body = match.groups() 1073 1074 # Determine aligns for columns. 1075 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", underline)))] 1076 align_from_col_idx = {} 1077 for col_idx, col in enumerate(cols): 1078 if col[0] == ':' and col[-1] == ':': 1079 align_from_col_idx[col_idx] = ' style="text-align:center;"' 1080 elif col[0] == ':': 1081 align_from_col_idx[col_idx] = ' style="text-align:left;"' 1082 elif col[-1] == ':': 1083 align_from_col_idx[col_idx] = ' style="text-align:right;"' 1084 1085 # thead 1086 hlines = ['<table%s>' % self._html_class_str_from_tag('table'), '<thead>', '<tr>'] 1087 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", head)))] 1088 for col_idx, col in enumerate(cols): 1089 hlines.append(' <th%s>%s</th>' % ( 1090 align_from_col_idx.get(col_idx, ''), 1091 self._run_span_gamut(col) 1092 )) 1093 hlines.append('</tr>') 1094 hlines.append('</thead>') 1095 1096 # tbody 1097 hlines.append('<tbody>') 1098 for line in body.strip('\n').split('\n'): 1099 hlines.append('<tr>') 1100 cols = [re.sub(escape_bar_re, '|', cell.strip()) for cell in re.split(split_bar_re, re.sub(trim_bar_re, "", re.sub(trim_space_re, "", line)))] 1101 for col_idx, col in enumerate(cols): 1102 hlines.append(' <td%s>%s</td>' % ( 1103 align_from_col_idx.get(col_idx, ''), 1104 self._run_span_gamut(col) 1105 )) 1106 hlines.append('</tr>') 1107 hlines.append('</tbody>') 1108 hlines.append('</table>') 1109 1110 return '\n'.join(hlines) + '\n' 1111 1112 def _do_tables(self, text): 1113 """Copying PHP-Markdown and GFM table syntax. Some regex borrowed from 1114 https://github.com/michelf/php-markdown/blob/lib/Michelf/Markdown.php#L2538 1115 """ 1116 less_than_tab = self.tab_width - 1 1117 table_re = re.compile(r''' 1118 (?:(?<=\n\n)|\A\n?) # leading blank line 1119 1120 ^[ ]{0,%d} # allowed whitespace 1121 (.*[|].*) \n # $1: header row (at least one pipe) 1122 1123 ^[ ]{0,%d} # allowed whitespace 1124 ( # $2: underline row 1125 # underline row with leading bar 1126 (?: \|\ *:?-+:?\ * )+ \|? \s? \n 1127 | 1128 # or, underline row without leading bar 1129 (?: \ *:?-+:?\ *\| )+ (?: \ *:?-+:?\ * )? \s? \n 1130 ) 1131 1132 ( # $3: data rows 1133 (?: 1134 ^[ ]{0,%d}(?!\ ) # ensure line begins with 0 to less_than_tab spaces 1135 .*\|.* \n 1136 )+ 1137 ) 1138 ''' % (less_than_tab, less_than_tab, less_than_tab), re.M | re.X) 1139 return table_re.sub(self._table_sub, text) 1140 1141 def _wiki_table_sub(self, match): 1142 ttext = match.group(0).strip() 1143 # print('wiki table: %r' % match.group(0)) 1144 rows = [] 1145 for line in ttext.splitlines(0): 1146 line = line.strip()[2:-2].strip() 1147 row = [c.strip() for c in re.split(r'(?<!\\)\|\|', line)] 1148 rows.append(row) 1149 # from pprint import pprint 1150 # pprint(rows) 1151 hlines = [] 1152 1153 def add_hline(line, indents=0): 1154 hlines.append((self.tab * indents) + line) 1155 1156 def format_cell(text): 1157 return self._run_span_gamut(re.sub(r"^\s*~", "", cell).strip(" ")) 1158 1159 add_hline('<table%s>' % self._html_class_str_from_tag('table')) 1160 # Check if first cell of first row is a header cell. If so, assume the whole row is a header row. 1161 if rows and rows[0] and re.match(r"^\s*~", rows[0][0]): 1162 add_hline('<thead>', 1) 1163 add_hline('<tr>', 2) 1164 for cell in rows[0]: 1165 add_hline("<th>{}</th>".format(format_cell(cell)), 3) 1166 add_hline('</tr>', 2) 1167 add_hline('</thead>', 1) 1168 # Only one header row allowed. 1169 rows = rows[1:] 1170 # If no more rows, don't create a tbody. 1171 if rows: 1172 add_hline('<tbody>', 1) 1173 for row in rows: 1174 add_hline('<tr>', 2) 1175 for cell in row: 1176 add_hline('<td>{}</td>'.format(format_cell(cell)), 3) 1177 add_hline('</tr>', 2) 1178 add_hline('</tbody>', 1) 1179 add_hline('</table>') 1180 return '\n'.join(hlines) + '\n' 1181 1182 def _do_wiki_tables(self, text): 1183 # Optimization. 1184 if "||" not in text: 1185 return text 1186 1187 less_than_tab = self.tab_width - 1 1188 wiki_table_re = re.compile(r''' 1189 (?:(?<=\n\n)|\A\n?) # leading blank line 1190 ^([ ]{0,%d})\|\|.+?\|\|[ ]*\n # first line 1191 (^\1\|\|.+?\|\|\n)* # any number of subsequent lines 1192 ''' % less_than_tab, re.M | re.X) 1193 return wiki_table_re.sub(self._wiki_table_sub, text) 1194 1195 def _run_span_gamut(self, text): 1196 # These are all the transformations that occur *within* block-level 1197 # tags like paragraphs, headers, and list items. 1198 1199 text = self._do_code_spans(text) 1200 1201 text = self._escape_special_chars(text) 1202 1203 # Process anchor and image tags. 1204 if "link-patterns" in self.extras: 1205 text = self._do_link_patterns(text) 1206 1207 text = self._do_links(text) 1208 1209 # Make links out of things like `<http://example.com/>` 1210 # Must come after _do_links(), because you can use < and > 1211 # delimiters in inline links like [this](<url>). 1212 text = self._do_auto_links(text) 1213 1214 text = self._encode_amps_and_angles(text) 1215 1216 if "strike" in self.extras: 1217 text = self._do_strike(text) 1218 1219 if "underline" in self.extras: 1220 text = self._do_underline(text) 1221 1222 text = self._do_italics_and_bold(text) 1223 1224 if "smarty-pants" in self.extras: 1225 text = self._do_smart_punctuation(text) 1226 1227 # Do hard breaks: 1228 if "break-on-newline" in self.extras: 1229 text = re.sub(r" *\n", "<br%s\n" % self.empty_element_suffix, text) 1230 else: 1231 text = re.sub(r" {2,}\n", " <br%s\n" % self.empty_element_suffix, text) 1232 1233 return text 1234 1235 # "Sorta" because auto-links are identified as "tag" tokens. 1236 _sorta_html_tokenize_re = re.compile(r""" 1237 ( 1238 # tag 1239 </? 1240 (?:\w+) # tag name 1241 (?:\s+(?:[\w-]+:)?[\w-]+=(?:".*?"|'.*?'))* # attributes 1242 \s*/?> 1243 | 1244 # auto-link (e.g., <http://www.activestate.com/>) 1245 <[\w~:/?#\[\]@!$&'\(\)*+,;%=\.\\-]+> 1246 | 1247 <!--.*?--> # comment 1248 | 1249 <\?.*?\?> # processing instruction 1250 ) 1251 """, re.X) 1252 1253 def _escape_special_chars(self, text): 1254 # Python markdown note: the HTML tokenization here differs from 1255 # that in Markdown.pl, hence the behaviour for subtle cases can 1256 # differ (I believe the tokenizer here does a better job because 1257 # it isn't susceptible to unmatched '<' and '>' in HTML tags). 1258 # Note, however, that '>' is not allowed in an auto-link URL 1259 # here. 1260 escaped = [] 1261 is_html_markup = False 1262 for token in self._sorta_html_tokenize_re.split(text): 1263 if is_html_markup: 1264 # Within tags/HTML-comments/auto-links, encode * and _ 1265 # so they don't conflict with their use in Markdown for 1266 # italics and strong. We're replacing each such 1267 # character with its corresponding MD5 checksum value; 1268 # this is likely overkill, but it should prevent us from 1269 # colliding with the escape values by accident. 1270 escaped.append(token.replace('*', self._escape_table['*']) 1271 .replace('_', self._escape_table['_'])) 1272 else: 1273 escaped.append(self._encode_backslash_escapes(token)) 1274 is_html_markup = not is_html_markup 1275 return ''.join(escaped) 1276 1277 def _hash_html_spans(self, text): 1278 # Used for safe_mode. 1279 1280 def _is_auto_link(s): 1281 if ':' in s and self._auto_link_re.match(s): 1282 return True 1283 elif '@' in s and self._auto_email_link_re.match(s): 1284 return True 1285 return False 1286 1287 tokens = [] 1288 is_html_markup = False 1289 for token in self._sorta_html_tokenize_re.split(text): 1290 if is_html_markup and not _is_auto_link(token): 1291 sanitized = self._sanitize_html(token) 1292 key = _hash_text(sanitized) 1293 self.html_spans[key] = sanitized 1294 tokens.append(key) 1295 else: 1296 tokens.append(self._encode_incomplete_tags(token)) 1297 is_html_markup = not is_html_markup 1298 return ''.join(tokens) 1299 1300 def _unhash_html_spans(self, text): 1301 for key, sanitized in list(self.html_spans.items()): 1302 text = text.replace(key, sanitized) 1303 return text 1304 1305 def _sanitize_html(self, s): 1306 if self.safe_mode == "replace": 1307 return self.html_removed_text 1308 elif self.safe_mode == "escape": 1309 replacements = [ 1310 ('&', '&'), 1311 ('<', '<'), 1312 ('>', '>'), 1313 ] 1314 for before, after in replacements: 1315 s = s.replace(before, after) 1316 return s 1317 else: 1318 raise MarkdownError("invalid value for 'safe_mode': %r (must be " 1319 "'escape' or 'replace')" % self.safe_mode) 1320 1321 _inline_link_title = re.compile(r''' 1322 ( # \1 1323 [ \t]+ 1324 (['"]) # quote char = \2 1325 (?P<title>.*?) 1326 \2 1327 )? # title is optional 1328 \)$ 1329 ''', re.X | re.S) 1330 _tail_of_reference_link_re = re.compile(r''' 1331 # Match tail of: [text][id] 1332 [ ]? # one optional space 1333 (?:\n[ ]*)? # one optional newline followed by spaces 1334 \[ 1335 (?P<id>.*?) 1336 \] 1337 ''', re.X | re.S) 1338 1339 _whitespace = re.compile(r'\s*') 1340 1341 _strip_anglebrackets = re.compile(r'<(.*)>.*') 1342 1343 def _find_non_whitespace(self, text, start): 1344 """Returns the index of the first non-whitespace character in text 1345 after (and including) start 1346 """ 1347 match = self._whitespace.match(text, start) 1348 return match.end() 1349 1350 def _find_balanced(self, text, start, open_c, close_c): 1351 """Returns the index where the open_c and close_c characters balance 1352 out - the same number of open_c and close_c are encountered - or the 1353 end of string if it's reached before the balance point is found. 1354 """ 1355 i = start 1356 l = len(text) 1357 count = 1 1358 while count > 0 and i < l: 1359 if text[i] == open_c: 1360 count += 1 1361 elif text[i] == close_c: 1362 count -= 1 1363 i += 1 1364 return i 1365 1366 def _extract_url_and_title(self, text, start): 1367 """Extracts the url and (optional) title from the tail of a link""" 1368 # text[start] equals the opening parenthesis 1369 idx = self._find_non_whitespace(text, start+1) 1370 if idx == len(text): 1371 return None, None, None 1372 end_idx = idx 1373 has_anglebrackets = text[idx] == "<" 1374 if has_anglebrackets: 1375 end_idx = self._find_balanced(text, end_idx+1, "<", ">") 1376 end_idx = self._find_balanced(text, end_idx, "(", ")") 1377 match = self._inline_link_title.search(text, idx, end_idx) 1378 if not match: 1379 return None, None, None 1380 url, title = text[idx:match.start()], match.group("title") 1381 if has_anglebrackets: 1382 url = self._strip_anglebrackets.sub(r'\1', url) 1383 return url, title, end_idx 1384 1385 _safe_protocols = re.compile(r'(https?|ftp):', re.I) 1386 def _do_links(self, text): 1387 """Turn Markdown link shortcuts into XHTML <a> and <img> tags. 1388 1389 This is a combination of Markdown.pl's _DoAnchors() and 1390 _DoImages(). They are done together because that simplified the 1391 approach. It was necessary to use a different approach than 1392 Markdown.pl because of the lack of atomic matching support in 1393 Python's regex engine used in $g_nested_brackets. 1394 """ 1395 MAX_LINK_TEXT_SENTINEL = 3000 # markdown2 issue 24 1396 1397 # `anchor_allowed_pos` is used to support img links inside 1398 # anchors, but not anchors inside anchors. An anchor's start 1399 # pos must be `>= anchor_allowed_pos`. 1400 anchor_allowed_pos = 0 1401 1402 curr_pos = 0 1403 while True: # Handle the next link. 1404 # The next '[' is the start of: 1405 # - an inline anchor: [text](url "title") 1406 # - a reference anchor: [text][id] 1407 # - an inline img:  1408 # - a reference img: ![text][id] 1409 # - a footnote ref: [^id] 1410 # (Only if 'footnotes' extra enabled) 1411 # - a footnote defn: [^id]: ... 1412 # (Only if 'footnotes' extra enabled) These have already 1413 # been stripped in _strip_footnote_definitions() so no 1414 # need to watch for them. 1415 # - a link definition: [id]: url "title" 1416 # These have already been stripped in 1417 # _strip_link_definitions() so no need to watch for them. 1418 # - not markup: [...anything else... 1419 try: 1420 start_idx = text.index('[', curr_pos) 1421 except ValueError: 1422 break 1423 text_length = len(text) 1424 1425 # Find the matching closing ']'. 1426 # Markdown.pl allows *matching* brackets in link text so we 1427 # will here too. Markdown.pl *doesn't* currently allow 1428 # matching brackets in img alt text -- we'll differ in that 1429 # regard. 1430 bracket_depth = 0 1431 for p in range(start_idx+1, min(start_idx+MAX_LINK_TEXT_SENTINEL, 1432 text_length)): 1433 ch = text[p] 1434 if ch == ']': 1435 bracket_depth -= 1 1436 if bracket_depth < 0: 1437 break 1438 elif ch == '[': 1439 bracket_depth += 1 1440 else: 1441 # Closing bracket not found within sentinel length. 1442 # This isn't markup. 1443 curr_pos = start_idx + 1 1444 continue 1445 link_text = text[start_idx+1:p] 1446 1447 # Fix for issue 341 - Injecting XSS into link text 1448 if self.safe_mode: 1449 link_text = self._hash_html_spans(link_text) 1450 link_text = self._unhash_html_spans(link_text) 1451 1452 # Possibly a footnote ref? 1453 if "footnotes" in self.extras and link_text.startswith("^"): 1454 normed_id = re.sub(r'\W', '-', link_text[1:]) 1455 if normed_id in self.footnotes: 1456 self.footnote_ids.append(normed_id) 1457 result = '<sup class="footnote-ref" id="fnref-%s">' \ 1458 '<a href="#fn-%s">%s</a></sup>' \ 1459 % (normed_id, normed_id, len(self.footnote_ids)) 1460 text = text[:start_idx] + result + text[p+1:] 1461 else: 1462 # This id isn't defined, leave the markup alone. 1463 curr_pos = p+1 1464 continue 1465 1466 # Now determine what this is by the remainder. 1467 p += 1 1468 if p == text_length: 1469 return text 1470 1471 # Inline anchor or img? 1472 if text[p] == '(': # attempt at perf improvement 1473 url, title, url_end_idx = self._extract_url_and_title(text, p) 1474 if url is not None: 1475 # Handle an inline anchor or img. 1476 is_img = start_idx > 0 and text[start_idx-1] == "!" 1477 if is_img: 1478 start_idx -= 1 1479 1480 # We've got to encode these to avoid conflicting 1481 # with italics/bold. 1482 url = url.replace('*', self._escape_table['*']) \ 1483 .replace('_', self._escape_table['_']) 1484 if title: 1485 title_str = ' title="%s"' % ( 1486 _xml_escape_attr(title) 1487 .replace('*', self._escape_table['*']) 1488 .replace('_', self._escape_table['_'])) 1489 else: 1490 title_str = '' 1491 if is_img: 1492 img_class_str = self._html_class_str_from_tag("img") 1493 result = '<img src="%s" alt="%s"%s%s%s' \ 1494 % (_html_escape_url(url, safe_mode=self.safe_mode), 1495 _xml_escape_attr(link_text), 1496 title_str, 1497 img_class_str, 1498 self.empty_element_suffix) 1499 if "smarty-pants" in self.extras: 1500 result = result.replace('"', self._escape_table['"']) 1501 curr_pos = start_idx + len(result) 1502 text = text[:start_idx] + result + text[url_end_idx:] 1503 elif start_idx >= anchor_allowed_pos: 1504 safe_link = self._safe_protocols.match(url) or url.startswith('#') 1505 if self.safe_mode and not safe_link: 1506 result_head = '<a href="#"%s>' % (title_str) 1507 else: 1508 result_head = '<a href="%s"%s>' % (_html_escape_url(url, safe_mode=self.safe_mode), title_str) 1509 result = '%s%s</a>' % (result_head, link_text) 1510 if "smarty-pants" in self.extras: 1511 result = result.replace('"', self._escape_table['"']) 1512 # <img> allowed from curr_pos on, <a> from 1513 # anchor_allowed_pos on. 1514 curr_pos = start_idx + len(result_head) 1515 anchor_allowed_pos = start_idx + len(result) 1516 text = text[:start_idx] + result + text[url_end_idx:] 1517 else: 1518 # Anchor not allowed here. 1519 curr_pos = start_idx + 1 1520 continue 1521 1522 # Reference anchor or img? 1523 else: 1524 match = self._tail_of_reference_link_re.match(text, p) 1525 if match: 1526 # Handle a reference-style anchor or img. 1527 is_img = start_idx > 0 and text[start_idx-1] == "!" 1528 if is_img: 1529 start_idx -= 1 1530 link_id = match.group("id").lower() 1531 if not link_id: 1532 link_id = link_text.lower() # for links like [this][] 1533 if link_id in self.urls: 1534 url = self.urls[link_id] 1535 # We've got to encode these to avoid conflicting 1536 # with italics/bold. 1537 url = url.replace('*', self._escape_table['*']) \ 1538 .replace('_', self._escape_table['_']) 1539 title = self.titles.get(link_id) 1540 if title: 1541 title = _xml_escape_attr(title) \ 1542 .replace('*', self._escape_table['*']) \ 1543 .replace('_', self._escape_table['_']) 1544 title_str = ' title="%s"' % title 1545 else: 1546 title_str = '' 1547 if is_img: 1548 img_class_str = self._html_class_str_from_tag("img") 1549 result = '<img src="%s" alt="%s"%s%s%s' \ 1550 % (_html_escape_url(url, safe_mode=self.safe_mode), 1551 _xml_escape_attr(link_text), 1552 title_str, 1553 img_class_str, 1554 self.empty_element_suffix) 1555 if "smarty-pants" in self.extras: 1556 result = result.replace('"', self._escape_table['"']) 1557 curr_pos = start_idx + len(result) 1558 text = text[:start_idx] + result + text[match.end():] 1559 elif start_idx >= anchor_allowed_pos: 1560 if self.safe_mode and not self._safe_protocols.match(url): 1561 result_head = '<a href="#"%s>' % (title_str) 1562 else: 1563 result_head = '<a href="%s"%s>' % (_html_escape_url(url, safe_mode=self.safe_mode), title_str) 1564 result = '%s%s</a>' % (result_head, link_text) 1565 if "smarty-pants" in self.extras: 1566 result = result.replace('"', self._escape_table['"']) 1567 # <img> allowed from curr_pos on, <a> from 1568 # anchor_allowed_pos on. 1569 curr_pos = start_idx + len(result_head) 1570 anchor_allowed_pos = start_idx + len(result) 1571 text = text[:start_idx] + result + text[match.end():] 1572 else: 1573 # Anchor not allowed here. 1574 curr_pos = start_idx + 1 1575 else: 1576 # This id isn't defined, leave the markup alone. 1577 curr_pos = match.end() 1578 continue 1579 1580 # Otherwise, it isn't markup. 1581 curr_pos = start_idx + 1 1582 1583 return text 1584 1585 def header_id_from_text(self, text, prefix, n): 1586 """Generate a header id attribute value from the given header 1587 HTML content. 1588 1589 This is only called if the "header-ids" extra is enabled. 1590 Subclasses may override this for different header ids. 1591 1592 @param text {str} The text of the header tag 1593 @param prefix {str} The requested prefix for header ids. This is the 1594 value of the "header-ids" extra key, if any. Otherwise, None. 1595 @param n {int} The <hN> tag number, i.e. `1` for an <h1> tag. 1596 @returns {str} The value for the header tag's "id" attribute. Return 1597 None to not have an id attribute and to exclude this header from 1598 the TOC (if the "toc" extra is specified).