Metadata-Version: 2.1
Name: simplify-docx
Version: 0.1.1
Summary: A utility for simplifying python-docx document objects
Home-page: https://microsofteconomics.visualstudio.com/EconTools/_git/loadify?_a=history
Author: Microsoft Research
Author-email: jathorpe@microsoft.com
License: UNKNOWN
Keywords: sql,csv
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Operating System :: Unix
Classifier: Operating System :: POSIX
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Utilities
Description-Content-Type: text/markdown
License-File: LICENSE

# Overview

DOCX files are complex, and their complexity makes scraping documents
for their content difficult. The aim of this package is to simplify
`.docx` files to just the components which carry meaning, thereby easing the
process of pattern matching and data extraction by converting a `.docx`
file into a predictable and *human readable* JSON file.

Simplifying a complex document down to it's *meaningful* parts of course
requires taking a position on what does and does-not convey meaning in a
document. Generally, this package takes the stance that the document
structure (body, paragraphs, tables, etc.) are meaningful as is the text
itself, whereas text styling (font, font-weight, etc.) is ignored almost
entirely, with the exception of paragraph indentation and numbering which
is often used to create lists, block quotes, etc.  Furthermore, the
opinions expressed by this package are explained in the Options section
below and can be changed to suite your needs.

# Usage
```python
import docx
from simplify_docx import simplify

# read in a document 
my_doc = docx.Document("/path/to/my/favorite/file.docx")

# coerce to JSON using the standard options
my_doc_as_json = simplify(my_doc)

# or with non-standard options
my_doc_as_json = simplify(my_doc,{"remove-leading-white-space":False})
```

# Installation

This project relies on the `python-docx` package which can be installed via
`pip install python-docx`. **However**, as of this writing, if you wish to
scrape documents which contain (A) form fields such as drop down lists,
checkboxes and text inputs or (B) nested documents (subdocs, altChunks,
etc.), you'll need to clone [this fork](https://github.com/jdthorpe/python-docx) of the python-docx package.

# Options

### General

* **"friendly-name"**: (*Default = `True`*): Use user-friendly type names
	such as "table-cell", over standard element names like "CT_Tc"

* **"merge-consecutive-text"**: (*Default = `True`*): Sentences and even single
	words can be represented by multiple text elements. If `True`,
	concatenate consecutive text elements into a single text element.

### Ignoring Invisible things

* **"ignore-empty-paragraphs"**: (*Default = `True`*): Empty paragraphs are
	often used for styling purpose and rarely have significance in the
	meaning of the document.
* **"ignore-empty-text"**: (*Default = `True`*): Empty text runs can make an
	otherwise empty paragraph appear to contain data.
* **"remove-leading-white-space"**: (*Default = `True`*): Leading white-space
	at the start of a paragraph is ocassionaly used for styling purposes
	and rarely has significance in the interpretation of a document.
* **"remove-trailing-white-space"**: (*Default = `True`*): Trailing white-space
	at the end of a paragraph rarely has significance in the interpretation
	of a document.
* **"flatten-inner-spaces"**: (*Default = `False`*): Collapse multiple
	space characters between words to a single space.
* **"ignore-joiners"**: (*Default = `False`*): Zero width joiner and non-joiner 
	characters are special characters used to create ligatures in displayed
	text and don't typically convey meaning (at least in alphabet based
	languages).

### Special symbols

* **"dumb-quotes"**: (*Default = `True`*): Replace smart quotes with
	dumb quotes.
* **"dumb-hyphens"**: (*Default = `True`*): Replace en-dash, em-dash,
	figure-dash, horizontal bar, and non-breaking hyphens with ordinary hyphens.
* **"dumb-spaces"**: (*Default = `True`*): Replace zero width spaces, hair 
	spaces, thin spaces, punctuation spaces, figure spaces, six per em
	spaces, four per em spaces, three per em spaces, em spaces, en spaces,
	em quad spaces, and en quad spaces with ordinary spaces.
* **"special-characters-as-text"**: (*Default = `True`*): Coerce special
	characters into text equivalents according to the following table:

| Character | Text Equivalent | 
| --------- | --------------- | 
| CarriageReturn | `\n` |
| Break | `\r` |
| TabChar | `\t` |
| PositionalTab | `\t` |
| NoBreakHyphen | `-` |
| SoftHyphen | `-` |

* **"symbol-as-text"**: (*Default = `True`*): Special symbols often cary
	meaning other than the underlying unicode character, especially when
	the font is a special font such as `Wingdings`. If `True` these are
	included as ordinary text and their font information is omitted.
* **"empty-as-text"**: (*Default = `False`*): There are a variety of "Empty"
	tags such as the `<"w:yearLong">` tag which cause the current year to
	be inserted into the document text. If `True`, include these as text
	formatted as `"[yearLong]"`.
* **"ignore-left-to-right-mark"**: (*Default = `False`*): Ignore the left-to-right
	mark, which is not writeable by pythons csv writer.
* **"ignore-right-to-left-mark"**: (*Default = `False`*): Ignore the right-to-left
	mark which is not writeable by pythons csv writer.

### Paragraph style:

Paragraph style markup are one exception to the styling vs. content
dichotomy. For example, block quotes are often indicated by indenting whole
paragraphs, and Ordered lists, Unordered lists and nesting of lists is
often used to divide sections of a document into logical components. 

* **"include-paragraph-indent"**: (*Default = `True`*): Include the
	indentation markup on paragraph (`CT_P`) elements. Indentation is
	measured in twips
* **"include-paragraph-numbering"**: (*Default = `True`*): Include the
	numbering styles, which are included in the `CT_P.pPr.numPr` element.
	The `ilvl` attribute indicates the level of nesting (zero based index)
	and the `numId` attribute refers to a specific numbering style
	included in the document's internal styles sheet. 

### Form Elements

* **"simplify-dropdown"**: (*Default = `True`*): Include just the selected
	and default values, the available options, and the name and label attributes in the form element.
* **"simplify-textinput"**: (*Default = `True`*): Include just the current
	and default values, and the name and label attributes in the form element.
* **"greedy-text-input"**: (*Default = `True`*): Continue consuming run
	elements when the text-input has not ended at the end of a paragraph,
	and the next block level element is also a paragraph. This typically
	occurs when the user preses the return key while editing a text input
	field.
* **"simplify-checkbox"**: (*Default = `True`*): Include just the current
	and default values, and the name and label attributes in the form element.
* **"use-checkbox-default"**: (*Default = `True`*): If the checkbox has no
	`value` attribute (typically because the user has not interacted with
	it), report the default value as the checkbox value.
* **"checkbox-as-text"**: (*Default = `False`*): Coerce the value of the
	checkbox to text, represented as either `"[CheckBox:True]"` or `"[CheckBox:False]"`
* **"dropdown-as-text"**: (*Default = `False`*): Coerce the value of the
	checkbox to text, represented as `"[DropDown:<selected value>]"`
* **"trim-dropdown-options"**: (*Default = `True`*): Remove white-space on
	the left and right of drop down option items.
* **"flatten-generic-field"**: (*Default = `True`*): `generic-fields` are
	`CT_FldChar` runs which are not marked as a drop-down, text-input, or
	checkbox. These may include special instructions which apply special
	formatting to a text run (e.g. a hyper link). If `True`, the contents
	of generic-fields are included in the normal flow of text

### Special content

* **"flatten-hyperlink"**: (*Default = `True`*): Flatten hyperlinks, including
	their contents in the flow of normal text.
* **"flatten-smartTag"**: (*Default = `True`*): Flatten smartTag elements, 
	including their contents in the flow of normal text.
* **"flatten-customXml"**: (*Default = `True`*): Flatten customXml elements, 
	including their contents in the flow of normal text.
* **"flatten-simpleField"**: (*Default = `True`*): Flatten simpleField elements, 
	including their contents in the flow of normal text.

# Contributing

This project welcomes contributions and suggestions.  Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.



