8000 feat: Add GitHub data connector; add Markdown partitioner (#284) · ajaycode/unstructured@ded60af · GitHub
[go: up one dir, main page]

Skip to content

Commit ded60af

Browse files
authored
feat: Add GitHub data connector; add Markdown partitioner (Unstructured-IO#284)
1 parent c89bba1 commit ded60af

File tree

27 files changed

+872
-24
lines changed

27 files changed

+872
-24
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,7 @@ jobs:
108108
make test
109109
make check-coverage
110110
make install-ingest-s3
111+
make install-ingest-github
111112
./test_unstructured_ingest/test-ingest.sh
112113
113114
changelog:

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.4.16-dev4
1+
## 0.4.16-dev5
22

33
### Enhancements
44

@@ -7,6 +7,8 @@
77
### Features
88

99
* Added setup script for Ubuntu
10+
* Added GitHub connector for ingest cli.
11+
* Added `partition_md` partitioner.
1012
* Added Reddit connector for ingest cli.
1113

1214
### Fixes

Makefile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,10 @@ install-build:
5454
install-ingest-s3:
5555
pip install -r requirements/ingest-s3.txt
5656

57+
.PHONY: instal 67E6 l-ingest-github
58+
install-ingest-github:
59+
pip install -r requirements/ingest-github.txt
60+
5761
.PHONY: install-ingest-reddit
5862
install-ingest-reddit:
5963
pip install -r requirements/ingest-reddit.txt
@@ -88,6 +92,7 @@ pip-compile:
8892
cp requirements/build.txt docs/requirements.txt
8993
pip-compile --upgrade --extra=s3 --output-file=requirements/ingest-s3.txt requirements/base.txt setup.py
9094
pip-compile --upgrade --extra=reddit --output-file=requirements/ingest-reddit.txt requirements/base.txt setup.py
95+
pip-compile --upgrade --extra=github --output-file=requirements/ingest-github.txt requirements/base.txt setup.py
9196

9297
## install-project-local: install unstructured into your local python environment
9398
.PHONY: install-project-local

examples/ingest/github/ingest.sh

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
#!/usr/bin/env bash
2+
3+
# Processes the Unstructured-IO/unstructured repository
4+
# through Unstructured's library in 2 processes.
5+
6+
# Structured outputs are stored in github-ingest-output/
7+
8+
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
9+
cd "$SCRIPT_DIR"/../../.. || exit 1
10+
11+
PYTHONPATH=. ./unstructured/ingest/main.py \
12+
--github-url Unstructured-IO/unstructured \
13+
--github-branch main \
14+
--structured-output-dir github-ingest-output \
15+
--num-processes 2 \
16+
--verbose
17+
18+
# Alternatively, you can call it using:
19+
# unstructured-ingest --github-url ...

requirements/base.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@ charset-normalizer==3.0.1
2020
# via requests
2121
click==8.1.3
2222
# via nltk
23+
colorama==0.4.6
24+
# via
25+
# click
26+
# tqdm
2327
deprecated==1.2.13
2428
# via argilla
2529
et-xmlfile==1.1.0
@@ -35,13 +39,17 @@ idna==3.4
3539
# anyio
3640
# requests
3741
# rfc3986
42+
importlib-metadata==6.0.0
43+
# via markdown
3844
joblib==1.2.0
3945
# via nltk
4046
lxml==4.9.2
4147
# via
4248
# python-docx
4349
# python-pptx
4450
# unstructured (setup.py)
51+
markdown==3.4.1
52+
# via unstructured (setup.py)
4553
monotonic==1.6
4654
# via argilla
4755
nltk==3.8.1
@@ -101,3 +109,5 @@ wrapt==1.14.1
101109
# deprecated
102110
xlsxwriter==3.0.8
103111
# via python-pptx
112+
zipp==3.15.0
113+
# via importlib-metadata

requirements/build.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ certifi==2022.12.7
1616
# requests
1717
charset-normalizer==3.0.1
1818
# via requests
19+
colorama==0.4.6
20+
# via sphinx
1921
docutils==0.18.1
2022
# via
2123
# sphinx

requirements/dev.txt

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,6 @@
66
#
77
anyio==3.6.2
88
# via jupyter-server
9-
appnope==0.1.3
10-
# via
11-
# ipykernel
12-
# ipython
139
argon2-cffi==21.3.0
1410
# via
1511
# jupyter-server
@@ -35,6 +31,11 @@ cffi==1.15.1
3531
# via argon2-cffi-bindings
3632
click==8.1.3
3733
# via pip-tools
34+
colorama==0.4.6
35+
# via
36+
# build
37+
# click
38+
# ipython
3839
comm==0.1.2
3940
# via ipykernel
4041
debugpy==1.6.6
@@ -181,8 +182,6 @@ pandocfilters==1.5.0
181182
# via nbconvert
182183
parso==0.8.3
183184
# via jedi
184-
pexpect==4.8.0
185-
# via ipython
186185
pickleshare==0.7.5
187186
# via ipython
188187
pip-tools==6.12.2
@@ -202,10 +201,6 @@ prompt-toolkit==3.0.37
202201
# jupyter-console
203202
psutil==5.9.4
204203
# via ipykernel
205-
ptyprocess==0.7.0
206-
# via
207-
# pexpect
208-
# terminado
209204
pure-eval==0.2.2
210205
# via stack-data
211206
pycparser==2.21

requirements/huggingface.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,10 @@ click==8.1.3
2222
# via
2323
# nltk
2424
# sacremoses
25+
colorama==0.4.6
26+
# via
27+
# click
28+
# tqdm
2529
deprecated==1.2.13
2630
# via argilla
2731
et-xmlfile==1.1.0
@@ -43,6 +47,8 @@ idna==3.4
4347
# anyio
4448
# requests
4549
# rfc3986
50+
importlib-metadata==6.0.0
51+
# via markdown
4652
joblib==1.2.0
4753
# via
4854
# nltk
@@ -54,6 +60,8 @@ lxml==4.9.2
5460
# python-docx
5561
# python-pptx
5662
# unstructured (setup.py)
63+
markdown==3.4.1
64+
# via unstructured (setup.py)
5765
monotonic==1.6
5866
# via argilla
5967
nltk==3.8.1
@@ -146,3 +154,5 @@ wrapt==1.14.1
146154
# deprecated
147155
xlsxwriter==3.0.8
148156
# via python-pptx
157+
zipp==3.15.0
158+
# via importlib-metadata

requirements/ingest-github.txt

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
#
2+
# This file is autogenerated by pip-compile with Python 3.8
3+
# by the following command:
4+
#
5+
# pip-compile --extra=github --output-file=requirements/ingest-github.txt requirements/base.txt setup.py
6+
#
7+
anyio==3.6.2
8+
# via
9+
# -r requirements/base.txt
10+
# httpcore
11+
argilla==1.3.0
12+
# via
13+
# -r requirements/base.txt
14+
# unstructured (setup.py)
15+
backoff==2.2.1
16+
# via
17+
# -r requirements/base.txt
18+
# argilla
19+
certifi==2022.12.7
20+
# via
21+
# -r requirements/base.txt
22+
# httpcore
23+
# httpx
24+
# requests
25+
# unstructured (setup.py)
26+
cffi==1.15.1
27+
# via pynacl
28+
charset-normalizer==3.0.1
29+
# via
30+
# -r requirements/base.txt
31+
# requests
32+
click==8.1.3
33+
# via
34+
# -r requirements/base.txt
35+
# nltk
36+
colorama==0.4.6
37+
# via
38+
# click
39+
# tqdm
40+
deprecated==1.2.13
41+
# via
42+
# -r requirements/base.txt
43+
# argilla
44+
# pygithub
45+
et-xmlfile==1.1.0
46+
# via
47+
# -r requirements/base.txt
48+
# openpyxl
49+
h11==0.14.0
50+
# via
51+
# -r requirements/base.txt
52+
# httpcore
53+
httpcore==0.16.3
54+
# via
55+
# -r requirements/base.txt
56+
# httpx
57+
httpx==0.23.3
58+
# via
59+
# -r requirements/base.txt
60+
# argilla
61+
idna==3.4
62+
# via
63+
# -r requirements/base.txt
64+
# anyio
65+
# requests
66+
# rfc3986
67+
joblib==1.2.0
68+
# via
69+
# -r requirements/base.txt
70+
# nltk
71+
lxml==4.9.2
72+
# via
73+
# -r requirements/base.txt
74+
# python-docx
75+
# python-pptx
76+
# unstructured (setup.py)
77+
monotonic==1.6
78+
# via
79+
# -r requirements/base.txt
80+
# argilla
81+
nltk==3.8.1
82+
# via
83+
# -r requirements/base.txt
84+
# unstructured (setup.py)
85+
numpy==1.23.5
86+
# via
87+
# -r requirements/base.txt
88+
# argilla
89+
# pandas
90+
openpyxl==3.1.1
91+
# via
92+
# -r requirements/base.txt
93+
# unstructured (setup.py)
94+
packaging==23.0
95+
# via
96+
# -r requirements/base.txt
97+
# argilla
98+
pandas==1.5.3
99+
# via
100+
# -r requirements/base.txt
101+
# argilla
102+
# unstructured (setup.py)
103+
pillow==9.4.0
104+
# via
105+
# -r requirements/base.txt
106+
# python-pptx
107+
# unstructured (setup.py)
108+
pycparser==2.21
109+
# via cffi
110+
pydantic==1.10.4
111+
# via
112+
# -r requirements/base.txt
113+
# argilla
114+
pygithub==1.57.0
115+
# via unstructured (setup.py)
116+
pyjwt==2.6.0
117+
# via pygithub
118+
pynacl==1.5.0
119+
# via pygithub
120+
python-dateutil==2.8.2
121+
# via
122+
# -r requirements/base.txt
123+
# pandas
124+
python-docx==0.8.11
125+
# via
126+
# -r requirements/base.txt
127+
# unstructured (setup.py)
128+
python-magic==0.4.27
129+
# via
130+
# -r requirements/base.txt
131+
# unstructured (setup.py)
132+
python-pptx==0.6.21
133+
# via
134+
# -r requirements/base.txt
135+
# unstructured (setup.py)
136+
pytz==2022.7.1
137+
# via
138+
# -r requirements/base.txt
139+
# pandas
140+
regex==2022.10.31
141+
# via
142+
# -r requirements/base.txt
143+
# nltk
144+
requests==2.28.2
145+
# via
146+
# -r requirements/base.txt
147+
# pygithub
148+
# unstructured (setup.py)
149+
rfc3986[idna2008]==1.5.0
150+
# via
151+
# -r requirements/base.txt
152+
# httpx
153+
six==1.16.0
154+
# via
155+
# -r requirements/base.txt
156+
# python-dateutil
157+
sniffio==1.3.0
158+
# via
159+
# -r requirements/base.txt
160+
# anyio
161+
# httpcore
162+
# httpx
163+
tqdm==4.64.1
164+
# via
165+
# -r requirements/base.txt
166+
# argilla
167+
# nltk
168+
typing-extensions==4.4.0
169+
# via
170+
# -r requirements/base.txt
171+
# pydantic
172+
urllib3==1.26.14
173+
# via
174+
# -r requirements/base.txt
175+
# requests
176+
wrapt==1.14.1
177+
# via
178+
# -r requirements/base.txt
179+
# argilla
180+
# deprecated
181+
xlsxwriter==3.0.8
182+
# via
183+
# -r requirements/base.txt
184+
# python-pptx

0 commit comments

Comments
 (0)
0