10000 Merge pull request #218 from pyexcel/dev · blog2i2j/pyexcel.._..pyexcel@f0e0613 · GitHub
[go: up one dir, main page]

Skip to content

Commit f0e0613

Browse files
authored
Merge pull request pyexcel#218 from pyexcel/dev
release 0.6.3
2 parents 12688ed + bd60888 commit f0e0613

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+811
-315
lines changed

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@ With your PR, here is a check list:
33
- [ ] Has Test cases written
44
- [ ] Has all code lines tested
55
- [ ] Has `make format` been run?
6-
- [ ] Has `moban` been run?
6+
- [ ] Please update CHANGELOG.yml
77
- [ ] Passes all Travis CI builds
88
- [ ] Has fair amount of documentation if your change is complex
99
- [ ] run 'make format' so as to confirm the pyexcel organisation's coding style
10-
- [ ] Please update CHANGELOG.rst
1110
- [ ] Please add yourself to CONTRIBUTORS.rst
1211
- [ ] Agree on NEW BSD License for your contribution
12+
- [ ] Has `moban` been run?

.isort.cfg

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
[settings]
2+
line_length=79
3+
known_first_party=
4+
known_third_party=mock,nose,pyexcel_io
5+
indent=' '
6+
multi_line_output=3
7+
length_sort=1
8+
default_section=FIRSTPARTY
9+
no_lines_before=LOCALFOLDER
10+
sections=FUTURE,STDLIB,FIRSTPARTY,THIRDPARTY,LOCALFOLDER

.moban.d/CONTRIBUTORS.rst.jj2

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
Contributors
2+
================================================================================
3+
4+
In alphabetical order:
5+
6+
{% for contributor in contributors|sort(attribute='name') %}
7+
* `{{ contributor.name }} <{{ contributor.url }}>`_
8+
{% endfor %}

.moban.d/custom_setup.py.jj2

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,7 @@
1313
{% block additional_classifiers %}
1414
'Development Status :: 3 - Alpha',
1515
{% endblock %}}
16+
17+
{%- block morefiles %}
18+
"CONTRIBUTORS.rst",
19+
{%- endblock %}

.moban.d/docs/source/bigdata.rst.jj2

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
================================================================================
2+
Read partial data
3+
================================================================================
4+
5+
6+
{% include "partial-data.rst.jj2" %}

.moban.d/docs/source/pyexcel-index.rst.jj2

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -263,7 +263,7 @@ Design
263263
architecture
264264

265265
New tutorial
266-
----------
266+
--------------
267267
.. toctree::
268268

269269
quickstart
@@ -276,7 +276,7 @@ New tutorial
276276
database
277277

278278
Old tutorial
279-
----------
279+
--------------
280280
.. toctree::
281281

282282
tutorial_file

.moban.d/one-liners.rst.jj2

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -65,11 +65,8 @@ And let's check what do we have:
6565

6666
.. code-block:: python
6767

68-
>>> for record in records:
69-
... print("%s of %s has %s mg" % (
70-
... record['Serving Size'],
71-
... record['Coffees'],
72-
... record['Caffeine (mg)']))
68+
>>> for r in records:
69+
... print(f"{r['Serving Size']} of {r['Coffees']} has {r['Caffeine (mg)']} mg")
7370
venti(20 oz) of Starbucks Coffee Blonde Roast has 475 mg
7471
large(20 oz.) of Dunkin' Donuts Coffee with Turbo Shot has 398 mg
7572
grande(16 oz.) of Starbucks Coffee Pike Place Roast has 310 mg
@@ -84,10 +81,7 @@ Instead, what if you have to use `pyexcel.get_array` to do the same:
8481
.. code-block:: python
8582

8683
>>> for row in p.get_array(file_name="your_file.xls", start_row=1):
87-
... print("%s of %s has %s mg" % (
88-
... row[1],
89-
... row[0],
90-
... row[2]))
84+
... print(f"{row[1]} of {row[0]} has {row[2]} mg")
9185
venti(20 oz) of Starbucks Coffee Blonde Roast has 475 mg
9286
large(20 oz.) of Dunkin' Donuts Coffee with Turbo Shot has 398 mg
9387
grande(16 oz.) of Starbucks Coffee Pike Place Roast has 310 mg

.moban.d/partial-data.rst.jj2

Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
2+
When you are dealing with huge amount of data, e.g. 64GB, obviously you would not
3+
like to fill up your memory with those data. What you may want to do is, record
4+
data from Nth line, take M records and stop. And you only want to use your memory
5+
for the M records, not for beginning part nor for the tail part.
6+
7+
Hence partial read feature is developed to read partial data into memory for
8+
processing.
9+
10+
You can paginate by row, by column and by both, hence you dictate what portion of the
11+
data to read back. But remember only row limit features help you save memory. Let's
12+
you use this feature to record data from Nth column, take M number of columns and skip
13+
the rest. You are not going to reduce your memory footprint.
14+
15+
Why did not I see above benefit?
16+
--------------------------------------------------------------------------------
17+
18+
This feature depends heavily on the implementation details.
19+
20+
`pyexcel-xls`_ (xlrd), `pyexcel-xlsx`_ (openpyxl), `pyexcel-ods`_ (odfpy) and
21+
`pyexcel-ods3`_ (pyexcel-ezodf) will read all data into memory. Because xls,
22+
xlsx and ods file are effective a zipped folder, all four will unzip the folder
23+
and read the content in xml format in **full**, so as to make sense of all details.
24+
25+
Hence, during the partial data is been returned, the memory consumption won't
26+
differ from reading the whole data back. Only after the partial
27+
data is returned, the memory comsumption curve shall jump the cliff. So pagination
28+
code here only limits the data returned to your program.
29+
30+
With that said, `pyexcel-xlsxr`_, `pyexcel-odsr`_ and `pyexcel-htmlr`_ DOES read
31+
partial data into memory. Those three are implemented in such a way that they
32+
consume the xml(html) when needed. When they have read designated portion of the
33+
data, they stop, even if they are half way through.
34+
35+
In addition, pyexcel's csv readers can read partial data into memory too.
36+
37+
{% if sphinx %}
38+
39+
.. testcode::
40+
:hide:
41+
42+
>>> import sys
43+
>>> if sys.version_info[0] < 3:
44+
... from StringIO import StringIO
45+
... else:
46+
... from io import StringIO
47+
>>> from pyexcel_io._compact import OrderedDict
48+
49+
{% endif %}
50+
51+
Let's assume the following file is a huge csv file:
52+
53+
.. code-block:: python
54+
55+
>>> import datetime
56+
>>> import pyexcel as pe
57+
>>> data = [
58+
... [1, 21, 31],
59+
... [2, 22, 32],
60+
... [3, 23, 33],
61+
... [4, 24, 34],
62+
... [5, 25, 35],
63+
... [6, 26, 36]
64+
... ]
65+
>>> pe.save_as(array=data, dest_file_name="your_file.csv")
66+
67+
68+
And let's pretend to read partial data:
69+
70+
71+
.. code-block:: python
72+
73+
>>> pe.get_sheet(file_name="your_file.csv", start_row=2, row_limit=3)
74+
your_file.csv:
75+
+---+----+----+
76+
| 3 | 23 | 33 |
77+
+---+----+----+
78+
| 4 | 24 | 34 |
79+
+---+----+----+
80+
| 5 | 25 | 35 |
81+
+---+----+----+
82+
83+
And you could as well do the same for columns:
84+
85+
.. code-block:: python
86+
87+
>>> pe.get_sheet(file_name="your_file.csv", start_column=1, column_limit=2)
88+
your_file.csv:
89+
+----+----+
90+
| 21 | 31 |
91+
+----+----+
92+
| 22 | 32 |
93+
+----+----+
94+
| 23 | 33 |
95+
+----+----+
96+
| 24 | 34 |
97+
+----+----+
98+
| 25 | 35 |
99+
+----+----+
100+
| 26 | 36 |
101+
+----+----+
102+
103+
Obvious, you could do both at the same time:
104+
105+
.. code-block:: python
106+
107+
>>> pe.get_sheet(file_name="your_file.csv",
108+
... start_row=2, row_limit=3,
109+
... start_column=1, column_limit=2)
110+
your_file.csv:
111+
+----+----+
112+
| 23 | 33 |
113+
+----+----+
114+
| 24 | 34 |
115+
+----+----+
116+
| 25 | 35 |
117+
+----+----+
118+
119+
120+
The pagination support is available across all pyexcel plugins.
121+
122+
.. note::
123+
124+
No column pagination support for query sets as data source.
125+
126+
127+
Formatting while transcoding a big data file
128+
--------------------------------------------------------------------------------
129+
130+
If you are transcoding a big data set, conventional formatting method would not
131+
help unless a on-demand free RAM is available. However, there is a way to minimize
132+
the memory footprint of pyexcel while the formatting is performed.
133+
134+
Let's continue from previous example. Suppose we want to transcode "your_file.csv"
135+
to "your_file.xls" but increase each element by 1.
136+
137+
What we can do is to define a row renderer function as the following:
138+
139+
.. code-block:: python
140+
141+
>>> def increment_by_one(row):
142+
... for element in row:
143+
... yield element + 1
144+
145+
Then pass it onto save_as function using row_renderer:
146+
147+
.. code-block:: python
148+
149+
>>> pe.isave_as(file_name="your_file.csv",
150+
... row_renderer=increment_by_one,
151+
... dest_file_name="your_file.xlsx")
152+
153+
154+
.. note::
155+
156+
If the data content is from a generator, isave_as has to be used.
157+
158+
We can verify if it was done correctly:
159+
160+
.. code-block:: python
161+
162+
>>> pe.get_sheet(file_name="your_file.xlsx")
163+
your_file.csv:
164+
+---+----+----+
165+
| 2 | 22 | 32 |
166+
+---+----+----+
167+
| 3 | 23 | 33 |
168+
+---+----+----+
169+
| 4 | 24 | 34 |
170+
+---+----+----+
171+
| 5 | 25 | 35 |
172+
+---+----+----+
173+
| 6 | 26 | 36 |
174+
+---+----+----+
175+
| 7 | 27 | 37 |
176+
+---+----+----+
177+
178+
{% if sphinx %}
179+
180+
.. testcode::
181+
:hide:
182+
183+
>>> import os
184+
>>> os.unlink("your_file.csv")
185+
>>> os.unlink("your_file.xlsx")
186+
187+
{% endif %}

.moban.d/pyexcel-README.rst.jj2

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,22 +15,29 @@ Feature Highlights
1515
* SQLAlchemy table
1616
* Django Model
1717
* Python data structures: dictionary, records and array
18+
1819
2. One API to read and write data in various excel file formats.
1920
3. For large data sets, data streaming are supported. A genenerator can be returned to you. Checkout iget_records, iget_array, isave_as and isave_book_as.
2021

2122
{% endblock %}
2223

2324
{%block usage%}
2425

25-
{%include "one-liners.rst.jj2"%}
26+
{%include "one-liners.rst.jj2" %}
27+
28+
Hidden feature: partial read
29+
===============================================
30+
31+
Most pyexcel users do not know, but other library users were requesting `the similar features <https://github.com/jazzband/tablib/issues/467>`_
2632

27-
{%include "two-liners.rst.jj2"%}
33+
{%include "partial-data.rst.jj2" %}
2834

35+
{%include "two-liners.rst.jj2" %}
2936

3037
Available Plugins
3138
=================
3239

33-
{% include "plugins-list.rst.jj2"%}
40+
{% include "plugins-list.rst.jj2" %}
3441

3542

3643
Acknowledgement

.moban.d/two-liners.rst.jj2

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ Stream APIs for big file : A set of two liners
44
This section shows you how to get data from your **BIG** excel files and how to
55
export data to excel files in **two lines** at most.
66

7+
Please use dedicated readers to gain the extra memory savings.
8+
79

810
Two liners for get data from big excel files
911
--------------------------------------------------------------------------------
@@ -66,17 +68,14 @@ And let's check what do we have:
6668

6769
.. code-block:: python
6870

69-
>>> for record in records:
70-
... print("%s of %s has %s mg" % (
71-
... record['Serving Size'],
72-
... record['Coffees'],
73-
... record['Caffeine (mg)']))
71+
>>> for r in records:
72+
... print(f"{r['Serving Size']} of {r['Coffees']} has {r['Caffeine (mg)']} mg")
7473
venti(20 oz) of Starbucks Coffee Blonde Roast has 475 mg
7574
large(20 oz.) of Dunkin' Donuts Coffee with Turbo Shot has 398 mg
7675
grande(16 oz.) of Starbucks Coffee Pike Place Roast has 310 mg
7776
regular(16 oz.) of Panera Coffee Light Roast has 300 mg
7877

79-
Please do not forgot the second line:
78+
Please do not forgot the second line to close the opened file handle:
8079

8180
.. code-block:: python
8281

@@ -90,10 +89,7 @@ Instead, what if you have to use `pyexcel.get_array` to do the same:
9089
.. code-block:: python
9190

9291
>>> for row in p.iget_array(file_name="your_file.xls", start_row=1):
93-
... print("%s of %s has %s mg" % (
94-
... row[1],
95-
... row[0],
96-
... row[2]))
92+
... print(f"{row[1]} of {row[0]} has {row[2]} mg")
9793
venti(20 oz) of Starbucks Coffee Blonde Roast has 475 mg
9894
large(20 oz.) of Dunkin' Donuts Coffee with Turbo Shot has 398 mg
9995
grande(16 oz.) of Starbucks Coffee Pike Place Roast has 310 mg
@@ -326,7 +322,7 @@ Again it is really simple. Let's verify what we have gotten:
326322
| Smith | 4.2 | 12/11/14 |
327323
+-------+--------+----------+
328324

329-
.. NOTE::
325+
.. note::
330326

331327
Please note that csv(comma separate value) file is pure text file. Formula, charts, images and formatting in xls file will disappear no matter which transcoding tool you use. Hence, pyexcel is a quick alternative for this transcoding job.
332328

0 commit comments

Comments
 (0)
0