@@ -525,6 +525,170 @@ Suppose you just want to extract one sheet from many sheets that exists in a wor
525
525
for the output file, you can specify any of the supported formats
526
526
527
527
528
+
529
+ Hidden feature: partial read
530
+ ===============================================
531
+
532
+ Most pyexcel users do not know, but other library users were requesting `the similar features <https://github.com/jazzband/tablib/issues/467 >`_
533
+
534
+
535
+ When you are dealing with huge amount of data, e.g. 64GB, obviously you would not
536
+ like to fill up your memory with those data. What you may want to do is, record
537
+ data from Nth line, take M records and stop. And you only want to use your memory
538
+ for the M records, not for beginning part nor for the tail part.
539
+
540
+ Hence partial read feature is developed to read partial data into memory for processing.
541
+ You can paginate by row, by column and by both, hence you dictate what portion of the
542
+ data to read back. But remember only row limit features help you save memory. Let's
543
+ you use this feature to record data from Nth column, take M number of columns and skip
544
+ the rest. You are not going to reduce your memory footprint.
545
+
546
+ Why did not I see above benefit?
547
+
548
+ This feature depends heavily on the implementation details.
549
+
550
+ `pyexcel-xls`_(xlrd), `pyexcel-xlsx`_(openpyxl), `pyexcel-ods`_(odfpy) and `pyexcel-ods3`_(pyexcel-ezodf)
551
+ will read all data into memory. Because xls, xlsx and ods file are effective a zipped folder,
552
+ all four will unzip the folder and read the content in xml format in **full **, so as to make sense
553
+ of all details.
554
+
555
+ Hence, during the partial data is been returned, the memory
556
+ consumption won't differ from reading the whole data back. Only after the partial
557
+ data is returned, the memory comsumption curve shall jump the cliff. So pagination
558
+ code here only limits the data returned to your program.
559
+
560
+ With that said, `pyexcel-xlsxr `_, `pyexcel-odsr `_ and `pyexcel-htmlr `_ DOES read partial data into memory.
561
+ Those three are implemented in such a way that they consume the xml(html) when needed. When they
562
+ have read designated portion of the data, they stop, even if they are half way through.
563
+
564
+ In addition, pyexcel's csv readers can read partial data into memory too.
565
+
566
+
567
+
568
+ Let's assume the following file is a huge csv file:
569
+
570
+ .. code-block :: python
571
+
572
+ >> > import datetime
573
+ >> > import pyexcel as pe
574
+ >> > data = [
575
+ ... [1 , 21 , 31 ],
576
+ ... [2 , 22 , 32 ],
577
+ ... [3 , 23 , 33 ],
578
+ ... [4 , 24 , 34 ],
579
+ ... [5 , 25 , 35 ],
580
+ ... [6 , 26 , 36 ]
581
+ ... ]
582
+ >> > pe.save_as(array = data, dest_file_name = " your_file.csv" )
583
+
584
+
585
+ And let's pretend to read partial data:
586
+
587
+
588
+ .. code-block :: python
589
+
590
+ >> > pe.get_sheet(file_name = " your_file.csv" , start_row = 2 , row_limit = 3 )
591
+ your_file.csv:
592
+ + -- -+ ---- + ---- +
593
+ | 3 | 23 | 33 |
594
+ + -- -+ ---- + ---- +
595
+ | 4 | 24 | 34 |
596
+ + -- -+ ---- + ---- +
597
+ | 5 | 25 | 35 |
598
+ + -- -+ ---- + ---- +
599
+
600
+ And you could as well do the same for columns:
601
+
602
+ .. code-block :: python
603
+
604
+ >> > pe.get_sheet(file_name = " your_file.csv" , start_column = 1 , column_limit = 2 )
605
+ your_file.csv:
606
+ + ---- + ---- +
607
+ | 21 | 31 |
608
+ + ---- + ---- +
609
+ | 22 | 32 |
610
+ + ---- + ---- +
611
+ | 23 | 33 |
612
+ + ---- + ---- +
613
+ | 24 | 34 |
614
+ + ---- + ---- +
615
+ |
10000
25 | 35 |
616
+ + ---- + ---- +
617
+ | 26 | 36 |
618
+ + ---- + ---- +
619
+
620
+ Obvious, you could do both at the same time:
621
+
622
+ .. code-block :: python
623
+
624
+ >> > pe.get_sheet(file_name = " your_file.csv" ,
625
+ ... start_row = 2 , row_limit = 3 ,
626
+ ... start_column = 1 , column_limit = 2 )
627
+ your_file.csv:
628
+ + ---- + ---- +
629
+ | 23 | 33 |
630
+ + ---- + ---- +
631
+ | 24 | 34 |
632
+ + ---- + ---- +
633
+ | 25 | 35 |
634
+ + ---- + ---- +
635
+
636
+
637
+ The pagination support is available across all pyexcel plugins.
638
+
639
+ .. note ::
640
+
641
+ No column pagination support for query sets as data source.
642
+
643
+
644
+ Formatting while transcoding a big data file
645
+ --------------------------------------------------------------------------------
646
+
647
+ If you are transcoding a big data set, conventional formatting method would not
648
+ help unless a on-demand free RAM is available. However, there is a way to minimize
649
+ the memory footprint of pyexcel while the formatting is performed.
650
+
651
+ Let's continue from previous example. Suppose we want to transcode "your_file.csv"
652
+ to "your_file.xls" but increase each element by 1.
653
+
654
+ What we can do is to define a row renderer function as the following:
655
+
656
+ >>> def increment_by_one (row ):
657
+ ... for element in row:
658
+ ... yield element + 1
659
+
660
+ Then pass it onto save_as function using row_renderer:
661
+
662
+ >>> pe.isave_as(file_name = " your_file.csv" ,
663
+ ... row_renderer= increment_by_one,
664
+ ... dest_file_name= " your_file.xlsx" )
665
+
666
+
667
+ .. note ::
668
+
669
+ If the data content is from a generator, isave_as has to be used.
670
+
671
+ We can verify if it was done correctly:
672
+
673
+ .. code-block :: python
674
+
675
+ >> > pe.get_sheet(file_name = " your_file.xlsx" )
676
+ your_file.csv:
677
+ + -- -+ ---- + ---- +
678
+ | 2 | 22 | 32 |
679
+ + -- -+ ---- + ---- +
680
+ | 3 | 23 | 33 |
681
+ + -- -+ ---- + ---- +
682
+ | 4 | 24 | 34 |
683
+ + -- -+ ---- + ---- +
684
+ | 5 | 25 | 35 |
685
+ + -- -+ ---- + ---- +
686
+ | 6 | 26 | 36 |
687
+ + -- -+ ---- + ---- +
688
+ | 7 | 27 | 37 |
689
+ + -- -+ ---- + ---- +
690
+
691
+
528
692
Stream APIs for big file : A set of two liners
529
693
================================================================================
530
694
@@ -829,170 +993,6 @@ Again let's verify what we have gotten:
829
993
+ ------ -+ -------- + ---------- +
830
994
831
995
832
- Hidden feature: partial read
833
- ===============================================
834
-
835
- Most pyexcel users do not know, but other library users were requesting `the similar features <https://github.com/jazzband/tablib/issues/467 >`_
836
-
837
-
838
- When you are dealing with huge amount of data, e.g. 64GB, obviously you would not
839
- like to fill up your memory with those data. What you may want to do is, record
840
- data from Nth line, take M records and stop. And you only want to use your memory
841
- for the M records, not for beginning part nor for the tail part.
842
-
843
- Hence partial read feature is developed to read partial data into memory for processing.
844
- You can paginate by row, by column and by both, hence you dictate what portion of the
845
- data to read back. But remember only row limit features help you save memory. Let's
846
- you use this feature to record data from Nth column, take M number of columns and skip
847
- the rest. You are not going to reduce your memory footprint.
848
-
849
- Why did not I see above benefit?
850
-
851
- This feature depends heavily on the implementation details.
852
-
853
- `pyexcel-xls(xlrd) `, `pyexcel-xlsx(openpyxl) `, `pyexcel-ods(odfpy) ` and `pyexcel-ods3(pyexcel-ezodf) `
854
- will read all data into memory. Because xls, xlsx and ods file are effective a zipped folder,
855
- all four will unzip the folder and read the content in xml format in **full **, so as to make sense
856
- of all details.
857
-
858
- Hence, during the partial data is been returned, the memory
859
- consumption won't differ from reading the whole data back. Only after the partial
860
- data is returned, the memory comsumption curve shall jump the cliff. So pagination
861
- code here only limits the data returned to your program.
862
-
863
- With that said, `pyexcel-xlsxr `, `pyexcel-odsr ` and `pyexcel-htmlr ` DOES read partial data into memory.
864
- Those three are implemented in such a way that they consume the xml(html) when needed. When they
865
- have read designated portion of the data, they stop, even if they are half way through.
866
-
867
- In addition, pyexcel's csv readers can read partial data into memory too.
868
-
869
-
870
-
871
- Let's assume the following file is a huge csv file:
872
-
873
- .. code-block :: python
874
-
875
- >> > import datetime
876
- >> > import pyexcel as pe
877
- >> > data = [
878
- ... [1 , 21 , 31 ],
879
- ... [2 , 22 , 32 ],
880
- ... [3 , 23 , 33 ],
881
- ... [4 , 24 , 34 ],
882
- ... [5 , 25 , 35 ],
883
- ... [6 , 26 , 36 ]
884
- ... ]
885
- >> > pe.save_as(array = data, dest_file_name = " your_file.csv" )
886
-
887
-
888
- And let's pretend to read partial data:
889
-
890
-
891
- .. code-block :: python
892
-
893
- >> > pe.get_sheet(file_name = " your_file.csv" , start_row = 2 , row_limit = 3 )
894
- your_file.csv:
895
- + -- -+ ---- + ---- +
896
- | 3 | 23 | 33 |
897
- + -- -+ ---- + ---- +
898
- | 4 | 24 | 34 |
899
- + -- -+ ---- + ---- +
900
- | 5 | 25 | 35 |
901
- + -- -+ ---- + ---- +
902
-
903
- And you could as well do the same for columns:
904
-
905
- .. code-block :: python
906
-
907
- >> > pe.get_sheet(file_name = " your_file.csv" , start_column = 1 , column_limit = 2 )
908
- your_file.csv:
909
- + ---- + ---- +
910
- | 21 | 31 |
911
- + ---- + ---- +
912
- | 22 | 32 |
913
- + ---- + ---- +
914
- | 23 | 33 |
915
- + ---- + ---- +
916
- | 24 | 34 |
917
- + ---- + ---- +
918
- | 25 | 35 |
919
- + ---- + ---- +
920
- | 26 | 36 |
921
- + ---- + ---- +
922
-
923
- Obvious, you could do both at the same time:
924
-
925
- .. code-block :: python
926
-
927
- >> > pe.get_sheet(file_name = " your_file.csv" ,
928
- ... start_row = 2 , row_limit = 3 ,
929
- ... start_column = 1 , column_limit = 2 )
930
- your_file.csv:
931
- + ---- + ---- +
932
- | 23 | 33 |
933
- + ---- + ---- +
934
- | 24 | 34 |
935
- + ---- + ---- +
936
- | 25 | 35 |
937
- + ---- + ---- +
938
-
939
-
940
- The pagination support is available across all pyexcel plugins.
941
-
942
- .. note ::
943
-
944
- No column pagination support for query sets as data source.
945
-
946
-
947
- Formatting while transcoding a big data file
948
- --------------------------------------------------------------------------------
949
-
950
- If you are transcoding a big data set, conventional formatting method would not
951
- help unless a on-demand free RAM is available. However, there is a way to minimize
952
- the memory footprint of pyexcel while the formatting is performed.
953
-
954
- Let's continue from previous example. Suppose we want to transcode "your_file.csv"
955
- to "your_file.xls" but increase each element by 1.
956
-
957
- What we can do is to define a row renderer function as the following:
958
-
959
- >>> def increment_by_one (row ):
960
- ... for element in row:
961
- ... yield element + 1
962
-
963
- Then pass it onto save_as function using row_renderer:
964
-
965
- >>> pe.isave_as(file_name = " your_file.csv" ,
966
- ... row_renderer= increment_by_one,
967
- ... dest_file_name= " your_file.xlsx" )
968
-
969
-
970
- .. note ::
971
-
972
- If the data content is from a generator, isave_as has to be used.
973
-
974
- We can verify if it was done correctly:
975
-
976
- .. code-block :: python
977
-
978
- >> > pe.get_sheet(file_name = " your_file.xlsx" )
979
- your_file.csv:
980
- + -- -+ ---- + ---- +
981
- | 2 | 22 | 32 |
982
- + -- -+ ---- + ---- +
983
- | 3 | 23 | 33 |
984
- + -- -+ ---- + ---- +
985
- | 4 | 24 | 34 |
986
- + -- -+ ---- + ---- +
987
- | 5 | 25 | 35 |
988
- + -- -+ ---- + ---- +
989
- | 6 | 26 | 36 |
990
- + -- -+ ---- + ---- +
991
- | 7 | 27 | 37 |
992
- + -- -+ ---- + ---- +
993
-
994
-
995
-
996
996
Available Plugins
997
997
=================
998
998
0 commit comments