-
Notifications
You must be signed in to change notification settings - Fork 853
Reading CSV files #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would really like to see something like this: while the csv and json importers are useful, they are not generic enough to import arbitrary data: I was playing around recently and wanted to do a bulk import of data, roughly 2GB. It would be nice to be able to load and process such a file immediately from within arango, but the File.read function in the It would be great to have a more versatile |
I pre-process my raw data to csv and then use the importer. Works fine. ;) What would the benefit be from moving this to ArangoDb? |
Hi, would like to chime in, but I am not sure what you mean by "What would the benefit be from moving this to ArangoDb?" 😄 Can you elaborate, on what you want to do, what you did, and what the last sentence means? 😃 |
Hi Frank, If I understood correctly, @a2800276 wants to be able to process his raw data and enter them into the db from withing arangosh. I was wandering why the devs should invest time to this feature since one can easily process his raw data into csv/json (via any language, say PHP or Python, or Bash) and use the already working importer. |
Oh, yes, didn't notice the different users 😄 . Yes of course, I totally agree with @rotatingJazz on that. |
To me it seems very elaborate to preprocess data, that may or may not be in a form suitable for CSV/JSON, transforming it to a different format, throwing that against a --functionally restricted-- import script which then uses HTTP to import individual records to the database. When instead: I could be reading and transforming arbitrarily formatted files from within DB and have a much more efficient workflow, both from the "programmer efficiency" point of view and in terms of performance. What I was trying to do concretely: re-implement a toy project to play with graph functionality that I have working for neo4j in arango. I'd like to importethe wikipedia inter-page links and play around with that dataset. The dump of that data is 4GB, (in the form of mysql INSERT statements). If I can avoid it, I don't want to preprocess 4GB of data into 3GB of some other data that I can import when I could import directly in ~half the time. More generally: Since arango wants to become a general purpose deployment platform with Foxx, then it will certainly need some rudimentary file io implementation. As it's currently implemented, File.read is utterly useless apart from reading tiny toy files. |
It might be interesting to have some reference data that one could try to
|
We'll eventually have an implementation of Buffer, which will allow us to read binary files and process them in chunks from JavaScript. Until that's available, I think there are two alternatives available at least for processing CSV and JSON files. Example invocation for CSV files:
And for p```
var internal = require("internal");
|
Closed because processCsvFile and processJsonFile are doing what I intended. |
TODO: - Tests are not yet fixed - ASAN failure [ RUN ] IResearchAnalyzerFeatureTest.test_emplace_valid AddressSanitizer:DEADLYSIGNAL ================================================================= ==247960==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x618333055636 bp 0x7ffee439d260 sp 0x7ffee439d250 T0) ==247960==The signal is caused by a READ memory access. ==247960==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used. #0 0x618333055636 in std::__shared_ptr<arangodb::async_registry::ThreadRegistry, (__gnu_cxx::_Lock_policy)2>::get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/shared_ptr_base.h:1667 #1 0x6183330555f9 in std::__shared_ptr_access<arangodb::async_registry::ThreadRegistry, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/shared_ptr_base.h:1364 #2 0x618333053636 in std::__shared_ptr_access<arangodb::async_registry::ThreadRegistry, (__gnu_cxx::_Lock_policy)2, false, false>::operator->() const /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/shared_ptr_base.h:1358 #3 0x6183483a2cd4 in arangodb::async_registry::PromiseInList::mark_for_deletion() /home/jvolmer/code/arangodb/lib/Async/Registry/promise.cpp:17 #4 0x61832bae6ddd in arangodb::futures::detail::SharedState<arangodb::Result>::detachOne() /home/jvolmer/code/arangodb/lib/Futures/include/Futures/SharedState.h:276 #5 0x61832bae66b6 in arangodb::futures::detail::SharedState<arangodb::Result>::detachFuture() /home/jvolmer/code/arangodb/lib/Futures/include/Futures/SharedState.h:226 #6 0x61832bae6519 in arangodb::futures::Future<arangodb::Result>::detach() /home/jvolmer/code/arangodb/lib/Futures/include/Futures/Future.h:531 #7 0x61832ba8db76 in arangodb::futures::Future<arangodb::Result>::~Future() /home/jvolmer/code/arangodb/lib/Futures/include/Futures/Future.h:218 #8 0x6183337bbb44 in arangodb::futures::FutureAwaitable<arangodb::Result>::~FutureAwaitable() /home/jvolmer/code/arangodb/lib/Futures/include/Futures/coro-helper.h:63 #9 0x618339fdbef9 in arangodb::transaction::Methods::commitInternal(arangodb::transaction::MethodsApi) /home/jvolmer/code/arangodb/arangod/Transaction/Methods.cpp:3787 #10 0x618339fdc8f4 in arangodb::transaction::Methods::commitAsync() /home/jvolmer/code/arangodb/arangod/Transaction/Methods.cpp:1925 #11 0x61833e23903d in arangodb::aql::Query::cleanupTrxAndEngines() /home/jvolmer/code/arangodb/arangod/Aql/Query.cpp:1972 #12 0x61833e20838f in arangodb::aql::Query::cleanupPlanAndEngine(bool) /home/jvolmer/code/arangodb/arangod/Aql/Query.cpp:1642 #13 0x61833e21ccc8 in arangodb::aql::Query::finalize(arangodb::velocypack::Builder&) /home/jvolmer/code/arangodb/arangod/Aql/Query.cpp:985 #14 0x61833e218ddc in arangodb::aql::Query::execute(arangodb::aql::QueryResult&) /home/jvolmer/code/arangodb/arangod/Aql/Query.cpp:704 #15 0x61833e21dfc0 in arangodb::aql::Query::executeSync() /home/jvolmer/code/arangodb/arangod/Aql/Query.cpp:756 #16 0x61833c6843b2 in (anonymous namespace)::visitAnalyzers(TRI_vocbase_t&, std::function<arangodb::Result (arangodb::velocypack::Slice)> const&, arangodb::transaction::OperationOrigin) /home/jvolmer/code/arangodb/arangod/IResearch/IResearchAnalyzerFeature.cpp:631 #17 0x61833c65a2a8 in arangodb::iresearch::IResearchAnalyzerFeature::loadAnalyzers(arangodb::transaction::OperationOrigin, std::basic_string_view<char, std::char_traits<char>>) /home/jvolmer/code/arangodb/arangod/IResearch/IResearchAnalyzerFeature.cpp:2282 #18 0x61833c64f58c in arangodb::iresearch::IResearchAnalyzerFeature::emplace(std::pair<std::shared_ptr<arangodb::iresearch::AnalyzerPool>, bool>&, std::basic_string_view<char, std::char_traits<char>>, std::basic_string_view<char, std::char_traits<char>>, arangodb::velocypack::Slice, arangodb::transaction::OperationOrigin, arangodb::iresearch::Features) /home/jvolmer/code/arangodb/arangod/IResearch/IResearchAnalyzerFeature.cpp:1389 #19 0x61832b9065b2 in IResearchAnalyzerFeatureTest_test_emplace_valid_Test::TestBody() /home/jvolmer/code/arangodb/tests/IResearch/IResearchAnalyzerFeatureTest.cpp:601 #20 0x61833bcd323f in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2599 #21 0x61833bc7f59c in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2635 #22 0x61833bc1e3fe in testing::Test::Run() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2674 #23 0x61833bc20a1c in testing::TestInfo::Run() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2853 #24 0x61833bc2282f in testing::TestSuite::Run() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:3012 #25 0x61833bc566db in testing::internal::UnitTestImpl::RunAllTests() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:5870 #26 0x61833bcd46cf in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2599 #27 0x61833bc868ac in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2635 #28 0x61833bc553c0 in testing::UnitTest::Run() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:5444 #29 0x6183338f239b in RUN_ALL_TESTS() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/include/gtest/gtest.h:2293 #30 0x6183338f18b7 in main::$_0::operator()(int, char**) const /home/jvolmer/code/arangodb/tests/main.cpp:150 #31 0x6183338f1674 in TestThread<main::$_0>::run() /home/jvolmer/code/arangodb/tests/main.cpp:61 #32 0x6183338f1050 in TestThread<main::$_0>::TestThread(arangodb::application_features::ApplicationServerT<arangodb::ArangodFeatures>&, main::$_0&&, int, char**) /home/jvolmer/code/arangodb/tests/main.cpp:48 #33 0x6183338f0656 in main /home/jvolmer/code/arangodb/tests/main.cpp:151 #34 0x789583e2a1c9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #35 0x789583e2a28a in __libc_start_main csu/../csu/libc-start.c:360:3 #36 0x61832b54b024 in _start (/home/jvolmer/code/arangodb/build-presets/my-alubsan/bin/arangodbtests+0x3b9a8024) (BuildId: 4c994b7192b507c6ed2fbee420b30cfaa0527976) AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/shared_ptr_base.h:1667 in std::__shared_ptr<arangodb::async_registry::ThreadRegistry, (__gnu_cxx::_Lock_policy)2>::get() const ==247960==ABORTING
TODO: - Tests are not yet fixed - ASAN failure [ RUN ] IResearchAnalyzerFeatureTest.test_emplace_valid AddressSanitizer:DEADLYSIGNAL ================================================================= ==247960==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x618333055636 bp 0x7ffee439d260 sp 0x7ffee439d250 T0) ==247960==The signal is caused by a READ memory access. ==247960==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used. #0 0x618333055636 in std::__shared_ptr<arangodb::async_registry::ThreadRegistry, (__gnu_cxx::_Lock_policy)2>::get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/shared_ptr_base.h:1667 #1 0x6183330555f9 in std::__shared_ptr_access<arangodb::async_registry::ThreadRegistry, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/shared_ptr_base.h:1364 #2 0x618333053636 in std::__shared_ptr_access<arangodb::async_registry::ThreadRegistry, (__gnu_cxx::_Lock_policy)2, false, false>::operator->() const /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/shared_ptr_base.h:1358 #3 0x6183483a2cd4 in arangodb::async_registry::PromiseInList::mark_for_deletion() /home/jvolmer/code/arangodb/lib/Async/Registry/promise.cpp:17 #4 0x61832bae6ddd in arangodb::futures::detail::SharedState<arangodb::Result>::detachOne() /home/jvolmer/code/arangodb/lib/Futures/include/Futures/SharedState.h:276 #5 0x61832bae66b6 in arangodb::futures::detail::SharedState<arangodb::Result>::detachFuture() /home/jvolmer/code/arangodb/lib/Futures/include/Futures/SharedState.h:226 #6 0x61832bae6519 in arangodb::futures::Future<arangodb::Result>::detach() /home/jvolmer/code/arangodb/lib/Futures/include/Futures/Future.h:531 #7 0x61832ba8db76 in arangodb::futures::Future<arangodb::Result>::~Future() /home/jvolmer/code/arangodb/lib/Futures/include/Futures/Future.h:218 #8 0x6183337bbb44 in arangodb::futures::FutureAwaitable<arangodb::Result>::~FutureAwaitable() /home/jvolmer/code/arangodb/lib/Futures/include/Futures/coro-helper.h:63 #9 0x618339fdbef9 in arangodb::transaction::Methods::commitInternal(arangodb::transaction::MethodsApi) /home/jvolmer/code/arangodb/arangod/Transaction/Methods.cpp:3787 #10 0x618339fdc8f4 in arangodb::transaction::Methods::commitAsync() /home/jvolmer/code/arangodb/arangod/Transaction/Methods.cpp:1925 #11 0x61833e23903d in arangodb::aql::Query::cleanupTrxAndEngines() /home/jvolmer/code/arangodb/arangod/Aql/Query.cpp:1972 #12 0x61833e20838f in arangodb::aql::Query::cleanupPlanAndEngine(bool) /home/jvolmer/code/arangodb/arangod/Aql/Query.cpp:1642 #13 0x61833e21ccc8 in arangodb::aql::Query::finalize(arangodb::velocypack::Builder&) /home/jvolmer/code/arangodb/arangod/Aql/Query.cpp:985 #14 0x61833e218ddc in arangodb::aql::Query::execute(arangodb::aql::QueryResult&) /home/jvolmer/code/arangodb/arangod/Aql/Query.cpp:704 #15 0x61833e21dfc0 in arangodb::aql::Query::executeSync() /home/jvolmer/code/arangodb/arangod/Aql/Query.cpp:756 #16 0x61833c6843b2 in (anonymous namespace)::visitAnalyzers(TRI_vocbase_t&, std::function<arangodb::Result (arangodb::velocypack::Slice)> const&, arangodb::transaction::OperationOrigin) /home/jvolmer/code/arangodb/arangod/IResearch/IResearchAnalyzerFeature.cpp:631 #17 0x61833c65a2a8 in arangodb::iresearch::IResearchAnalyzerFeature::loadAnalyzers(arangodb::transaction::OperationOrigin, std::basic_string_view<char, std::char_traits<char>>) /home/jvolmer/code/arangodb/arangod/IResearch/IResearchAnalyzerFeature.cpp:2282 #18 0x61833c64f58c in arangodb::iresearch::IResearchAnalyzerFeature::emplace(std::pair<std::shared_ptr<arangodb::iresearch::AnalyzerPool>, bool>&, std::basic_string_view<char, std::char_traits<char>>, std::basic_string_view<char, std::char_traits<char>>, arangodb::velocypack::Slice, arangodb::transaction::OperationOrigin, arangodb::iresearch::Features) /home/jvolmer/code/arangodb/arangod/IResearch/IResearchAnalyzerFeature.cpp:1389 #19 0x61832b9065b2 in IResearchAnalyzerFeatureTest_test_emplace_valid_Test::TestBody() /home/jvolmer/code/arangodb/tests/IResearch/IResearchAnalyzerFeatureTest.cpp:601 #20 0x61833bcd323f in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2599 #21 0x61833bc7f59c in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2635 #22 0x61833bc1e3fe in testing::Test::Run() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2674 #23 0x61833bc20a1c in testing::TestInfo::Run() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2853 #24 0x61833bc2282f in testing::TestSuite::Run() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:3012 #25 0x61833bc566db in testing::internal::UnitTestImpl::RunAllTests() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:5870 #26 0x61833bcd46cf in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2599 #27 0x61833bc868ac in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:2635 #28 0x61833bc553c0 in testing::UnitTest::Run() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/src/gtest.cc:5444 #29 0x6183338f239b in RUN_ALL_TESTS() /home/jvolmer/code/arangodb/3rdParty/gtest/googletest/include/gtest/gtest.h:2293 #30 0x6183338f18b7 in main::$_0::operator()(int, char**) const /home/jvolmer/code/arangodb/tests/main.cpp:150 #31 0x6183338f1674 in TestThread<main::$_0>::run() /home/jvolmer/code/arangodb/tests/main.cpp:61 #32 0x6183338f1050 in TestThread<main::$_0>::TestThread(arangodb::application_features::ApplicationServerT<arangodb::ArangodFeatures>&, main::$_0&&, int, char**) /home/jvolmer/code/arangodb/tests/main.cpp:48 #33 0x6183338f0656 in main /home/jvolmer/code/arangodb/tests/main.cpp:151 #34 0x789583e2a1c9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #35 0x789583e2a28a in __libc_start_main csu/../csu/libc-start.c:360:3 #36 0x61832b54b024 in _start (/home/jvolmer/code/arangodb/build-presets/my-alubsan/bin/arangodbtests+0x3b9a8024) (BuildId: 4c994b7192b507c6ed2fbee420b30cfaa0527976) AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/shared_ptr_base.h:1667 in std::__shared_ptr<arangodb::async_registry::ThreadRegistry, (__gnu_cxx::_Lock_policy)2>::get() const ==247960==ABORTING
Die zwei Methoden processCsvFile und processJsonFile sind sehr nützlich. Vielleicht braucht man später auch was zum schreiben.
Vielleicht macht es Sinn, die in ein Module zu packen? In Node heißt ein Ähnliches Module einfach fs, was sicher kein schlechter Name ist!
The text was updated successfully, but these errors were encountered: