• 首页 首页 icon
  • 工具库 工具库 icon
    • IP查询 IP查询 icon
  • 内容库 内容库 icon
    • 快讯库 快讯库 icon
    • 精品库 精品库 icon
    • 问答库 问答库 icon
  • 更多 更多 icon
    • 服务条款 服务条款 icon

节点使用mongoose插入大数据

用户头像
it1352
帮助1

问题说明

我正在尝试使用mongoose将大型数据集插入mongodb。但在此之前,我需要确保我的for循环正常工作。

I am trying to insert large data sets into mongodb with mongoose. But before that I need to make sure my for loop is working correctly.

// basic schema settings
var mongoose = require('mongoose');
var Schema = mongoose.Schema;
var TempcolSchema = new Schema({
    cid: {
        type: Number,
        required: true
    },
    loc:[]
});
TempcolSchema.index({
  'loc': "sphere2d"
});


// we can easily see from the output that the forloop runs correctly
mongoose.connect('mongodb://localhost/mean-dev', function(err){
    for (var i = 0; i < 10000000; i  ) {
        var c = i;
        console.log(c);
    }
});

输出为1,2,3,4,......等等

the output is 1,2,3,4,....etc

现在我想在for循环中添加一个mongoose save语句。

now I want to add a mongoose save statement into the for loop.

mongoose.connect('mongodb://localhost/mean-dev', function(err){
    var Tempcol = mongoose.model('Tempcol', TempcolSchema);

    for (var i = 0; i < 10000000; i  ) {
        var c = i;
        console.log(c);
        var lon = parseInt(c/100000);
        var lat = c0000;
        new Tempcol({cid: Math.random(), loc: [lon, lat]}).save(function(err){});
    }
});

输出仍然是1,2,3,4,.....然而for循环一段时间后停止并说达到最大堆栈并出现某种内存问题。此外,当我检查集合时,我意识到根本没有插入数据点。

the output is still 1,2,3,4,..... However the for loop stops after a while and saying the maximum stack is reached and have some kind of memory problem. Also when I check the collection I realized there are no data points being inserted at all.

所以有人知道可能会发生什么吗?谢谢。

So does anyone know what might be happening? Thanks.

正确答案

#1

这里的问题是你正在运行的循环不等待每个操作完成。所以实际上你只是在排队1000个 .save()请求并尝试同时运行它们。你不能在合理的限制范围内这样做,因此你会收到错误回复。

The problem here is that the loop you are running is not waiting for each operation to complete. So in fact you are just queuing up 1000's of .save() requests and trying to run them concurrently. You can't do that within reasonable limitations, hence you get the error response.

async 模块在处理该迭代器的回调时有各种迭代方法,其中最简单的直接方法是同时。 Mongoose还可以为您处理连接管理,而无需嵌入回调,因为模型可以识别连接:

The async module has various methods for iterating while processing a callback for that iterator, where probably the most simple direct for is whilst. Mongoose also handles the connection management for you without needing to embed within the callback, as the models are connection aware:

var tempColSchema = new Schema({
    cid: {
        type: Number,
        required: true
    },
    loc:[]
});

var TempCol = mongoose.model( "TempCol", tempColSchema );

mongoose.connect( 'mongodb://localhost/mean-dev' );

var i = 0;
async.whilst(
    function() { return i < 10000000; },
    function(callback) {
        i  ;
        var c = i;
        console.log(c);
        var lon = parseInt(c/100000);
        var lat = c0000;
        new Tempcol({cid: Math.random(), loc: [lon, lat]}).save(function(err){
            callback(err);
        });            
    },
    function(err) {
       // When the loop is complete or on error
    }
);

不是最神奇的方式,它仍然是一次写入你可以使用管理并发操作的其他方法,但这至少不会炸掉调用堆栈。

Not the most fantastic way to do it, it is still one write at a time and you could use other methods to "govern" the concurrent operations, but this at least will not blow up the call stack.

表单MongoDB 2.6及更高版本你可以使用批量操作API ,以便在服务器上一次处理多个写入。所以这个过程是类似的,但这次你可以在一次写入和响应中一次发送1000个服务器,这要快得多:

Form MongoDB 2.6 and greater you can make use of the Bulk Operations API in order to process more than one write at a time on the server. So the process is similar, but this time you can send 1000 at a time to the server in a single write and response, which is much faster:

var tempColSchema = new Schema({
    cid: {
        type: Number,
        required: true
    },
    loc:[]
});

var TempCol = mongoose.model( "TempCol", tempColSchema );

mongoose.connect( 'mongodb://localhost/mean-dev' );

mongoose.on("open",function(err,conn) {

    var i = 0;
    var bulk = TempCol.collection.initializeOrderedBulkOp();

    async.whilst(
      function() { return i < 10000000; },
      function(callback) {
        i  ;
        var c = i;
        console.log(c);
        var lon = parseInt(c/100000);
        var lat = c0000;

        bulk.insert({ "cid": Math.random(), "loc": [ lon, lat ] });

        if ( i % 1000 == 0 ) {
            bulk.execute(function(err,result) {
                bulk = TempCol.collection.initializeOrderedBulkOp();
                callback(err);
            });
        } else {
            process.nextTick(callback);
        }
      },
      function(err) {
        // When the loop is complete or on error

        // If you had a number not plainly divisible by 1000
        if ( i % 1000 != 0 )
            bulk.execute(function(err,result) {
                // possibly check for errors here
            });
      }
    );

});

这实际上是使用了尚未在mongoose中公开的本机驱动程序方法,因此需要额外的关注正在采取措施,以确保连接可用。这是一个例子,但不是唯一的方法,但重点是连接的猫鼬魔术不是在这里建立的,所以你应该确定它已经建立。

That is actually using the native driver methods which are not yet exposed in mongoose, so the additional care there is being taken to make sure the connection is available. That's an example but not the only way, but the main point is the mongoose "magic" for connections is not built in here so you should be sure it is established.

你有多个要处理的项目,但在不是这种情况下,你应该在最后一个块中调用 bulk.execute(),但是它取决于响应模数的数字。

You have a round number of items to process, but where it is not the case you should be calling the bulk.execute() in that final block as well as shown, but it depends on the number responding to the modulo.

重点是不要将一堆操作增加到不合理的大小,并保持处理受限。此处的流控制允许在进入下一次迭代之前需要一些时间才能实际完成的操作。因此,批量更新或一些额外的并行排队是您希望获得最佳性能的。

The main point is to not grow a stack of operations to an unreasonable size, and keep the processing limited. The flow control here allows operations that will take some time to actually complete before moving on to the next iteration. So either the batch updates or some additional parallel queuing is what you want for best performance.

还有 .initializeUnorderedBulkOp()表单。大多数情况下,请参阅有关批量API的官方文档以及如何解释给出的响应的响应。

There is also the .initializeUnorderedBulkOp() form for this if you don't want write errors to be fatal but handle those in a different way instead. Mostly see the official documentation on Bulk API and responses for how to interpret the response given.

这篇好文章是转载于:学新通技术网

  • 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
  • 本站站名: 学新通技术网
  • 本文地址: /reply/detail/tanhcfijba
系列文章
更多 icon
同类精品
更多 icon
继续加载