节点使用mongoose插入大数据

Question

问题说明

我正在尝试使用mongoose将大型数据集插入mongodb。但在此之前，我需要确保我的for循环正常工作。

I am trying to insert large data sets into mongodb with mongoose. But before that I need to make sure my for loop is working correctly.

// basic schema settings
var mongoose = require('mongoose');
var Schema = mongoose.Schema;
var TempcolSchema = new Schema({
    cid: {
        type: Number,
        required: true
    },
    loc:[]
});
TempcolSchema.index({
  'loc': "sphere2d"
});


// we can easily see from the output that the forloop runs correctly
mongoose.connect('mongodb://localhost/mean-dev', function(err){
    for (var i = 0; i < 10000000; i  ) {
        var c = i;
        console.log(c);
    }
});

输出为1,2,3,4，......等等

the output is 1,2,3,4,....etc

现在我想在for循环中添加一个mongoose save语句。

now I want to add a mongoose save statement into the for loop.

mongoose.connect('mongodb://localhost/mean-dev', function(err){
    var Tempcol = mongoose.model('Tempcol', TempcolSchema);

    for (var i = 0; i < 10000000; i  ) {
        var c = i;
        console.log(c);
        var lon = parseInt(c/100000);
        var lat = c0000;
        new Tempcol({cid: Math.random(), loc: [lon, lat]}).save(function(err){});
    }
});

输出仍然是1,2,3,4，.....然而for循环一段时间后停止并说达到最大堆栈并出现某种内存问题。此外，当我检查集合时，我意识到根本没有插入数据点。

the output is still 1,2,3,4,..... However the for loop stops after a while and saying the maximum stack is reached and have some kind of memory problem. Also when I check the collection I realized there are no data points being inserted at all.

所以有人知道可能会发生什么吗？谢谢。

So does anyone know what might be happening? Thanks.

Answer 1

正确答案

#1

这里的问题是你正在运行的循环不等待每个操作完成。所以实际上你只是在排队1000个 .save（）请求并尝试同时运行它们。你不能在合理的限制范围内这样做，因此你会收到错误回复。

The problem here is that the loop you are running is not waiting for each operation to complete. So in fact you are just queuing up 1000's of .save() requests and trying to run them concurrently. You can't do that within reasonable limitations, hence you get the error response.

async 模块在处理该迭代器的回调时有各种迭代方法，其中最简单的直接方法是同时。 Mongoose还可以为您处理连接管理，而无需嵌入回调，因为模型可以识别连接：

The async module has various methods for iterating while processing a callback for that iterator, where probably the most simple direct for is whilst. Mongoose also handles the connection management for you without needing to embed within the callback, as the models are connection aware:

var tempColSchema = new Schema({
    cid: {
        type: Number,
        required: true
    },
    loc:[]
});

var TempCol = mongoose.model( "TempCol", tempColSchema );

mongoose.connect( 'mongodb://localhost/mean-dev' );

var i = 0;
async.whilst(
    function() { return i < 10000000; },
    function(callback) {
        i  ;
        var c = i;
        console.log(c);
        var lon = parseInt(c/100000);
        var lat = c0000;
        new Tempcol({cid: Math.random(), loc: [lon, lat]}).save(function(err){
            callback(err);
        });            
    },
    function(err) {
       // When the loop is complete or on error
    }
);

不是最神奇的方式，它仍然是一次写入你可以使用管理并发操作的其他方法，但这至少不会炸掉调用堆栈。

Not the most fantastic way to do it, it is still one write at a time and you could use other methods to "govern" the concurrent operations, but this at least will not blow up the call stack.

表单MongoDB 2.6及更高版本你可以使用批量操作API ，以便在服务器上一次处理多个写入。所以这个过程是类似的，但这次你可以在一次写入和响应中一次发送1000个服务器，这要快得多：

Form MongoDB 2.6 and greater you can make use of the Bulk Operations API in order to process more than one write at a time on the server. So the process is similar, but this time you can send 1000 at a time to the server in a single write and response, which is much faster:

var tempColSchema = new Schema({
    cid: {
        type: Number,
        required: true
    },
    loc:[]
});

var TempCol = mongoose.model( "TempCol", tempColSchema );

mongoose.connect( 'mongodb://localhost/mean-dev' );

mongoose.on("open",function(err,conn) {

    var i = 0;
    var bulk = TempCol.collection.initializeOrderedBulkOp();

    async.whilst(
      function() { return i < 10000000; },
      function(callback) {
        i  ;
        var c = i;
        console.log(c);
        var lon = parseInt(c/100000);
        var lat = c0000;

        bulk.insert({ "cid": Math.random(), "loc": [ lon, lat ] });

        if ( i % 1000 == 0 ) {
            bulk.execute(function(err,result) {
                bulk = TempCol.collection.initializeOrderedBulkOp();
                callback(err);
            });
        } else {
            process.nextTick(callback);
        }
      },
      function(err) {
        // When the loop is complete or on error

        // If you had a number not plainly divisible by 1000
        if ( i % 1000 != 0 )
            bulk.execute(function(err,result) {
                // possibly check for errors here
            });
      }
    );

});

这实际上是使用了尚未在mongoose中公开的本机驱动程序方法，因此需要额外的关注正在采取措施，以确保连接可用。这是一个例子，但不是唯一的方法，但重点是连接的猫鼬魔术不是在这里建立的，所以你应该确定它已经建立。

That is actually using the native driver methods which are not yet exposed in mongoose, so the additional care there is being taken to make sure the connection is available. That's an example but not the only way, but the main point is the mongoose "magic" for connections is not built in here so you should be sure it is established.

你有多个要处理的项目，但在不是这种情况下，你应该在最后一个块中调用 bulk.execute（），但是它取决于响应模数的数字。

You have a round number of items to process, but where it is not the case you should be calling the bulk.execute() in that final block as well as shown, but it depends on the number responding to the modulo.

重点是不要将一堆操作增加到不合理的大小，并保持处理受限。此处的流控制允许在进入下一次迭代之前需要一些时间才能实际完成的操作。因此，批量更新或一些额外的并行排队是您希望获得最佳性能的。

The main point is to not grow a stack of operations to an unreasonable size, and keep the processing limited. The flow control here allows operations that will take some time to actually complete before moving on to the next iteration. So either the batch updates or some additional parallel queuing is what you want for best performance.

还有 .initializeUnorderedBulkOp（）表单。大多数情况下，请参阅有关批量API的官方文档以及如何解释给出的响应的响应。

There is also the .initializeUnorderedBulkOp() form for this if you don't want write errors to be fatal but handle those in a different way instead. Mostly see the official documentation on Bulk API and responses for how to interpret the response given.

这篇好文章是转载于：学新通技术网

节点使用mongoose插入大数据

问题说明

正确答案

YouTube API 不能在 iOS (iPhone/iPad) 工作，但在桌面浏览器工作正常?

iPhone，一张图像叠加到另一张图像上以创建要保存的新图像?(水印)

保持在后台运行的 iPhone 应用程序完全可操作

使用 iPhone 进行移动设备管理

在android同时打开手电筒和前置摄像头

扫描 NFC 标签时是否可以启动应用程序?

检查邮件是否发送成功

Android微调工具-删除当前选择

希伯来语的空格句子标记化错误

Android App 和三星 Galaxy S4 不兼容