Context
When it comes to scrapping, Scrapy
is one of the most known and used frameworks. The huge community and the large collection of third-party extensions are only few reasons to choose scrapy. You can find a scrapy extension for almost everything.
File storing
As I have already said here, storing large number of files is quite challenging, you should always have in mind storing large number of files in a single folder is not a good idea.
In case of Scrapy
, all files are stored in a single folder. So I decided to implement a Scrapy
pipeline extension in order to provide a way to store files in a more efficient way using folder trees.
scrapy-files-hierarchy
sp1thas / scrapy-folder-tree
A scrapy pipeline which stores files using folder trees.
scrapy-folder-tree
This is a scrapy pipeline that provides an easy way to store files and images using various folder structures.
Supported folder structures:
Given this scraped file: 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
, you can choose the following folder structures:
Using the file name
class: scrapy-folder-tree.ImagesHashTreePipeline
full
├── 0
. ├── 5
. . ├── b
. . . ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Using the crawling time
class: scrapy-folder-tree.ImagesTimeTreePipeline
full
├── 0
. ├── 11
. . ├── 48
. . . ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Using the crawling date
class: scrapy-folder-tree.ImagesDateTreePipeline
full
├── 2022
. ├── 1
. . ├── 24
. . . ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Installation
pip install scrapy-folder-tree
Usage
Use the following settings in your project:
ITEM_PIPELINES = {
'scrapy_folder_tree.FilesHashTreePipeline': 300
}
This scrapy pipelines provides various ways to store your crawled files. Currently, given this scraped file: 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
, the following three folder structures are supported:
Using the file name(hash)
full
├── 0
. ├── 5
. . ├── b
. . . ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Using the crawling date
full
├── 0
. ├── 11
. . ├── 48
. . . ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Using the crawling time
full
├── 2022
. ├── 1
. . ├── 24
. . . ├── 05b40af07cb3284506acbf395452e0e93bfc94c8.jpg
Installation
pip install scrapy-folder-tree
Usage
Use the following settings in your project:
ITEM_PIPELINES = {
'scrapy_folder_tree.FilesHashTreePipeline': 300
}
FOLDER_TREE_DEPTH = 3
Feel free to give a try and to provide your feedback.
Future work
- Support more folder structures
- Parameterize folder structure
Top comments (0)