Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
Scarica il documento per vederlo tutto.
vuoi
o PayPal
tutte le volte che vuoi
Uall pixels in U in the input images I.Spatial filter is a transformation with the following characteristics:
- the location of the output doesn’t change
- The operation is repeated for each pixel
- T can be linear or noLinear transformation can be written as:
and we can consider weights as an image, or a filter w that entirely defines the operation
We define Correlation among a filter w and an image l the following operation
4.2 Linear Classifier
Dimensionality prevents us from using a deep NN as those seen so far. A 1-layer NN to classify images is a feasible but poor solution. We can use a fully connected network where w is the weight associated i,jth th to the i neuron of the input when computing the j output neuron
Note that we “define” an output node for each class of the classifierth We can arrange weights in a matrix W, then the score of the i class is given by the product: s = W[i,:]*x +bi input i This is equal to a linear classifier K(x) = Wx + b W[i,:] can be seen as a
The classification score is then computed as a correlation between each input and the class.
This kind of model is however too simple to achieve good performance.
So stack multiple linear layer is equal to a single layer.
4.3 Image Classification Problem
Image classification is a very unusual setting for Deep Learning:
- Images are very high-dimensional data, so we have few samples but high computational cost.
- Label Ambiguity: A Label might not uniquely identify the images.
- There are many transformations that change the image, but not its label.
- Inter-Class variability.
- Perceptual Similarity in images is not related to (data) pixel-similarity.
Convolutional Neural Network
Images cannot be directly fed to a classifier, they need some intermediate step to:
- Extract meaningful information.
- Reduce Data Dimension.
We have 2 ways to extract features:
- "By Hand": where we exploit a priori information, features are interpretable and we need a limited amount of training data.
decrease- Once the images gest to a vector, this is fed to a traditional neural Network
CNN are made of block that include:
- Convolutional Layers
- Non Linearities (activation Function)
- Polling Layers (SubSampling, MaxPooling)
- Dense Layer
An image Passing through a CNN is transformed into a sequence of volume. As depth increases, the height and width of the volume decrease
5.2 Convolutional Layer
Convolutional Layers “mix” all the input components. The output is a linear combination of all the values in a region of the input, considering all the channels (RGB component of the image). Filters need to have the same number of channels as the input. Different filters compute different layers in the output. Typically filters have very small spatial extension and large depth extent
5.3 Other Layers
Activation Layers introduce nonlinearities in the network, otherwise the CNN might be equivalent to a linear classifier. They are scalar function and don’t change volume size
One of the
The most used Activation function is the ReLu function and its variants.
Pooling Layers reduce the spatial size of the volume (height, width). It operates independently on every depth slice of the input and resize it spatially.
Famous functions used to perform this operation are: Max or Average.
Note that in this phase the number of channels remains the same (depth).
5.4 Dense Layers
Is the last part of the architecture, where the spatial dimension is lost and the input has become a vector.
It is called dense as each output neuron is connected to each input one.
The output of this layer has the same size as the number of classes, and provides a score for the input images to belong to each class.
In a neural network, a feature map is a set of values that are output by a convolutional layer.
Note that the Full connected part takes in input a feature map, that is the result of the Convolutional part.
Feature Map is a condensed representation of the input data that highlights the presence of the features detected.
by the filter. 5.5 Useful Formulas We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons "fit" is given by (W-F+2P)/S+1. In general, setting zero padding to be P=(F-1)/2 when the stride is S=1 ensures that the input volume and output volume will have the same size spatially. - w = Floor ((W - F + 2P)/S + 1))new - Padding = "valid": P = 0 - Padding = "same": P = (F-1)/2 - Padding = "causal": P = (F-1) Note that the node "tfkl.Conv2DTranspose" is the opposite as the convolution layer, so we have to multiply instead of divide in computation of the output size 6. CNN Parameter and Training Convolution is a linear operator, but, if we unroll the inputField of a neuron in a CNN, you need to consider the size of the receptive field of the previous layer and the size of the filter used in the current layer. The receptive field of a neuron is the region in the input that affects the computation of its output. Pooling: Pooling layers are used in CNNs to reduce the spatial dimensions of the input. They do this by downsampling the input using operations like max pooling or average pooling. Pooling helps in reducing the number of parameters and controlling overfitting. Stride: Stride is the number of pixels by which the filter is moved horizontally or vertically when scanning the input. A larger stride value results in a smaller output size and less computation. Padding: Padding is the process of adding extra pixels to the input image to preserve its spatial dimensions after convolution. It helps in avoiding the reduction of the output size and retaining more information from the input. Activation Function: Activation functions introduce non-linearity to the output of a neuron. Common activation functions used in CNNs include ReLU (Rectified Linear Unit), sigmoid, and tanh. Dropout: Dropout is a regularization technique used in CNNs to prevent overfitting. It randomly sets a fraction of the input units to 0 during training, which helps in reducing the reliance of the network on specific features. Batch Normalization: Batch normalization is a technique used to normalize the inputs of each layer in a CNN. It helps in stabilizing the learning process and improving the overall performance of the network. These are some of the key concepts and techniques used in CNNs for image processing and analysis. By understanding and utilizing these concepts, you can effectively design and train CNN models for various computer vision tasks.field:With padding Same (starting from end with R=1):- with convolution Rec += F-1- with Pooling Rec *= PoolFactor
6.1 Training a CNN
Each CNN can be seen as MLP, so can be in principle trained by gradient descent to minimize a loss function over a batch. Gradient can be computed by backpropagation, but with few precautions:
- Weight sharing needs to be taken into account while computing derivatives
- The gradient is only routed through the input pixel that contributes to the output values (i.e. if a max pooling is used with 4 input node, the gradient is back propagated only to 1 of the 4 nodes, the maximum value ones)
6.2 Data Scarcity
Deep Learning models are very data hungry, but we don't have too much data. So we use some tricks.
6.2.1. Data Augmentation
Data Augmentation is the technique of generating other samples from transformation of initial image samples using transformations that do not change the label of the images and are meaningfulness for the scope. We can have 2 type of
transformations:
- Geometric Transformation: shift, rotation, scaling, flip, …
- Photometric Transformation: Adding noise, modifying contrast or brightness, …
This technique is used also to push the network to become invariant to some transformation, even if this does not work all the time. Test Time Augmentation (TTA) can be used to improve accuracy and invariancy of our model.
6.2.2 Transfer Learning
We can divide a CNN in 2 part:
- Data Driven Features extraction (convolutional Part): quite generic
- Features Classifier (starting from the vector Input FC layers): task-specific
We can use the Convolutional Part of a well trained CNN on a big and generic problem and modify the FC layer to match the new problem. After that we have to freeze the weights in the convolutional layer and train the remaining part of the network. We have 2 different approach:
- Transfer Learning: only the FC part is trained again
- Fine Tuning: also a part of the ConvPart is trained
Typically used in sequences
Famous CNN Architectures
7.1 AlexNet
This architecture is famous because it introduced RELU, DropOut and MaxPooling to avoid overfitting. Also the input was splitted in 2 different NN to be trained with 2 parallel GPUs. Most connections are among features maps on the same GPU, which will be mixed at the last layer. Large filter are used.
7.2 VGG16
Variant of AlexNet with smaller filter and deeper network. In increase the depth of the network by adding more convolutional layer, which is feasible due to the use of a very small (3x3) convolutional filter in all layers. Use of more small filter allow to have the same receptive area with less parameter to train respect to have less filter with bigger area.
7.3 Network In Network
Instead of a convolution layer, use a sequence of FC+RELU. Uses a stack of FC layers followed by RELU in a sliding manner on the entity image. This corresponds to MLP networks used convolutionally. The key idea behind NiN is to use "micro" neural networks, called NiN blocks.
information simultaneously. The NiN architecture has been shown to be effective in various computer vision tasks, such as image classification and object detection. By using the MLP layers in the mlpconv and the 1x1 convolution in the NiN blocks, the model is able to capture complex patterns and relationships within the input data. The global average pooling layer helps to reduce the spatial dimensions of the feature maps, while retaining important information. This allows the model to have a more compact representation of the input, which can improve efficiency and reduce overfitting. Overall, the NiN architecture is a powerful and flexible approach for deep learning, as it combines the benefits of both spatial and channel-wise information. It has been widely adopted in the field and has achieved state-of-the-art results in many tasks.